298 lines
8.1 KiB
Markdown
298 lines
8.1 KiB
Markdown
# RULER NIAH Standalone Test Plan
|
|
|
|
## Overview
|
|
|
|
This document describes how to independently test nano-vllm's CPU offload functionality using RULER benchmark's NIAH (Needle-In-A-Haystack) task data.
|
|
|
|
## Background
|
|
|
|
### Problem Being Investigated
|
|
|
|
When running 32K sequence length tests with CPU offload mode, the model outputs garbled text instead of finding the magic number. This issue was traced to:
|
|
|
|
- **Root Cause**: Ring buffer `max_seq_len` was set equal to `max_model_len` (32768)
|
|
- **Issue**: When prefill uses ~32K tokens, decode needs to store KV at position 32768+, but ring buffer only has indices 0-32767
|
|
- **Fix Applied**: In `nanovllm/kvcache/__init__.py`, changed `max_seq_len = max_model_len + 512`
|
|
|
|
### Test Objective
|
|
|
|
Verify that the fix works correctly by running a standalone test with actual RULER NIAH data.
|
|
|
|
## Step 1: Copy Test Data
|
|
|
|
### Source Location
|
|
|
|
```
|
|
/home/zijie/Code/x-attention/eval/RULER/scripts/benchmark_root/full_fuse_16_llama3.1-8b-chat/synthetic/32768/data/niah_single_1/validation.jsonl
|
|
```
|
|
|
|
### Data Format
|
|
|
|
Each line is a JSON object:
|
|
|
|
```json
|
|
{
|
|
"index": 0,
|
|
"input": "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nA special magic number is hidden within the following text...",
|
|
"outputs": ["8930103"],
|
|
"length": 32768
|
|
}
|
|
```
|
|
|
|
- `input`: Full prompt with Llama 3.1 chat template (~122K characters, ~30K tokens)
|
|
- `outputs`: Expected answer (the magic number to find)
|
|
- `length`: Target sequence length in tokens
|
|
|
|
### Copy Command
|
|
|
|
```bash
|
|
mkdir -p /home/zijie/Code/nano-vllm/tests/data/ruler_niah
|
|
cp /home/zijie/Code/x-attention/eval/RULER/scripts/benchmark_root/full_fuse_16_llama3.1-8b-chat/synthetic/32768/data/niah_single_1/validation.jsonl \
|
|
/home/zijie/Code/nano-vllm/tests/data/ruler_niah/niah_single_1_32k.jsonl
|
|
```
|
|
|
|
## Step 2: Create Test Script
|
|
|
|
Create `/home/zijie/Code/nano-vllm/tests/test_ruler_niah_32k.py`:
|
|
|
|
```python
|
|
"""
|
|
Standalone test for RULER NIAH task with 32K context length.
|
|
|
|
This test verifies that CPU offload mode correctly handles long sequences
|
|
where prefill tokens approach max_model_len.
|
|
|
|
Usage:
|
|
python tests/test_ruler_niah_32k.py
|
|
"""
|
|
|
|
import json
|
|
import torch
|
|
from pathlib import Path
|
|
|
|
from nanovllm import LLM
|
|
from nanovllm.config import SamplingParams
|
|
|
|
# Configuration
|
|
MODEL_PATH = "/data/models/Llama-3.1-8B-Instruct"
|
|
DATA_FILE = Path(__file__).parent / "data/ruler_niah/niah_single_1_32k.jsonl"
|
|
MAX_MODEL_LEN = 32768
|
|
MAX_NEW_TOKENS = 50
|
|
|
|
# CPU Offload Settings
|
|
ENABLE_CPU_OFFLOAD = True
|
|
NUM_GPU_BLOCKS = 4
|
|
BLOCK_SIZE = 1024
|
|
|
|
|
|
def load_test_sample(filepath: Path, index: int = 0) -> dict:
|
|
"""Load a single test sample from JSONL file."""
|
|
with open(filepath) as f:
|
|
for i, line in enumerate(f):
|
|
if i == index:
|
|
return json.loads(line)
|
|
raise ValueError(f"Sample index {index} not found")
|
|
|
|
|
|
def test_niah_single():
|
|
"""Test NIAH single needle task with 32K context."""
|
|
print("=" * 60)
|
|
print("RULER NIAH 32K Standalone Test")
|
|
print("=" * 60)
|
|
|
|
# Load test data
|
|
sample = load_test_sample(DATA_FILE, index=0)
|
|
prompt = sample["input"]
|
|
expected = sample["outputs"][0]
|
|
|
|
print(f"Prompt length: {len(prompt)} characters")
|
|
print(f"Expected answer: {expected}")
|
|
print()
|
|
|
|
# Initialize model with CPU offload
|
|
print("Initializing LLM with CPU offload...")
|
|
llm = LLM(
|
|
model=MODEL_PATH,
|
|
max_model_len=MAX_MODEL_LEN,
|
|
enable_cpu_offload=ENABLE_CPU_OFFLOAD,
|
|
num_gpu_blocks=NUM_GPU_BLOCKS,
|
|
kvcache_block_size=BLOCK_SIZE,
|
|
enforce_eager=True, # Disable CUDA graphs for debugging
|
|
)
|
|
|
|
# Generate
|
|
print("Generating response...")
|
|
sampling_params = SamplingParams(
|
|
temperature=0.0, # Greedy
|
|
max_tokens=MAX_NEW_TOKENS,
|
|
)
|
|
|
|
outputs = llm.generate([prompt], sampling_params)
|
|
generated_text = outputs[0].outputs[0].text
|
|
|
|
print()
|
|
print("=" * 60)
|
|
print("Results")
|
|
print("=" * 60)
|
|
print(f"Expected: {expected}")
|
|
print(f"Generated: {generated_text[:200]}...")
|
|
print()
|
|
|
|
# Check if expected number is in output
|
|
if expected in generated_text:
|
|
print("SUCCESS: Magic number found in output!")
|
|
return True
|
|
else:
|
|
print("FAILED: Magic number NOT found in output")
|
|
print(f"Full output: {generated_text}")
|
|
return False
|
|
|
|
|
|
def test_multiple_samples(num_samples: int = 5):
|
|
"""Test multiple NIAH samples."""
|
|
print("=" * 60)
|
|
print(f"Testing {num_samples} NIAH samples with 32K context")
|
|
print("=" * 60)
|
|
|
|
# Initialize model once
|
|
llm = LLM(
|
|
model=MODEL_PATH,
|
|
max_model_len=MAX_MODEL_LEN,
|
|
enable_cpu_offload=ENABLE_CPU_OFFLOAD,
|
|
num_gpu_blocks=NUM_GPU_BLOCKS,
|
|
kvcache_block_size=BLOCK_SIZE,
|
|
enforce_eager=True,
|
|
)
|
|
|
|
sampling_params = SamplingParams(
|
|
temperature=0.0,
|
|
max_tokens=MAX_NEW_TOKENS,
|
|
)
|
|
|
|
correct = 0
|
|
for i in range(num_samples):
|
|
sample = load_test_sample(DATA_FILE, index=i)
|
|
prompt = sample["input"]
|
|
expected = sample["outputs"][0]
|
|
|
|
outputs = llm.generate([prompt], sampling_params)
|
|
generated_text = outputs[0].outputs[0].text
|
|
|
|
if expected in generated_text:
|
|
print(f"Sample {i}: PASS (found {expected})")
|
|
correct += 1
|
|
else:
|
|
print(f"Sample {i}: FAIL (expected {expected}, got: {generated_text[:50]}...)")
|
|
|
|
print()
|
|
print(f"Accuracy: {correct}/{num_samples} ({100*correct/num_samples:.1f}%)")
|
|
return correct == num_samples
|
|
|
|
|
|
if __name__ == "__main__":
|
|
import sys
|
|
|
|
if len(sys.argv) > 1 and sys.argv[1] == "--all":
|
|
success = test_multiple_samples(5)
|
|
else:
|
|
success = test_niah_single()
|
|
|
|
sys.exit(0 if success else 1)
|
|
```
|
|
|
|
## Step 3: Run Test
|
|
|
|
### Single Sample Test
|
|
|
|
```bash
|
|
cd /home/zijie/Code/nano-vllm
|
|
CUDA_VISIBLE_DEVICES=2,3,4,5 python tests/test_ruler_niah_32k.py
|
|
```
|
|
|
|
### All 5 Samples
|
|
|
|
```bash
|
|
cd /home/zijie/Code/nano-vllm
|
|
CUDA_VISIBLE_DEVICES=2,3,4,5 python tests/test_ruler_niah_32k.py --all
|
|
```
|
|
|
|
## Step 4: Expected Results
|
|
|
|
### Before Fix (Bug)
|
|
|
|
- Output: Garbled text like "not only has been replaced by thesiums..."
|
|
- Score: 0% (magic number not found)
|
|
- Time: ~80 seconds per sample
|
|
|
|
### After Fix (Expected)
|
|
|
|
- Output: The magic number (e.g., "8930103")
|
|
- Score: ~100% (magic number found)
|
|
- Time: ~80 seconds per sample (same, as the compute is unchanged)
|
|
|
|
## Debugging Tips
|
|
|
|
### Enable Verbose Logging
|
|
|
|
```python
|
|
import logging
|
|
logging.basicConfig(level=logging.DEBUG)
|
|
```
|
|
|
|
### Check Ring Buffer Size
|
|
|
|
In the logs, verify:
|
|
```
|
|
OffloadEngine initializing: num_layers=32, num_kv_buffers=4, max_seq_len=33280
|
|
```
|
|
|
|
The `max_seq_len` should be `32768 + 512 = 33280` (not 32768).
|
|
|
|
### Monitor GPU Memory
|
|
|
|
```bash
|
|
watch -n 1 nvidia-smi
|
|
```
|
|
|
|
With CPU offload, GPU memory for KV cache should be ~640MB (ring buffer only).
|
|
|
|
## Related Files
|
|
|
|
| File | Description |
|
|
|------|-------------|
|
|
| `nanovllm/kvcache/__init__.py` | Fix location: `max_seq_len = max_model_len + 512` |
|
|
| `nanovllm/kvcache/offload_engine.py` | Ring buffer allocation |
|
|
| `nanovllm/engine/model_runner.py` | Layer-wise offload prefill/decode |
|
|
| `nanovllm/kvcache/hybrid_manager.py` | CPU block management |
|
|
|
|
## Test Data Details
|
|
|
|
### NIAH Task Description
|
|
|
|
The NIAH (Needle-In-A-Haystack) task tests the model's ability to retrieve a specific piece of information (the "needle") from a large context (the "haystack").
|
|
|
|
- **Needle**: A magic number associated with a keyword (e.g., "worried-purse")
|
|
- **Haystack**: ~30K tokens of distractor text
|
|
- **Task**: Extract the magic number when asked
|
|
|
|
### Sample Prompt Structure
|
|
|
|
```
|
|
<|begin_of_text|><|start_header_id|>user<|end_header_id|>
|
|
|
|
A special magic number is hidden within the following text. Make sure to memorize it. I will quiz you about the number afterwards.
|
|
|
|
[... ~30K tokens of haystack text ...]
|
|
|
|
The special magic number for worried-purse is 8930103.
|
|
|
|
[... more haystack text ...]
|
|
|
|
What is the special magic number for worried-purse mentioned in the provided text?
|
|
<|eot_id|><|start_header_id|>assistant<|end_header_id|>
|
|
|
|
The special magic number for worried-purse mentioned in the provided text is
|
|
```
|
|
|
|
The model should complete with: `8930103`
|