nano-vllm/docs/ruler_niah_standalone_test.md

# RULER NIAH Standalone Test Plan

## Overview

This document describes how to independently test nano-vllm's CPU offload functionality using RULER benchmark's NIAH (Needle-In-A-Haystack) task data.

## Background

### Problem Being Investigated

When running 32K sequence length tests with CPU offload mode, the model outputs garbled text instead of finding the magic number. This issue was traced to:

- **Root Cause**: Ring buffer `max_seq_len` was set equal to `max_model_len` (32768)
- **Issue**: When prefill uses ~32K tokens, decode needs to store KV at position 32768+, but ring buffer only has indices 0-32767
- **Fix Applied**: In `nanovllm/kvcache/__init__.py`, changed `max_seq_len = max_model_len + 512`

### Test Objective

Verify that the fix works correctly by running a standalone test with actual RULER NIAH data.

## Step 1: Copy Test Data

### Source Location

```
/home/zijie/Code/x-attention/eval/RULER/scripts/benchmark_root/full_fuse_16_llama3.1-8b-chat/synthetic/32768/data/niah_single_1/validation.jsonl
```

### Data Format

Each line is a JSON object:

```json
{
  "index": 0,
  "input": "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nA special magic number is hidden within the following text...",
  "outputs": ["8930103"],
  "length": 32768
}
```

- `input`: Full prompt with Llama 3.1 chat template (~122K characters, ~30K tokens)
- `outputs`: Expected answer (the magic number to find)
- `length`: Target sequence length in tokens

### Copy Command

```bash
mkdir -p /home/zijie/Code/nano-vllm/tests/data/ruler_niah
cp /home/zijie/Code/x-attention/eval/RULER/scripts/benchmark_root/full_fuse_16_llama3.1-8b-chat/synthetic/32768/data/niah_single_1/validation.jsonl \
   /home/zijie/Code/nano-vllm/tests/data/ruler_niah/niah_single_1_32k.jsonl
```

## Step 2: Create Test Script

Create `/home/zijie/Code/nano-vllm/tests/test_ruler_niah_32k.py`:

```python
"""
Standalone test for RULER NIAH task with 32K context length.

This test verifies that CPU offload mode correctly handles long sequences
where prefill tokens approach max_model_len.

Usage:
    python tests/test_ruler_niah_32k.py
"""

import json
import torch
from pathlib import Path

from nanovllm import LLM
from nanovllm.config import SamplingParams

# Configuration
MODEL_PATH = "/data/models/Llama-3.1-8B-Instruct"
DATA_FILE = Path(__file__).parent / "data/ruler_niah/niah_single_1_32k.jsonl"
MAX_MODEL_LEN = 32768
MAX_NEW_TOKENS = 50

# CPU Offload Settings
ENABLE_CPU_OFFLOAD = True
NUM_GPU_BLOCKS = 4
BLOCK_SIZE = 1024


def load_test_sample(filepath: Path, index: int = 0) -> dict:
    """Load a single test sample from JSONL file."""
    with open(filepath) as f:
        for i, line in enumerate(f):
            if i == index:
                return json.loads(line)
    raise ValueError(f"Sample index {index} not found")


def test_niah_single():
    """Test NIAH single needle task with 32K context."""
    print("=" * 60)
    print("RULER NIAH 32K Standalone Test")
    print("=" * 60)

    # Load test data
    sample = load_test_sample(DATA_FILE, index=0)
    prompt = sample["input"]
    expected = sample["outputs"][0]

    print(f"Prompt length: {len(prompt)} characters")
    print(f"Expected answer: {expected}")
    print()

    # Initialize model with CPU offload
    print("Initializing LLM with CPU offload...")
    llm = LLM(
        model=MODEL_PATH,
        max_model_len=MAX_MODEL_LEN,
        enable_cpu_offload=ENABLE_CPU_OFFLOAD,
        num_gpu_blocks=NUM_GPU_BLOCKS,
        kvcache_block_size=BLOCK_SIZE,
        enforce_eager=True,  # Disable CUDA graphs for debugging
    )

    # Generate
    print("Generating response...")
    sampling_params = SamplingParams(
        temperature=0.0,  # Greedy
        max_tokens=MAX_NEW_TOKENS,
    )

    outputs = llm.generate([prompt], sampling_params)
    generated_text = outputs[0].outputs[0].text

    print()
    print("=" * 60)
    print("Results")
    print("=" * 60)
    print(f"Expected: {expected}")
    print(f"Generated: {generated_text[:200]}...")
    print()

    # Check if expected number is in output
    if expected in generated_text:
        print("SUCCESS: Magic number found in output!")
        return True
    else:
        print("FAILED: Magic number NOT found in output")
        print(f"Full output: {generated_text}")
        return False


def test_multiple_samples(num_samples: int = 5):
    """Test multiple NIAH samples."""
    print("=" * 60)
    print(f"Testing {num_samples} NIAH samples with 32K context")
    print("=" * 60)

    # Initialize model once
    llm = LLM(
        model=MODEL_PATH,
        max_model_len=MAX_MODEL_LEN,
        enable_cpu_offload=ENABLE_CPU_OFFLOAD,
        num_gpu_blocks=NUM_GPU_BLOCKS,
        kvcache_block_size=BLOCK_SIZE,
        enforce_eager=True,
    )

    sampling_params = SamplingParams(
        temperature=0.0,
        max_tokens=MAX_NEW_TOKENS,
    )

    correct = 0
    for i in range(num_samples):
        sample = load_test_sample(DATA_FILE, index=i)
        prompt = sample["input"]
        expected = sample["outputs"][0]

        outputs = llm.generate([prompt], sampling_params)
        generated_text = outputs[0].outputs[0].text

        if expected in generated_text:
            print(f"Sample {i}: PASS (found {expected})")
            correct += 1
        else:
            print(f"Sample {i}: FAIL (expected {expected}, got: {generated_text[:50]}...)")

    print()
    print(f"Accuracy: {correct}/{num_samples} ({100*correct/num_samples:.1f}%)")
    return correct == num_samples


if __name__ == "__main__":
    import sys

    if len(sys.argv) > 1 and sys.argv[1] == "--all":
        success = test_multiple_samples(5)
    else:
        success = test_niah_single()

    sys.exit(0 if success else 1)
```

## Step 3: Run Test

### Single Sample Test

```bash
cd /home/zijie/Code/nano-vllm
CUDA_VISIBLE_DEVICES=2,3,4,5 python tests/test_ruler_niah_32k.py
```

### All 5 Samples

```bash
cd /home/zijie/Code/nano-vllm
CUDA_VISIBLE_DEVICES=2,3,4,5 python tests/test_ruler_niah_32k.py --all
```

## Step 4: Expected Results

### Before Fix (Bug)

- Output: Garbled text like "not only has been replaced by thesiums..."
- Score: 0% (magic number not found)
- Time: ~80 seconds per sample

### After Fix (Expected)

- Output: The magic number (e.g., "8930103")
- Score: ~100% (magic number found)
- Time: ~80 seconds per sample (same, as the compute is unchanged)

## Debugging Tips

### Enable Verbose Logging

```python
import logging
logging.basicConfig(level=logging.DEBUG)
```

### Check Ring Buffer Size

In the logs, verify:
```
OffloadEngine initializing: num_layers=32, num_kv_buffers=4, max_seq_len=33280
```

The `max_seq_len` should be `32768 + 512 = 33280` (not 32768).

### Monitor GPU Memory

```bash
watch -n 1 nvidia-smi
```

With CPU offload, GPU memory for KV cache should be ~640MB (ring buffer only).

## Related Files

| File | Description |
|------|-------------|
| `nanovllm/kvcache/__init__.py` | Fix location: `max_seq_len = max_model_len + 512` |
| `nanovllm/kvcache/offload_engine.py` | Ring buffer allocation |
| `nanovllm/engine/model_runner.py` | Layer-wise offload prefill/decode |
| `nanovllm/kvcache/hybrid_manager.py` | CPU block management |

## Test Data Details

### NIAH Task Description

The NIAH (Needle-In-A-Haystack) task tests the model's ability to retrieve a specific piece of information (the "needle") from a large context (the "haystack").

- **Needle**: A magic number associated with a keyword (e.g., "worried-purse")
- **Haystack**: ~30K tokens of distractor text
- **Task**: Extract the magic number when asked

### Sample Prompt Structure

```
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

A special magic number is hidden within the following text. Make sure to memorize it. I will quiz you about the number afterwards.

[... ~30K tokens of haystack text ...]

The special magic number for worried-purse is 8930103.

[... more haystack text ...]

What is the special magic number for worried-purse mentioned in the provided text?
<|eot_id|><|start_header_id|>assistant<|end_header_id|>

 The special magic number for worried-purse mentioned in the provided text is
```

The model should complete with: `8930103`