Files
nano-vllm/CLAUDE.md
Zijie Tian 6180055ed8 📝 docs: add chunked attention solutions guide and update doc index
Add comprehensive documentation analyzing the 32K chunked offload
accuracy issues with proposed solutions covering LSE precision,
ring buffer state management, and position encoding validation.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-20 04:48:20 +08:00

4.7 KiB

CLAUDE.md

This file provides guidance to Claude Code when working with this repository.

Overview

Nano-vLLM is a lightweight vLLM implementation (~1,200 lines) for fast offline LLM inference. Supports Qwen3 models with CPU offload for long-context inference.

Documentation Index

Document Purpose
docs/architecture_guide.md Core components, CPU offload system design, ring buffer architecture, stream configuration
docs/sparse_policy_architecture.md SparsePolicy abstraction: prefill/decode delegation, pipeline modes, policy implementations
docs/sparse_policy_implementation_guide.md How to implement custom SparsePolicy: required methods, hooks, ring buffer pipeline pattern
docs/sparse_attention_guide.md Block sparse attention methods (XAttention, FlexPrefill, MInference, AvgPool, Quest), computation flow, algorithms
docs/xattention_algorithm_guide.md XAttention 算法详解: stride reshape、Triton kernels、BSA 依赖、块选择算法
docs/block_sparse_attn_interface.md BSA (Block Sparse Attention) 接口文档: 函数签名、使用示例、约束条件
docs/debugging_guide.md PyTorch hooks for debugging, hook positions, tensor comparison, memory profiling
docs/optimization_guide.md Performance optimizations: sgDMA (15x), Triton merge (4.3x), N-way pipeline (2x)
docs/known_issues.md Documented bugs and fixes: partial last block bug, block size 4096 race condition
docs/ruler_benchmark_results_32k.md RULER benchmark results (32K context): 13 tasks, 92.3% accuracy, CPU offload performance
docs/ruler_32k_chunked_offload_issue.md ⚠️ OPEN ISSUE: 32K chunked offload accuracy problem (20% error rate in RULER)
docs/chunked_attention_solutions.md 🔧 SOLUTIONS: Chunked attention 准确性问题的代码分析和解决方案

GPU Mutex for Multi-Instance Debugging

IMPORTANT: When running multiple Claude instances for parallel debugging, different rules apply based on script type:

Benchmarks (bench*.py) - Exclusive GPU Access Required

Before running any bench*.py script, Claude MUST wait for exclusive GPU access:

# Check and wait for GPU to be free
while [ -n "$(nvidia-smi --query-compute-apps=pid --format=csv,noheader)" ]; do
  echo "GPU busy, waiting 10s..."
  sleep 10
done

Other Scripts (tests, examples) - No Special Requirements

For non-benchmark scripts, exclusive GPU access is NOT required. Multiple nanovllm processes can run simultaneously on different GPUs - each process automatically selects a unique port for torch.distributed communication.

Multi-Instance Development with PYTHONPATH

IMPORTANT: When running multiple Claude instances on different worktrees, do NOT use pip install -e . globally as it will affect other instances.

Use PYTHONPATH directly - no pip install needed:

# Set PYTHONPATH to point to the project root directory
PYTHONPATH=/path/to/your/worktree:$PYTHONPATH python <script.py>

# Example: running tests
PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH python tests/test_needle.py

Benefits:

  • No pip install required
  • Code changes take effect immediately (no reinstall needed)
  • Each worktree is completely isolated

Configuration

Parameter Default Notes
kvcache_block_size 1024 Tokens per block (4096 now works after race condition fix)
max_num_batched_tokens 16384 Set = max_model_len for long context
gpu_memory_utilization 0.9 GPU memory fraction
enable_cpu_offload False Enable for long context
enforce_eager False Set True to disable CUDA graphs

Benchmarking

Files: bench.py (GPU), bench_offload.py (CPU offload), bench_vllm.py (comparison)

Common Issues:

  1. max_num_batched_tokens < max_model_len: Set equal for long context
  2. CUDA graph dimension mismatch: Ensure input_len + output_len <= max_model_len
  3. RoPE out of bounds: Check model's max_position_embeddings in config.json

Model Limits:

  • Qwen3-0.6B/4B: 40960 tokens
  • Qwen2.5-7B-Instruct-1M: 1048576 tokens

Performance (Qwen3-0.6B):

  • GPU: ~18k tok/s (prefill), ~100 tok/s (decode)
  • CPU Offload (16K): ~14k tok/s (prefill)
  • CPU Offload (32K): ~13k tok/s (prefill)

Author: Zijie Tian