Comprehensive documentation for XAttention sparse policy integration: - Algorithm principles (chunked estimation + block sparse attention) - COMPASS source code analysis - Design decisions for CPU offload mode - Implementation details (utils.py, kernels.py, xattn.py) - Problem-solving (OOM, GQA, abstract method) - Test validation results (RULER 32k benchmark) Co-Authored-By: Claude <noreply@anthropic.com>
5.5 KiB
CLAUDE.md
This file provides guidance to Claude Code when working with this repository.
Overview
Nano-vLLM is a lightweight vLLM implementation (~1,200 lines) for fast offline LLM inference. Supports multiple model architectures (Qwen3, Qwen2, Llama) with CPU offload for long-context inference.
GPU Mutex for Multi-Instance Debugging
IMPORTANT: When running multiple Claude instances for parallel debugging, different rules apply based on script type:
Benchmarks (bench*.py) - Exclusive GPU Access Required
Before running any bench*.py script, Claude MUST wait for exclusive GPU access:
# Check and wait for GPU to be free
while [ -n "$(nvidia-smi --query-compute-apps=pid --format=csv,noheader)" ]; do
echo "GPU busy, waiting 10s..."
sleep 10
done
Other Scripts (tests, examples) - No Special Requirements
For non-benchmark scripts, exclusive GPU access is NOT required. Multiple nanovllm processes can run simultaneously on different GPUs - each process automatically selects a unique port for torch.distributed communication.
Multi-Instance Development with PYTHONPATH
IMPORTANT: When running multiple Claude instances on different worktrees, do NOT use pip install -e . globally as it will affect other instances.
Use PYTHONPATH directly - no pip install needed:
# Set PYTHONPATH to point to the project root directory
PYTHONPATH=/path/to/your/worktree:$PYTHONPATH python <script.py>
# Example: running tests
PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH python tests/test_needle.py
Benefits:
- No
pip installrequired - Code changes take effect immediately (no reinstall needed)
- Each worktree is completely isolated
Documentation Index
| Document | Purpose |
|---|---|
docs/architecture_guide.md |
Core components, layer-wise CPU offload design, prefill/decode flows, implementation details |
docs/multi_model_support.md |
Model registry system, adding new models (Qwen3/Llama), architecture differences, RoPE scaling |
docs/cuda_graph_offload_guide.md |
CUDA graph support for CPU offload decode path, 4x decode speedup |
docs/sparse_attention_guide.md |
Block sparse attention methods (MInference, FlexPrefill, XAttention, Quest), computation flow |
docs/block_sparse_attention_lib.md |
MIT-Han-Lab Block-Sparse-Attention library reference: sparse modes, API, performance |
docs/sparse_prefill_integration_plan.md |
Integration plan for MInference/XAttention/FlexPrefill with unified BlockMask interface |
docs/sparse_offload_integration.md |
Sparse policy integration with layerwise offload, requires_block_selection interface design |
docs/layerwise_offload_memory_analysis.md |
Memory allocation analysis with theoretical formulas and empirical validation (< 5% error) |
docs/debugging_guide.md |
PyTorch hooks for debugging, tensor comparison, memory profiling |
docs/gpu_only_performance_issue.md |
GPU-only mode slower than offload due to PagedAttention scatter overhead, optimization proposals |
docs/offload_accuracy_issue.md |
BUG: CPU offload mode 66% accuracy vs 100% non-offload on RULER NIAH benchmark |
docs/64k_memory_analysis.md |
64k inference memory analysis: GPU-only vs offload, OOM root cause (fragmentation), RTX 3090 limitations |
docs/xattention_integration.md |
XAttention integration guide: algorithm, implementation, design decisions, and testing |
docs/xattention_analysis.md |
XAttention algorithm analysis: chunked estimation, block sparse attention, integration design |
docs/development_notes.md |
Development notes and scratchpad for ongoing work |
Configuration
| Parameter | Default | Notes |
|---|---|---|
kvcache_block_size |
4096 | Tokens per block |
max_num_batched_tokens |
16384 | Set = max_model_len for long context |
gpu_memory_utilization |
0.9 | GPU memory fraction |
enable_cpu_offload |
False | Enable for long context |
num_gpu_blocks |
2 | GPU blocks for offload mode |
num_kv_buffers |
4 | Ring buffer size (1-4), lower = less memory but slower decode |
enforce_eager |
False | Set True to disable CUDA graphs |
Benchmarking
Files: bench.py (GPU), bench_offload.py (CPU offload), bench_vllm.py (comparison)
Common Issues:
max_num_batched_tokens < max_model_len: Set equal for long context- CUDA graph dimension mismatch: Ensure
input_len + output_len <= max_model_len - RoPE out of bounds: Check model's
max_position_embeddingsin config.json
Model Limits:
- Qwen3-0.6B/4B: 40960 tokens
- Qwen2.5-7B-Instruct-1M: 1048576 tokens
- Llama-3.1-8B-Instruct: 131072 tokens
- 64k on RTX 3090/4090 (24GB): Requires CPU offload + optimizations, see
docs/64k_memory_analysis.md
Performance (Qwen3-4B, CPU Offload):
- Prefill: ~5700-8000 tok/s (varies by context length)
- Decode with CUDA Graph: ~50 tok/s (TPOT ~19ms)
- Decode Eager Mode: ~12 tok/s (TPOT ~80ms)
- CUDA Graph speedup: 4x decode throughput
Author: Zijie Tian