Merge GPU-only sparse attention support from tzj/minference-exp branch: **GPU-only mode additions:** - Add compute_prefill/compute_decode methods to SparsePolicy base class - Add GPU-only attention routing in attention.py - Add alloc_policy_metadata() for pre-allocating GQA buffers - Add XAttention + BSA sparse attention for GPU-only prefill - Add kvcache_manager to set_context() for policy access **bench.py enhancements:** - Add --model argument for configurable model path - Add --policy argument (full, xattn) for sparse policy selection - Add --enable-policy flag for FullAttentionPolicy routing - Add --enforce-eager option to disable CUDA graphs - Add --gpu-util option for GPU memory utilization **Documentation:** - Add gpu_only_xattn_guide.md with performance analysis - Add gpu_only_sparse_integration.md baseline document - Add gpu-vram-requirement.md rule for GPU-only mode Both CPU offload and GPU-only paths are preserved and functional. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
6.9 KiB
CLAUDE.md
This file provides guidance to Claude Code when working with this repository.
Overview
Nano-vLLM is a lightweight vLLM implementation (~1,200 lines) for fast offline LLM inference. Supports Qwen3 models with CPU offload for long-context inference.
Documentation Index
| Document | Purpose |
|---|---|
docs/architecture_guide.md |
Core components, CPU offload system design, ring buffer architecture, stream configuration |
docs/sparse_policy_architecture.md |
SparsePolicy abstraction: prefill/decode delegation, pipeline modes, policy implementations |
docs/sparse_policy_implementation_guide.md |
How to implement custom SparsePolicy: required methods, hooks, ring buffer pipeline pattern |
docs/sparse_attention_guide.md |
Block sparse attention methods (XAttention, FlexPrefill, MInference, AvgPool, Quest), computation flow, algorithms |
docs/xattention_algorithm_guide.md |
XAttention 算法详解: stride reshape、Triton kernels、BSA 依赖、块选择算法 |
docs/xattn_kernels_guide.md |
XAttention Triton kernels: flat_group_gemm (反对角线求和)、softmax_fuse_block_sum (block 聚合) |
docs/xattn_chunked_prefill.md |
XAttention chunked prefill: API、使用方式、一致性要求 |
docs/xattn_bsa_policy_design.md |
XAttention BSA Policy: 算法设计、性能基准(128K)、内存管理、density 统计 |
docs/block_sparse_attn_interface.md |
BSA (Block Sparse Attention) 接口文档: 函数签名、使用示例、约束条件 |
docs/debugging_guide.md |
PyTorch hooks for debugging, hook positions, tensor comparison, memory profiling |
docs/optimization_guide.md |
Performance optimizations: sgDMA (15x), Triton merge (4.3x), N-way pipeline (2x) |
docs/known_issues.md |
Documented bugs and fixes: partial last block bug, block size 4096 race condition |
docs/ruler_benchmark_results_32k.md |
RULER benchmark results (32K context): 13 tasks, 92.3% accuracy, CPU offload performance |
docs/ruler_32k_chunked_offload_issue.md |
⚠️ OPEN ISSUE: 32K chunked offload accuracy problem (20% error rate in RULER) |
docs/chunked_attention_solutions.md |
🔧 SOLUTIONS: Chunked attention 准确性问题的代码分析和解决方案 |
docs/nsys_wrong_event_order_bug.md |
🐛 NSYS BUG: Ring buffer pipeline 触发 nsys 时间戳乱序问题的调试记录 |
docs/cpu_scheduling_latency_analysis.md |
⚡ PERF: CPU 调度延迟分析,kernel 间隙来源,GPU 利用率优化方向 |
docs/bench_offload_results.md |
📊 BENCH: CPU offload 性能测试结果,Full vs XAttention 对比 (32K/128K) |
docs/cpu_offload_optimization_strategies.md |
🚀 OPT: CPU offload 优化策略:chunk size、CUDA Graph、前沿研究(InfiniGen/ShadowKV) |
docs/gpu_only_xattn_guide.md |
🚀 GPU-Only XAttention: 内存预分配、性能分析 (32K +15%, 64K +41%)、CUDA Graph 限制 |
Rules Index
| Rule | Purpose |
|---|---|
.claude/rules/multi-gpu-debugging.md |
Multi-GPU debugging: GPU allocation (1-2 for validation, rest for exploration), single-task validation policy |
.claude/rules/gpu-testing.md |
GPU type detection, card assignment, needle test requirements |
.claude/rules/sparse-policy.md |
SparsePolicy implementation requirements |
.claude/rules/planning-with-files.md |
Planning file management for complex tasks |
.claude/rules/gpu-monitor.md |
GPU memory monitoring: 必须使用 gpu-monitor agent,禁止手动 nvidia-smi 循环 |
GPU Mutex for Multi-Instance Debugging
IMPORTANT: When running multiple Claude instances for parallel debugging, different rules apply based on script type:
Benchmarks (bench*.py) - Exclusive GPU Access Required
Before running any bench*.py script, Claude MUST wait for exclusive GPU access:
# Check and wait for GPU to be free
while [ -n "$(nvidia-smi --query-compute-apps=pid --format=csv,noheader)" ]; do
echo "GPU busy, waiting 10s..."
sleep 10
done
Other Scripts (tests, examples) - No Special Requirements
For non-benchmark scripts, exclusive GPU access is NOT required. Multiple nanovllm processes can run simultaneously on different GPUs - each process automatically selects a unique port for torch.distributed communication.
Multi-Instance Development with PYTHONPATH
IMPORTANT: When running multiple Claude instances on different worktrees, do NOT use pip install -e . globally as it will affect other instances.
Use PYTHONPATH directly - no pip install needed:
# Set PYTHONPATH to point to the project root directory
PYTHONPATH=/path/to/your/worktree:$PYTHONPATH python <script.py>
# Example: running tests
PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH python tests/test_needle.py
Benefits:
- No
pip installrequired - Code changes take effect immediately (no reinstall needed)
- Each worktree is completely isolated
Configuration
| Parameter | Default | Notes |
|---|---|---|
kvcache_block_size |
1024 | Tokens per block (4096 now works after race condition fix) |
max_num_batched_tokens |
16384 | Set = max_model_len for long context |
gpu_memory_utilization |
0.9 | GPU memory fraction |
enable_cpu_offload |
False | Enable for long context |
enforce_eager |
False | Set True to disable CUDA graphs |
Benchmarking
Files: bench.py (GPU), bench_offload.py (CPU offload), bench_vllm.py (comparison)
Offload Mode Constraint: When using enable_cpu_offload=True, only test with context length ≥ 32K. Shorter contexts don't exercise the chunked offload pipeline properly.
Common Issues:
max_num_batched_tokens < max_model_len: Set equal for long context- CUDA graph dimension mismatch: Ensure
input_len + output_len <= max_model_len - RoPE out of bounds: Check model's
max_position_embeddingsin config.json
Model Limits:
- Qwen3-0.6B/4B: 40960 tokens
- Qwen2.5-7B-Instruct-1M: 1048576 tokens
Performance (Qwen3-0.6B):
- GPU: ~18k tok/s (prefill), ~100 tok/s (decode)
- CPU Offload (16K): ~14k tok/s (prefill)
- CPU Offload (32K): ~13k tok/s (prefill)
Author: Zijie Tian