Files

Zijie Tian 6180055ed8 📝 docs: add chunked attention solutions guide and update doc index

Add comprehensive documentation analyzing the 32K chunked offload
accuracy issues with proposed solutions covering LSE precision,
ring buffer state management, and position encoding validation.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-20 04:48:20 +08:00

4.7 KiB

Raw Blame History

CLAUDE.md

This file provides guidance to Claude Code when working with this repository.

Overview

Nano-vLLM is a lightweight vLLM implementation (~1,200 lines) for fast offline LLM inference. Supports Qwen3 models with CPU offload for long-context inference.

Documentation Index

Document	Purpose
`docs/architecture_guide.md`	Core components, CPU offload system design, ring buffer architecture, stream configuration
`docs/sparse_policy_architecture.md`	SparsePolicy abstraction: prefill/decode delegation, pipeline modes, policy implementations
`docs/sparse_policy_implementation_guide.md`	How to implement custom SparsePolicy: required methods, hooks, ring buffer pipeline pattern
`docs/sparse_attention_guide.md`	Block sparse attention methods (XAttention, FlexPrefill, MInference, AvgPool, Quest), computation flow, algorithms
`docs/xattention_algorithm_guide.md`	XAttention 算法详解: stride reshape、Triton kernels、BSA 依赖、块选择算法
`docs/block_sparse_attn_interface.md`	BSA (Block Sparse Attention) 接口文档: 函数签名、使用示例、约束条件
`docs/debugging_guide.md`	PyTorch hooks for debugging, hook positions, tensor comparison, memory profiling
`docs/optimization_guide.md`	Performance optimizations: sgDMA (15x), Triton merge (4.3x), N-way pipeline (2x)
`docs/known_issues.md`	Documented bugs and fixes: partial last block bug, block size 4096 race condition
`docs/ruler_benchmark_results_32k.md`	RULER benchmark results (32K context): 13 tasks, 92.3% accuracy, CPU offload performance
`docs/ruler_32k_chunked_offload_issue.md`	⚠️ OPEN ISSUE: 32K chunked offload accuracy problem (20% error rate in RULER)
`docs/chunked_attention_solutions.md`	🔧 SOLUTIONS: Chunked attention 准确性问题的代码分析和解决方案

GPU Mutex for Multi-Instance Debugging

IMPORTANT: When running multiple Claude instances for parallel debugging, different rules apply based on script type:

Benchmarks (`bench*.py`) - Exclusive GPU Access Required

Before running any bench*.py script, Claude MUST wait for exclusive GPU access:

# Check and wait for GPU to be free
while [ -n "$(nvidia-smi --query-compute-apps=pid --format=csv,noheader)" ]; do
  echo "GPU busy, waiting 10s..."
  sleep 10
done

Other Scripts (tests, examples) - No Special Requirements

For non-benchmark scripts, exclusive GPU access is NOT required. Multiple nanovllm processes can run simultaneously on different GPUs - each process automatically selects a unique port for torch.distributed communication.

Multi-Instance Development with PYTHONPATH

IMPORTANT: When running multiple Claude instances on different worktrees, do NOT use pip install -e . globally as it will affect other instances.

Use PYTHONPATH directly - no pip install needed:

# Set PYTHONPATH to point to the project root directory
PYTHONPATH=/path/to/your/worktree:$PYTHONPATH python <script.py>

# Example: running tests
PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH python tests/test_needle.py

Benefits:

No pip install required
Code changes take effect immediately (no reinstall needed)
Each worktree is completely isolated

Configuration

Parameter	Default	Notes
`kvcache_block_size`	1024	Tokens per block (4096 now works after race condition fix)
`max_num_batched_tokens`	16384	Set = max_model_len for long context
`gpu_memory_utilization`	0.9	GPU memory fraction
`enable_cpu_offload`	False	Enable for long context
`enforce_eager`	False	Set True to disable CUDA graphs

Benchmarking

Files: bench.py (GPU), bench_offload.py (CPU offload), bench_vllm.py (comparison)

Common Issues:

max_num_batched_tokens < max_model_len: Set equal for long context
CUDA graph dimension mismatch: Ensure input_len + output_len <= max_model_len
RoPE out of bounds: Check model's max_position_embeddings in config.json

Model Limits:

Qwen3-0.6B/4B: 40960 tokens
Qwen2.5-7B-Instruct-1M: 1048576 tokens

Performance (Qwen3-0.6B):

GPU: ~18k tok/s (prefill), ~100 tok/s (decode)
CPU Offload (16K): ~14k tok/s (prefill)
CPU Offload (32K): ~13k tok/s (prefill)

Author: Zijie Tian

4.7 KiB Raw Blame History