Files

Zijie Tian 2771312565 [docs] Add sparse prefill integration plan from int-minference analysis

Consolidated analysis from int-minference-1/2/3 branches into a unified
integration plan for MInference, XAttention, and FlexPrefill strategies.

Key design decisions:
- Backward compatible: Keep existing SparsePolicy interface
- Unified BlockMask intermediate representation for new strategies
- XAttention/FlexPrefill use block_sparse_attn_func kernel
- MInference can optionally use block_sparse_attn (Phase 4)

Five-phase implementation plan:
1. BlockMask + block_sparse_attn wrapper
2. XAttention implementation
3. FlexPrefill implementation
4. Optional MInference refactoring
5. Integration and testing

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-10 23:33:09 +08:00

4.7 KiB

Raw Blame History

CLAUDE.md

This file provides guidance to Claude Code when working with this repository.

Overview

Nano-vLLM is a lightweight vLLM implementation (~1,200 lines) for fast offline LLM inference. Supports multiple model architectures (Qwen3, Qwen2, Llama) with CPU offload for long-context inference.

GPU Mutex for Multi-Instance Debugging

IMPORTANT: When running multiple Claude instances for parallel debugging, different rules apply based on script type:

Benchmarks (`bench*.py`) - Exclusive GPU Access Required

Before running any bench*.py script, Claude MUST wait for exclusive GPU access:

# Check and wait for GPU to be free
while [ -n "$(nvidia-smi --query-compute-apps=pid --format=csv,noheader)" ]; do
  echo "GPU busy, waiting 10s..."
  sleep 10
done

Other Scripts (tests, examples) - Port Conflict Check Only

For non-benchmark scripts, exclusive GPU access is NOT required. However, check for distributed port conflicts before running:

# Check if port 29500 (default torch distributed port) is in use
if lsof -i :29500 >/dev/null 2>&1; then
  echo "Port 29500 in use, waiting 10s..."
  sleep 10
fi

Note: nanovllm's distributed port handling is not yet robust - two processes competing for the same port will cause errors. This check prevents that issue.

Multi-Instance Development with PYTHONPATH

IMPORTANT: When running multiple Claude instances on different worktrees, do NOT use pip install -e . globally as it will affect other instances.

Use PYTHONPATH directly - no pip install needed:

# Set PYTHONPATH to point to the project root directory
PYTHONPATH=/path/to/your/worktree:$PYTHONPATH python <script.py>

# Example: running tests
PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH python tests/test_needle.py

Benefits:

No pip install required
Code changes take effect immediately (no reinstall needed)
Each worktree is completely isolated

Documentation Index

Document	Purpose
`docs/architecture_guide.md`	Core components, layer-wise CPU offload design, prefill/decode flows, implementation details
`docs/multi_model_support.md`	Model registry system, adding new models (Qwen3/Llama), architecture differences, RoPE scaling
`docs/cuda_graph_offload_guide.md`	CUDA graph support for CPU offload decode path, 4x decode speedup
`docs/sparse_attention_guide.md`	Block sparse attention methods (MInference, FlexPrefill, XAttention, Quest), computation flow
`docs/sparse_prefill_integration_plan.md`	Integration plan for MInference/XAttention/FlexPrefill with unified BlockMask interface
`docs/sparse_offload_integration.md`	Sparse policy integration with layerwise offload, `requires_block_selection` interface design
`docs/layerwise_offload_memory_analysis.md`	Memory allocation analysis with theoretical formulas and empirical validation (< 5% error)
`docs/debugging_guide.md`	PyTorch hooks for debugging, tensor comparison, memory profiling
`docs/gpu_only_performance_issue.md`	GPU-only mode slower than offload due to PagedAttention scatter overhead, optimization proposals

Configuration

Parameter	Default	Notes
`kvcache_block_size`	4096	Tokens per block
`max_num_batched_tokens`	16384	Set = max_model_len for long context
`gpu_memory_utilization`	0.9	GPU memory fraction
`enable_cpu_offload`	False	Enable for long context
`num_gpu_blocks`	2	GPU blocks for offload mode
`num_kv_buffers`	4	Ring buffer size for decode pipeline
`enforce_eager`	False	Set True to disable CUDA graphs

Benchmarking

Files: bench.py (GPU), bench_offload.py (CPU offload), bench_vllm.py (comparison)

Common Issues:

max_num_batched_tokens < max_model_len: Set equal for long context
CUDA graph dimension mismatch: Ensure input_len + output_len <= max_model_len
RoPE out of bounds: Check model's max_position_embeddings in config.json

Model Limits:

Qwen3-0.6B/4B: 40960 tokens
Qwen2.5-7B-Instruct-1M: 1048576 tokens
Llama-3.1-8B-Instruct: 131072 tokens

Performance (Qwen3-4B, CPU Offload):

Prefill: ~5700-8000 tok/s (varies by context length)
Decode with CUDA Graph: ~50 tok/s (TPOT ~19ms)
Decode Eager Mode: ~12 tok/s (TPOT ~80ms)
CUDA Graph speedup: 4x decode throughput

Author: Zijie Tian

4.7 KiB Raw Blame History