Files

Zijie Tian 39d12a0416 📈 feat: add MemoryObserver for GPU-CPU communication tracking

Implement MemoryObserver to track memory transfers between GPU and CPU:
- H2D (Host to Device): CPU → GPU transfers
- D2H (Device to Host): GPU → CPU transfers
- D2D (Device to Device): GPU buffer copies
- Supports prefill/decode phase separation

Integration points in offload_engine.py:
- load_to_slot_layer: H2D with is_prefill parameter
- offload_slot_layer_to_cpu, offload_prefill_buffer_async: D2H
- write_to_prefill_buffer, write_to_decode_buffer: D2D
- load_block_sample_from_cpu, load_block_full_from_cpu: H2D

Add bench_offload.py integration for memory stats printing.

Benchmark results (Llama-3.1-8B, 64K context):
- Full Policy: Prefill H2D 262.13 GB
- XAttention: Prefill H2D 386.62 GB (1.48x)

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>

2026-01-28 04:06:45 +08:00

7.4 KiB

Raw Blame History

CLAUDE.md

This file provides guidance to Claude Code when working with this repository.

Overview

Nano-vLLM is a lightweight vLLM implementation (~1,200 lines) for fast offline LLM inference. Supports Qwen3 models with CPU offload for long-context inference.

Documentation Index

Document	Purpose
`docs/architecture_guide.md`	Core components, CPU offload system design, ring buffer architecture, stream configuration
`docs/sparse_policy_architecture.md`	SparsePolicy abstraction: prefill/decode delegation, pipeline modes, policy implementations
`docs/sparse_policy_implementation_guide.md`	How to implement custom SparsePolicy: required methods, hooks, ring buffer pipeline pattern
`docs/sparse_attention_guide.md`	Block sparse attention methods (XAttention, FlexPrefill, MInference, AvgPool, Quest), computation flow, algorithms
`docs/xattention_algorithm_guide.md`	XAttention 算法详解: stride reshape、Triton kernels、BSA 依赖、块选择算法
`docs/xattn_kernels_guide.md`	XAttention Triton kernels: flat_group_gemm (反对角线求和)、softmax_fuse_block_sum (block 聚合)
`docs/xattn_chunked_prefill.md`	XAttention chunked prefill: API、使用方式、一致性要求
`docs/xattn_bsa_policy_design.md`	XAttention BSA Policy: 算法设计、性能基准(128K)、内存管理、density 统计
`docs/block_sparse_attn_interface.md`	BSA (Block Sparse Attention) 接口文档: 函数签名、使用示例、约束条件
`docs/debugging_guide.md`	PyTorch hooks for debugging, hook positions, tensor comparison, memory profiling
`docs/optimization_guide.md`	Performance optimizations: sgDMA (15x), Triton merge (4.3x), N-way pipeline (2x)
`docs/known_issues.md`	Documented bugs and fixes: partial last block bug, block size 4096 race condition
`docs/ruler_benchmark_results_32k.md`	RULER benchmark results (32K context): 13 tasks, 92.3% accuracy, CPU offload performance
`docs/ruler_32k_chunked_offload_issue.md`	⚠️ OPEN ISSUE: 32K chunked offload accuracy problem (20% error rate in RULER)
`docs/chunked_attention_solutions.md`	🔧 SOLUTIONS: Chunked attention 准确性问题的代码分析和解决方案
`docs/nsys_wrong_event_order_bug.md`	🐛 NSYS BUG: Ring buffer pipeline 触发 nsys 时间戳乱序问题的调试记录
`docs/cpu_scheduling_latency_analysis.md`	⚡ PERF: CPU 调度延迟分析，kernel 间隙来源，GPU 利用率优化方向
`docs/bench_offload_results.md`	📊 BENCH: CPU offload 性能测试结果，Full vs XAttention 对比 (32K/128K)
`docs/cpu_offload_optimization_strategies.md`	🚀 OPT: CPU offload 优化策略：chunk size、CUDA Graph、前沿研究(InfiniGen/ShadowKV)
`docs/gpu_only_xattn_guide.md`	🚀 GPU-Only XAttention: 内存预分配、性能分析 (32K +15%, 64K +41%)、CUDA Graph 限制
`docs/xattn_performance_analysis.md`	📊 XAttention 性能分析: NVTX 标记、block size 影响、estimate vs compute 耗时对比
`docs/observer_architecture.md`	📊 Observer 架构: InferenceObserver (TTFT/TPOT)、MemoryObserver (H2D/D2H/D2D) 设计
`docs/memory_communication_benchmark.md`	📊 通信量测试: Full vs XAttention 通信量对比 (32K/64K)、阶段分离统计

Rules Index

Rule	Purpose
`.claude/rules/multi-gpu-debugging.md`	Multi-GPU debugging: GPU allocation (1-2 for validation, rest for exploration), single-task validation policy
`.claude/rules/gpu-testing.md`	GPU type detection, card assignment, needle test requirements
`.claude/rules/sparse-policy.md`	SparsePolicy implementation requirements
`.claude/rules/planning-with-files.md`	Planning file management for complex tasks
`.claude/rules/gpu-monitor.md`	GPU memory monitoring: 必须使用 gpu-monitor agent，禁止手动 nvidia-smi 循环

GPU Mutex for Multi-Instance Debugging

IMPORTANT: When running multiple Claude instances for parallel debugging, different rules apply based on script type:

Benchmarks (`bench*.py`) - Exclusive GPU Access Required

Before running any bench*.py script, Claude MUST wait for exclusive GPU access:

# Check and wait for GPU to be free
while [ -n "$(nvidia-smi --query-compute-apps=pid --format=csv,noheader)" ]; do
  echo "GPU busy, waiting 10s..."
  sleep 10
done

Other Scripts (tests, examples) - No Special Requirements

For non-benchmark scripts, exclusive GPU access is NOT required. Multiple nanovllm processes can run simultaneously on different GPUs - each process automatically selects a unique port for torch.distributed communication.

Multi-Instance Development with PYTHONPATH

IMPORTANT: When running multiple Claude instances on different worktrees, do NOT use pip install -e . globally as it will affect other instances.

Use PYTHONPATH directly - no pip install needed:

# Set PYTHONPATH to point to the project root directory
PYTHONPATH=/path/to/your/worktree:$PYTHONPATH python <script.py>

# Example: running tests
PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH python tests/test_needle.py

Benefits:

No pip install required
Code changes take effect immediately (no reinstall needed)
Each worktree is completely isolated

Configuration

Parameter	Default	Notes
`kvcache_block_size`	1024	Tokens per block (4096 now works after race condition fix)
`max_num_batched_tokens`	16384	Set = max_model_len for long context
`gpu_memory_utilization`	0.9	GPU memory fraction
`enable_cpu_offload`	False	Enable for long context
`enforce_eager`	False	Set True to disable CUDA graphs

Benchmarking

Files: bench.py (GPU), bench_offload.py (CPU offload), bench_vllm.py (comparison)

Offload Mode Constraint: When using enable_cpu_offload=True, only test with context length ≥ 32K. Shorter contexts don't exercise the chunked offload pipeline properly.

Common Issues:

max_num_batched_tokens < max_model_len: Set equal for long context
CUDA graph dimension mismatch: Ensure input_len + output_len <= max_model_len
RoPE out of bounds: Check model's max_position_embeddings in config.json

Model Limits:

Qwen3-0.6B/4B: 40960 tokens
Qwen2.5-7B-Instruct-1M: 1048576 tokens

Performance (Qwen3-0.6B):

GPU: ~18k tok/s (prefill), ~100 tok/s (decode)
CPU Offload (16K): ~14k tok/s (prefill)
CPU Offload (32K): ~13k tok/s (prefill)

Author: Zijie Tian

7.4 KiB Raw Blame History Unescape Escape