nano-vllm

Author	SHA1	Message	Date
Zijie Tian	11a867f6fb	🐛 fix: skip GQA buffer allocation in XAttention offload mode In offload mode, GQA expansion buffers (_k_expanded, _v_expanded) are not needed since compute_chunked_prefill() handles GQA inline. Previously, these buffers were always allocated based on max_model_len, causing OOM on 24GB GPUs (e.g., RTX 3090) when max_model_len=1M (16GB buffer). Changes: - Add enable_cpu_offload parameter to alloc_policy_metadata() in base class - Skip GQA buffer allocation when enable_cpu_offload=True in XAttentionBSAPolicy - Pass enable_cpu_offload from model_runner to policy Memory savings: ~16GB for 1M seq, ~1.1GB for 72K seq Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 02:57:18 +08:00
Zijie Tian	51bd678335	📊 feat: distinguish compute density and communication density in DensityObserver - Add record_comm_density() call in select_blocks to track CPU block selection - Add get_per_layer_comm_density() method for detailed analysis - Update print_summary() to show both densities and H2D savings ratio - Set DensityObserver mode (offload/gpu_only) in test_ruler.py - Update get_summary() to return both density types Key insight: Comm density can be 100% even when compute density is ~37% because sparse BSA blocks are distributed across all CPU blocks. Since CPU block granularity is 32x coarser (4096 vs 128 tokens), any() aggregation across heads/Q-blocks results in all CPU blocks being needed. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 01:43:17 +08:00
Zijie Tian	829b311c02	🐛 fix: stream synchronization for XAttention estimate kernels in offload mode - Wrap all compute kernels in select_blocks with compute_stream context (Pass 1 historical blocks, Pass 1 current chunk, Step 2 merge, Pass 2 historical blocks, Pass 2 current chunk, Step 4 block selection) - Fix K data mismatch between Pass 1 and Pass 2 by ensuring wait_slot_layer syncs with compute_stream where kernels actually run - Remove STRONG SYNC code from offload_engine.py (now handled by events) - Remove debug print statements and torch.save code - Consolidate fallback conditions in compute_with_xattn - Change default chunk_size from 16384 to 4096 for density alignment The bug caused Pass 1 and Pass 2 to see different K data from the same CPU block because compute kernels ran on default stream while wait_slot_layer only synced compute_stream. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 01:30:23 +08:00
Zijie Tian	aeed6ccdfb	✅ test: add GPU-only density alignment verification test Add test to verify XAttention density calculation in GPU-only mode matches independent xattn_estimate calls. Changes: - Add tests/test_gpuonly_density_alignment.py: loads saved Q/K from xattn_bsa.py, calls xattn_estimate independently, compares results - Enhance debug save in xattn_bsa.py: now saves Q, K tensors and xattn_estimate parameters for external verification - Set _DEBUG_SAVE_MASK = False by default Usage: 1. Set _DEBUG_SAVE_MASK = True in xattn_bsa.py 2. Run GPU-only inference with XAttention (e.g., test_ruler.py) 3. Run tests/test_gpuonly_density_alignment.py to verify alignment Verified on 4k/8k/16k/32k/64k contexts - all pass with exact match. Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>	2026-02-02 11:14:46 +08:00
Zijie Tian	6c55c4d2a3	♻️ refactor: rewrite select_blocks with 3-stage KV chunking algorithm Implement correct 3-stage KV chunking for XAttention offload mode: - Stage 1: Compute partial softmax stats (m, l) for each KV chunk - Stage 2: Merge all partial stats to get global normalization factors - Stage 3: Normalize with global stats and compute block sums Key fixes: - Add wait_all_prefill_offloads() before loading CPU blocks to ensure async offload completion (fixes stale data bug) - Pre-allocate m/l partial buffers and block_sums buffer This produces identical density to GPU-only xattn_estimate while using O(S×C) peak memory instead of O(S²). Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>	2026-02-02 10:10:10 +08:00
Zijie Tian	8ab53e7331	🚧 WIP: add DEBUG code for XAttention KV chunking density verification Add instrumentation to compare GPU-only vs Offload mode density: - Layer 0 DEBUG output for both modes - Accumulate selected/total counts across chunks - Proper causal mask with Q offset handling - Skip normal offload logic for isolated testing Test results (threshold=1.0 achieves alignment): - 32K: GPU-only 0.9999, Offload 0.9999 (diff ~0%) - 64K: GPU-only 0.9995, Offload 0.9995 (diff ~0%) Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>	2026-02-01 17:33:23 +08:00
Zijie Tian	2e96d1d97d	WIP: Enhance sparse attention with density tracking and block selection improvements - Added analysis documentation for xattn density alignment. - Refactored ModelRunner to pre-allocate policy metadata buffers regardless of CPU offload configuration. - Updated FullAttentionPolicy and SparsePolicy to accept query and key tensors for block selection. - Enhanced QuestPolicy to utilize query tensor for block selection and improved handling of selected blocks. - Expanded XAttentionBSAPolicy to support chunked prefill and improved attention score computation with historical and current chunk handling. - Introduced DensityObserver to track compute and communication density for sparse attention layers. - Updated attention layer to ensure block selection is always called, improving robustness in first chunk scenarios. - Added tests for attention kernel behavior with enhanced input patterns.	2026-01-31 14:48:23 +08:00
Zijie Tian	f6ac4ccdde	✨ feat: add DensityObserver for XAttention sparse attention density tracking - Add DensityObserver class to track per-layer density statistics - Integrate DensityObserver into compute_prefill for GPU-only mode - Fix stride parameter not being passed to xattn_estimate - Add density statistics output to test_ruler.py for XATTN_BSA - Add comprehensive density benchmark documentation Key changes: - nanovllm/utils/density_observer.py: New Observer for density tracking - xattn_bsa.py: Add stride param to xattn_estimate, integrate DensityObserver - test_ruler.py: Enable DensityObserver and print summary for XATTN_BSA - docs/xattn_density_benchmark.md: Benchmark results for 4K-32K contexts Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-30 16:26:56 +08:00
Zijie Tian	8d19e61446	⚡️ perf: replace Triton merge with FlashInfer merge_state Use FlashInfer's optimized merge_state kernel for attention output merging in chunked prefill. End-to-end improvement: +0.8% (32K) to +2.4% (64K). Key changes: - Add merge_attention_outputs_flashinfer() with LSE format conversion - FlashInfer uses log2, flash_attn uses ln: convert via LOG2_E/LN_2 - Keep original Triton kernel for fallback Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-28 10:04:38 +08:00
Zijie Tian	2c2383c786	⚡️ perf: optimize XAttention estimate with hierarchical block sum Replace slow softmax_fuse_block_sum (block_size=4096) with optimized hierarchical approach (estimate_block_size=1024): - Add estimate_block_size parameter to XAttentionBSAPolicy (default 1024) - Rewrite select_blocks to use hierarchical aggregation: 1. Fine-grained softmax with small block size (15x faster kernel) 2. Aggregate to CPU block level via reshape + sum 3. Score + threshold selection (replaces mask + voting) Performance improvement (CPU Offload mode): - softmax_fuse_block_sum: 48% → 1% of total time (44x faster) - 128K: XAttention now +2.4% faster than Full (was -59%) - 64K: -3.8% (was -21%) - 32K: -6.0% (was -14%) Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>	2026-01-28 06:47:13 +08:00
Zijie Tian	3da9b8aef2	⚡️ perf: optimize XAttention estimate phase with K-only loading Add load_k_only_to_slot_layer() to OffloadEngine for estimate phase: - Only load K (not K+V) during block selection in select_blocks() - Reduces H2D transfer by 50% in estimate phase - 64K context: XAttn/Full ratio drops from 1.48x to 0.99x - 32K context: XAttn/Full ratio drops from 1.67x to 1.20x The estimate phase uses flat_group_gemm_fuse_reshape(Q, K) which only requires K for attention score computation. V is unused. Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>	2026-01-28 06:24:20 +08:00
Zijie Tian	7b5d3b34eb	📈 feat: add NVTX markers to XAttention for profiling Add NVTX range markers to track XAttention performance: - GPU-only: xattn_estimate, xattn_bsa_compute - Offload: xattn_estimate_gemm, xattn_estimate_softmax, xattn_estimate_find_blocks, xattn_compute_historical, xattn_compute_current, xattn_compute_merge These markers enable detailed nsys profiling to identify performance bottlenecks in estimate vs compute phases. Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>	2026-01-28 00:57:11 +08:00
Zijie Tian	a504bd873d	⚡ perf: pre-allocate GQA buffers in XAttention policy Add alloc_policy_metadata() method to SparsePolicy base class for pre-allocating GPU buffers during initialization. This avoids dynamic memory allocation during forward pass. Changes: - Add alloc_policy_metadata() to SparsePolicy base class - Implement GQA buffer pre-allocation in XAttentionBSAPolicy - Call alloc_policy_metadata() in model_runner for GPU-only mode - Modify compute_prefill() to reuse pre-allocated buffers - Add --gpu-util parameter to bench.py Memory savings: - Previously: 2x GQA expansion (~2GB for 64K) - Now: 1x pre-allocated buffer (~1GB for 64K, reused) Tested: - GPU-only 32K: 5602 tok/s (512MB pre-allocated) - GPU-only 64K: 4821 tok/s (1GB pre-allocated, gpu_util=0.7) - Offload Full: PASSED (no changes to offload path) - Offload XAttention: PASSED (uses compute_chunked_prefill) Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>	2026-01-27 05:49:23 +08:00
Zijie Tian	076656c9c2	✨ feat: add GPU-only XAttention BSA sparse attention support - Implement compute_prefill() in XAttentionBSAPolicy for GPU-only mode - Uses xattn_estimate to compute sparse block mask - Uses block_sparse_attn_func for efficient sparse attention - Handles GQA by expanding K/V heads - Falls back to flash_attn for paged KV cache (prefix cache) - Implement compute_decode() by delegating to FullAttentionPolicy - Add --policy xattn option to bench.py Verified: RULER 32k niah_single_1 5/5 samples passed (100%) Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>	2026-01-27 05:19:24 +08:00
Zijie Tian	aea3812230	♻️ refactor: unify KV cache operations through OffloadEngine - Add write_to_prefill_buffer() and write_to_decode_buffer() methods - Add chunk_idx parameter to load_to_slot_layer() for NVTX labeling - Replace direct copy_() calls with OffloadEngine methods in attention.py - Update all load_to_slot_layer() calls to pass chunk_idx - NVTX markers now show chunk info: "H2D: L{layer} Chunk{chunk} CPU[{block}]->Slot[{slot}]" All KV cache data transfers in chunked offload mode now go through OffloadEngine, enabling better profiling and consistent management. Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>	2026-01-27 02:20:59 +08:00
Zijie Tian	ed3c8bb4b8	🐛 fix: memory leak in XAttentionBSAPolicy select_blocks Fix severe memory leak (64GB -> 4GB growth) by: - Remove unused sparse_metadata storage (was accumulating attn_scores) - Delete intermediate tensor list (attn_scores_list) after use - Explicitly delete intermediate tensors before return Before: 16GB -> 80GB during 128K prefill After: 16GB -> 19.8GB during 128K prefill Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-23 09:30:18 +08:00
Zijie Tian	5eb35982bf	🔧 feat: add density statistics tracking to sparse policies Add statistics tracking to compare block selection between policies: - XAttentionBSAPolicy: track available/selected blocks per chunk - FullAttentionPolicy: track total blocks (always 100% density) - Add reset_stats(), get_density_stats(), print_density_stats() methods - Use logger.debug for per-chunk density logging Results on 32K niah_single_1: - Full: 100% density across all chunks - XAttn BSA: 90% -> 73% density (saves ~25-30% blocks in later chunks) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-23 08:53:22 +08:00
Zijie Tian	4d1e40152d	✨ feat(xattn): implement compute_chunked_prefill with ring buffer pipeline - Copy compute_chunked_prefill implementation from FullAttentionPolicy - Set default threshold to 0.95 for accuracy testing - Remove debug code (sys.exit, verbose prints) - Use ring buffer pipeline for historical block loading - Merge with current chunk attention using flash_attn_with_lse RULER NIAH test passed with 5/5 samples (100% accuracy). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-23 08:27:40 +08:00
Zijie Tian	832b352afa	✨ feat(xattn): implement select_blocks with majority voting aggregation Implement XAttention-based block selection for sparse attention: - Use flat_group_gemm_fuse_reshape to compute Q@K^T attention scores - Apply softmax_fuse_block_sum to aggregate into block-level attention - Use find_blocks_chunked for threshold-based block selection - Handle GQA by aggregating within KV head groups first - Use majority voting (>50%) across heads instead of any() for better sparsity - Align block_size with CPU offload block size (1024 tokens / stride = 128) Test results show ~45% density at chunk 40 (down from 100% with any() aggregation). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-23 08:19:05 +08:00
Zijie Tian	a50b4c2ac2	♻️ refactor: move select_blocks from policy to attention layer Move block selection logic from compute_chunked_prefill/decode methods to attention.py caller. This improves separation of concerns: - attention.py now calls select_blocks() before compute_chunked_*() - Policy methods receive pre-selected blocks via selected_blocks parameter - Enables sparse policies to implement custom block selection without modifying the compute path Changes: - policy.py: Add selected_blocks parameter to abstract methods - full_policy.py: Remove internal select_blocks calls, use passed blocks - xattn_bsa.py: Sync signatures for prefill/decode methods - attention.py: Add select_blocks calls before policy delegation Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-23 05:21:28 +08:00
Zijie Tian	ca32ea6f93	[WIP] Before refactor the compute)_chunked_prefill.	2026-01-23 03:36:12 +08:00
Zijie Tian	999858e82f	feat: add xattn kernels test and update testing rules - Add test_xattn_kernels.py demonstrating flat_group_gemm_fuse_reshape and softmax_fuse_block_sum Triton kernels with structured data - Update testing.md with new test code style guidelines - Update xattn.py and xattn_bsa.py with improvements Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-23 03:01:25 +08:00
Zijie Tian	b97b0b96a0	[WIP] Before refactor the nanovllm sparse policy.	2026-01-19 22:34:44 +08:00
Zijie Tian	b5da802dff	[WIP] Before integrate the xattn operator.	2026-01-19 21:19:21 +08:00

24 Commits