nano-vllm

Author	SHA1	Message	Date
Zijie Tian	8d19e61446	⚡️ perf: replace Triton merge with FlashInfer merge_state Use FlashInfer's optimized merge_state kernel for attention output merging in chunked prefill. End-to-end improvement: +0.8% (32K) to +2.4% (64K). Key changes: - Add merge_attention_outputs_flashinfer() with LSE format conversion - FlashInfer uses log2, flash_attn uses ln: convert via LOG2_E/LN_2 - Keep original Triton kernel for fallback Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-28 10:04:38 +08:00
Zijie Tian	2c2383c786	⚡️ perf: optimize XAttention estimate with hierarchical block sum Replace slow softmax_fuse_block_sum (block_size=4096) with optimized hierarchical approach (estimate_block_size=1024): - Add estimate_block_size parameter to XAttentionBSAPolicy (default 1024) - Rewrite select_blocks to use hierarchical aggregation: 1. Fine-grained softmax with small block size (15x faster kernel) 2. Aggregate to CPU block level via reshape + sum 3. Score + threshold selection (replaces mask + voting) Performance improvement (CPU Offload mode): - softmax_fuse_block_sum: 48% → 1% of total time (44x faster) - 128K: XAttention now +2.4% faster than Full (was -59%) - 64K: -3.8% (was -21%) - 32K: -6.0% (was -14%) Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>	2026-01-28 06:47:13 +08:00
Zijie Tian	3da9b8aef2	⚡️ perf: optimize XAttention estimate phase with K-only loading Add load_k_only_to_slot_layer() to OffloadEngine for estimate phase: - Only load K (not K+V) during block selection in select_blocks() - Reduces H2D transfer by 50% in estimate phase - 64K context: XAttn/Full ratio drops from 1.48x to 0.99x - 32K context: XAttn/Full ratio drops from 1.67x to 1.20x The estimate phase uses flat_group_gemm_fuse_reshape(Q, K) which only requires K for attention score computation. V is unused. Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>	2026-01-28 06:24:20 +08:00
Zijie Tian	39d12a0416	📈 feat: add MemoryObserver for GPU-CPU communication tracking Implement MemoryObserver to track memory transfers between GPU and CPU: - H2D (Host to Device): CPU → GPU transfers - D2H (Device to Host): GPU → CPU transfers - D2D (Device to Device): GPU buffer copies - Supports prefill/decode phase separation Integration points in offload_engine.py: - load_to_slot_layer: H2D with is_prefill parameter - offload_slot_layer_to_cpu, offload_prefill_buffer_async: D2H - write_to_prefill_buffer, write_to_decode_buffer: D2D - load_block_sample_from_cpu, load_block_full_from_cpu: H2D Add bench_offload.py integration for memory stats printing. Benchmark results (Llama-3.1-8B, 64K context): - Full Policy: Prefill H2D 262.13 GB - XAttention: Prefill H2D 386.62 GB (1.48x) Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>	2026-01-28 04:06:45 +08:00
Zijie Tian	7b5d3b34eb	📈 feat: add NVTX markers to XAttention for profiling Add NVTX range markers to track XAttention performance: - GPU-only: xattn_estimate, xattn_bsa_compute - Offload: xattn_estimate_gemm, xattn_estimate_softmax, xattn_estimate_find_blocks, xattn_compute_historical, xattn_compute_current, xattn_compute_merge These markers enable detailed nsys profiling to identify performance bottlenecks in estimate vs compute phases. Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>	2026-01-28 00:57:11 +08:00
Zijie Tian	a504bd873d	⚡ perf: pre-allocate GQA buffers in XAttention policy Add alloc_policy_metadata() method to SparsePolicy base class for pre-allocating GPU buffers during initialization. This avoids dynamic memory allocation during forward pass. Changes: - Add alloc_policy_metadata() to SparsePolicy base class - Implement GQA buffer pre-allocation in XAttentionBSAPolicy - Call alloc_policy_metadata() in model_runner for GPU-only mode - Modify compute_prefill() to reuse pre-allocated buffers - Add --gpu-util parameter to bench.py Memory savings: - Previously: 2x GQA expansion (~2GB for 64K) - Now: 1x pre-allocated buffer (~1GB for 64K, reused) Tested: - GPU-only 32K: 5602 tok/s (512MB pre-allocated) - GPU-only 64K: 4821 tok/s (1GB pre-allocated, gpu_util=0.7) - Offload Full: PASSED (no changes to offload path) - Offload XAttention: PASSED (uses compute_chunked_prefill) Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>	2026-01-27 05:49:23 +08:00
Zijie Tian	076656c9c2	✨ feat: add GPU-only XAttention BSA sparse attention support - Implement compute_prefill() in XAttentionBSAPolicy for GPU-only mode - Uses xattn_estimate to compute sparse block mask - Uses block_sparse_attn_func for efficient sparse attention - Handles GQA by expanding K/V heads - Falls back to flash_attn for paged KV cache (prefix cache) - Implement compute_decode() by delegating to FullAttentionPolicy - Add --policy xattn option to bench.py Verified: RULER 32k niah_single_1 5/5 samples passed (100%) Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>	2026-01-27 05:19:24 +08:00
Zijie Tian	09b2136e9f	✨ feat: integrate sparse policy architecture into GPU-only mode - Add compute_prefill() and compute_decode() GPU-only methods to SparsePolicy base class - Implement GPU-only methods in FullAttentionPolicy using flash_attn - Add sparse_policy parameter to GPUOnlyManager - Update create_kvcache_manager() to create FullAttentionPolicy for GPU-only mode - Route GPU-only attention through sparse_policy in attention.py - Pass kvcache_manager to context for policy access - Add --enable-policy flag to bench.py for testing - Handle warmup phase when kvcache_manager is not yet allocated This allows GPU-only mode to use the same policy architecture as CPU offload mode, enabling future sparse attention implementations (Quest, XAttention) in GPU-only mode. Performance verified: ~4890 tok/s (unchanged from baseline) Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>	2026-01-27 05:08:02 +08:00
Zijie Tian	18bc433f09	⚡ perf: improve NVTX profiling with colored ranges and configurable slots - Switch from torch.cuda.nvtx to nvtx package for colored range support - Add color coding: blue for H2D, green for D2H decode, orange for D2H prefill - Add --num-gpu-blocks parameter to profile_offload.sh - Include slot count in output filename for easier comparison Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>	2026-01-27 03:42:05 +08:00
Zijie Tian	aea3812230	♻️ refactor: unify KV cache operations through OffloadEngine - Add write_to_prefill_buffer() and write_to_decode_buffer() methods - Add chunk_idx parameter to load_to_slot_layer() for NVTX labeling - Replace direct copy_() calls with OffloadEngine methods in attention.py - Update all load_to_slot_layer() calls to pass chunk_idx - NVTX markers now show chunk info: "H2D: L{layer} Chunk{chunk} CPU[{block}]->Slot[{slot}]" All KV cache data transfers in chunked offload mode now go through OffloadEngine, enabling better profiling and consistent management. Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>	2026-01-27 02:20:59 +08:00
Zijie Tian	7c41032a2e	✨ feat: add configurable stride and chunk_size for XAttention BSA - Add sparse_chunk_size config option (default: 16384) - Pass stride, chunk_size, use_triton through factory function - Add --sparse-stride CLI option to test_ruler.py Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-23 10:37:04 +08:00
Zijie Tian	ed3c8bb4b8	🐛 fix: memory leak in XAttentionBSAPolicy select_blocks Fix severe memory leak (64GB -> 4GB growth) by: - Remove unused sparse_metadata storage (was accumulating attn_scores) - Delete intermediate tensor list (attn_scores_list) after use - Explicitly delete intermediate tensors before return Before: 16GB -> 80GB during 128K prefill After: 16GB -> 19.8GB during 128K prefill Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-23 09:30:18 +08:00
Zijie Tian	5eb35982bf	🔧 feat: add density statistics tracking to sparse policies Add statistics tracking to compare block selection between policies: - XAttentionBSAPolicy: track available/selected blocks per chunk - FullAttentionPolicy: track total blocks (always 100% density) - Add reset_stats(), get_density_stats(), print_density_stats() methods - Use logger.debug for per-chunk density logging Results on 32K niah_single_1: - Full: 100% density across all chunks - XAttn BSA: 90% -> 73% density (saves ~25-30% blocks in later chunks) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-23 08:53:22 +08:00
Zijie Tian	4d1e40152d	✨ feat(xattn): implement compute_chunked_prefill with ring buffer pipeline - Copy compute_chunked_prefill implementation from FullAttentionPolicy - Set default threshold to 0.95 for accuracy testing - Remove debug code (sys.exit, verbose prints) - Use ring buffer pipeline for historical block loading - Merge with current chunk attention using flash_attn_with_lse RULER NIAH test passed with 5/5 samples (100% accuracy). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-23 08:27:40 +08:00
Zijie Tian	832b352afa	✨ feat(xattn): implement select_blocks with majority voting aggregation Implement XAttention-based block selection for sparse attention: - Use flat_group_gemm_fuse_reshape to compute Q@K^T attention scores - Apply softmax_fuse_block_sum to aggregate into block-level attention - Use find_blocks_chunked for threshold-based block selection - Handle GQA by aggregating within KV head groups first - Use majority voting (>50%) across heads instead of any() for better sparsity - Align block_size with CPU offload block size (1024 tokens / stride = 128) Test results show ~45% density at chunk 40 (down from 100% with any() aggregation). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-23 08:19:05 +08:00
Zijie Tian	a50b4c2ac2	♻️ refactor: move select_blocks from policy to attention layer Move block selection logic from compute_chunked_prefill/decode methods to attention.py caller. This improves separation of concerns: - attention.py now calls select_blocks() before compute_chunked_*() - Policy methods receive pre-selected blocks via selected_blocks parameter - Enables sparse policies to implement custom block selection without modifying the compute path Changes: - policy.py: Add selected_blocks parameter to abstract methods - full_policy.py: Remove internal select_blocks calls, use passed blocks - xattn_bsa.py: Sync signatures for prefill/decode methods - attention.py: Add select_blocks calls before policy delegation Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-23 05:21:28 +08:00
Zijie Tian	ca32ea6f93	[WIP] Before refactor the compute)_chunked_prefill.	2026-01-23 03:36:12 +08:00
Zijie Tian	999858e82f	feat: add xattn kernels test and update testing rules - Add test_xattn_kernels.py demonstrating flat_group_gemm_fuse_reshape and softmax_fuse_block_sum Triton kernels with structured data - Update testing.md with new test code style guidelines - Update xattn.py and xattn_bsa.py with improvements Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-23 03:01:25 +08:00
Zijie Tian	78050aef9f	🐛 fix: resolve CPU KV cache state leakage between requests Root Cause: - OffloadEngine.reset() cleared GPU buffers but NOT CPU cache - Previous request's KV cache data persisted in CPU memory, contaminating subsequent requests Fixes: - Add k_cache_cpu.zero_() and v_cache_cpu.zero_() to OffloadEngine.reset() - Add clear_decode_tracking(seq) call in HybridKVCacheManager.deallocate() Results: - niah_single_1 accuracy improved from ~80% to 94% (+14%) - Remaining ~6% errors are model limitations, not state leakage Also: - Update docs/ruler_32k_chunked_offload_issue.md with fix details - Remove debug planning files (findings.md, progress.md, task_plan.md) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-21 01:12:21 +08:00
Zijie Tian	690456dbf9	♻️ refactor: create ops module and move chunked_attention - Create nanovllm/ops/ module for low-level attention operators - Move chunked_attention.py from kvcache/ to ops/ - Update imports in full_policy.py (3 locations) - Fix: remove dead code in OffloadEngine.reset() referencing non-existent layer_k/v_buffer_a/b attributes Verified with needle test (32K offload): PASSED Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 02:50:14 +08:00
Zijie Tian	07f5220f40	Merge branch 'tzj/minference' of ssh://git.zijie-tian.site:2222/zijie-tian/nano-vllm into tzj/minference	2026-01-20 02:27:10 +08:00
Zijie Tian	b1f292cf22	Merge branch 'tzj/minference' of ssh://git.zijie-tian.site:2222/zijie-tian/nano-vllm into tzj/minference	2026-01-20 02:16:39 +08:00
Zijie Tian	fa7601f4b8	♻️ refactor: remove cross-layer pipeline and rename compute_chunked_prefill - Remove cross-layer pipeline from OffloadEngine (saves ~1GB GPU memory for long sequences) - Delete layer_k/v_buffer_a/b double buffers - Remove start_decode_pipeline, get_decode_layer_kv, end_decode_pipeline methods - Remove pipeline state tracking variables - Simplify decode to use ring buffer pipeline only (more efficient for long sequences) - Rename compute_chunked_attention → compute_chunked_prefill for clarity - Add mandatory needle test requirements: --enable-offload --input-len 32768 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 02:10:40 +08:00
Zijie Tian	4593f42ec3	♻️ refactor: migrate chunked decode attention to SparsePolicy Move decode attention computation from attention.py to SparsePolicy: - Add compute_chunked_decode abstract method to SparsePolicy base class - Implement compute_chunked_decode in FullAttentionPolicy with: - Ring buffer pipeline (_decode_ring_buffer_pipeline) - Cross-layer pipeline (_decode_with_layer_pipeline) - Decode buffer handling - Simplify _chunked_decode_attention to only validate and delegate - Remove _decode_ring_buffer_pipeline and _decode_with_layer_pipeline from attention.py - Add supports_decode check for policy validation This completes the SparsePolicy v5 refactoring where both prefill and decode paths now delegate all computation to the sparse policy. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 01:32:17 +08:00
Zijie Tian	baa4be7e2e	♻️ refactor: migrate chunked prefill attention to SparsePolicy Move all chunked prefill attention computation from attention.py to SparsePolicy.compute_chunked_attention(). This is the v4 architecture refactoring for sparse attention policies. Changes: - Add compute_chunked_attention abstract method to SparsePolicy base - Add offload_engine parameter to select_blocks for policies needing KV access during block selection - Implement compute_chunked_attention in FullAttentionPolicy with complete ring buffer pipeline logic - Simplify attention.py to delegate all chunked prefill to policy - Remove redundant _sync_load_previous_chunks and _ring_buffer_pipeline_load methods from Attention class Test: test_needle.py --enable-offload PASSED Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 00:58:46 +08:00
Zijie Tian	b97b0b96a0	[WIP] Before refactor the nanovllm sparse policy.	2026-01-19 22:34:44 +08:00
Zijie Tian	b5da802dff	[WIP] Before integrate the xattn operator.	2026-01-19 21:19:21 +08:00
Zijie Tian	9e6fdc0650	[WIP] Before plan execute.	2026-01-19 03:30:44 +08:00
Zijie Tian	50520a6c3c	[fix] fixed request to request error.	2026-01-19 00:55:26 +08:00
Zijie Tian	6575099a06	[refactor] Cleanup unused code after perf_opt merge Removed ~460 lines of unused/redundant code from offload_engine.py: - CUDA gather methods (gathered_h2d_*, update_gather_indices) - Legacy async transfer methods (prefetch_block_async, offload_block_async) - Legacy sync/wait methods (wait_for_block, wait_all_transfers, sync_indices) - Legacy compatibility methods (load_to_compute_layer, wait_compute_layer) - Unused gather_indices tensors and memory calculations Updated class docstring to reflect current architecture. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-07 06:25:21 +08:00
Zijie Tian	8fd25d72d7	Merge perf_opt-1 and perf_opt-2 branches Combines two performance optimization features: - perf_opt-1: Cross-layer pipeline for decode (double-buffered layer cache) - perf_opt-2: Per-layer prefill buffer for async offload Both features are complementary and improve CPU offload performance. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-07 06:03:44 +08:00
Zijie Tian	ccf27d3a74	[claudesquad] update from 'perf_opt-1' on 07 Jan 26 05:58 CST	2026-01-07 05:58:23 +08:00
Zijie Tian	0ad86eb449	[claudesquad] update from 'perf_opt-2' on 07 Jan 26 05:58 CST	2026-01-07 05:58:10 +08:00
Zijie Tian	2a6e0a2c02	[feat] Added Quest Sparsity Policy.	2026-01-07 03:29:21 +08:00
Zijie Tian	c99a6f3d3f	[WIP] Before add Quest policy.	2026-01-07 02:32:30 +08:00
Zijie Tian	0e691f2d85	[WIP] move metadata to GPU.	2026-01-06 23:32:32 +08:00
Zijie Tian	690492e074	[WIP] Before refactor policies.	2026-01-06 20:47:55 +08:00
Zijie Tian	535f2037ab	[WIP] Before fix bench_offload.py.	2026-01-06 18:41:08 +08:00
Zijie Tian	e554d5482b	[refactor] Delete unnesscessory test, and refacrtor the offload prefix cache.	2026-01-05 20:31:42 +08:00
Zijie Tian	247c5312d9	[fix] Fixed decode misalign.	2026-01-05 19:00:44 +08:00
Zijie Tian	054aaff403	[fix] Fixed needle test bug.	2026-01-05 18:34:09 +08:00
Zijie Tian	74ee6d0895	[WIP] need to fix model to normally decode.	2026-01-01 05:18:27 +08:00
Zijie Tian	965c8aff12	[WIP] need change flashattention to debug.	2026-01-01 00:58:22 +08:00
Zijie Tian	30462fe89a	[WIP] Before fix needle.	2025-12-31 23:35:25 +08:00
Zijie Tian	ccd1b3d4ab	[WIP] Before modify nanovllm CPU-GPU kvcache.	2025-12-31 22:41:07 +08:00
Zijie Tian	484d0de9f9	[feat] Added debug hook to offload_engine.py.	2025-12-31 19:44:39 +08:00
Zijie Tian	89f8020d38	[WIP] fixing attention compute error.	2025-12-30 00:31:48 +08:00
Zijie Tian	82ed34fc2d	[opt] optimize nanovllm performance compareable with vllm.	2025-12-25 03:47:07 +08:00
Zijie Tian	16fcf8350b	[WIP] replace merge attention with triton kernel.	2025-12-25 01:07:05 +08:00
Zijie Tian	6ec1b23982	[WIP] NEED to modify communication.	2025-12-24 21:57:51 +08:00

1 2

68 Commits