nano-vllm

Author	SHA1	Message	Date
Zijie Tian	2e96d1d97d	WIP: Enhance sparse attention with density tracking and block selection improvements - Added analysis documentation for xattn density alignment. - Refactored ModelRunner to pre-allocate policy metadata buffers regardless of CPU offload configuration. - Updated FullAttentionPolicy and SparsePolicy to accept query and key tensors for block selection. - Enhanced QuestPolicy to utilize query tensor for block selection and improved handling of selected blocks. - Expanded XAttentionBSAPolicy to support chunked prefill and improved attention score computation with historical and current chunk handling. - Introduced DensityObserver to track compute and communication density for sparse attention layers. - Updated attention layer to ensure block selection is always called, improving robustness in first chunk scenarios. - Added tests for attention kernel behavior with enhanced input patterns.	2026-01-31 14:48:23 +08:00
Zijie Tian	726e4b58cf	✨ feat: add GLM-4-9B-Chat-1M model support Add support for GLM-4 model architecture with the following changes: - Add glm4.py with ChatGLMForCausalLM, GLM4Model, GLM4Attention, GLM4MLP - Add GLM4RotaryEmbedding with interleaved partial rotation (rotary_dim = head_dim // 2) - Add apply_rotary_emb_interleaved function for GLM-4 style RoPE - Add GLM-4 weight name conversion and loading in loader.py - Add GLM-4 chat template conversion in test_ruler.py - Add trust_remote_code=True for GLM-4 config loading Key GLM-4 specific adaptations: - QKV bias enabled (add_qkv_bias: true) - RoPE with rope_ratio scaling (base = 10000 * rope_ratio) - Interleaved RoPE (pairs adjacent elements, not first/second half) - Partial rotation (only half of head_dim is rotated) - Uses multi_query_group_num instead of num_key_value_heads - Uses kv_channels instead of head_dim - Uses ffn_hidden_size instead of intermediate_size Tested with RULER niah_single_1 (5 samples): 100% accuracy Both GPU-only and CPU offload modes verified Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-28 13:15:57 +08:00
Zijie Tian	4fe7dfb239	🔀 merge: integrate tzj/minference-exp (GPU-only sparse attention) Merge GPU-only sparse attention support from tzj/minference-exp branch: GPU-only mode additions: - Add compute_prefill/compute_decode methods to SparsePolicy base class - Add GPU-only attention routing in attention.py - Add alloc_policy_metadata() for pre-allocating GQA buffers - Add XAttention + BSA sparse attention for GPU-only prefill - Add kvcache_manager to set_context() for policy access bench.py enhancements: - Add --model argument for configurable model path - Add --policy argument (full, xattn) for sparse policy selection - Add --enable-policy flag for FullAttentionPolicy routing - Add --enforce-eager option to disable CUDA graphs - Add --gpu-util option for GPU memory utilization Documentation: - Add gpu_only_xattn_guide.md with performance analysis - Add gpu_only_sparse_integration.md baseline document - Add gpu-vram-requirement.md rule for GPU-only mode Both CPU offload and GPU-only paths are preserved and functional. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-27 09:25:36 +08:00
Zijie Tian	0437311068	⚡ feat: add Phase 5 CUDA Graph optimization for chunked prefill Implement extended CUDA Graph coverage for CPU offload path: - Add graphed_layers.py with N+2 graph architecture (EmbedGraph, FirstGraph, InterGraphs, LastGraph) - Support both prefill (seq_len=chunk_size) and decode (seq_len=1) graph modes - Extend graph coverage to ~70-80% including qkv_proj, rotary, o_proj - Only attention core remains in eager mode for dynamic offload Performance: Prefill throughput improved ~5.6% (3782 -> 3995 tok/s at 32K) Also adds: - --enforce-eager flag to bench_offload.py for comparison - Offload mode constraint documentation in CLAUDE.md Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>	2026-01-27 07:38:40 +08:00
Zijie Tian	09b2136e9f	✨ feat: integrate sparse policy architecture into GPU-only mode - Add compute_prefill() and compute_decode() GPU-only methods to SparsePolicy base class - Implement GPU-only methods in FullAttentionPolicy using flash_attn - Add sparse_policy parameter to GPUOnlyManager - Update create_kvcache_manager() to create FullAttentionPolicy for GPU-only mode - Route GPU-only attention through sparse_policy in attention.py - Pass kvcache_manager to context for policy access - Add --enable-policy flag to bench.py for testing - Handle warmup phase when kvcache_manager is not yet allocated This allows GPU-only mode to use the same policy architecture as CPU offload mode, enabling future sparse attention implementations (Quest, XAttention) in GPU-only mode. Performance verified: ~4890 tok/s (unchanged from baseline) Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>	2026-01-27 05:08:02 +08:00
Zijie Tian	aea3812230	♻️ refactor: unify KV cache operations through OffloadEngine - Add write_to_prefill_buffer() and write_to_decode_buffer() methods - Add chunk_idx parameter to load_to_slot_layer() for NVTX labeling - Replace direct copy_() calls with OffloadEngine methods in attention.py - Update all load_to_slot_layer() calls to pass chunk_idx - NVTX markers now show chunk info: "H2D: L{layer} Chunk{chunk} CPU[{block}]->Slot[{slot}]" All KV cache data transfers in chunked offload mode now go through OffloadEngine, enabling better profiling and consistent management. Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>	2026-01-27 02:20:59 +08:00
Zijie Tian	a50b4c2ac2	♻️ refactor: move select_blocks from policy to attention layer Move block selection logic from compute_chunked_prefill/decode methods to attention.py caller. This improves separation of concerns: - attention.py now calls select_blocks() before compute_chunked_*() - Policy methods receive pre-selected blocks via selected_blocks parameter - Enables sparse policies to implement custom block selection without modifying the compute path Changes: - policy.py: Add selected_blocks parameter to abstract methods - full_policy.py: Remove internal select_blocks calls, use passed blocks - xattn_bsa.py: Sync signatures for prefill/decode methods - attention.py: Add select_blocks calls before policy delegation Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-23 05:21:28 +08:00
Zijie Tian	ca32ea6f93	[WIP] Before refactor the compute)_chunked_prefill.	2026-01-23 03:36:12 +08:00
Zijie Tian	fa7601f4b8	♻️ refactor: remove cross-layer pipeline and rename compute_chunked_prefill - Remove cross-layer pipeline from OffloadEngine (saves ~1GB GPU memory for long sequences) - Delete layer_k/v_buffer_a/b double buffers - Remove start_decode_pipeline, get_decode_layer_kv, end_decode_pipeline methods - Remove pipeline state tracking variables - Simplify decode to use ring buffer pipeline only (more efficient for long sequences) - Rename compute_chunked_attention → compute_chunked_prefill for clarity - Add mandatory needle test requirements: --enable-offload --input-len 32768 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 02:10:40 +08:00
Zijie Tian	4593f42ec3	♻️ refactor: migrate chunked decode attention to SparsePolicy Move decode attention computation from attention.py to SparsePolicy: - Add compute_chunked_decode abstract method to SparsePolicy base class - Implement compute_chunked_decode in FullAttentionPolicy with: - Ring buffer pipeline (_decode_ring_buffer_pipeline) - Cross-layer pipeline (_decode_with_layer_pipeline) - Decode buffer handling - Simplify _chunked_decode_attention to only validate and delegate - Remove _decode_ring_buffer_pipeline and _decode_with_layer_pipeline from attention.py - Add supports_decode check for policy validation This completes the SparsePolicy v5 refactoring where both prefill and decode paths now delegate all computation to the sparse policy. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 01:32:17 +08:00
Zijie Tian	baa4be7e2e	♻️ refactor: migrate chunked prefill attention to SparsePolicy Move all chunked prefill attention computation from attention.py to SparsePolicy.compute_chunked_attention(). This is the v4 architecture refactoring for sparse attention policies. Changes: - Add compute_chunked_attention abstract method to SparsePolicy base - Add offload_engine parameter to select_blocks for policies needing KV access during block selection - Implement compute_chunked_attention in FullAttentionPolicy with complete ring buffer pipeline logic - Simplify attention.py to delegate all chunked prefill to policy - Remove redundant _sync_load_previous_chunks and _ring_buffer_pipeline_load methods from Attention class Test: test_needle.py --enable-offload PASSED Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 00:58:46 +08:00
Zijie Tian	b97b0b96a0	[WIP] Before refactor the nanovllm sparse policy.	2026-01-19 22:34:44 +08:00
Zijie Tian	b5da802dff	[WIP] Before integrate the xattn operator.	2026-01-19 21:19:21 +08:00
Zijie Tian	9e6fdc0650	[WIP] Before plan execute.	2026-01-19 03:30:44 +08:00
Zijie Tian	03a8c033cb	[claudesquad] update from 'add-llama-1' on 10 Jan 26 21:03 CST	2026-01-10 21:03:45 +08:00
Zijie Tian	8fd25d72d7	Merge perf_opt-1 and perf_opt-2 branches Combines two performance optimization features: - perf_opt-1: Cross-layer pipeline for decode (double-buffered layer cache) - perf_opt-2: Per-layer prefill buffer for async offload Both features are complementary and improve CPU offload performance. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-07 06:03:44 +08:00
Zijie Tian	ccf27d3a74	[claudesquad] update from 'perf_opt-1' on 07 Jan 26 05:58 CST	2026-01-07 05:58:23 +08:00
Zijie Tian	0ad86eb449	[claudesquad] update from 'perf_opt-2' on 07 Jan 26 05:58 CST	2026-01-07 05:58:10 +08:00
Zijie Tian	58a06501c1	Merge branch 'zijie/debug_chunk-2' into tzj/minference	2026-01-07 03:30:38 +08:00
Zijie Tian	2a6e0a2c02	[feat] Added Quest Sparsity Policy.	2026-01-07 03:29:21 +08:00
Zijie Tian	2fe50bab50	[claudesquad] update from 'debug_chunk-2' on 07 Jan 26 03:27 CST	2026-01-07 03:27:27 +08:00
Zijie Tian	c99a6f3d3f	[WIP] Before add Quest policy.	2026-01-07 02:32:30 +08:00
Zijie Tian	7cc8a394a5	[fix] Fixed bench_offload.py, BUT performance DEGRAD.	2026-01-06 18:46:48 +08:00
Zijie Tian	535f2037ab	[WIP] Before fix bench_offload.py.	2026-01-06 18:41:08 +08:00
Zijie Tian	247c5312d9	[fix] Fixed decode misalign.	2026-01-05 19:00:44 +08:00
Zijie Tian	054aaff403	[fix] Fixed needle test bug.	2026-01-05 18:34:09 +08:00
Zijie Tian	772313db8f	[refactor] Refactor the kvcache offload.	2026-01-04 19:37:03 +08:00
Zijie Tian	74ee6d0895	[WIP] need to fix model to normally decode.	2026-01-01 05:18:27 +08:00
Zijie Tian	965c8aff12	[WIP] need change flashattention to debug.	2026-01-01 00:58:22 +08:00
Zijie Tian	30462fe89a	[WIP] Before fix needle.	2025-12-31 23:35:25 +08:00
Zijie Tian	484d0de9f9	[feat] Added debug hook to offload_engine.py.	2025-12-31 19:44:39 +08:00
Zijie Tian	89f8020d38	[WIP] fixing attention compute error.	2025-12-30 00:31:48 +08:00
Zijie Tian	82ed34fc2d	[opt] optimize nanovllm performance compareable with vllm.	2025-12-25 03:47:07 +08:00
Zijie Tian	6ec1b23982	[WIP] NEED to modify communication.	2025-12-24 21:57:51 +08:00
Zijie Tian	782437c486	[WIP] remove num_prefetch_blocks varible.	2025-12-24 18:22:26 +08:00
Zijie Tian	051f2295c9	[feat] Added sparse KVcache feature, NEED VERIFY.	2025-12-22 08:51:02 +08:00
Zijie Tian	91a0f09a24	[feat] Optimized with ASYNC offload.	2025-12-15 07:21:35 +08:00
Zijie Tian	b8b6478506	[feat] Need to optimized with async prefetch.	2025-12-15 06:58:40 +08:00
Zijie Tian	1081ab51ea	[refactor] Refactor offload code to multi-chunk.	2025-12-15 01:13:58 +08:00
Zijie Tian	61edb8a344	[feat] Finished offload. Still need optimize performance.	2025-12-12 02:27:40 +08:00
Zijie Tian	9b8165af5a	[fix] Fixed kvcache offload problem.	2025-12-12 01:35:30 +08:00
Zijie Tian	60d24f7c12	[feat] Added bench_offload.py and GreedySampler.	2025-12-12 00:24:08 +08:00
Zijie Tian	babfa17354	[refactor] Translate into english, void Chinese due to claude.	2025-12-11 00:30:24 +08:00
Zijie Tian	e85c2b4776	[fix] Fixed kvcache offload bugs.	2025-12-10 22:34:00 +08:00
Zijie Tian	190df5f70d	[refactor] Refactor current gpu and cpu block allocation strategy.	2025-12-10 21:23:31 +08:00
Zijie Tian	0a247ccb1b	[feat] Added `num_gpu_blocks` limit gpu blocks.	2025-12-10 20:17:42 +08:00
Zijie Tian	87055cc5ce	[refactor] Implement real chunked prefill mechenism.	2025-12-10 18:34:01 +08:00
Zijie Tian	0b6f19242d	[feat] Added chunked prefill and kvcache offload mechenism.	2025-12-10 03:47:37 +08:00
Zijie Tian	761929390e	[bench] Added vllm vs nano-vllm bench.	2025-12-10 00:44:57 +08:00
GeeeekExplorer	6ef2a4f630	compile random sampling	2025-08-31 22:55:34 +08:00

1 2

63 Commits