nano-vllm

Author	SHA1	Message	Date
Zijie Tian	2e96d1d97d	WIP: Enhance sparse attention with density tracking and block selection improvements - Added analysis documentation for xattn density alignment. - Refactored ModelRunner to pre-allocate policy metadata buffers regardless of CPU offload configuration. - Updated FullAttentionPolicy and SparsePolicy to accept query and key tensors for block selection. - Enhanced QuestPolicy to utilize query tensor for block selection and improved handling of selected blocks. - Expanded XAttentionBSAPolicy to support chunked prefill and improved attention score computation with historical and current chunk handling. - Introduced DensityObserver to track compute and communication density for sparse attention layers. - Updated attention layer to ensure block selection is always called, improving robustness in first chunk scenarios. - Added tests for attention kernel behavior with enhanced input patterns.	2026-01-31 14:48:23 +08:00
Zijie Tian	f6ac4ccdde	✨ feat: add DensityObserver for XAttention sparse attention density tracking - Add DensityObserver class to track per-layer density statistics - Integrate DensityObserver into compute_prefill for GPU-only mode - Fix stride parameter not being passed to xattn_estimate - Add density statistics output to test_ruler.py for XATTN_BSA - Add comprehensive density benchmark documentation Key changes: - nanovllm/utils/density_observer.py: New Observer for density tracking - xattn_bsa.py: Add stride param to xattn_estimate, integrate DensityObserver - test_ruler.py: Enable DensityObserver and print summary for XATTN_BSA - docs/xattn_density_benchmark.md: Benchmark results for 4K-32K contexts Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-30 16:26:56 +08:00
Zijie Tian	726e4b58cf	✨ feat: add GLM-4-9B-Chat-1M model support Add support for GLM-4 model architecture with the following changes: - Add glm4.py with ChatGLMForCausalLM, GLM4Model, GLM4Attention, GLM4MLP - Add GLM4RotaryEmbedding with interleaved partial rotation (rotary_dim = head_dim // 2) - Add apply_rotary_emb_interleaved function for GLM-4 style RoPE - Add GLM-4 weight name conversion and loading in loader.py - Add GLM-4 chat template conversion in test_ruler.py - Add trust_remote_code=True for GLM-4 config loading Key GLM-4 specific adaptations: - QKV bias enabled (add_qkv_bias: true) - RoPE with rope_ratio scaling (base = 10000 * rope_ratio) - Interleaved RoPE (pairs adjacent elements, not first/second half) - Partial rotation (only half of head_dim is rotated) - Uses multi_query_group_num instead of num_key_value_heads - Uses kv_channels instead of head_dim - Uses ffn_hidden_size instead of intermediate_size Tested with RULER niah_single_1 (5 samples): 100% accuracy Both GPU-only and CPU offload modes verified Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-28 13:15:57 +08:00
Zijie Tian	39d12a0416	📈 feat: add MemoryObserver for GPU-CPU communication tracking Implement MemoryObserver to track memory transfers between GPU and CPU: - H2D (Host to Device): CPU → GPU transfers - D2H (Device to Host): GPU → CPU transfers - D2D (Device to Device): GPU buffer copies - Supports prefill/decode phase separation Integration points in offload_engine.py: - load_to_slot_layer: H2D with is_prefill parameter - offload_slot_layer_to_cpu, offload_prefill_buffer_async: D2H - write_to_prefill_buffer, write_to_decode_buffer: D2D - load_block_sample_from_cpu, load_block_full_from_cpu: H2D Add bench_offload.py integration for memory stats printing. Benchmark results (Llama-3.1-8B, 64K context): - Full Policy: Prefill H2D 262.13 GB - XAttention: Prefill H2D 386.62 GB (1.48x) Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>	2026-01-28 04:06:45 +08:00
Zijie Tian	c16bfcf40f	♻️ refactor: restructure Observer as base class with InferenceObserver - Refactor Observer into base class with common enable/disable/reset interface - Create InferenceObserver subclass for TTFT/TPOT metrics - Fix TTFT calculation timing: compute after prefill completes instead of at decode start (fixes max_tokens=1 returning TTFT=0) - Integrate InferenceObserver into bench.py and bench_offload.py for accurate internal timing metrics vs external wall-clock time - Add get_summary() and print_summary() methods for structured output Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>	2026-01-28 03:15:33 +08:00
Zijie Tian	b8b6478506	[feat] Need to optimized with async prefetch.	2025-12-15 06:58:40 +08:00
Zijie Tian	1081ab51ea	[refactor] Refactor offload code to multi-chunk.	2025-12-15 01:13:58 +08:00
Zijie Tian	9b8165af5a	[fix] Fixed kvcache offload problem.	2025-12-12 01:35:30 +08:00
Zijie Tian	babfa17354	[refactor] Translate into english, void Chinese due to claude.	2025-12-11 00:30:24 +08:00
Zijie Tian	e85c2b4776	[fix] Fixed kvcache offload bugs.	2025-12-10 22:34:00 +08:00
Zijie Tian	0a247ccb1b	[feat] Added `num_gpu_blocks` limit gpu blocks.	2025-12-10 20:17:42 +08:00
Zijie Tian	01f19ee4a6	[feat] Added logger into nanovllm.	2025-12-10 19:53:38 +08:00
Zijie Tian	0b6f19242d	[feat] Added chunked prefill and kvcache offload mechenism.	2025-12-10 03:47:37 +08:00
Zijie Tian	204fe2b38f	[feat] Added metric into tqdm bar.	2025-12-10 00:52:13 +08:00
GeeeekExplorer	cde3fc22c2	simplify	2025-06-21 17:19:15 +08:00
GeeeekExplorer	bc0ad5a116	better	2025-06-17 23:33:38 +08:00
GeeeekExplorer	fc778a4da9	better	2025-06-15 10:36:45 +08:00
GeeeekExplorer	98a1551a7d	support CUDA_VISIBLE_DEVICES	2025-06-12 23:14:01 +08:00
GeeeekExplorer	fee58d44e4	fix	2025-06-12 01:00:31 +08:00
GeeeekExplorer	08c84ec08d	multi file loader	2025-06-12 01:00:09 +08:00
GeeeekExplorer	a5a4909e6a	init commit	2025-06-10 00:27:01 +08:00

21 Commits