nano-vllm

Author	SHA1	Message	Date
Zijie Tian	51bd678335	📊 feat: distinguish compute density and communication density in DensityObserver - Add record_comm_density() call in select_blocks to track CPU block selection - Add get_per_layer_comm_density() method for detailed analysis - Update print_summary() to show both densities and H2D savings ratio - Set DensityObserver mode (offload/gpu_only) in test_ruler.py - Update get_summary() to return both density types Key insight: Comm density can be 100% even when compute density is ~37% because sparse BSA blocks are distributed across all CPU blocks. Since CPU block granularity is 32x coarser (4096 vs 128 tokens), any() aggregation across heads/Q-blocks results in all CPU blocks being needed. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 01:43:17 +08:00
Zijie Tian	f6ac4ccdde	✨ feat: add DensityObserver for XAttention sparse attention density tracking - Add DensityObserver class to track per-layer density statistics - Integrate DensityObserver into compute_prefill for GPU-only mode - Fix stride parameter not being passed to xattn_estimate - Add density statistics output to test_ruler.py for XATTN_BSA - Add comprehensive density benchmark documentation Key changes: - nanovllm/utils/density_observer.py: New Observer for density tracking - xattn_bsa.py: Add stride param to xattn_estimate, integrate DensityObserver - test_ruler.py: Enable DensityObserver and print summary for XATTN_BSA - docs/xattn_density_benchmark.md: Benchmark results for 4K-32K contexts Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-30 16:26:56 +08:00
Zijie Tian	e436ec861f	⚙️ config: update test_ruler.py defaults - max_new_tokens: 128 → 16 (sufficient for NIAH answers) - block_size: 1024 → 4096 (better performance) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-28 14:21:23 +08:00
Zijie Tian	45efcf0db1	✨ feat: add --dtype parameter to test_ruler.py Support models with float32 default dtype (e.g., Nemotron). FlashAttention requires fp16/bf16, so dtype must be specified. Usage: --dtype bfloat16 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-28 13:56:15 +08:00
Zijie Tian	726e4b58cf	✨ feat: add GLM-4-9B-Chat-1M model support Add support for GLM-4 model architecture with the following changes: - Add glm4.py with ChatGLMForCausalLM, GLM4Model, GLM4Attention, GLM4MLP - Add GLM4RotaryEmbedding with interleaved partial rotation (rotary_dim = head_dim // 2) - Add apply_rotary_emb_interleaved function for GLM-4 style RoPE - Add GLM-4 weight name conversion and loading in loader.py - Add GLM-4 chat template conversion in test_ruler.py - Add trust_remote_code=True for GLM-4 config loading Key GLM-4 specific adaptations: - QKV bias enabled (add_qkv_bias: true) - RoPE with rope_ratio scaling (base = 10000 * rope_ratio) - Interleaved RoPE (pairs adjacent elements, not first/second half) - Partial rotation (only half of head_dim is rotated) - Uses multi_query_group_num instead of num_key_value_heads - Uses kv_channels instead of head_dim - Uses ffn_hidden_size instead of intermediate_size Tested with RULER niah_single_1 (5 samples): 100% accuracy Both GPU-only and CPU offload modes verified Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-28 13:15:57 +08:00
Zijie Tian	7c41032a2e	✨ feat: add configurable stride and chunk_size for XAttention BSA - Add sparse_chunk_size config option (default: 16384) - Pass stride, chunk_size, use_triton through factory function - Add --sparse-stride CLI option to test_ruler.py Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-23 10:37:04 +08:00
Zijie Tian	1ab4676396	♻️ refactor: consolidate RULER test files and document root cause - test_ruler.py: add --fresh-llm, --sample-indices, --json-output options - test_ruler.py: consolidate test_ruler_single_sample.py, test_ruler_sequential.py, test_ruler_samples.py - docs: update chunked offload issue with root cause (state leakage confirmed) - docs: add single-sample test results showing 100% accuracy for niah_single_1 Deleted redundant test files: - tests/test_ruler_single_sample.py - tests/test_ruler_sequential.py - tests/test_ruler_samples.py Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 23:41:17 +08:00
Zijie Tian	b1f292cf22	Merge branch 'tzj/minference' of ssh://git.zijie-tian.site:2222/zijie-tian/nano-vllm into tzj/minference	2026-01-20 02:16:39 +08:00
Zijie Tian	b5da802dff	[WIP] Before integrate the xattn operator.	2026-01-19 21:19:21 +08:00
Zijie Tian	50520a6c3c	[fix] fixed request to request error.	2026-01-19 00:55:26 +08:00
Zijie Tian	e6e0dc5d7d	✨ feat: add comprehensive RULER benchmark testing - Add test_ruler.py from tzj/vs_offload branch with 13 RULER tasks - Add comprehensive documentation for RULER benchmark results - Update CLAUDE.md with new documentation index entry - Add architecture, debugging, optimization, and known issues guides - Test 32K context with CPU offload: 92.3% accuracy across all tasks - Parallel execution on 4 GPUs with detailed performance metrics Benchmark results: - 13 RULER tasks total (niah_single, multikey, multiquery, multivalue, qa, cwe, fwe, vt) - 26 samples tested with 92.3% overall accuracy - CPU offload stable at 32K context length - Parallel GPU execution achieving 4x speedup Key findings: - Single needle tasks: 100% accuracy - Multi-value and recall tasks: 100% accuracy - Multi-query tasks: 50% accuracy (most challenging) - QA tasks: 100% accuracy - Total execution time: ~220 seconds (parallel)	2026-01-18 20:34:06 +08:00

11 Commits