nano-vllm

Author	SHA1	Message	Date
Zijie Tian	52b12a89e3	📋 docs: add changelog for 2026-02-05 Document today's changes: - GQA buffer OOM fix (saves 16GB for 1M seq in offload mode) - Tests directory cleanup (removed 16 files, -4306 lines) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 03:16:39 +08:00
Zijie Tian	11a867f6fb	🐛 fix: skip GQA buffer allocation in XAttention offload mode In offload mode, GQA expansion buffers (_k_expanded, _v_expanded) are not needed since compute_chunked_prefill() handles GQA inline. Previously, these buffers were always allocated based on max_model_len, causing OOM on 24GB GPUs (e.g., RTX 3090) when max_model_len=1M (16GB buffer). Changes: - Add enable_cpu_offload parameter to alloc_policy_metadata() in base class - Skip GQA buffer allocation when enable_cpu_offload=True in XAttentionBSAPolicy - Pass enable_cpu_offload from model_runner to policy Memory savings: ~16GB for 1M seq, ~1.1GB for 72K seq Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 02:57:18 +08:00
Zijie Tian	af4da454ba	📊 docs: add XAttention offload profiling analysis for 32K context - Profile XAttn vs Full attention using nsys NVTX markers - Key finding: estimate (41%) + find_blocks (37%) dominate, compute only 21% - Chunk7 comparison: XAttn (38ms) vs Full (35ms) - XAttn slightly slower - Identify optimization opportunities: reduce find_blocks overhead, merge estimate passes Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>	2026-02-05 02:49:59 +08:00
Zijie Tian	ef37d4f1a8	🐛 docs: document XAttention offload GQA buffer OOM issue Document OOM issue when using XAttention BSA + CPU offload with large models (GLM-4-9B) on 24GB GPUs. Issue: 8GB allocation for k_expanded buffer fails due to using num_heads instead of num_kv_heads in GQA models. Root cause analysis and proposed fix included. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 02:46:50 +08:00
Zijie Tian	c8a5ef04c0	📝 docs: add test_ruler.py usage guide and rule - Add comprehensive test_ruler.py usage guide with verified commands - Add .claude/rules/test-ruler.md to enforce documentation-first approach - Update CLAUDE.md documentation index Tested commands on RTX 3090 (GPU 4): - 32K/64K offload + XAttn BSA - Multi-dataset, JSON output, quiet mode - GLM-4 model support Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 02:46:44 +08:00
Zijie Tian	54fd302fa8	📝 docs: add XAttention density alignment verification results - Add verification doc comparing GPU-only vs Offload mode density - Test results: 32K (0.37% diff), 64K (0.09% diff) - alignment successful - Both modes achieve 100% accuracy on RULER niah_single_1 Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>	2026-02-05 01:59:11 +08:00
Zijie Tian	1eb7521994	📝 docs: add XAttention density types documentation Document the difference between compute density (BSA block level) and communication density (CPU block level). Key finding: Even with 37% compute density, comm density can be 100% due to any() aggregation across heads/Q-positions spreading sparse blocks across all CPU blocks. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 01:44:11 +08:00
Zijie Tian	1ea5afd886	📝 docs: add XAttention offload stream sync fix documentation - Document the CUDA stream synchronization bug in XAttention BSA - Include root cause analysis with stream timing diagrams - Add test commands and verification results (100% accuracy) - Update CLAUDE.md documentation index Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 01:32:50 +08:00
Zijie Tian	a1c68a733e	📊 docs: add XAttention memory benchmark for 24GB GPUs - Add memory analysis for Qwen3-0.6B @ 32K context - Document 24GB VRAM feasibility (RTX 3090/4090) - Recommend gpu-utilization=0.28 for 24GB GPUs - Include KV cache breakdown and model estimations - Update CLAUDE.md index Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>	2026-02-02 14:38:27 +08:00
Zijie Tian	dc51972777	📝 docs: update density alignment test with Offload mode results - Rename doc to "Density Alignment Test Results" (covers both modes) - Add Offload mode test results (3.7K-64.9K tokens, all passed) - Add Layer 5 GPU-only test results (threshold=0.9, density=6.24%) - Enhance test script to support both GPU-only and Offload data formats - Add batch testing commands for all data files - Update CLAUDE.md index Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>	2026-02-02 14:22:40 +08:00
Zijie Tian	232fcf043e	📝 docs: add GPU-only density alignment test results Document test results verifying XAttention density calculation in GPU-only mode matches independent xattn_estimate calls. Test results (Llama-3.1-8B-Instruct, threshold=0.9): - 4k: Layer 0 density 63.8%, verified ✅ - 8k: Layer 0 density 65.0%, verified ✅ - 16k: Layer 0 density 61.6%, verified ✅ - 32k: Layer 0 density 50.2%, verified ✅ - 64k: Layer 0 density 37.0%, verified ✅ All tests show exact match (attn_sums diff=0, mask exact match). Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>	2026-02-02 11:22:34 +08:00
Zijie Tian	6e34efd58a	📝 docs: add storage overhead analysis and batch tests for KV chunking - Update xattn_kv_chunking_kernels.md with: - Detailed storage overhead analysis (O(S) vs O(S²)) - Peak memory optimization (8x reduction) - Support for independent Q/KV chunk sizes - Batch verification results (3K-64K seqlen) - ASCII pipeline diagram - Add test_xattn_kv_chunking_batch.py for batch validation - Fix causal mask post-processing in alignment test - Update CLAUDE.md documentation index Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>	2026-02-01 19:22:36 +08:00
Zijie Tian	5acd5558d6	feat: add KV chunking support for XAttention softmax kernels Implement three-phase KV chunking for sparse attention estimation: 1. softmax_compute_partial_stats: compute (m, l) per KV chunk 2. merge_softmax_stats: merge partial stats on host 3. softmax_normalize_and_block_sum: normalize with global stats This allows computing sparse attention masks without storing full raw attention scores in GPU memory, reducing peak memory usage from O(q_len * k_full_len) to O(q_len * k_chunk_len). Key changes: - Add softmax_partial_stats_kernel with causal mask support - Add softmax_normalize_block_sum_kernel with kv_offset parameter - Add Python wrappers for new kernels - Update test script to validate KV chunking alignment - Add documentation for the new kernels Test results show perfect alignment with xattn_estimate API: - Density difference: 0.000000 - Mask difference: 0.0044% Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>	2026-02-01 18:53:26 +08:00
Zijie Tian	8035e4db3d	📝 docs: add XAttention KV chunking density test results Document the verification test for XAttention Triton kernel KV chunking: - 32K and 64K test results with threshold 0.9/0.95/1.0 - Key finding: threshold=1.0 achieves alignment (~0% diff) - threshold<1.0 shows 10-13% difference due to per-chunk threshold application - Conclusion: softmax normalization is correct, issue is threshold accumulation Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>	2026-02-01 17:36:19 +08:00
Zijie Tian	f6ac4ccdde	✨ feat: add DensityObserver for XAttention sparse attention density tracking - Add DensityObserver class to track per-layer density statistics - Integrate DensityObserver into compute_prefill for GPU-only mode - Fix stride parameter not being passed to xattn_estimate - Add density statistics output to test_ruler.py for XATTN_BSA - Add comprehensive density benchmark documentation Key changes: - nanovllm/utils/density_observer.py: New Observer for density tracking - xattn_bsa.py: Add stride param to xattn_estimate, integrate DensityObserver - test_ruler.py: Enable DensityObserver and print summary for XATTN_BSA - docs/xattn_density_benchmark.md: Benchmark results for 4K-32K contexts Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-30 16:26:56 +08:00
Zijie Tian	a239bfb40d	📚 docs: add new model integration guide Summarizes lessons learned from GLM-4 integration: - Config field mapping (multi_query_group_num, kv_channels, etc.) - RoPE variants (interleaved vs half, partial vs full rotation) - EOS token handling for multi-EOS models - Weight name conversion patterns - Verification checklist Also updates CLAUDE.md to reflect GLM-4 support. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-28 13:36:24 +08:00
Zijie Tian	8d19e61446	⚡️ perf: replace Triton merge with FlashInfer merge_state Use FlashInfer's optimized merge_state kernel for attention output merging in chunked prefill. End-to-end improvement: +0.8% (32K) to +2.4% (64K). Key changes: - Add merge_attention_outputs_flashinfer() with LSE format conversion - FlashInfer uses log2, flash_attn uses ln: convert via LOG2_E/LN_2 - Keep original Triton kernel for fallback Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-28 10:04:38 +08:00
Zijie Tian	4484ebbb77	📚 docs: add 1M+ context length models reference list - Add comprehensive list of 1M+ context models from Hugging Face - Categorize by type: text-only LLM vs vision-language models - Separate ≤10B (practical) from >10B (resource-intensive) models - Include Qwen, GLM, InternLM, Llama, MiniMax, Gradient AI series - Add VRAM requirements and technical comparison table - Update CLAUDE.md documentation index Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-28 09:04:55 +08:00
Zijie Tian	2c2383c786	⚡️ perf: optimize XAttention estimate with hierarchical block sum Replace slow softmax_fuse_block_sum (block_size=4096) with optimized hierarchical approach (estimate_block_size=1024): - Add estimate_block_size parameter to XAttentionBSAPolicy (default 1024) - Rewrite select_blocks to use hierarchical aggregation: 1. Fine-grained softmax with small block size (15x faster kernel) 2. Aggregate to CPU block level via reshape + sum 3. Score + threshold selection (replaces mask + voting) Performance improvement (CPU Offload mode): - softmax_fuse_block_sum: 48% → 1% of total time (44x faster) - 128K: XAttention now +2.4% faster than Full (was -59%) - 64K: -3.8% (was -21%) - 32K: -6.0% (was -14%) Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>	2026-01-28 06:47:13 +08:00
Zijie Tian	c90dc196b2	📝 docs: add estimate block_size performance analysis Document the performance impact of block_size on softmax_fuse_block_sum: - Current 4096 (reshaped 512) is the WORST point: 95ms - Optimal 1024 (reshaped 128): 6ms - 15x faster - Performance follows U-shaped curve Add tests/bench_estimate_block_size.py for benchmarking and propose hierarchical block sum approach for optimization. Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>	2026-01-28 06:24:28 +08:00
Zijie Tian	3da9b8aef2	⚡️ perf: optimize XAttention estimate phase with K-only loading Add load_k_only_to_slot_layer() to OffloadEngine for estimate phase: - Only load K (not K+V) during block selection in select_blocks() - Reduces H2D transfer by 50% in estimate phase - 64K context: XAttn/Full ratio drops from 1.48x to 0.99x - 32K context: XAttn/Full ratio drops from 1.67x to 1.20x The estimate phase uses flat_group_gemm_fuse_reshape(Q, K) which only requires K for attention score computation. V is unused. Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>	2026-01-28 06:24:20 +08:00
Zijie Tian	39d12a0416	📈 feat: add MemoryObserver for GPU-CPU communication tracking Implement MemoryObserver to track memory transfers between GPU and CPU: - H2D (Host to Device): CPU → GPU transfers - D2H (Device to Host): GPU → CPU transfers - D2D (Device to Device): GPU buffer copies - Supports prefill/decode phase separation Integration points in offload_engine.py: - load_to_slot_layer: H2D with is_prefill parameter - offload_slot_layer_to_cpu, offload_prefill_buffer_async: D2H - write_to_prefill_buffer, write_to_decode_buffer: D2D - load_block_sample_from_cpu, load_block_full_from_cpu: H2D Add bench_offload.py integration for memory stats printing. Benchmark results (Llama-3.1-8B, 64K context): - Full Policy: Prefill H2D 262.13 GB - XAttention: Prefill H2D 386.62 GB (1.48x) Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>	2026-01-28 04:06:45 +08:00
Zijie Tian	f3e4611e3b	📝 docs: add XAttention performance analysis documentation Add comprehensive performance analysis for XAttention: - NVTX marker locations and usage - Block size impact on offload mode (4096 vs 1024) - Detailed timing breakdown for estimate vs compute phases - softmax_fuse_block_sum_kernel analysis - Optimization recommendations Key findings: - block_size=4096 is 2x faster than 1024 for 64K context - find_blocks_chunked is bottleneck (40%) at block_size=4096 - estimate_gemm becomes bottleneck (24%) at block_size=1024 Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>	2026-01-28 00:57:20 +08:00
Zijie Tian	e874229adc	📝 docs: add comprehensive GPU-only vs Offload benchmark results - Add --block-size argument to bench.py for configurable KV cache block size - Update bench_offload_results.md with complete benchmark analysis: - GPU-only: XAttention shows +15% to +41% speedup - CPU Offload: XAttention shows -14% to -59% slowdown - Block size 4096 recommended for best performance - Document why XAttention hurts Offload mode (transfer bottleneck) Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>	2026-01-27 22:32:07 +08:00
Zijie Tian	4fe7dfb239	🔀 merge: integrate tzj/minference-exp (GPU-only sparse attention) Merge GPU-only sparse attention support from tzj/minference-exp branch: GPU-only mode additions: - Add compute_prefill/compute_decode methods to SparsePolicy base class - Add GPU-only attention routing in attention.py - Add alloc_policy_metadata() for pre-allocating GQA buffers - Add XAttention + BSA sparse attention for GPU-only prefill - Add kvcache_manager to set_context() for policy access bench.py enhancements: - Add --model argument for configurable model path - Add --policy argument (full, xattn) for sparse policy selection - Add --enable-policy flag for FullAttentionPolicy routing - Add --enforce-eager option to disable CUDA graphs - Add --gpu-util option for GPU memory utilization Documentation: - Add gpu_only_xattn_guide.md with performance analysis - Add gpu_only_sparse_integration.md baseline document - Add gpu-vram-requirement.md rule for GPU-only mode Both CPU offload and GPU-only paths are preserved and functional. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-27 09:25:36 +08:00
Zijie Tian	6da116de98	📝 docs: add GPU-Only XAttention guide with performance analysis Add comprehensive documentation for GPU-only XAttention BSA mode: - Architecture design and SparsePolicy interface - Memory pre-allocation mechanism (alloc_policy_metadata) - Performance analysis: 32K +15%, 64K +41% vs baseline - CUDA Graph limitations explanation (variable seq_len in prefill) - nsys profiling tools usage guide Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>	2026-01-27 07:21:46 +08:00
Zijie Tian	0d31b3f71f	📝 docs: add CPU offload optimization strategies guide - Document chunk size optimization (simplest, most effective) - Analyze CUDA Graph limitations for offload scenarios - Cover CUDA Graph applicability for MLP/Proj layers - Survey frontier research: InfiniGen, ShadowKV, L2 Prefetch, KVPR - Add optimization priority recommendations Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>	2026-01-27 04:44:36 +08:00
Zijie Tian	05ce57ee8e	📝 docs: add GPU-only sparse policy integration baseline Document baseline performance before integrating sparse attention to GPU-only mode: - GPU-only Full Attention: 4869 tok/s (32K prefill) - CPU Offload Full Attention: 1500 tok/s (3.2x slower) Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>	2026-01-27 04:36:31 +08:00
Zijie Tian	73c9dc46ff	✨ feat: add XAttention BSA support to bench_offload.py - Add --model parameter (default: Llama-3.1-8B-Instruct) - Add --enable-xattn flag for XAttention BSA sparse prefill - Add --xattn-threshold and --xattn-stride parameters - Change default num-gpu-blocks from 6 to 4 - Add benchmark results doc with Full vs XAttn comparison (32K/128K) Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>	2026-01-27 04:20:16 +08:00
Zijie Tian	0619accd1c	📝 docs: add CPU scheduling latency analysis for chunked attention - Document kernel gap analysis showing 77-81% CPU scheduling overhead - Identify GPU utilization at 12.8% with potential to reach 39.5% - Outline optimization directions: CUDA Graph, Triton fusion, C++ extension - Add documentation index entry in CLAUDE.md Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>	2026-01-27 03:42:12 +08:00
Zijie Tian	3100724666	📝 docs: add nsys wrong event order bug investigation - Document ring buffer pipeline triggering nsys timestamp bug - Update profile_offload.sh to use test_ruler.py with options - Add reference to new doc in CLAUDE.md Root cause: 4-slot ring buffer pipeline (4 transfer streams + 1 compute stream) triggers event ordering bug in nsys < 2024.2 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-24 04:32:05 +08:00
Zijie Tian	da5e13e2bb	📝 docs: update XAttention BSA Policy with benchmarks and memory management Add new sections to xattn_bsa_policy_design.md: - Performance benchmarks: 128K context comparison (Full vs XAttn BSA) - Density trend analysis across chunks - Memory leak issue and fix (64GB -> 4GB reduction) - Memory monitoring guide with gpu-monitor agent - Density statistics API documentation - Known issues and optimization directions Update CLAUDE.md description to reflect new content. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-23 09:35:18 +08:00
Zijie Tian	ad361c2c3b	📝 docs: add XAttention BSA Policy design documentation - Create docs/xattn_bsa_policy_design.md with: - Algorithm overview and data flow diagram - select_blocks implementation details - GQA-aware aggregation and majority voting - compute_chunked_prefill ring buffer pipeline - Parameter configuration and usage examples - Performance characteristics and limitations - Update CLAUDE.md documentation index Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-23 08:36:56 +08:00
Zijie Tian	edc006463b	docs: add XAttention kernels guide - Document flat_group_gemm_fuse_reshape and softmax_fuse_block_sum kernels - Explain anti-diagonal sum principle and stride sampling - Add GPU-specific BLOCK_M/N constraints (RTX 3090 vs A100) - Show Q/K can have different lengths (chunked prefill support) - Update CLAUDE.md with doc reference Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-23 03:22:25 +08:00
Zijie Tian	bc92c1fdb8	feat: add xattn_estimate_chunked for chunked prefill support - Add xattn_estimate_chunked function ported from COMPASS - Support chunked prefill with q_start_pos parameter - Ensure 100% consistency with standard xattn_estimate when using matching chunk_size parameter - Add test and documentation Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-22 01:13:17 +08:00
Zijie Tian	5d722968ff	[docs] Added cuda_graph_guide.md	2026-01-21 21:56:24 +08:00
Zijie Tian	42cf124343	📝 docs: add CUDA Graph memory mechanism guide Document CUDA Graph memory behavior based on actual testing: - Memory overhead at each stage (model, cache, warmup, capture, replay) - StaticCache is the main overhead (~144MB for 1K tokens) - Graph capture adds minimal overhead (~8MB) - Graph replay requires zero additional allocation - Performance improvement: ~2.8x decode throughput Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-21 02:59:21 +08:00
Zijie Tian	78050aef9f	🐛 fix: resolve CPU KV cache state leakage between requests Root Cause: - OffloadEngine.reset() cleared GPU buffers but NOT CPU cache - Previous request's KV cache data persisted in CPU memory, contaminating subsequent requests Fixes: - Add k_cache_cpu.zero_() and v_cache_cpu.zero_() to OffloadEngine.reset() - Add clear_decode_tracking(seq) call in HybridKVCacheManager.deallocate() Results: - niah_single_1 accuracy improved from ~80% to 94% (+14%) - Remaining ~6% errors are model limitations, not state leakage Also: - Update docs/ruler_32k_chunked_offload_issue.md with fix details - Remove debug planning files (findings.md, progress.md, task_plan.md) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-21 01:12:21 +08:00
Zijie Tian	1ab4676396	♻️ refactor: consolidate RULER test files and document root cause - test_ruler.py: add --fresh-llm, --sample-indices, --json-output options - test_ruler.py: consolidate test_ruler_single_sample.py, test_ruler_sequential.py, test_ruler_samples.py - docs: update chunked offload issue with root cause (state leakage confirmed) - docs: add single-sample test results showing 100% accuracy for niah_single_1 Deleted redundant test files: - tests/test_ruler_single_sample.py - tests/test_ruler_sequential.py - tests/test_ruler_samples.py Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 23:41:17 +08:00
Zijie Tian	6180055ed8	📝 docs: add chunked attention solutions guide and update doc index Add comprehensive documentation analyzing the 32K chunked offload accuracy issues with proposed solutions covering LSE precision, ring buffer state management, and position encoding validation. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 04:48:20 +08:00
Zijie Tian	4cbd451af7	📝 docs: add BSA interface documentation and cleanup temp files - Add docs/block_sparse_attn_interface.md with BSA function signatures - Update CLAUDE.md documentation index - Remove obsolete DEBUG_SUMMARY.md and test_report_sparse_policy_refactor.md - Add notes.md to .gitignore Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 04:27:19 +08:00
Zijie Tian	e440c45e73	📝 docs: add XAttention algorithm guide based on COMPASS implementation - Create docs/xattention_algorithm_guide.md with detailed algorithm explanation - Stride reshape (inverse mode) for Q/K interleaved sampling - Triton kernels: flat_group_gemm_fuse_reshape, softmax_fuse_block_sum - Block selection via find_blocks_chunked with cumulative threshold - BSA (block_sparse_attn) dependency for sparse computation - Update docs/sparse_attention_guide.md XAttention section with accurate description - Add documentation index entry in CLAUDE.md Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 02:50:03 +08:00
Zijie Tian	07f5220f40	Merge branch 'tzj/minference' of ssh://git.zijie-tian.site:2222/zijie-tian/nano-vllm into tzj/minference	2026-01-20 02:27:10 +08:00
Zijie Tian	37aecd4d52	📝 docs: add SparsePolicy implementation guide and update rules - Create docs/sparse_policy_implementation_guide.md with comprehensive guide - Rewrite .claude/rules/sparse-policy.md with mandatory base class requirements - Add new doc reference to CLAUDE.md documentation index Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 02:25:46 +08:00
Zijie Tian	b1f292cf22	Merge branch 'tzj/minference' of ssh://git.zijie-tian.site:2222/zijie-tian/nano-vllm into tzj/minference	2026-01-20 02:16:39 +08:00
Zijie Tian	16fbcf9e4c	docs: add RULER 32K chunked offload issue documentation - Document accuracy degradation issue in 32K context with chunked offload - Add detailed hypothesis analysis and debugging approach - Include 4-slot ring buffer experiment results Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 02:16:21 +08:00
Zijie Tian	fa7601f4b8	♻️ refactor: remove cross-layer pipeline and rename compute_chunked_prefill - Remove cross-layer pipeline from OffloadEngine (saves ~1GB GPU memory for long sequences) - Delete layer_k/v_buffer_a/b double buffers - Remove start_decode_pipeline, get_decode_layer_kv, end_decode_pipeline methods - Remove pipeline state tracking variables - Simplify decode to use ring buffer pipeline only (more efficient for long sequences) - Rename compute_chunked_attention → compute_chunked_prefill for clarity - Add mandatory needle test requirements: --enable-offload --input-len 32768 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 02:10:40 +08:00
Zijie Tian	e5a17c832c	📝 docs: add SparsePolicy architecture documentation Add comprehensive documentation for the SparsePolicy abstraction: - SparsePolicy base class and abstract methods - FullAttentionPolicy prefill/decode flow - Ring buffer and cross-layer pipeline modes - Code conventions and testing guidelines Update CLAUDE.md documentation index with reference. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 01:36:09 +08:00
Zijie Tian	b5da802dff	[WIP] Before integrate the xattn operator.	2026-01-19 21:19:21 +08:00
Zijie Tian	e6e0dc5d7d	✨ feat: add comprehensive RULER benchmark testing - Add test_ruler.py from tzj/vs_offload branch with 13 RULER tasks - Add comprehensive documentation for RULER benchmark results - Update CLAUDE.md with new documentation index entry - Add architecture, debugging, optimization, and known issues guides - Test 32K context with CPU offload: 92.3% accuracy across all tasks - Parallel execution on 4 GPUs with detailed performance metrics Benchmark results: - 13 RULER tasks total (niah_single, multikey, multiquery, multivalue, qa, cwe, fwe, vt) - 26 samples tested with 92.3% overall accuracy - CPU offload stable at 32K context length - Parallel GPU execution achieving 4x speedup Key findings: - Single needle tasks: 100% accuracy - Multi-value and recall tasks: 100% accuracy - Multi-query tasks: 50% accuracy (most challenging) - QA tasks: 100% accuracy - Total execution time: ~220 seconds (parallel)	2026-01-18 20:34:06 +08:00

1 2

51 Commits