Test that xattn_estimate produces the same results as manually calling:
- flat_group_gemm_fuse_reshape
- softmax_fuse_block_sum
- find_blocks_chunked
Uses real KV cache data from results/kvcache/ directory.
Verifies density calculation matches between high-level API and kernels.
Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)
Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
Document the verification test for XAttention Triton kernel KV chunking:
- 32K and 64K test results with threshold 0.9/0.95/1.0
- Key finding: threshold=1.0 achieves alignment (~0% diff)
- threshold<1.0 shows 10-13% difference due to per-chunk threshold application
- Conclusion: softmax normalization is correct, issue is threshold accumulation
Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)
Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
- Added analysis documentation for xattn density alignment.
- Refactored ModelRunner to pre-allocate policy metadata buffers regardless of CPU offload configuration.
- Updated FullAttentionPolicy and SparsePolicy to accept query and key tensors for block selection.
- Enhanced QuestPolicy to utilize query tensor for block selection and improved handling of selected blocks.
- Expanded XAttentionBSAPolicy to support chunked prefill and improved attention score computation with historical and current chunk handling.
- Introduced DensityObserver to track compute and communication density for sparse attention layers.
- Updated attention layer to ensure block selection is always called, improving robustness in first chunk scenarios.
- Added tests for attention kernel behavior with enhanced input patterns.
- Add DensityObserver class to track per-layer density statistics
- Integrate DensityObserver into compute_prefill for GPU-only mode
- Fix stride parameter not being passed to xattn_estimate
- Add density statistics output to test_ruler.py for XATTN_BSA
- Add comprehensive density benchmark documentation
Key changes:
- nanovllm/utils/density_observer.py: New Observer for density tracking
- xattn_bsa.py: Add stride param to xattn_estimate, integrate DensityObserver
- test_ruler.py: Enable DensityObserver and print summary for XATTN_BSA
- docs/xattn_density_benchmark.md: Benchmark results for 4K-32K contexts
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Support models with float32 default dtype (e.g., Nemotron).
FlashAttention requires fp16/bf16, so dtype must be specified.
Usage: --dtype bfloat16
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Separate Qwen2 from Qwen3 implementation:
- Qwen2: Uses QKV bias, no QK norm
- Qwen3: Has optional QK norm when no bias
Tested with Qwen2.5-7B-Instruct-1M, RULER niah_single_1 passed.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Summarizes lessons learned from GLM-4 integration:
- Config field mapping (multi_query_group_num, kv_channels, etc.)
- RoPE variants (interleaved vs half, partial vs full rotation)
- EOS token handling for multi-EOS models
- Weight name conversion patterns
- Verification checklist
Also updates CLAUDE.md to reflect GLM-4 support.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
GLM-4 uses multiple EOS tokens [151329, 151336, 151338] where 151336
(<|user|>) should also stop generation. Previously only the first EOS
from tokenizer was used, causing generation to always hit max_tokens.
Changes:
- config.py: Change eos type to int | list[int]
- llm_engine.py: Read eos_token_id from hf_config (contains full list)
- scheduler.py: Use set for efficient multi-EOS lookup
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add support for GLM-4 model architecture with the following changes:
- Add glm4.py with ChatGLMForCausalLM, GLM4Model, GLM4Attention, GLM4MLP
- Add GLM4RotaryEmbedding with interleaved partial rotation (rotary_dim = head_dim // 2)
- Add apply_rotary_emb_interleaved function for GLM-4 style RoPE
- Add GLM-4 weight name conversion and loading in loader.py
- Add GLM-4 chat template conversion in test_ruler.py
- Add trust_remote_code=True for GLM-4 config loading
Key GLM-4 specific adaptations:
- QKV bias enabled (add_qkv_bias: true)
- RoPE with rope_ratio scaling (base = 10000 * rope_ratio)
- Interleaved RoPE (pairs adjacent elements, not first/second half)
- Partial rotation (only half of head_dim is rotated)
- Uses multi_query_group_num instead of num_key_value_heads
- Uses kv_channels instead of head_dim
- Uses ffn_hidden_size instead of intermediate_size
Tested with RULER niah_single_1 (5 samples): 100% accuracy
Both GPU-only and CPU offload modes verified
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Use FlashInfer's optimized merge_state kernel for attention output merging
in chunked prefill. End-to-end improvement: +0.8% (32K) to +2.4% (64K).
Key changes:
- Add merge_attention_outputs_flashinfer() with LSE format conversion
- FlashInfer uses log2, flash_attn uses ln: convert via LOG2_E/LN_2
- Keep original Triton kernel for fallback
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add comprehensive list of 1M+ context models from Hugging Face
- Categorize by type: text-only LLM vs vision-language models
- Separate ≤10B (practical) from >10B (resource-intensive) models
- Include Qwen, GLM, InternLM, Llama, MiniMax, Gradient AI series
- Add VRAM requirements and technical comparison table
- Update CLAUDE.md documentation index
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Validate the hierarchical estimation approach for XAttention:
- Test 1: Math equivalence (diff = 0.0) between hierarchical and direct
- Test 2: Score + threshold selection strategy (replaces mask + voting)
- Test 3: Performance benchmark (41x speedup)
Uses pure torch + xattn kernels, independent of nanovllm framework.
Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)
Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
Document the performance impact of block_size on softmax_fuse_block_sum:
- Current 4096 (reshaped 512) is the WORST point: 95ms
- Optimal 1024 (reshaped 128): 6ms - 15x faster
- Performance follows U-shaped curve
Add tests/bench_estimate_block_size.py for benchmarking and propose
hierarchical block sum approach for optimization.
Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)
Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
Add load_k_only_to_slot_layer() to OffloadEngine for estimate phase:
- Only load K (not K+V) during block selection in select_blocks()
- Reduces H2D transfer by 50% in estimate phase
- 64K context: XAttn/Full ratio drops from 1.48x to 0.99x
- 32K context: XAttn/Full ratio drops from 1.67x to 1.20x
The estimate phase uses flat_group_gemm_fuse_reshape(Q, K) which
only requires K for attention score computation. V is unused.
Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)
Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
Add a specialized agent for NVIDIA Nsys profiling that handles:
- Profile data collection using framework scripts
- Statistical analysis of kernel timing and memory transfers
- Timeline analysis for GPU-CPU overlap efficiency
- Comparative analysis between different configurations
Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)
Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
- Refactor Observer into base class with common enable/disable/reset interface
- Create InferenceObserver subclass for TTFT/TPOT metrics
- Fix TTFT calculation timing: compute after prefill completes instead of
at decode start (fixes max_tokens=1 returning TTFT=0)
- Integrate InferenceObserver into bench.py and bench_offload.py for
accurate internal timing metrics vs external wall-clock time
- Add get_summary() and print_summary() methods for structured output
Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)
Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
Add comprehensive performance analysis for XAttention:
- NVTX marker locations and usage
- Block size impact on offload mode (4096 vs 1024)
- Detailed timing breakdown for estimate vs compute phases
- softmax_fuse_block_sum_kernel analysis
- Optimization recommendations
Key findings:
- block_size=4096 is 2x faster than 1024 for 64K context
- find_blocks_chunked is bottleneck (40%) at block_size=4096
- estimate_gemm becomes bottleneck (24%) at block_size=1024
Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)
Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
- Add --ctx-len parameter (32k/64k/128k) for context length selection
- Auto-configure max-model-len and data-dir based on context length
- Add error handling to delete .nsys-rep file on test failure
- Remove set -e to allow proper error handling
- Update output filename format to include context length
Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)
Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
Merge GPU-only sparse attention support from tzj/minference-exp branch:
**GPU-only mode additions:**
- Add compute_prefill/compute_decode methods to SparsePolicy base class
- Add GPU-only attention routing in attention.py
- Add alloc_policy_metadata() for pre-allocating GQA buffers
- Add XAttention + BSA sparse attention for GPU-only prefill
- Add kvcache_manager to set_context() for policy access
**bench.py enhancements:**
- Add --model argument for configurable model path
- Add --policy argument (full, xattn) for sparse policy selection
- Add --enable-policy flag for FullAttentionPolicy routing
- Add --enforce-eager option to disable CUDA graphs
- Add --gpu-util option for GPU memory utilization
**Documentation:**
- Add gpu_only_xattn_guide.md with performance analysis
- Add gpu_only_sparse_integration.md baseline document
- Add gpu-vram-requirement.md rule for GPU-only mode
Both CPU offload and GPU-only paths are preserved and functional.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Implement extended CUDA Graph coverage for CPU offload path:
- Add graphed_layers.py with N+2 graph architecture (EmbedGraph, FirstGraph, InterGraphs, LastGraph)
- Support both prefill (seq_len=chunk_size) and decode (seq_len=1) graph modes
- Extend graph coverage to ~70-80% including qkv_proj, rotary, o_proj
- Only attention core remains in eager mode for dynamic offload
Performance: Prefill throughput improved ~5.6% (3782 -> 3995 tok/s at 32K)
Also adds:
- --enforce-eager flag to bench_offload.py for comparison
- Offload mode constraint documentation in CLAUDE.md
Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)
Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
- Add "Policy 不能为 None (CRITICAL)" section
- Document that sparse_policy must always be at least FullAttentionPolicy
- Document warmup phase as the only exception where kvcache_manager can be None
Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)
Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
- Add compute_prefill() and compute_decode() GPU-only methods to SparsePolicy base class
- Implement GPU-only methods in FullAttentionPolicy using flash_attn
- Add sparse_policy parameter to GPUOnlyManager
- Update create_kvcache_manager() to create FullAttentionPolicy for GPU-only mode
- Route GPU-only attention through sparse_policy in attention.py
- Pass kvcache_manager to context for policy access
- Add --enable-policy flag to bench.py for testing
- Handle warmup phase when kvcache_manager is not yet allocated
This allows GPU-only mode to use the same policy architecture as CPU offload mode,
enabling future sparse attention implementations (Quest, XAttention) in GPU-only mode.
Performance verified: ~4890 tok/s (unchanged from baseline)
Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)
Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
Document baseline performance before integrating sparse attention
to GPU-only mode:
- GPU-only Full Attention: 4869 tok/s (32K prefill)
- CPU Offload Full Attention: 1500 tok/s (3.2x slower)
Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)
Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
- Add --model parameter (default: Llama-3.1-8B-Instruct)
- Add --enable-xattn flag for XAttention BSA sparse prefill
- Add --xattn-threshold and --xattn-stride parameters
- Change default num-gpu-blocks from 6 to 4
- Add benchmark results doc with Full vs XAttn comparison (32K/128K)
Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)
Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
- Document kernel gap analysis showing 77-81% CPU scheduling overhead
- Identify GPU utilization at 12.8% with potential to reach 39.5%
- Outline optimization directions: CUDA Graph, Triton fusion, C++ extension
- Add documentation index entry in CLAUDE.md
Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)
Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
- Switch from torch.cuda.nvtx to nvtx package for colored range support
- Add color coding: blue for H2D, green for D2H decode, orange for D2H prefill
- Add --num-gpu-blocks parameter to profile_offload.sh
- Include slot count in output filename for easier comparison
Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)
Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
- Add write_to_prefill_buffer() and write_to_decode_buffer() methods
- Add chunk_idx parameter to load_to_slot_layer() for NVTX labeling
- Replace direct copy_() calls with OffloadEngine methods in attention.py
- Update all load_to_slot_layer() calls to pass chunk_idx
- NVTX markers now show chunk info: "H2D: L{layer} Chunk{chunk} CPU[{block}]->Slot[{slot}]"
All KV cache data transfers in chunked offload mode now go through
OffloadEngine, enabling better profiling and consistent management.
Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)
Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
- Document ring buffer pipeline triggering nsys timestamp bug
- Update profile_offload.sh to use test_ruler.py with options
- Add reference to new doc in CLAUDE.md
Root cause: 4-slot ring buffer pipeline (4 transfer streams +
1 compute stream) triggers event ordering bug in nsys < 2024.2
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add .claude/rules/gpu-monitor.md requiring gpu-monitor agent for all GPU memory monitoring tasks
- Update CLAUDE.md rules index with reference to new rule
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>