63 Commits

Author SHA1 Message Date
Zijie Tian
52b12a89e3 📋 docs: add changelog for 2026-02-05
Document today's changes:
- GQA buffer OOM fix (saves 16GB for 1M seq in offload mode)
- Tests directory cleanup (removed 16 files, -4306 lines)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 03:16:39 +08:00
Zijie Tian
d35dd76e09 🗑️ chore: clean up tests directory to essential files only
Keep only core test files:
- test_ruler.py - main RULER benchmark
- test_xattn_estimate_alignment.py - XAttn kernel validation
- utils.py - shared utilities

Remove 8 files (recoverable from git history):
- bench_estimate_block_size.py
- modeling_qwen3.py
- test_chunk_attention_graph_reuse.py
- test_cudagraph_memory.py
- test_gpuonly_density_alignment.py
- test_hierarchical_estimate.py
- test_quest_policy.py
- test_sequential.py

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 03:13:50 +08:00
Zijie Tian
2b61c5ab57 🗑️ chore: remove test_needle* files
Remove needle tests (validation now covered by test_ruler.py):
- test_needle.py - basic needle-in-haystack test
- test_needle_ref.py - HuggingFace reference implementation

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 03:11:28 +08:00
Zijie Tian
a709551072 🗑️ chore: remove redundant XAttention test files
Remove 6 obsolete test files:
- test_xattn_bsa.py - XAttn+BSA integration (covered by test_ruler)
- test_xattn_chunked.py - duplicate of test_xattn_estimate_chunked
- test_xattn_estimate_chunked.py - chunked prefill validation
- test_xattn_kernels.py - Triton kernel unit tests
- test_xattn_kv_chunking_batch.py - batch KV chunking validation
- test_chunk_attention_graph.py - superseded by graph_reuse version

Retained: test_xattn_estimate_alignment.py (critical kernel validation)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 03:11:21 +08:00
Zijie Tian
11a867f6fb 🐛 fix: skip GQA buffer allocation in XAttention offload mode
In offload mode, GQA expansion buffers (_k_expanded, _v_expanded) are not
needed since compute_chunked_prefill() handles GQA inline. Previously,
these buffers were always allocated based on max_model_len, causing OOM
on 24GB GPUs (e.g., RTX 3090) when max_model_len=1M (16GB buffer).

Changes:
- Add enable_cpu_offload parameter to alloc_policy_metadata() in base class
- Skip GQA buffer allocation when enable_cpu_offload=True in XAttentionBSAPolicy
- Pass enable_cpu_offload from model_runner to policy

Memory savings: ~16GB for 1M seq, ~1.1GB for 72K seq

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 02:57:18 +08:00
Zijie Tian
af4da454ba 📊 docs: add XAttention offload profiling analysis for 32K context
- Profile XAttn vs Full attention using nsys NVTX markers
- Key finding: estimate (41%) + find_blocks (37%) dominate, compute only 21%
- Chunk7 comparison: XAttn (38ms) vs Full (35ms) - XAttn slightly slower
- Identify optimization opportunities: reduce find_blocks overhead, merge estimate passes

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-02-05 02:49:59 +08:00
Zijie Tian
ef37d4f1a8 🐛 docs: document XAttention offload GQA buffer OOM issue
Document OOM issue when using XAttention BSA + CPU offload
with large models (GLM-4-9B) on 24GB GPUs.

Issue: 8GB allocation for k_expanded buffer fails due to
using num_heads instead of num_kv_heads in GQA models.

Root cause analysis and proposed fix included.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 02:46:50 +08:00
Zijie Tian
c8a5ef04c0 📝 docs: add test_ruler.py usage guide and rule
- Add comprehensive test_ruler.py usage guide with verified commands
- Add .claude/rules/test-ruler.md to enforce documentation-first approach
- Update CLAUDE.md documentation index

Tested commands on RTX 3090 (GPU 4):
- 32K/64K offload + XAttn BSA
- Multi-dataset, JSON output, quiet mode
- GLM-4 model support

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 02:46:44 +08:00
Zijie Tian
1c36d53570 🙈 chore: add ralph-tui session file to gitignore
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 02:00:44 +08:00
Zijie Tian
54fd302fa8 📝 docs: add XAttention density alignment verification results
- Add verification doc comparing GPU-only vs Offload mode density
- Test results: 32K (0.37% diff), 64K (0.09% diff) - alignment successful
- Both modes achieve 100% accuracy on RULER niah_single_1

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-02-05 01:59:11 +08:00
Zijie Tian
1eb7521994 📝 docs: add XAttention density types documentation
Document the difference between compute density (BSA block level)
and communication density (CPU block level).

Key finding: Even with 37% compute density, comm density can be 100%
due to any() aggregation across heads/Q-positions spreading sparse
blocks across all CPU blocks.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 01:44:11 +08:00
Zijie Tian
51bd678335 📊 feat: distinguish compute density and communication density in DensityObserver
- Add record_comm_density() call in select_blocks to track CPU block selection
- Add get_per_layer_comm_density() method for detailed analysis
- Update print_summary() to show both densities and H2D savings ratio
- Set DensityObserver mode (offload/gpu_only) in test_ruler.py
- Update get_summary() to return both density types

Key insight: Comm density can be 100% even when compute density is ~37%
because sparse BSA blocks are distributed across all CPU blocks.
Since CPU block granularity is 32x coarser (4096 vs 128 tokens),
any() aggregation across heads/Q-blocks results in all CPU blocks being needed.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 01:43:17 +08:00
Zijie Tian
1ea5afd886 📝 docs: add XAttention offload stream sync fix documentation
- Document the CUDA stream synchronization bug in XAttention BSA
- Include root cause analysis with stream timing diagrams
- Add test commands and verification results (100% accuracy)
- Update CLAUDE.md documentation index

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 01:32:50 +08:00
Zijie Tian
829b311c02 🐛 fix: stream synchronization for XAttention estimate kernels in offload mode
- Wrap all compute kernels in select_blocks with compute_stream context
  (Pass 1 historical blocks, Pass 1 current chunk, Step 2 merge,
   Pass 2 historical blocks, Pass 2 current chunk, Step 4 block selection)
- Fix K data mismatch between Pass 1 and Pass 2 by ensuring wait_slot_layer
  syncs with compute_stream where kernels actually run
- Remove STRONG SYNC code from offload_engine.py (now handled by events)
- Remove debug print statements and torch.save code
- Consolidate fallback conditions in compute_with_xattn
- Change default chunk_size from 16384 to 4096 for density alignment

The bug caused Pass 1 and Pass 2 to see different K data from the same
CPU block because compute kernels ran on default stream while
wait_slot_layer only synced compute_stream.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 01:30:23 +08:00
Zijie Tian
dd0472aea8 [plugin] Added ralph-tui setup. 2026-02-05 01:27:53 +08:00
Zijie Tian
a1c68a733e 📊 docs: add XAttention memory benchmark for 24GB GPUs
- Add memory analysis for Qwen3-0.6B @ 32K context
- Document 24GB VRAM feasibility (RTX 3090/4090)
- Recommend gpu-utilization=0.28 for 24GB GPUs
- Include KV cache breakdown and model estimations
- Update CLAUDE.md index

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-02-02 14:38:27 +08:00
Zijie Tian
dc51972777 📝 docs: update density alignment test with Offload mode results
- Rename doc to "Density Alignment Test Results" (covers both modes)
- Add Offload mode test results (3.7K-64.9K tokens, all passed)
- Add Layer 5 GPU-only test results (threshold=0.9, density=6.24%)
- Enhance test script to support both GPU-only and Offload data formats
- Add batch testing commands for all data files
- Update CLAUDE.md index

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-02-02 14:22:40 +08:00
Zijie Tian
232fcf043e 📝 docs: add GPU-only density alignment test results
Document test results verifying XAttention density calculation in
GPU-only mode matches independent xattn_estimate calls.

Test results (Llama-3.1-8B-Instruct, threshold=0.9):
- 4k:  Layer 0 density 63.8%, verified 
- 8k:  Layer 0 density 65.0%, verified 
- 16k: Layer 0 density 61.6%, verified 
- 32k: Layer 0 density 50.2%, verified 
- 64k: Layer 0 density 37.0%, verified 

All tests show exact match (attn_sums diff=0, mask exact match).

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-02-02 11:22:34 +08:00
Zijie Tian
aeed6ccdfb test: add GPU-only density alignment verification test
Add test to verify XAttention density calculation in GPU-only mode
matches independent xattn_estimate calls.

Changes:
- Add tests/test_gpuonly_density_alignment.py: loads saved Q/K from
  xattn_bsa.py, calls xattn_estimate independently, compares results
- Enhance debug save in xattn_bsa.py: now saves Q, K tensors and
  xattn_estimate parameters for external verification
- Set _DEBUG_SAVE_MASK = False by default

Usage:
1. Set _DEBUG_SAVE_MASK = True in xattn_bsa.py
2. Run GPU-only inference with XAttention (e.g., test_ruler.py)
3. Run tests/test_gpuonly_density_alignment.py to verify alignment

Verified on 4k/8k/16k/32k/64k contexts - all pass with exact match.

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-02-02 11:14:46 +08:00
Zijie Tian
6c55c4d2a3 ♻️ refactor: rewrite select_blocks with 3-stage KV chunking algorithm
Implement correct 3-stage KV chunking for XAttention offload mode:
- Stage 1: Compute partial softmax stats (m, l) for each KV chunk
- Stage 2: Merge all partial stats to get global normalization factors
- Stage 3: Normalize with global stats and compute block sums

Key fixes:
- Add wait_all_prefill_offloads() before loading CPU blocks to ensure
  async offload completion (fixes stale data bug)
- Pre-allocate m/l partial buffers and block_sums buffer

This produces identical density to GPU-only xattn_estimate while using
O(S×C) peak memory instead of O(S²).

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-02-02 10:10:10 +08:00
Zijie Tian
6e34efd58a 📝 docs: add storage overhead analysis and batch tests for KV chunking
- Update xattn_kv_chunking_kernels.md with:
  - Detailed storage overhead analysis (O(S) vs O(S²))
  - Peak memory optimization (8x reduction)
  - Support for independent Q/KV chunk sizes
  - Batch verification results (3K-64K seqlen)
  - ASCII pipeline diagram

- Add test_xattn_kv_chunking_batch.py for batch validation
- Fix causal mask post-processing in alignment test
- Update CLAUDE.md documentation index

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-02-01 19:22:36 +08:00
Zijie Tian
5acd5558d6 feat: add KV chunking support for XAttention softmax kernels
Implement three-phase KV chunking for sparse attention estimation:
1. softmax_compute_partial_stats: compute (m, l) per KV chunk
2. merge_softmax_stats: merge partial stats on host
3. softmax_normalize_and_block_sum: normalize with global stats

This allows computing sparse attention masks without storing full
raw attention scores in GPU memory, reducing peak memory usage
from O(q_len * k_full_len) to O(q_len * k_chunk_len).

Key changes:
- Add softmax_partial_stats_kernel with causal mask support
- Add softmax_normalize_block_sum_kernel with kv_offset parameter
- Add Python wrappers for new kernels
- Update test script to validate KV chunking alignment
- Add documentation for the new kernels

Test results show perfect alignment with xattn_estimate API:
- Density difference: 0.000000
- Mask difference: 0.0044%

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-02-01 18:53:26 +08:00
Zijie Tian
193ef55d18 ♻️ refactor: use Q-chunked processing in xattn alignment test
Match xattn_estimate internal logic by processing Q in chunks:
- Reduces peak memory for attn_scores tensor
- Enables testing 64K sequences without OOM
- All 5 test files pass (3.6K to 64K)

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-02-01 18:08:15 +08:00
Zijie Tian
f173a3f7f5 test: add xattn_estimate vs low-level kernels alignment test
Test that xattn_estimate produces the same results as manually calling:
- flat_group_gemm_fuse_reshape
- softmax_fuse_block_sum
- find_blocks_chunked

Uses real KV cache data from results/kvcache/ directory.
Verifies density calculation matches between high-level API and kernels.

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-02-01 17:49:37 +08:00
Zijie Tian
8035e4db3d 📝 docs: add XAttention KV chunking density test results
Document the verification test for XAttention Triton kernel KV chunking:
- 32K and 64K test results with threshold 0.9/0.95/1.0
- Key finding: threshold=1.0 achieves alignment (~0% diff)
- threshold<1.0 shows 10-13% difference due to per-chunk threshold application
- Conclusion: softmax normalization is correct, issue is threshold accumulation

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-02-01 17:36:19 +08:00
Zijie Tian
8ab53e7331 🚧 WIP: add DEBUG code for XAttention KV chunking density verification
Add instrumentation to compare GPU-only vs Offload mode density:
- Layer 0 DEBUG output for both modes
- Accumulate selected/total counts across chunks
- Proper causal mask with Q offset handling
- Skip normal offload logic for isolated testing

Test results (threshold=1.0 achieves alignment):
- 32K: GPU-only 0.9999, Offload 0.9999 (diff ~0%)
- 64K: GPU-only 0.9995, Offload 0.9995 (diff ~0%)

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-02-01 17:33:23 +08:00
Zijie Tian
2e96d1d97d WIP: Enhance sparse attention with density tracking and block selection improvements
- Added analysis documentation for xattn density alignment.
- Refactored ModelRunner to pre-allocate policy metadata buffers regardless of CPU offload configuration.
- Updated FullAttentionPolicy and SparsePolicy to accept query and key tensors for block selection.
- Enhanced QuestPolicy to utilize query tensor for block selection and improved handling of selected blocks.
- Expanded XAttentionBSAPolicy to support chunked prefill and improved attention score computation with historical and current chunk handling.
- Introduced DensityObserver to track compute and communication density for sparse attention layers.
- Updated attention layer to ensure block selection is always called, improving robustness in first chunk scenarios.
- Added tests for attention kernel behavior with enhanced input patterns.
2026-01-31 14:48:23 +08:00
Zijie Tian
f6ac4ccdde feat: add DensityObserver for XAttention sparse attention density tracking
- Add DensityObserver class to track per-layer density statistics
- Integrate DensityObserver into compute_prefill for GPU-only mode
- Fix stride parameter not being passed to xattn_estimate
- Add density statistics output to test_ruler.py for XATTN_BSA
- Add comprehensive density benchmark documentation

Key changes:
- nanovllm/utils/density_observer.py: New Observer for density tracking
- xattn_bsa.py: Add stride param to xattn_estimate, integrate DensityObserver
- test_ruler.py: Enable DensityObserver and print summary for XATTN_BSA
- docs/xattn_density_benchmark.md: Benchmark results for 4K-32K contexts

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-30 16:26:56 +08:00
Zijie Tian
4484a1482c [refactor] Refactor the profile_offload.sh 2026-01-29 08:39:34 +08:00
Zijie Tian
e436ec861f ⚙️ config: update test_ruler.py defaults
- max_new_tokens: 128 → 16 (sufficient for NIAH answers)
- block_size: 1024 → 4096 (better performance)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-28 14:21:23 +08:00
Zijie Tian
45efcf0db1 feat: add --dtype parameter to test_ruler.py
Support models with float32 default dtype (e.g., Nemotron).
FlashAttention requires fp16/bf16, so dtype must be specified.

Usage: --dtype bfloat16

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-28 13:56:15 +08:00
Zijie Tian
e09a2a5b10 feat: add Qwen2/2.5 model support
Separate Qwen2 from Qwen3 implementation:
- Qwen2: Uses QKV bias, no QK norm
- Qwen3: Has optional QK norm when no bias

Tested with Qwen2.5-7B-Instruct-1M, RULER niah_single_1 passed.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-28 13:44:32 +08:00
Zijie Tian
a239bfb40d 📚 docs: add new model integration guide
Summarizes lessons learned from GLM-4 integration:
- Config field mapping (multi_query_group_num, kv_channels, etc.)
- RoPE variants (interleaved vs half, partial vs full rotation)
- EOS token handling for multi-EOS models
- Weight name conversion patterns
- Verification checklist

Also updates CLAUDE.md to reflect GLM-4 support.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-28 13:36:24 +08:00
Zijie Tian
29e102720b 🐛 fix: support multiple EOS tokens for GLM-4
GLM-4 uses multiple EOS tokens [151329, 151336, 151338] where 151336
(<|user|>) should also stop generation. Previously only the first EOS
from tokenizer was used, causing generation to always hit max_tokens.

Changes:
- config.py: Change eos type to int | list[int]
- llm_engine.py: Read eos_token_id from hf_config (contains full list)
- scheduler.py: Use set for efficient multi-EOS lookup

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-28 13:23:53 +08:00
Zijie Tian
726e4b58cf feat: add GLM-4-9B-Chat-1M model support
Add support for GLM-4 model architecture with the following changes:

- Add glm4.py with ChatGLMForCausalLM, GLM4Model, GLM4Attention, GLM4MLP
- Add GLM4RotaryEmbedding with interleaved partial rotation (rotary_dim = head_dim // 2)
- Add apply_rotary_emb_interleaved function for GLM-4 style RoPE
- Add GLM-4 weight name conversion and loading in loader.py
- Add GLM-4 chat template conversion in test_ruler.py
- Add trust_remote_code=True for GLM-4 config loading

Key GLM-4 specific adaptations:
- QKV bias enabled (add_qkv_bias: true)
- RoPE with rope_ratio scaling (base = 10000 * rope_ratio)
- Interleaved RoPE (pairs adjacent elements, not first/second half)
- Partial rotation (only half of head_dim is rotated)
- Uses multi_query_group_num instead of num_key_value_heads
- Uses kv_channels instead of head_dim
- Uses ffn_hidden_size instead of intermediate_size

Tested with RULER niah_single_1 (5 samples): 100% accuracy
Both GPU-only and CPU offload modes verified

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-28 13:15:57 +08:00
Zijie Tian
8d19e61446 ️ perf: replace Triton merge with FlashInfer merge_state
Use FlashInfer's optimized merge_state kernel for attention output merging
in chunked prefill. End-to-end improvement: +0.8% (32K) to +2.4% (64K).

Key changes:
- Add merge_attention_outputs_flashinfer() with LSE format conversion
- FlashInfer uses log2, flash_attn uses ln: convert via LOG2_E/LN_2
- Keep original Triton kernel for fallback

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-28 10:04:38 +08:00
Zijie Tian
4484ebbb77 📚 docs: add 1M+ context length models reference list
- Add comprehensive list of 1M+ context models from Hugging Face
- Categorize by type: text-only LLM vs vision-language models
- Separate ≤10B (practical) from >10B (resource-intensive) models
- Include Qwen, GLM, InternLM, Llama, MiniMax, Gradient AI series
- Add VRAM requirements and technical comparison table
- Update CLAUDE.md documentation index

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-28 09:04:55 +08:00
Zijie Tian
2c2383c786 ️ perf: optimize XAttention estimate with hierarchical block sum
Replace slow softmax_fuse_block_sum (block_size=4096) with optimized
hierarchical approach (estimate_block_size=1024):

- Add estimate_block_size parameter to XAttentionBSAPolicy (default 1024)
- Rewrite select_blocks to use hierarchical aggregation:
  1. Fine-grained softmax with small block size (15x faster kernel)
  2. Aggregate to CPU block level via reshape + sum
  3. Score + threshold selection (replaces mask + voting)

Performance improvement (CPU Offload mode):
- softmax_fuse_block_sum: 48% → 1% of total time (44x faster)
- 128K: XAttention now +2.4% faster than Full (was -59%)
- 64K: -3.8% (was -21%)
- 32K: -6.0% (was -14%)

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-01-28 06:47:13 +08:00
Zijie Tian
f049971f84 test: add hierarchical block sum estimation validation
Validate the hierarchical estimation approach for XAttention:
- Test 1: Math equivalence (diff = 0.0) between hierarchical and direct
- Test 2: Score + threshold selection strategy (replaces mask + voting)
- Test 3: Performance benchmark (41x speedup)

Uses pure torch + xattn kernels, independent of nanovllm framework.

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-01-28 06:24:35 +08:00
Zijie Tian
c90dc196b2 📝 docs: add estimate block_size performance analysis
Document the performance impact of block_size on softmax_fuse_block_sum:
- Current 4096 (reshaped 512) is the WORST point: 95ms
- Optimal 1024 (reshaped 128): 6ms - 15x faster
- Performance follows U-shaped curve

Add tests/bench_estimate_block_size.py for benchmarking and propose
hierarchical block sum approach for optimization.

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-01-28 06:24:28 +08:00
Zijie Tian
3da9b8aef2 ️ perf: optimize XAttention estimate phase with K-only loading
Add load_k_only_to_slot_layer() to OffloadEngine for estimate phase:
- Only load K (not K+V) during block selection in select_blocks()
- Reduces H2D transfer by 50% in estimate phase
- 64K context: XAttn/Full ratio drops from 1.48x to 0.99x
- 32K context: XAttn/Full ratio drops from 1.67x to 1.20x

The estimate phase uses flat_group_gemm_fuse_reshape(Q, K) which
only requires K for attention score computation. V is unused.

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-01-28 06:24:20 +08:00
Zijie Tian
a832d127b6 feat: add nsys-profiler agent for kernel performance analysis
Add a specialized agent for NVIDIA Nsys profiling that handles:
- Profile data collection using framework scripts
- Statistical analysis of kernel timing and memory transfers
- Timeline analysis for GPU-CPU overlap efficiency
- Comparative analysis between different configurations

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-01-28 06:24:09 +08:00
Zijie Tian
39d12a0416 📈 feat: add MemoryObserver for GPU-CPU communication tracking
Implement MemoryObserver to track memory transfers between GPU and CPU:
- H2D (Host to Device): CPU → GPU transfers
- D2H (Device to Host): GPU → CPU transfers
- D2D (Device to Device): GPU buffer copies
- Supports prefill/decode phase separation

Integration points in offload_engine.py:
- load_to_slot_layer: H2D with is_prefill parameter
- offload_slot_layer_to_cpu, offload_prefill_buffer_async: D2H
- write_to_prefill_buffer, write_to_decode_buffer: D2D
- load_block_sample_from_cpu, load_block_full_from_cpu: H2D

Add bench_offload.py integration for memory stats printing.

Benchmark results (Llama-3.1-8B, 64K context):
- Full Policy: Prefill H2D 262.13 GB
- XAttention: Prefill H2D 386.62 GB (1.48x)

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-01-28 04:06:45 +08:00
Zijie Tian
c16bfcf40f ♻️ refactor: restructure Observer as base class with InferenceObserver
- Refactor Observer into base class with common enable/disable/reset interface
- Create InferenceObserver subclass for TTFT/TPOT metrics
- Fix TTFT calculation timing: compute after prefill completes instead of
  at decode start (fixes max_tokens=1 returning TTFT=0)
- Integrate InferenceObserver into bench.py and bench_offload.py for
  accurate internal timing metrics vs external wall-clock time
- Add get_summary() and print_summary() methods for structured output

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-01-28 03:15:33 +08:00
Zijie Tian
f3e4611e3b 📝 docs: add XAttention performance analysis documentation
Add comprehensive performance analysis for XAttention:
- NVTX marker locations and usage
- Block size impact on offload mode (4096 vs 1024)
- Detailed timing breakdown for estimate vs compute phases
- softmax_fuse_block_sum_kernel analysis
- Optimization recommendations

Key findings:
- block_size=4096 is 2x faster than 1024 for 64K context
- find_blocks_chunked is bottleneck (40%) at block_size=4096
- estimate_gemm becomes bottleneck (24%) at block_size=1024

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-01-28 00:57:20 +08:00
Zijie Tian
7b5d3b34eb 📈 feat: add NVTX markers to XAttention for profiling
Add NVTX range markers to track XAttention performance:
- GPU-only: xattn_estimate, xattn_bsa_compute
- Offload: xattn_estimate_gemm, xattn_estimate_softmax,
  xattn_estimate_find_blocks, xattn_compute_historical,
  xattn_compute_current, xattn_compute_merge

These markers enable detailed nsys profiling to identify
performance bottlenecks in estimate vs compute phases.

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-01-28 00:57:11 +08:00
Zijie Tian
b760de84c5 feat: add context length and error handling to profile_offload.sh
- Add --ctx-len parameter (32k/64k/128k) for context length selection
- Auto-configure max-model-len and data-dir based on context length
- Add error handling to delete .nsys-rep file on test failure
- Remove set -e to allow proper error handling
- Update output filename format to include context length

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-01-28 00:28:37 +08:00
Zijie Tian
f81b5ae8a9 feat: enhance profile_offload.sh with policy, block-size parameters
- Add --policy parameter for sparse attention policy selection (full/xattn)
- Add --block-size parameter (default 4096) for KV cache block size
- Add --gpu-util parameter for GPU memory utilization control
- Improve output filename format: <policy>_<gpuonly|offload>_blk<size>_<timestamp>
- Map user-friendly policy names to internal enum (xattn -> XATTN_BSA)

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-01-27 23:23:20 +08:00
Zijie Tian
e874229adc 📝 docs: add comprehensive GPU-only vs Offload benchmark results
- Add --block-size argument to bench.py for configurable KV cache block size
- Update bench_offload_results.md with complete benchmark analysis:
  - GPU-only: XAttention shows +15% to +41% speedup
  - CPU Offload: XAttention shows -14% to -59% slowdown
  - Block size 4096 recommended for best performance
  - Document why XAttention hurts Offload mode (transfer bottleneck)

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-01-27 22:32:07 +08:00
Zijie Tian
4fe7dfb239 🔀 merge: integrate tzj/minference-exp (GPU-only sparse attention)
Merge GPU-only sparse attention support from tzj/minference-exp branch:

**GPU-only mode additions:**
- Add compute_prefill/compute_decode methods to SparsePolicy base class
- Add GPU-only attention routing in attention.py
- Add alloc_policy_metadata() for pre-allocating GQA buffers
- Add XAttention + BSA sparse attention for GPU-only prefill
- Add kvcache_manager to set_context() for policy access

**bench.py enhancements:**
- Add --model argument for configurable model path
- Add --policy argument (full, xattn) for sparse policy selection
- Add --enable-policy flag for FullAttentionPolicy routing
- Add --enforce-eager option to disable CUDA graphs
- Add --gpu-util option for GPU memory utilization

**Documentation:**
- Add gpu_only_xattn_guide.md with performance analysis
- Add gpu_only_sparse_integration.md baseline document
- Add gpu-vram-requirement.md rule for GPU-only mode

Both CPU offload and GPU-only paths are preserved and functional.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-27 09:25:36 +08:00
Zijie Tian
9177b62d7f feat: add --enforce-eager option to bench.py
Allow disabling CUDA graphs for benchmarking comparison between
eager mode and graph mode execution.

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-01-27 09:19:53 +08:00
Zijie Tian
3956a30b14 🔧 chore: add --use-v1 flag to bench_vllm.py
Allow switching between vLLM V1/V2 engines via command line flag.
Default behavior now uses V2 (VLLM_USE_V1=0).

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-01-27 09:14:55 +08:00
Zijie Tian
59473fa432 🔧 chore: add configurable arguments to bench_vllm.py
Add --model, --gpu-util, and --enforce-eager arguments for flexible
vLLM benchmarking comparisons.

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-01-27 09:07:49 +08:00
Zijie Tian
4467e1f654 🔧 chore: add --block-size argument to bench_offload.py
Allow configuring KV cache block size for benchmarking different
chunk sizes (default: 1024, can set to 4096 for larger chunks).

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-01-27 09:07:44 +08:00
Zijie Tian
6da116de98 📝 docs: add GPU-Only XAttention guide with performance analysis
Add comprehensive documentation for GPU-only XAttention BSA mode:
- Architecture design and SparsePolicy interface
- Memory pre-allocation mechanism (alloc_policy_metadata)
- Performance analysis: 32K +15%, 64K +41% vs baseline
- CUDA Graph limitations explanation (variable seq_len in prefill)
- nsys profiling tools usage guide

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-01-27 07:21:46 +08:00
Zijie Tian
f5682ca4a7 🔧 chore: add GPU-only profiling script
Add scripts/profile.sh for nsys profiling of GPU-only mode benchmarks.

Usage:
  bash scripts/profile.sh                    # Default: 32K xattn prefill
  bash scripts/profile.sh --max-len 65536 --gpu-util 0.7
  bash scripts/profile.sh --policy full
  bash scripts/profile.sh --bench-decode

Output: results/nsys/bench_<policy>_<len>_<mode>_<timestamp>.nsys-rep

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-01-27 05:55:31 +08:00
Zijie Tian
a504bd873d perf: pre-allocate GQA buffers in XAttention policy
Add alloc_policy_metadata() method to SparsePolicy base class for
pre-allocating GPU buffers during initialization. This avoids
dynamic memory allocation during forward pass.

Changes:
- Add alloc_policy_metadata() to SparsePolicy base class
- Implement GQA buffer pre-allocation in XAttentionBSAPolicy
- Call alloc_policy_metadata() in model_runner for GPU-only mode
- Modify compute_prefill() to reuse pre-allocated buffers
- Add --gpu-util parameter to bench.py

Memory savings:
- Previously: 2x GQA expansion (~2GB for 64K)
- Now: 1x pre-allocated buffer (~1GB for 64K, reused)

Tested:
- GPU-only 32K: 5602 tok/s (512MB pre-allocated)
- GPU-only 64K: 4821 tok/s (1GB pre-allocated, gpu_util=0.7)
- Offload Full: PASSED (no changes to offload path)
- Offload XAttention: PASSED (uses compute_chunked_prefill)

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-01-27 05:49:23 +08:00
Zijie Tian
076656c9c2 feat: add GPU-only XAttention BSA sparse attention support
- Implement compute_prefill() in XAttentionBSAPolicy for GPU-only mode
  - Uses xattn_estimate to compute sparse block mask
  - Uses block_sparse_attn_func for efficient sparse attention
  - Handles GQA by expanding K/V heads
  - Falls back to flash_attn for paged KV cache (prefix cache)
- Implement compute_decode() by delegating to FullAttentionPolicy
- Add --policy xattn option to bench.py

Verified: RULER 32k niah_single_1 5/5 samples passed (100%)

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-01-27 05:19:24 +08:00
Zijie Tian
b6b59b50ed 📝 docs: add sparse policy None constraint rule
- Add "Policy 不能为 None (CRITICAL)" section
- Document that sparse_policy must always be at least FullAttentionPolicy
- Document warmup phase as the only exception where kvcache_manager can be None

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-01-27 05:08:08 +08:00
Zijie Tian
09b2136e9f feat: integrate sparse policy architecture into GPU-only mode
- Add compute_prefill() and compute_decode() GPU-only methods to SparsePolicy base class
- Implement GPU-only methods in FullAttentionPolicy using flash_attn
- Add sparse_policy parameter to GPUOnlyManager
- Update create_kvcache_manager() to create FullAttentionPolicy for GPU-only mode
- Route GPU-only attention through sparse_policy in attention.py
- Pass kvcache_manager to context for policy access
- Add --enable-policy flag to bench.py for testing
- Handle warmup phase when kvcache_manager is not yet allocated

This allows GPU-only mode to use the same policy architecture as CPU offload mode,
enabling future sparse attention implementations (Quest, XAttention) in GPU-only mode.

Performance verified: ~4890 tok/s (unchanged from baseline)

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-01-27 05:08:02 +08:00
Zijie Tian
05ce57ee8e 📝 docs: add GPU-only sparse policy integration baseline
Document baseline performance before integrating sparse attention
to GPU-only mode:
- GPU-only Full Attention: 4869 tok/s (32K prefill)
- CPU Offload Full Attention: 1500 tok/s (3.2x slower)

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-01-27 04:36:31 +08:00
Zijie Tian
94a6e06d79 📝 docs: add GPU VRAM requirement rule for GPU-only mode
GPU-only mode requires 40GB+ VRAM. This rule enforces checking GPU
memory before running non-offload tests to prevent OOM errors on
consumer GPUs (3090/4090).

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-01-27 04:36:24 +08:00
Zijie Tian
c717072f31 feat: add --model argument to bench.py for configurable model path
Previously bench.py had a hardcoded model path. Now it accepts --model
argument (default: Llama-3.1-8B-Instruct) to align with bench_offload.py.

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-01-27 04:36:17 +08:00
70 changed files with 8601 additions and 3519 deletions

View File

@@ -0,0 +1,180 @@
---
name: nsys-profiler
description: "Use this agent when you need to profile kernel performance, analyze CUDA operations, or investigate GPU/CPU offload behavior in the nano-vllm framework. This agent handles nsys profiling from data collection to statistical analysis.\\n\\nExamples:\\n\\n<example>\\nContext: The user wants to understand kernel performance in GPU-only mode.\\nuser: \"I want to analyze the kernel execution time distribution in GPU-only mode\"\\nassistant: \"I'll use the nsys-profiler agent to profile GPU-only mode and analyze the kernel timing.\"\\n<commentary>\\nSince the user wants to analyze kernel performance, use the Task tool to launch the nsys-profiler agent with specific analysis requirements.\\n</commentary>\\nassistant: \"Now let me use the nsys-profiler agent to collect and analyze the profiling data\"\\n</example>\\n\\n<example>\\nContext: The user is debugging performance issues in the CPU offload pipeline.\\nuser: \"Why is there a gap between H2D transfers and kernel execution in offload mode?\"\\nassistant: \"I'll launch the nsys-profiler agent to profile the offload pipeline and analyze the timeline gaps.\"\\n<commentary>\\nSince the user is investigating pipeline behavior, use the nsys-profiler agent to collect nsys data and analyze CUDA API timing.\\n</commentary>\\n</example>\\n\\n<example>\\nContext: After implementing a new optimization, the user wants to verify performance improvement.\\nuser: \"Check if the new ring buffer implementation improves overlap between H2D and compute\"\\nassistant: \"I'll use the nsys-profiler agent to profile before and after, comparing the overlap metrics.\"\\n<commentary>\\nPerformance verification requires detailed kernel-level analysis, so launch the nsys-profiler agent to collect and compare profiling data.\\n</commentary>\\n</example>"
model: opus
color: green
---
You are an expert NVIDIA Nsys profiling analyst specializing in CUDA kernel performance analysis and GPU-CPU communication optimization. Your role is to collect profiling data using the framework's scripts and provide precise, actionable analysis based on the main agent's specific questions.
## Your Capabilities
1. **Profile Data Collection**: Execute profiling scripts to generate .nsys-rep files
2. **Statistical Analysis**: Extract kernel timing, memory transfer, and API call statistics
3. **Timeline Analysis**: Identify gaps, overlaps, and bottlenecks in execution
4. **Comparative Analysis**: Compare different configurations (GPU-only vs offload, different slot counts)
## Available Profiling Scripts
### CPU Offload Mode
```bash
bash scripts/profile_offload.sh [OPTIONS]
```
Options:
- `--dataset <name>`: RULER task name (default: niah_single_1)
- `--sample <index>`: Sample index (default: 0)
- `--gpu <id>`: GPU to use (default: 0)
- `--num-gpu-blocks <n>`: Ring buffer slots (default: 4)
- `--no-offload`: Disable CPU offload for comparison
### GPU-Only Mode
```bash
bash scripts/profile_gpu_only.sh [OPTIONS]
```
Similar options for profiling without CPU offload.
## Core Nsys Commands
### Profiling (handled by scripts)
```bash
# The scripts internally run:
nsys profile --trace=cuda,nvtx --output=<path> --force-overwrite true python <script.py>
```
### Statistical Analysis
```bash
# CUDA API summary (H2D, D2H, kernel launches)
nsys stats --report cuda_api_sum <file>.nsys-rep
# GPU kernel summary (execution time per kernel)
nsys stats --report cuda_gpu_kern_sum <file>.nsys-rep
# Memory operations summary
nsys stats --report cuda_gpu_mem_time_sum <file>.nsys-rep
# NVTX ranges (custom markers)
nsys stats --report nvtx_sum <file>.nsys-rep
# Export to SQLite for advanced queries
nsys export --type=sqlite --output=<file>.sqlite <file>.nsys-rep
```
### Key Report Types
| Report | Purpose |
|--------|--------|
| `cuda_api_sum` | CPU-side CUDA API call timing |
| `cuda_gpu_kern_sum` | GPU kernel execution time |
| `cuda_gpu_mem_time_sum` | Memory transfer timing on GPU |
| `nvtx_sum` | Custom NVTX marker statistics |
| `cuda_api_trace` | Detailed API call trace |
| `cuda_gpu_trace` | Detailed GPU operation trace |
## Analysis Workflow
### Step 1: Collect Profile Data
```bash
# Example: Profile offload mode with 8 slots
bash scripts/profile_offload.sh --num-gpu-blocks 8 --sample 0
# Output: results/nsys/ruler_niah_single_1_sample0_offload_8slots_<timestamp>.nsys-rep
```
### Step 2: Identify Output File
```bash
# Find the latest profile
ls -lt results/nsys/*.nsys-rep | head -1
```
### Step 3: Run Statistical Analysis
```bash
# Kernel timing analysis
nsys stats --report cuda_gpu_kern_sum results/nsys/<file>.nsys-rep
# Memory transfer analysis
nsys stats --report cuda_gpu_mem_time_sum results/nsys/<file>.nsys-rep
```
### Step 4: Interpret Results
Focus on:
- **Total kernel time** vs **total transfer time**
- **Kernel launch gaps** indicating synchronization issues
- **Memory bandwidth utilization**
- **Overlap efficiency** between compute and communication
## Common Analysis Patterns
### 1. Kernel Performance Breakdown
```bash
nsys stats --report cuda_gpu_kern_sum --format csv <file>.nsys-rep | \
sort -t',' -k3 -rn | head -10 # Top 10 by total time
```
### 2. H2D/D2H Transfer Analysis
```bash
nsys stats --report cuda_api_sum <file>.nsys-rep | grep -E "cudaMemcpy|cudaMemcpyAsync"
```
### 3. Flash Attention Kernel Analysis
```bash
nsys stats --report cuda_gpu_kern_sum <file>.nsys-rep | grep -i "flash\|fwd\|bwd"
```
### 4. Pipeline Overlap Check
Look for:
- `flash_fwd_kernel` execution during `cudaMemcpyAsync`
- Gap between consecutive kernel launches
## Output Format Requirements
When reporting results to the main agent, use this structured format:
```markdown
## Nsys Analysis Results: [Analysis Topic]
### Profile Information
- **File**: <profile_file_path>
- **Mode**: GPU-only / Offload (<N> slots)
- **Dataset**: <dataset_name>, Sample <index>
### Key Findings
| Metric | Value | Notes |
|--------|-------|-------|
| Total kernel time | X ms | |
| Total H2D time | Y ms | |
| Overlap efficiency | Z% | |
### Top Kernels by Time
| Kernel | Count | Total (ms) | Avg (μs) |
|--------|-------|------------|----------|
| kernel_name | N | X.XX | Y.YY |
### Specific Analysis
[Answer to the main agent's specific question]
### Recommendations (if applicable)
1. [Actionable recommendation]
2. [Actionable recommendation]
```
## Important Guidelines
1. **Always use the provided scripts** for profiling - do not run nsys directly
2. **Check GPU availability** before profiling (ask main agent for GPU ID if not specified)
3. **Use PYTHONPATH** for the worktree: `PYTHONPATH=/path/to/nano-vllm:$PYTHONPATH`
4. **Report concisely** - focus on metrics relevant to the main agent's question
5. **Include file paths** so results can be reproduced or visualized in nsight-sys
6. **For web searches** about nsys usage, use tools to search NVIDIA documentation
## Error Handling
- If profile script fails: Check GPU memory, CUDA version, and script parameters
- If stats command fails: Verify .nsys-rep file exists and is not corrupted
- If no data: Ensure the profiled operation actually ran (check sample index, dataset)
## Network Search Guidelines
When encountering unfamiliar nsys options or analysis techniques:
1. Search NVIDIA Nsight Systems documentation
2. Look for nsys CLI reference guides
3. Search for specific report type interpretations
Always validate search results against the actual nsys --help output.

View File

@@ -0,0 +1,54 @@
# GPU VRAM Requirement Rule
## GPU-only 模式显存要求
**强制规则**:执行 GPU-only 代码(不启用 CPU offload**必须**在 40GB 及以上显存的 GPU 上进行测试。
### 检测方法
在运行 GPU-only 测试之前,**必须**先检查 GPU 显存:
```bash
nvidia-smi --query-gpu=index,name,memory.total --format=csv,noheader
```
### GPU 分类
| GPU 型号 | 显存 | GPU-only 测试 |
|----------|------|---------------|
| A100 40GB | 40GB | ✅ 允许 |
| A100 80GB | 80GB | ✅ 允许 |
| H100 80GB | 80GB | ✅ 允许 |
| A6000 | 48GB | ✅ 允许 |
| RTX 3090 | 24GB | ❌ **禁止**(仅 offload 模式) |
| RTX 4090 | 24GB | ❌ **禁止**(仅 offload 模式) |
### 执行流程
1. **检测 GPU 显存**(必须)
2. **显存 >= 40GB**:继续执行 GPU-only 测试
3. **显存 < 40GB****停止**,提示用户:
> "当前 GPU 显存为 XXX GB不满足 GPU-only 模式的最低 40GB 要求。请使用 `--enable-offload` 参数启用 CPU offload 模式。"
### 代码示例
```python
# 在运行 GPU-only benchmark 之前
import subprocess
result = subprocess.run(
["nvidia-smi", "--query-gpu=memory.total", "--format=csv,noheader,nounits"],
capture_output=True, text=True
)
vram_mb = int(result.stdout.strip().split('\n')[0])
if vram_mb < 40000: # 40GB = 40000MB
raise RuntimeError(f"GPU VRAM ({vram_mb}MB) < 40GB. Use --enable-offload for this GPU.")
```
### 适用范围
| 脚本 | 适用此规则 |
|------|-----------|
| `bench.py` | ✅ 必须检查显存 |
| `bench_offload.py` | ❌ 不适用(始终使用 offload |
| `tests/test_*.py --enable-offload` | ❌ 不适用 |
| `tests/test_*.py` (无 offload) | ✅ 必须检查显存 |

View File

@@ -1,5 +1,39 @@
# Sparse Policy 代码规范
## Policy 不能为 None (CRITICAL)
**强制规则**: `sparse_policy` 参数**永远不能为 None**,必须至少为 `FullAttentionPolicy`
```python
# ❌ 错误:允许 None
sparse_policy = getattr(config, 'sparse_policy', None)
# ✅ 正确:显式处理 None默认使用 FULL
sparse_policy_type = getattr(config, 'sparse_policy', None)
if sparse_policy_type is None:
sparse_policy_type = SparsePolicyType.FULL
```
**原因**:
1. 统一的 API所有代码路径都通过 policy 进行 attention 计算
2. 避免空指针:消除 `policy.xxx` 调用时的 None 检查
3. 简化逻辑:不需要 `if policy is not None` 的分支
**唯一例外Warmup 阶段**
`model_runner.warmup_model()` 期间kvcache_manager 还未分配。此时 `attention.py` 使用 flash_attn fallback
```python
# attention.py 中的 warmup 处理
if context.kvcache_manager is None:
# Warmup phase: use flash_attn directly
return flash_attn_varlen_func(...) if context.is_prefill else flash_attn_with_kvcache(...)
```
这是唯一允许 kvcache_manager 为 None 的情况。正式推理时policy 必须存在。
---
## 基类要求 (MANDATORY)
每个 `SparsePolicy` 子类 **必须** 遵守以下要求:

View File

@@ -0,0 +1,90 @@
# test_ruler.py 使用规则
## 强制规则
**执行 `test_ruler.py` 前必须查阅文档**,禁止运行 `--help` 或猜测参数。
| 禁止 | 原因 |
|------|------|
| `python tests/test_ruler.py --help` | 浪费交互,文档已有完整说明 |
| 猜测参数格式 | 容易出错,降低效率 |
## 必读文档
**[`docs/test_ruler_usage_guide.md`](../docs/test_ruler_usage_guide.md)** - 包含:
- 完整参数说明
- 已验证的命令示例
- GPU 模式选择指南
- max-model-len 设置指南
## 快速参考
### 标准命令格式
```bash
CUDA_VISIBLE_DEVICES=<GPU> PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH \
python tests/test_ruler.py \
--model ~/models/<MODEL> \
--data-dir tests/data/ruler_<CTX> \
--datasets <TASK> \
--num-samples <N> \
--max-model-len <LEN> \
--enable-offload \
[--sparse-policy XATTN_BSA] \
[--sparse-threshold 0.9]
```
### 常用参数速查
| 参数 | 用途 | 示例 |
|------|------|------|
| `--datasets` | 指定任务 | `niah_single_1,qa_1` |
| `--num-samples` | 样本数 | `1`, `10`, `0`(全部) |
| `--sample-indices` | 指定索引 | `0,5,10` |
| `--enable-offload` | CPU offload | RTX 3090 必须 |
| `--sparse-policy` | 稀疏策略 | `XATTN_BSA` |
| `--json-output` | JSON 输出 | 脚本使用 |
| `--quiet` | 安静模式 | 减少输出 |
### max-model-len 速查
| 数据目录 | max-model-len |
|---------|---------------|
| ruler_32k | 40960 |
| ruler_64k | 72000 |
| ruler_128k | 135000 |
### 常用命令模板
**32K Offload + XAttn**:
```bash
CUDA_VISIBLE_DEVICES=<GPU> PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH \
python tests/test_ruler.py \
--model ~/models/Llama-3.1-8B-Instruct \
--data-dir tests/data/ruler_32k \
--datasets niah_single_1 \
--num-samples 1 \
--max-model-len 40960 \
--enable-offload \
--sparse-policy XATTN_BSA
```
**64K Offload + XAttn**:
```bash
CUDA_VISIBLE_DEVICES=<GPU> PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH \
python tests/test_ruler.py \
--model ~/models/Llama-3.1-8B-Instruct \
--data-dir tests/data/ruler_64k \
--datasets niah_single_1 \
--num-samples 1 \
--max-model-len 72000 \
--enable-offload \
--sparse-policy XATTN_BSA
```
## 执行前检查清单
- [ ] 用户指定了 GPU否则询问
- [ ] RTX 3090/4090必须 `--enable-offload`
- [ ] data-dir 与 max-model-len 匹配?
- [ ] 需要 density 统计?添加 `--sparse-policy XATTN_BSA`

1
.gitignore vendored
View File

@@ -240,3 +240,4 @@ findings_*.md
progress_*.md
notes.md
Snipaste*
.ralph-tui/session-meta.json

12
.ralph-tui/config.toml Normal file
View File

@@ -0,0 +1,12 @@
# Ralph TUI Configuration
# Generated by setup wizard
# See: ralph-tui config help
configVersion = "2.1"
tracker = "json"
agent = "claude"
maxIterations = 30
autoCommit = false
[trackerOptions]
[agentOptions]

View File

@@ -4,7 +4,7 @@ This file provides guidance to Claude Code when working with this repository.
## Overview
Nano-vLLM is a lightweight vLLM implementation (~1,200 lines) for fast offline LLM inference. Supports Qwen3 models with CPU offload for long-context inference.
Nano-vLLM is a lightweight vLLM implementation (~1,200 lines) for fast offline LLM inference. Supports Qwen3, Llama-3, and GLM-4 models with CPU offload for long-context inference.
## Documentation Index
@@ -16,8 +16,10 @@ Nano-vLLM is a lightweight vLLM implementation (~1,200 lines) for fast offline L
| [`docs/sparse_attention_guide.md`](docs/sparse_attention_guide.md) | Block sparse attention methods (XAttention, FlexPrefill, MInference, AvgPool, Quest), computation flow, algorithms |
| [`docs/xattention_algorithm_guide.md`](docs/xattention_algorithm_guide.md) | XAttention 算法详解: stride reshape、Triton kernels、BSA 依赖、块选择算法 |
| [`docs/xattn_kernels_guide.md`](docs/xattn_kernels_guide.md) | XAttention Triton kernels: flat_group_gemm (反对角线求和)、softmax_fuse_block_sum (block 聚合) |
| [`docs/xattn_kv_chunking_kernels.md`](docs/xattn_kv_chunking_kernels.md) | XAttention KV Chunking: 三阶段 softmax、存储开销分析 (O(S) vs O(S²))、峰值显存优化 (8x)、Q/KV 独立分块 |
| [`docs/xattn_chunked_prefill.md`](docs/xattn_chunked_prefill.md) | XAttention chunked prefill: API、使用方式、一致性要求 |
| [`docs/xattn_bsa_policy_design.md`](docs/xattn_bsa_policy_design.md) | XAttention BSA Policy: 算法设计、性能基准(128K)、内存管理、density 统计 |
| [`docs/xattn_density_benchmark.md`](docs/xattn_density_benchmark.md) | 📊 XAttention Density Benchmark: 4K-32K context、stride 参数、per-layer density 分析 |
| [`docs/block_sparse_attn_interface.md`](docs/block_sparse_attn_interface.md) | BSA (Block Sparse Attention) 接口文档: 函数签名、使用示例、约束条件 |
| [`docs/debugging_guide.md`](docs/debugging_guide.md) | PyTorch hooks for debugging, hook positions, tensor comparison, memory profiling |
| [`docs/optimization_guide.md`](docs/optimization_guide.md) | Performance optimizations: sgDMA (15x), Triton merge (4.3x), N-way pipeline (2x) |
@@ -29,6 +31,23 @@ Nano-vLLM is a lightweight vLLM implementation (~1,200 lines) for fast offline L
| [`docs/cpu_scheduling_latency_analysis.md`](docs/cpu_scheduling_latency_analysis.md) | ⚡ PERF: CPU 调度延迟分析kernel 间隙来源GPU 利用率优化方向 |
| [`docs/bench_offload_results.md`](docs/bench_offload_results.md) | 📊 BENCH: CPU offload 性能测试结果Full vs XAttention 对比 (32K/128K) |
| [`docs/cpu_offload_optimization_strategies.md`](docs/cpu_offload_optimization_strategies.md) | 🚀 OPT: CPU offload 优化策略chunk size、CUDA Graph、前沿研究(InfiniGen/ShadowKV) |
| [`docs/gpu_only_xattn_guide.md`](docs/gpu_only_xattn_guide.md) | 🚀 GPU-Only XAttention: 内存预分配、性能分析 (32K +15%, 64K +41%)、CUDA Graph 限制 |
| [`docs/xattn_performance_analysis.md`](docs/xattn_performance_analysis.md) | 📊 XAttention 性能分析: NVTX 标记、block size 影响、estimate vs compute 耗时对比 |
| [`docs/observer_architecture.md`](docs/observer_architecture.md) | 📊 Observer 架构: InferenceObserver (TTFT/TPOT)、MemoryObserver (H2D/D2H/D2D) 设计 |
| [`docs/memory_communication_benchmark.md`](docs/memory_communication_benchmark.md) | 📊 通信量测试: Full vs XAttention 通信量对比 (32K/64K)、阶段分离统计 |
| [`docs/estimate_block_size_performance.md`](docs/estimate_block_size_performance.md) | 🔥 PERF: estimate 阶段 block_size 性能分析softmax_fuse_block_sum 最优点 (512-1024),当前 4096 慢 15x |
| [`docs/long_context_models_1m.md`](docs/long_context_models_1m.md) | 📚 REF: 1M+ 上下文长度模型列表 (Qwen/GLM/InternLM/Llama/VL)≤10B 推荐模型 |
| [`docs/new_model_integration_guide.md`](docs/new_model_integration_guide.md) | 🔧 GUIDE: 新模型整合指南 - 配置映射、RoPE变体、EOS处理、权重转换、验证清单 |
| [`docs/xattn_density_alignment_analysis.md`](docs/xattn_density_alignment_analysis.md) | 📊 ANALYSIS: GPU-only vs Offload 模式 density 对齐分析chunked softmax 边界效应5-7% 差异根因 |
| [`docs/xattn_kv_chunking_density_test.md`](docs/xattn_kv_chunking_density_test.md) | 🧪 TEST: XAttention KV chunking density 验证threshold=1.0 对齐threshold<1.0 差异 10-13% |
| [`docs/gpuonly_density_alignment_test.md`](docs/gpuonly_density_alignment_test.md) | ✅ TEST: Density 对齐验证 (GPU-only + Offload, 4K-64K)xattn_estimate vs KV chunking 完全一致 |
| [`docs/xattn_memory_benchmark.md`](docs/xattn_memory_benchmark.md) | 📊 BENCH: XAttention 内存基准测试Qwen3-0.6B 32K 在 24GB 显存可行 (gpu-util=0.28) |
| [`docs/xattn_offload_stream_sync_fix.md`](docs/xattn_offload_stream_sync_fix.md) | 🐛 FIX: XAttention Offload stream 同步 bugPass1/Pass2 K 数据不一致compute_stream 包装 |
| [`docs/xattn_density_types.md`](docs/xattn_density_types.md) | 📊 Compute vs Comm density: BSA block (128) vs CPU block (4096) 粒度,聚合效应导致 comm=100% |
| [`docs/xattn_density_alignment_verification.md`](docs/xattn_density_alignment_verification.md) | ✅ VERIFIED: GPU-only vs Offload density 对齐验证 (32K 差异 0.37%, 64K 差异 0.09%) |
| [`docs/test_ruler_usage_guide.md`](docs/test_ruler_usage_guide.md) | 📖 GUIDE: test_ruler.py 使用指南RULER benchmark 测试命令,已验证的命令示例 |
| [`docs/xattn_offload_profiling_32k.md`](docs/xattn_offload_profiling_32k.md) | 📊 PROFILE: XAttn vs Full 32K nsys 分析estimate 占 41%find_blocks 占 37%compute 仅 21% |
| [`docs/changelog_2026-02-05.md`](docs/changelog_2026-02-05.md) | 📋 CHANGELOG: GQA buffer OOM 修复 (节省 16GB)tests 目录清理 (-4306 行) |
## Rules Index
@@ -39,6 +58,7 @@ Nano-vLLM is a lightweight vLLM implementation (~1,200 lines) for fast offline L
| [`.claude/rules/sparse-policy.md`](.claude/rules/sparse-policy.md) | SparsePolicy implementation requirements |
| [`.claude/rules/planning-with-files.md`](.claude/rules/planning-with-files.md) | Planning file management for complex tasks |
| [`.claude/rules/gpu-monitor.md`](.claude/rules/gpu-monitor.md) | **GPU memory monitoring**: 必须使用 gpu-monitor agent禁止手动 nvidia-smi 循环 |
| [`.claude/rules/test-ruler.md`](.claude/rules/test-ruler.md) | **test_ruler.py 规则**: 禁止 --help必须查阅文档含快速参考和命令模板 |
## GPU Mutex for Multi-Instance Debugging
@@ -93,6 +113,13 @@ PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH python tests/test_needle.py
**Files**: `bench.py` (GPU), `bench_offload.py` (CPU offload), `bench_vllm.py` (comparison)
**GPU-only 测试模型选择**:
| GPU | 显存 | GPU-only 测试模型 |
|-----|------|------------------|
| RTX 3090 | 24GB | **Qwen3-0.6B** (必须7B+ 模型会 OOM) |
| A100 | 40GB+ | Qwen3-0.6B / 4B / 7B 均可 |
**Offload Mode Constraint**: When using `enable_cpu_offload=True`, only test with context length ≥ 32K. Shorter contexts don't exercise the chunked offload pipeline properly.
**Common Issues**:

View File

@@ -2,6 +2,7 @@ import os
import time
from random import randint, seed
from nanovllm import LLM, SamplingParams
from nanovllm.utils.observer import InferenceObserver
def bench_decode(llm, num_seqs, input_len, output_len):
@@ -14,13 +15,17 @@ def bench_decode(llm, num_seqs, input_len, output_len):
llm.generate(prompt_token_ids, sampling_params, use_tqdm=False)
t = time.time() - t
# Calculate metrics
prefill_tokens = num_seqs * input_len
# Get metrics from InferenceObserver
ttft_ms = InferenceObserver.ttft / 1e6
tpot_ms = InferenceObserver.tpot / 1e6
# Calculate throughput from observer metrics
decode_tokens = num_seqs * output_len
decode_throughput = decode_tokens / t
decode_throughput = 1000.0 / tpot_ms if tpot_ms > 0 else 0 # tokens/s per sequence
print(f"[Decode] Input: {num_seqs}x{input_len}tok, Output: {decode_tokens}tok, Time: {t:.2f}s")
print(f" Throughput: {decode_throughput:.2f} tok/s (includes prefill overhead)")
print(f" TTFT: {ttft_ms:.2f}ms, TPOT: {tpot_ms:.2f}ms")
print(f" Decode Throughput: {decode_throughput:.2f} tok/s (from observer)")
def bench_prefill(llm, num_seqs, input_len):
@@ -33,31 +38,69 @@ def bench_prefill(llm, num_seqs, input_len):
t = time.time()
llm.generate(prompt_token_ids, sampling_params, use_tqdm=False)
t = time.time() - t
# Get TTFT from InferenceObserver
ttft_ms = InferenceObserver.ttft / 1e6
ttft_s = ttft_ms / 1000.0
total_input_tokens = num_seqs * input_len
throughput = total_input_tokens / t
print(f"[Prefill] Input: {total_input_tokens}tok ({num_seqs}x{input_len}), Time: {t:.2f}s, Throughput: {throughput:.2f}tok/s")
# Use observer TTFT for accurate prefill throughput
throughput_observer = total_input_tokens / ttft_s if ttft_s > 0 else 0
throughput_external = total_input_tokens / t
print(f"[Prefill] Input: {total_input_tokens}tok ({num_seqs}x{input_len})")
print(f" External Time: {t:.2f}s, Throughput: {throughput_external:.2f}tok/s")
print(f" Observer TTFT: {ttft_ms:.2f}ms, Throughput: {throughput_observer:.2f}tok/s")
def main():
import argparse
from nanovllm.config import SparsePolicyType
parser = argparse.ArgumentParser(description="Benchmark nanovllm GPU performance")
parser.add_argument("--model", type=str, default="~/models/Llama-3.1-8B-Instruct",
help="Model path (default: ~/models/Llama-3.1-8B-Instruct)")
parser.add_argument("--input-len", type=int, default=None, help="Input length in tokens")
parser.add_argument("--output-len", type=int, default=64, help="Output length for decode benchmark (default: 64)")
parser.add_argument("--max-len", type=int, default=32*1024, help="Max model length (default: 32K)")
parser.add_argument("--bench-decode", action="store_true", help="Run decode benchmark (default: prefill only)")
parser.add_argument("--bench-all", action="store_true", help="Run both prefill and decode benchmarks")
# Sparse policy option (GPU-only mode now supports policy routing)
parser.add_argument("--policy", type=str, default=None,
choices=["full", "xattn"],
help="Sparse policy: full (FullAttention), xattn (XAttention+BSA)")
parser.add_argument("--enable-policy", action="store_true",
help="Enable sparse policy routing (FullAttentionPolicy by default)")
parser.add_argument("--gpu-util", type=float, default=0.9,
help="GPU memory utilization (default: 0.9)")
parser.add_argument("--block-size", type=int, default=1024,
help="KV cache block size (default: 1024)")
parser.add_argument("--enforce-eager", action="store_true",
help="Disable CUDA graphs (default: False)")
args = parser.parse_args()
path = os.path.expanduser("~/models/Qwen3-4B-Instruct-2507/")
path = os.path.expanduser(args.model)
max_len = args.max_len
print(f"\n[nanovllm GPU] max_len={max_len}")
# Configure sparse policy
if args.policy == "xattn":
sparse_policy = SparsePolicyType.XATTN_BSA
print(f"\n[nanovllm GPU + XAttention BSA] max_len={max_len}")
elif args.policy == "full" or args.enable_policy:
sparse_policy = SparsePolicyType.FULL
print(f"\n[nanovllm GPU + Policy] sparse_policy=FULL, max_len={max_len}")
else:
sparse_policy = None
print(f"\n[nanovllm GPU] max_len={max_len}")
llm = LLM(
path,
enforce_eager=False,
enforce_eager=args.enforce_eager,
max_model_len=max_len,
max_num_batched_tokens=max_len,
sparse_policy=sparse_policy,
gpu_memory_utilization=args.gpu_util,
kvcache_block_size=args.block_size,
)
# Warmup

View File

@@ -2,6 +2,15 @@ import os
import time
from random import randint, seed
from nanovllm import LLM, SamplingParams
from nanovllm.utils.observer import InferenceObserver
from nanovllm.utils.memory_observer import MemoryObserver
def print_memory_stats():
"""Print MemoryObserver communication statistics"""
fmt = MemoryObserver._fmt_bytes
print(f"[Memory] Prefill H2D: {fmt(MemoryObserver.prefill_h2d_bytes)}, D2H: {fmt(MemoryObserver.prefill_d2h_bytes)}")
print(f" Decode H2D: {fmt(MemoryObserver.decode_h2d_bytes)}, D2H: {fmt(MemoryObserver.decode_d2h_bytes)}")
def bench_decode(llm, num_seqs, input_len, output_len):
@@ -14,16 +23,18 @@ def bench_decode(llm, num_seqs, input_len, output_len):
llm.generate(prompt_token_ids, sampling_params, use_tqdm=False)
t = time.time() - t
# Calculate metrics
prefill_tokens = num_seqs * input_len
decode_tokens = num_seqs * output_len
# Get metrics from InferenceObserver
ttft_ms = InferenceObserver.ttft / 1e6
tpot_ms = InferenceObserver.tpot / 1e6
# Approximate: assume prefill takes ~input_len/prefill_speed, rest is decode
# For more accurate measurement, we'd need internal timing
decode_throughput = decode_tokens / t # This includes prefill time, so it's a lower bound
# Calculate throughput from observer metrics
decode_tokens = num_seqs * output_len
decode_throughput = 1000.0 / tpot_ms if tpot_ms > 0 else 0 # tokens/s per sequence
print(f"[Decode] Input: {num_seqs}x{input_len}tok, Output: {decode_tokens}tok, Time: {t:.2f}s")
print(f" Throughput: {decode_throughput:.2f} tok/s (includes prefill overhead)")
print(f" TTFT: {ttft_ms:.2f}ms, TPOT: {tpot_ms:.2f}ms")
print(f" Decode Throughput: {decode_throughput:.2f} tok/s (from observer)")
print_memory_stats()
def bench_prefill(llm, num_seqs, input_len):
@@ -36,9 +47,20 @@ def bench_prefill(llm, num_seqs, input_len):
t = time.time()
llm.generate(prompt_token_ids, sampling_params, use_tqdm=False)
t = time.time() - t
# Get TTFT from InferenceObserver
ttft_ms = InferenceObserver.ttft / 1e6
ttft_s = ttft_ms / 1000.0
total_input_tokens = num_seqs * input_len
throughput = total_input_tokens / t
print(f"[Prefill] Input: {total_input_tokens}tok ({num_seqs}x{input_len}), Time: {t:.2f}s, Throughput: {throughput:.2f}tok/s")
# Use observer TTFT for accurate prefill throughput
throughput_observer = total_input_tokens / ttft_s if ttft_s > 0 else 0
throughput_external = total_input_tokens / t
print(f"[Prefill] Input: {total_input_tokens}tok ({num_seqs}x{input_len})")
print(f" External Time: {t:.2f}s, Throughput: {throughput_external:.2f}tok/s")
print(f" Observer TTFT: {ttft_ms:.2f}ms, Throughput: {throughput_observer:.2f}tok/s")
print_memory_stats()
def main():
@@ -66,6 +88,7 @@ def main():
parser.add_argument("--input-len", type=int, default=None, help="Input length in tokens")
parser.add_argument("--output-len", type=int, default=64, help="Output length for decode benchmark (default: 64)")
parser.add_argument("--num-gpu-blocks", type=int, default=4, help="Number of GPU blocks (default: 4)")
parser.add_argument("--block-size", type=int, default=1024, help="KV cache block size (default: 1024)")
parser.add_argument("--max-len", type=int, default=32*1024, help="Max model length (default: 32K)")
parser.add_argument("--bench-decode", action="store_true", help="Run decode benchmark (default: prefill only)")
parser.add_argument("--bench-all", action="store_true", help="Run both prefill and decode benchmarks")
@@ -75,6 +98,9 @@ def main():
path = os.path.expanduser(args.model)
max_len = args.max_len
# Enable MemoryObserver for communication stats
MemoryObserver._enabled = True
# Setup policy configuration
if args.enable_quest:
sparse_policy = SparsePolicyType.QUEST
@@ -86,7 +112,7 @@ def main():
sparse_policy = SparsePolicyType.FULL
print("\n[Full Attention] baseline (no sparse)")
print(f"[Config] max_len={max_len}, num_gpu_blocks={args.num_gpu_blocks}")
print(f"[Config] max_len={max_len}, num_gpu_blocks={args.num_gpu_blocks}, block_size={args.block_size}")
llm = LLM(
path,
@@ -95,6 +121,7 @@ def main():
max_num_batched_tokens=max_len,
enable_cpu_offload=True,
num_gpu_blocks=args.num_gpu_blocks,
kvcache_block_size=args.block_size,
sparse_policy=sparse_policy,
# Quest parameters
sparse_topk_blocks=args.topk,

View File

@@ -1,5 +1,14 @@
import os
os.environ["VLLM_USE_V1"] = "1"
import sys
# Parse --use-v1 flag before importing vllm
use_v1 = "--use-v1" in sys.argv
if use_v1:
os.environ["VLLM_USE_V1"] = "1"
sys.argv.remove("--use-v1")
else:
os.environ["VLLM_USE_V1"] = "0"
import time
from random import randint, seed
from vllm import LLM, SamplingParams
@@ -44,24 +53,28 @@ def bench_prefill(llm, num_seqs, input_len):
def main():
import argparse
parser = argparse.ArgumentParser(description="Benchmark vLLM performance (for comparison)")
parser.add_argument("--model", type=str, default="~/models/Llama-3.1-8B-Instruct",
help="Model path (default: ~/models/Llama-3.1-8B-Instruct)")
parser.add_argument("--input-len", type=int, default=None, help="Input length in tokens")
parser.add_argument("--output-len", type=int, default=64, help="Output length for decode benchmark (default: 64)")
parser.add_argument("--max-len", type=int, default=32*1024, help="Max model length (default: 32K)")
parser.add_argument("--gpu-util", type=float, default=0.9, help="GPU memory utilization (default: 0.9)")
parser.add_argument("--enforce-eager", action="store_true", help="Disable CUDA Graphs (use eager mode)")
parser.add_argument("--bench-decode", action="store_true", help="Run decode benchmark (default: prefill only)")
parser.add_argument("--bench-all", action="store_true", help="Run both prefill and decode benchmarks")
args = parser.parse_args()
path = os.path.expanduser("~/models/Qwen3-4B-Instruct-2507/")
path = os.path.expanduser(args.model)
max_len = args.max_len
print(f"\n[vLLM] max_len={max_len}")
print(f"\n[vLLM] max_len={max_len}, gpu_util={args.gpu_util}, enforce_eager={args.enforce_eager}")
llm = LLM(
path,
enforce_eager=False,
enforce_eager=args.enforce_eager,
max_model_len=max_len,
max_num_seqs=128,
gpu_memory_utilization=0.9,
gpu_memory_utilization=args.gpu_util,
)
# Warmup

View File

@@ -9,8 +9,6 @@
| GPU | NVIDIA A100-SXM4-80GB |
| 模型 | Llama-3.1-8B-Instruct |
| GPU slots | 4 |
| Block size | 1024 tokens |
| Chunk size | 2048 tokens |
## Sparse Policy 配置
@@ -21,69 +19,181 @@
## 测试结果
### 32K 上下文
### Block Size 4096 (推荐)
| 策略 | 输入长度 | 耗时 | 吞吐量 | 相对性能 |
|------|----------|------|--------|----------|
| Full Attention | 32767 tok | 20.64s | **1587.74 tok/s** | baseline |
| XAttention BSA | 32767 tok | 27.95s | **1172.33 tok/s** | 0.74x |
#### GPU-only 模式
### 128K 上下文
| 上下文 | Full Attention | XAttention | 相对性能 |
|--------|----------------|------------|----------|
| 32K | 4863 tok/s | 5587 tok/s | **+14.9%** ✅ |
| 64K | 3373 tok/s | 4766 tok/s | **+41.3%** ✅ |
| 策略 | 输入长度 | 耗时 | 吞吐量 | 相对性能 |
|------|----------|------|--------|----------|
| Full Attention | 131071 tok | 237.18s | **552.63 tok/s** | baseline |
| XAttention BSA | 131071 tok | 281.17s | **466.17 tok/s** | 0.84x |
#### CPU Offload 模式 (优化后, 2026-01-28)
### KV Cache 配置
| 上下文 | Full Attention | XAttention | 相对性能 |
|--------|----------------|------------|----------|
| 32K | 4678 tok/s | 4398 tok/s | **-6.0%** |
| 64K | 3331 tok/s | 3203 tok/s | **-3.8%** |
| 128K | 2144 tok/s | 2196 tok/s | **+2.4%** ✅ |
| 上下文 | GPU Memory | CPU Memory | Total |
|--------|------------|------------|-------|
| 32K | 512 MB (4 blocks) | 4096 MB (32 blocks) | 4608 MB |
| 128K | 512 MB (4 blocks) | 16384 MB (128 blocks) | 16896 MB |
#### CPU Offload 模式 (优化前, 2026-01-27)
## 分析
| 上下文 | Full Attention | XAttention | 相对性能 |
|--------|----------------|------------|----------|
| 32K | 4648 tok/s | 4002 tok/s | **-13.9%** ❌ |
| 64K | 3329 tok/s | 2642 tok/s | **-20.6%** ❌ |
| 128K | 2122 tok/s | 867 tok/s | **-59.1%** ❌ |
### XAttention 性能特点
### Block Size 256 (小 block 测试)
1. **32K 上下文**: XAttention 比 Full 慢 26%
2. **128K 上下文**: XAttention 比 Full 慢 16%
#### CPU Offload 模式 (64K)
随着上下文增长XAttention 的相对性能有所提升74% → 84%),但仍未超过 Full Attention。
| 策略 | 耗时 | 吞吐量 | 相对性能 |
|------|------|--------|----------|
| Full Attention | 401.04s | 163.41 tok/s | baseline |
| XAttention BSA | 390.35s | 167.89 tok/s | **+2.7%** ✅ |
### 原因分析
### Block Size 1024 (历史测试)
1. **tau=0.95 阈值较高**: 需要覆盖 95% 累积注意力,实际跳过的 block 较少
2. **估计开销**: `xattn_estimate_chunked` 需要对每个 chunk 计算稀疏 mask
3. **BSA kernel overhead**: Block sparse kernel 有额外的 mask 处理和索引开销
4. **Offload 瓶颈**: CPU→GPU 传输是主要瓶颈,稀疏注意力节省的是计算而非传输
#### CPU Offload 模式
### 适用场景
| 上下文 | Full Attention | XAttention | 相对性能 |
|--------|----------------|------------|----------|
| 32K | 1587.74 tok/s | 1172.33 tok/s | -26% |
| 128K | 552.63 tok/s | 466.17 tok/s | -16% |
XAttention BSA 更适合以下场景:
- 更长的上下文256K+),稀疏收益更明显
- 计算密集型任务(非 offload 模式),传输不是瓶颈
- 较低的 tau 阈值(如 0.8),增加稀疏性
## 关键发现
### 1. GPU-only vs CPU Offload 模式差异
| 模式 | XAttention 效果 | 原因 |
|------|-----------------|------|
| **GPU-only** | ✅ 显著加速 (+15% ~ +41%) | 计算是瓶颈,稀疏注意力减少 FLOPs |
| **CPU Offload (优化后)** | ✅ 长上下文略有收益 | estimate_block_size 优化减少估计开销 |
| **CPU Offload (优化前)** | ❌ 性能下降 (-14% ~ -59%) | 传输是瓶颈,稀疏估计增加额外开销 |
### 2. Block Size 对性能的影响
| Block Size | 64K Full (Offload) | 特点 |
|------------|-------------------|------|
| 4096 | 3329 tok/s | ⭐ 最佳性能 |
| 1024 | ~1500 tok/s | 中等 |
| 256 | 163 tok/s | 极慢20x 下降) |
**原因**: 更小的 block = 更多的 blocks = 更多 H2D 传输开销
### 3. XAttention 在小 Block Size 下反转
当 block size = 256 时XAttention 反而略有优势 (+2.7%)
- 256 个 blocks (vs 16 个 @ 4096)
- 稀疏跳过的 blocks 比例更明显
- 但绝对性能极差,不推荐使用
### 4. estimate_block_size 优化效果 (2026-01-28)
```
Offload 模式 XAttention 相对性能变化:
优化前 优化后 改进
32K: -13.9% -6.0% +7.9pp
64K: -20.6% -3.8% +16.8pp
128K: -59.1% +2.4% +61.5pp ✅
```
优化内容:
- `estimate_block_size` 从 4096 改为 1024
- `softmax_fuse_block_sum` kernel 时间从 48% 降到 1% (44x 加速)
- 选择策略从 mask + voting 改为 score + threshold
优化后结论:
- **128K 长上下文 XAttention 反超 Full Attention**
- 短上下文仍有少量开销,但已显著减少
## 结论
### 推荐配置 (优化后, 2026-01-28)
| 场景 | 推荐策略 | Block Size |
|------|----------|------------|
| GPU-only (VRAM 充足) | XAttention | 4096 |
| CPU Offload (128K+) | XAttention | 4096 |
| CPU Offload (32K-64K) | Full Attention 或 XAttention | 4096 |
### XAttention 适用条件 (优化后)
**适合**:
- GPU-only 模式(计算密集)
- CPU Offload + 长上下文128K+)有正向收益
- 长上下文64K+)收益更大
⚠️ **中性**:
- CPU Offload + 中等上下文32K-64K略慢 3-6%,可接受
**不推荐**:
- 短上下文(<32K收益不明显
## 运行命令
```bash
# Full Attention (32K)
CUDA_VISIBLE_DEVICES=0 python bench_offload.py --max-len 32768
# GPU-only 模式
CUDA_VISIBLE_DEVICES=0 python bench.py --max-len 65536 --block-size 4096 --gpu-util 0.7
CUDA_VISIBLE_DEVICES=0 python bench.py --max-len 65536 --block-size 4096 --gpu-util 0.7 --policy xattn
# XAttention BSA (32K)
CUDA_VISIBLE_DEVICES=0 python bench_offload.py --max-len 32768 --enable-xattn
# CPU Offload 模式 (推荐 block-size 4096)
CUDA_VISIBLE_DEVICES=0 python bench_offload.py --max-len 65536 --block-size 4096
CUDA_VISIBLE_DEVICES=0 python bench_offload.py --max-len 65536 --block-size 4096 --enable-xattn
# Full Attention (128K)
CUDA_VISIBLE_DEVICES=0 python bench_offload.py --max-len 131072
# XAttention BSA (128K)
CUDA_VISIBLE_DEVICES=0 python bench_offload.py --max-len 131072 --enable-xattn
# CPU Offload 模式 (小 block size 测试)
CUDA_VISIBLE_DEVICES=0 python bench_offload.py --max-len 65536 --block-size 256
CUDA_VISIBLE_DEVICES=0 python bench_offload.py --max-len 65536 --block-size 256 --enable-xattn
# 调整 XAttention 参数
CUDA_VISIBLE_DEVICES=0 python bench_offload.py --enable-xattn --xattn-threshold 0.8 --xattn-stride 16
```
## FlashInfer Merge 优化 (2026-01-28)
将 Triton 实现的 `merge_attention_outputs` 替换为 FlashInfer 的 `cascade.merge_state`
### 性能对比 (Full Attention, block-size 4096)
| 上下文 | Triton merge | FlashInfer merge | 提升 |
|--------|--------------|------------------|------|
| 32K | 4678 tok/s | 4717 tok/s | **+0.8%** |
| 64K | 3331 tok/s | 3411 tok/s | **+2.4%** |
| 128K | 2144 tok/s | 2178 tok/s | **+1.6%** |
### 关键发现
1. **端到端提升有限**0.8% ~ 2.4%merge 操作不是主要瓶颈
- H2D 传输占主导64K 传输 64GB
- Attention 计算是另一主要耗时
- Merge 在总耗时中占比很小
2. **Merge kernel 单独对比**(长序列时 FlashInfer 优势明显):
| seq_len | heads | Triton (ms) | FlashInfer (ms) | Speedup |
|---------|-------|-------------|-----------------|---------|
| 4096 | 32 | 0.129 | 0.087 | **1.49x** |
| 8192 | 32 | 0.251 | 0.147 | **1.70x** |
| 16384 | 32 | 0.499 | 0.274 | **1.82x** |
3. **短序列 FlashInfer 反而慢**格式转换开销squeeze, transpose, contiguous
### 技术细节
- **LSE 格式差异**FlashInfer 使用 log2flash_attn 使用 ln
- **转换系数**`LOG2_E = 1.4427`ln → log2`LN_2 = 0.6931`log2 → ln
- **FlashInfer attention JIT 问题**CUDA 版本兼容性问题,仅使用 merge_state
### 代码位置
- `nanovllm/ops/chunked_attention.py`: `merge_attention_outputs_flashinfer()`
- `nanovllm/kvcache/sparse/full_policy.py`: 3 处 import 更新
- `nanovllm/kvcache/sparse/xattn_bsa.py`: 1 处 import 更新
## 更新记录
- 2026-01-28: **FlashInfer merge 替换 Triton merge**,端到端提升 0.8% ~ 2.4%
- 2026-01-28: **estimate_block_size 优化后重新测试**128K XAttention 反超 Full (+2.4%)
- 2026-01-27: 添加 GPU-only vs Offload 对比block size 影响分析
- 2026-01-27: 初始测试Llama-3.1-8B-Instruct, A100 80GB

View File

@@ -0,0 +1,94 @@
# Changelog 2026-02-05
## Bug Fixes
### XAttention Offload GQA Buffer OOM Fix
**Issue**: `docs/issue_xattn_offload_gqa_buffer_oom.md`
**Problem**: 在 XAttention BSA + CPU Offload 模式下,`alloc_policy_metadata()` 分配了只有 GPU-only 模式才需要的 GQA expansion buffers (`_k_expanded`, `_v_expanded`),导致 24GB GPU (RTX 3090) 上 OOM。
**Root Cause**:
- GQA buffer 大小: `2 × num_heads × max_seq_len × head_dim × dtype_size`
- 对于 1M max_seq_len: 2 × 32 × 1048576 × 128 × 2 = **16 GB**
- Offload 模式的 `compute_chunked_prefill()` 不需要这些 buffer
**Fix** (commit `11a867f`):
1. `nanovllm/kvcache/sparse/policy.py`: 基类添加 `enable_cpu_offload` 参数
2. `nanovllm/kvcache/sparse/xattn_bsa.py`: offload 模式跳过 GQA buffer 分配
3. `nanovllm/engine/model_runner.py`: 传入 `enable_cpu_offload` 参数
**Memory Savings**:
| max_model_len | 修复前 | 修复后 |
|---------------|--------|--------|
| 72K | +1.1 GB | 0 GB |
| 1M | +16 GB | 0 GB |
**Verification**:
```bash
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH \
python tests/test_ruler.py \
--model ~/models/Llama-3.1-8B-Instruct \
--data-dir tests/data/ruler_64k \
--datasets niah_single_1 \
--num-samples 1 \
--max-model-len 72000 \
--enable-offload \
--sparse-policy XATTN_BSA
```
- 日志显示: `[XAttn] Offload mode: skipping GQA expansion buffers`
- 测试结果: 100% 准确率
---
## Code Cleanup
### Tests Directory Cleanup
**Commits**: `a709551`, `2b61c5a`, `d35dd76`
删除了 16 个冗余/过时的测试文件,保留核心测试:
**保留的文件** (4 个):
| 文件 | 用途 |
|------|------|
| `test_ruler.py` | 核心 RULER benchmark (13 tasks, 100 samples) |
| `test_xattn_estimate_alignment.py` | XAttn kernel 一致性验证 |
| `utils.py` | 共享工具函数 |
| `__init__.py` | 包标记 |
**删除的文件** (16 个, -4306 行):
| 类别 | 文件 | 删除原因 |
|------|------|----------|
| XAttn 测试 | `test_xattn_bsa.py` | 功能被 test_ruler 覆盖 |
| | `test_xattn_chunked.py` | 与 estimate_chunked 重复 |
| | `test_xattn_estimate_chunked.py` | chunked prefill 验证 |
| | `test_xattn_kernels.py` | Triton kernel 单元测试 |
| | `test_xattn_kv_chunking_batch.py` | batch 验证 |
| Needle 测试 | `test_needle.py` | 被 test_ruler NIAH 任务覆盖 |
| | `test_needle_ref.py` | HF 参考实现 |
| CUDA Graph | `test_chunk_attention_graph.py` | 被 graph_reuse 取代 |
| | `test_chunk_attention_graph_reuse.py` | 实验性功能 |
| | `test_cudagraph_memory.py` | 内存分析工具 |
| 其他 | `test_gpuonly_density_alignment.py` | GPU-only 密度测试 |
| | `test_hierarchical_estimate.py` | 分层估计测试 |
| | `test_quest_policy.py` | Quest 策略测试 |
| | `test_sequential.py` | 状态隔离测试 |
| | `bench_estimate_block_size.py` | 性能 benchmark |
| | `modeling_qwen3.py` | Qwen3 参考模型 |
**Note**: 所有删除的文件可从 git 历史恢复:
```bash
git checkout <commit-hash>^ -- tests/<filename>
```
---
## Summary
| 类型 | 数量 | 影响 |
|------|------|------|
| Bug Fix | 1 | 节省 16GB 显存 (1M seq) |
| 文件删除 | 16 | -4306 行代码 |
| 新增文档 | 1 | 本文件 |

View File

@@ -0,0 +1,258 @@
# Estimate Block Size 性能分析
本文档记录 XAttention estimate 阶段中 `block_size` 参数对 `softmax_fuse_block_sum` kernel 性能的影响。
## 问题背景
当前 `select_blocks` 中的 estimate 过程使用全局的 `kvcache_block_size`(通常为 4096
```python
# xattn_bsa.py: select_blocks()
block_size = ctx.block_size # 来自 kvcache_manager.block_size (4096)
reshaped_block_size = block_size // self.stride # 4096/8 = 512
block_sums = softmax_fuse_block_sum(
attn_scores,
reshaped_block_size, # 512 - 性能最差点!
...
)
```
这导致 `softmax_fuse_block_sum` kernel 使用 `reshaped_block_size=512`,而这正是性能曲线的最差点。
## Benchmark 结果
### 测试配置
- GPU: NVIDIA A100-SXM4-80GB
- NUM_HEADS: 32
- HEAD_DIM: 128
- STRIDE: 8
- 测试脚本: `tests/bench_estimate_block_size.py`
### softmax_fuse_block_sum 性能数据
| block_size | reshaped | 16K context | 32K context | 64K context |
|------------|----------|-------------|-------------|-------------|
| 64 | 8 | 4.86ms | 18.36ms | 70.83ms |
| 128 | 16 | 0.83ms | 3.12ms | 16.83ms |
| 256 | 32 | 0.63ms | 2.41ms | 11.24ms |
| 512 | 64 | **0.38ms** | **1.52ms** | 9.54ms |
| 1024 | 128 | 0.42ms | 1.54ms | **6.01ms** |
| 2048 | 256 | 1.08ms | 3.24ms | 12.81ms |
| **4096** | **512** | 9.66ms | 25.36ms | **95.32ms** |
### 性能曲线
```
softmax_fuse_block_sum 耗时 (64K context):
block_size=64 ████████████████████████████████████ 70.83ms
block_size=128 ████████ 16.83ms
block_size=256 █████ 11.24ms
block_size=512 ████ 9.54ms
block_size=1024 ███ 6.01ms ◀── 最优点
block_size=2048 ██████ 12.81ms
block_size=4096 ████████████████████████████████████████████████ 95.32ms ◀── 当前使用
```
### 关键发现
1. **性能呈 U 型曲线**:太小和太大的 block_size 都会导致性能下降
2. **最优点在 512-1024**:对应 `reshaped_block_size` 64-128
3. **当前配置 (4096) 是最差点**95.32ms vs 最优 6.01ms**慢 15.85x**
## 性能曲线解释
```
Performance (耗时)
│ ▲ 太小:
│ / - output blocks 数量多 (q_len / block_size)
│/ - grid 调度开销大
│ - 每个 thread block 工作量小
│ ┌─────────┐
│ / 最优 \
│ / 区域 \ ▲ 太大:
│/ \ - block_size 作为 tl.constexpr
│ \ - 寄存器压力增大 (可能 spill)
│ \ - shared memory 不足
│ \- L1 cache 效率下降
└──────────────────────────────────→ block_size
64 128 256 512 1024 2048 4096
最优点 (512-1024)
```
### Triton Kernel 内部分析
`softmax_fuse_block_sum_kernel` 中的关键约束:
```python
# 每个 thread block 处理的数据
offs_q = tl.arange(0, block_size) # block_size 个元素
m_i = tl.zeros([block_size], dtype=tl.float32) # 寄存器分配
# reshape 操作
X = tl.reshape(X, (block_size, segment_size // block_size, block_size))
# 当 block_size=512, segment_size=512 时 → (512, 1, 512) 的 3D tensor
```
`block_size` 过大时:
- 每个 thread block 需要更多寄存器
- `tl.arange(0, block_size)` 生成更大的向量
- reshape 操作的内存访问模式变差
## 优化建议
### 方案 1: 固定 estimate block_size
`select_blocks` 中使用固定的小 block_size 进行估计:
```python
# 建议修改
ESTIMATE_BLOCK_SIZE = 1024 # 或 512而非 ctx.block_size
reshaped_block_size = ESTIMATE_BLOCK_SIZE // self.stride # 128
```
**优点**:简单直接,预期提升 15x
**缺点**estimate 的 block 粒度与 CPU block 不一致,需要映射
### 方案 2: 两级 block 结构
- 外层使用 `kvcache_block_size` (4096) 管理 CPU blocks
- 内层使用 `estimate_block_size` (1024) 进行估计
- 估计结果聚合回 CPU block 粒度
### 方案 3: 自适应 block_size
根据 context length 动态选择 estimate block_size
| Context Length | Recommended block_size |
|----------------|------------------------|
| < 16K | 512 |
| 16K - 64K | 1024 |
| > 64K | 1024 |
## 与实际 Profiling 的对比
### Nsys Profiling 数据 (64K context, block_size=4096)
| 阶段 | 时间占比 | 说明 |
|------|----------|------|
| softmax_fuse_block_sum | **48.1%** | 最后一个 chunk |
| flash_fwd_kernel | 30.7% | 实际 attention 计算 |
| flat_group_gemm | 3.5% | estimate GEMM |
### 预期优化效果
如果将 estimate block_size 从 4096 改为 1024
| 指标 | 当前 (4096) | 优化后 (1024) | 提升 |
|------|-------------|---------------|------|
| softmax kernel | 95.32ms | 6.01ms | **15.85x** |
| estimate 阶段占比 | 48.1% | ~5% | 显著降低 |
| 总体 prefill 时间 | ~2s (最后chunk) | ~1.1s | ~1.8x |
## 测试命令
```bash
# 运行 benchmark
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=/path/to/nano-vllm:$PYTHONPATH \
python tests/bench_estimate_block_size.py --gpu 0
# 指定单个 context length
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=/path/to/nano-vllm:$PYTHONPATH \
python tests/bench_estimate_block_size.py --gpu 0 --ctx-len 65536
```
## 相关文件
| 文件 | 说明 |
|------|------|
| `nanovllm/kvcache/sparse/xattn_bsa.py` | XAttention BSA Policy 实现 |
| `nanovllm/ops/xattn.py` | Triton kernels |
| `tests/bench_estimate_block_size.py` | 性能测试脚本 |
| `docs/xattn_performance_analysis.md` | XAttention 整体性能分析 |
## 分级求和方案 (Hierarchical Block Sum)
使用小的 `estimate_block_size=1024` 计算细粒度 block_sums然后聚合到 CPU block 级别 (4096)。
### 数学等价性
```
方案1 (block_size=4096): softmax_fuse_block_sum → [1, heads, 1, 1]
方案2 (block_size=1024): softmax_fuse_block_sum → [1, heads, 4, 4] → sum → [1, heads]
验证结果: Max difference = 0.0 ✅ 完全等价
```
### 验证代码
`tests/test_hierarchical_estimate.py` - 纯 torch + xattn kernels 实现
### 性能提升
| 指标 | 当前 (4096) | 优化后 (1024) | 提升 |
|------|-------------|---------------|------|
| softmax kernel | 12.07 ms | 0.29 ms | **41x** |
| 端到端 estimate | 95 ms | ~6 ms | **15x** |
## ⚠️ 选择策略变更
**重要**: 分级求和方案使用新的选择策略:
| 特性 | 原策略 (mask + voting) | 新策略 (score + threshold) |
|------|------------------------|----------------------------|
| 输入 | `[batch, heads, q_blocks, k_blocks]` | `[batch, heads, num_cpu_blocks]` |
| 选择粒度 | Per-q-block | Per-chunk |
| 聚合方式 | majority voting | threshold on scores |
新策略更简洁,直接利用分级求和产生的 score避免了 mask 生成和 voting 的复杂逻辑。
## 实现状态 ✅ (2026-01-28)
### 已实现
分级求和方案已在 `xattn_bsa.py` 中实现:
```python
class XAttentionBSAPolicy:
def __init__(self, ..., estimate_block_size: int = 1024):
self.estimate_block_size = estimate_block_size # 新参数
def select_blocks(self, ...):
# Step 2: Hierarchical softmax_fuse_block_sum
reshaped_est_bs = estimate_bs // self.stride # 1024/8 = 128
block_sums_fine = softmax_fuse_block_sum(attn_scores, reshaped_est_bs, ...)
# Step 3: Aggregate to CPU block level
block_sums_coarse = block_sums_fine.view(..., num_cpu_blocks, ratio).sum(dim=-1)
cpu_block_scores = block_sums_coarse.sum(dim=2)
# Step 4: Score + threshold selection (replaces mask + voting)
scores_per_block = cpu_block_scores.mean(dim=(0, 1))
# ... cumulative threshold selection
```
### 实测结果 (Nsys Profiling)
| Kernel | 优化前 | 优化后 | 改进 |
|--------|--------|--------|------|
| softmax_fuse_block_sum 占比 | 48.1% | **1.1%** | **44x** |
| softmax_fuse_block_sum 平均时间 | ~2ms | 489us | **4x** |
### 端到端性能 (32K context)
| 指标 | FULL Policy | XATTN Policy | 改进 |
|------|-------------|--------------|------|
| Prefill throughput | 3511 tok/s | 3695 tok/s | +5% |
| TTFT | 9327 ms | 8863 ms | -5% |
## 结论
当前 estimate 阶段使用全局 `kvcache_block_size=4096` 导致 `softmax_fuse_block_sum` kernel 性能处于最差点。通过将 estimate block_size 改为 512-1024可以获得 **15x** 的性能提升,显著降低 estimate 阶段的开销。
**⚠️ 重要变更**: 选择策略从 `mask + majority voting` 改为 `score + threshold`,更简洁且更直接。

View File

@@ -0,0 +1,77 @@
# GPU-only Sparse Policy 整合
本文档记录将 sparse attention 策略整合到 GPU-only 模式的过程和性能对比。
## 背景
当前 sparse policyQuest、XAttention仅在 CPU offload 路径中实现。目标是将其扩展到 GPU-only 模式,以提升长上下文场景下的性能。
## 基准性能(优化前)
**测试环境**:
- GPU: NVIDIA A100-SXM4-80GB
- 模型: Llama-3.1-8B-Instruct
- 上下文长度: 32K tokens
- 日期: 2026-01-27
### Prefill Benchmark (32K context)
| 模式 | Throughput | Time | KV Cache 分配 |
|------|------------|------|---------------|
| **GPU-only (Full Attention)** | 4869.67 tok/s | 6.73s | 438 blocks (56GB GPU) |
| CPU Offload (Full Attention) | 1500.29 tok/s | 21.84s | 4 blocks GPU + 32 blocks CPU |
**性能比**: GPU-only 比 CPU Offload 快 **3.2x**
### 配置详情
**GPU-only 模式**:
```bash
CUDA_VISIBLE_DEVICES=0 python bench.py \
--model ~/models/Llama-3.1-8B-Instruct \
--max-len 32768
```
**CPU Offload 模式**:
```bash
CUDA_VISIBLE_DEVICES=0 python bench_offload.py \
--model ~/models/Llama-3.1-8B-Instruct \
--max-len 32768
```
### KV Cache 配置
| 参数 | GPU-only | CPU Offload |
|------|----------|-------------|
| block_size | 1024 tokens | 1024 tokens |
| per-token KV | 128 KB | 128 KB |
| per-block KV | 128 MB | 128 MB |
| GPU blocks | 438 | 4 |
| CPU blocks | 0 | 32 |
| Total memory | 56 GB | 4.6 GB |
## 目标
将以下 sparse policy 整合到 GPU-only 模式:
| Policy | 阶段 | 描述 |
|--------|------|------|
| Quest | Decode | Top-K block selection based on query-key scores |
| XAttention BSA | Prefill | Block sparse attention with cumulative threshold |
## 实现进度
- [ ] 分析现有 sparse policy 代码结构
- [ ] 设计 GPU-only sparse policy 接口
- [ ] 实现 GPU-only Quest decode
- [ ] 实现 GPU-only XAttention prefill
- [ ] 性能测试和对比
## 优化后性能
*待测试*
| 模式 | Throughput | Speedup vs Full |
|------|------------|-----------------|
| GPU-only + Quest (decode) | TBD | TBD |
| GPU-only + XAttn (prefill) | TBD | TBD |

View File

@@ -0,0 +1,296 @@
# GPU-Only XAttention 指南
本文档介绍 GPU-only 模式下 XAttention BSA 的实现、内存优化和性能分析。
## 概述
GPU-only 模式下,所有 KV cache 存储在 GPU 上,无需 CPU offload。XAttention 通过稀疏注意力加速 prefill 阶段。
### 执行路径对比
| 模式 | Prefill 方法 | Decode 方法 | KV 存储 |
|------|-------------|-------------|---------|
| GPU-only Full | `compute_prefill()` | `compute_decode()` | GPU |
| GPU-only XAttn | `compute_prefill()` | `compute_decode()` | GPU |
| CPU Offload | `compute_chunked_prefill()` | `compute_chunked_decode()` | CPU + GPU |
## 架构设计
### SparsePolicy 接口
```python
class SparsePolicy:
# GPU-only 方法
def compute_prefill(self, q, k, v, ...) -> Tensor
def compute_decode(self, q, k_cache, v_cache, ...) -> Tensor
# CPU Offload 方法
def compute_chunked_prefill(self, q, k, v, ...) -> Tensor
def compute_chunked_decode(self, q, ...) -> Tensor
# 初始化方法
def initialize(self, num_layers, ...) -> None # CPU offload metadata
def alloc_policy_metadata(self, num_heads, ...) -> None # GPU-only buffers
```
### XAttentionBSAPolicy 实现
```
GPU-only Prefill 流程:
┌─────────────────────────────────────────────────────────────┐
│ 1. GQA 扩展 (使用预分配 buffer) │
│ K: [seq, kv_heads, dim] → K_exp: [1, heads, seq, dim] │
│ │
│ 2. XAttention 估计 │
│ flat_group_gemm_fuse_reshape_kernel (Q@K^T) │
│ softmax_fuse_block_sum_kernel (block 重要性) │
│ → sparse mask │
│ │
│ 3. BSA 稀疏注意力 │
│ flash_fwd_block_kernel (只计算选中的 blocks) │
│ → output │
└─────────────────────────────────────────────────────────────┘
```
## 内存预分配
### 问题背景
XAttention 的 `compute_prefill()` 需要 GQA 扩展:
```python
# 之前: 动态分配 (~2GB for 64K)
K_exp = K.repeat_interleave(num_groups, dim=1) # 分配 1
k_bsa = k.repeat_interleave(num_groups, dim=1) # 分配 2 (重复!)
```
每次 prefill 都动态分配,导致:
- 内存碎片
- 分配延迟
- 可能 OOM
### 解决方案: alloc_policy_metadata()
在框架初始化时预分配 buffer
```python
class XAttentionBSAPolicy(SparsePolicy):
def alloc_policy_metadata(self, num_heads, num_kv_heads, head_dim,
max_seq_len, dtype, device):
# 预分配 GQA 扩展 buffer
shape = (1, num_heads, max_seq_len, head_dim)
self._k_expanded = torch.empty(shape, dtype=dtype, device=device)
self._v_expanded = torch.empty(shape, dtype=dtype, device=device)
def compute_prefill(self, q, k, v, ...):
seq_len = k.shape[0]
# 使用预分配 buffer 的 slice
K_exp = self._k_expanded[:, :, :seq_len, :]
# 原地 GQA 扩展
K_exp.view(...).copy_(K.unsqueeze(2).expand(...))
# 复用同一 buffer 给 BSA
k_bsa = K_exp.squeeze(0).transpose(0, 1)
```
### 内存使用
| 序列长度 | 预分配大小 | 说明 |
|---------|-----------|------|
| 32K | 512 MB | `2 * 32 * 32768 * 128 * 2 bytes` |
| 64K | 1024 MB | `2 * 32 * 65536 * 128 * 2 bytes` |
优化效果:
- 之前: ~2GB 动态分配 (xattn_estimate + BSA 各一次)
- 之后: ~1GB 预分配 (复用同一 buffer)
### 框架集成
```python
# model_runner.py - allocate_kv_cache()
def allocate_kv_cache(self):
# ... KV cache 分配 ...
# GPU-only 模式: 预分配 policy buffers
if not config.enable_cpu_offload:
self.kvcache_manager.sparse_policy.alloc_policy_metadata(
num_heads=num_heads,
num_kv_heads=num_kv_heads,
head_dim=head_dim,
max_seq_len=config.max_model_len,
dtype=dtype,
device=torch.device("cuda"),
)
```
## 性能分析
### 32K Prefill 性能
| Policy | Throughput | 相对提升 |
|--------|------------|----------|
| Baseline | 4880 tok/s | - |
| Full | 4892 tok/s | +0.2% |
| **XAttention** | **5602 tok/s** | **+15%** |
### 64K Prefill 性能
| Policy | Throughput | 相对提升 |
|--------|------------|----------|
| Baseline | 3386 tok/s | - |
| Full | 3355 tok/s | -0.9% |
| **XAttention** | **4775 tok/s** | **+41%** |
### Kernel 时间分解 (32K)
**XAttention:**
```
FFN GEMM: 3219 ms (54%)
BSA Attention: 1231 ms (21%)
XAttn Estimation: 415 ms (7%)
Other: 1020 ms (18%)
─────────────────────────────
Total: 5885 ms
```
**Full:**
```
FFN GEMM: 3244 ms (48%)
Dense Attention: 2861 ms (43%)
Other: 595 ms (9%)
─────────────────────────────
Total: 6700 ms
```
### 加速来源
```
Dense Attention: 2861 ms
BSA Attention: 1231 ms (节省 1630 ms, -57%)
XAttn Estimation: 415 ms (额外开销)
─────────────────────────────
净节省: 1215 ms (42% attention 时间)
```
## CUDA Graph 限制
### 为什么 Prefill 不能用 CUDA Graph
CUDA Graph 要求所有操作在 capture 时确定:
| 必须固定 | Prefill 的情况 |
|---------|---------------|
| Tensor 形状 | seq_len 可变 (1 ~ max_model_len) |
| Kernel grid | 依赖 seq_len |
| 内存地址 | 中间 tensor 大小变化 |
```python
# 不同请求的 seq_len 不同
request_1: prefill(seq_len=1024) # grid=(8, 32, 1)
request_2: prefill(seq_len=32768) # grid=(256, 32, 1)
```
### Decode 可以用 CUDA Graph
```python
# Decode 每次只处理 1 token
q: [batch_size, 1, heads, dim] # 形状固定
```
nanovllm 为每个 batch_size 预先 capture 一个 graph
```python
def capture_cudagraph(self):
for batch_size in [1, 2, 4, 8, ...]:
with torch.cuda.graph(g):
self.run_model(dummy_input, is_prefill=False)
self.graphs[batch_size] = g
```
### Nsys Profile 结果
```
XAttention 32K Prefill:
Total kernels: 41,904
Non-graph: 41,904 (100%)
Graph: 0
Full 32K Prefill:
Total kernels: 35,308
Non-graph: 35,308 (100%)
Graph: 0
```
**两者都是 100% NON-GRAPH**,这是 prefill 的本质特性。
## Profiling 工具
### 使用 profile.sh
```bash
# XAttention 32K
bash scripts/profile.sh --max-len 32768 --policy xattn
# Full 32K
bash scripts/profile.sh --max-len 32768 --policy full
# 64K (需要降低 gpu-util)
bash scripts/profile.sh --max-len 65536 --policy xattn --gpu-util 0.7
```
### 分析 nsys 结果
```bash
# 查看 kernel 统计
nsys stats --report cuda_gpu_kern_sum results/nsys/<file>.nsys-rep
# 用 sqlite 查询详细数据
sqlite3 results/nsys/<file>.sqlite "
SELECT
(SELECT value FROM StringIds WHERE id = shortName) as kernel,
COUNT(*) as count,
SUM(end-start)/1e6 as total_ms
FROM CUPTI_ACTIVITY_KIND_KERNEL
GROUP BY shortName
ORDER BY total_ms DESC
LIMIT 10
"
```
## 使用指南
### 启用 XAttention GPU-only
```python
from nanovllm import LLM
from nanovllm.config import SparsePolicyType
llm = LLM(
model_path,
max_model_len=32768,
sparse_policy=SparsePolicyType.XATTN_BSA,
gpu_memory_utilization=0.9, # 64K 时可能需要降低
)
```
### 命令行测试
```bash
# bench.py
python bench.py --max-len 32768 --policy xattn
# 64K 需要降低 gpu-util
python bench.py --max-len 65536 --policy xattn --gpu-util 0.7
```
### 最佳实践
1. **32K 及以下**: 使用默认 `gpu_memory_utilization=0.9`
2. **64K**: 降低到 `gpu_memory_utilization=0.7`
3. **Decode**: XAttention 自动 fallback 到 FullAttentionPolicy
4. **Paged KV Cache**: 当 `block_tables` 存在时自动 fallback 到 flash_attn
## 相关文档
- [Sparse Policy 架构](sparse_policy_architecture.md)
- [XAttention 算法详解](xattention_algorithm_guide.md)
- [BSA 接口文档](block_sparse_attn_interface.md)

View File

@@ -0,0 +1,246 @@
# Density Alignment Test Results
验证 GPU-only 和 Offload 模式下三阶段 KV chunking 流程的正确性。
## 测试配置
### GPU-only 模式
- **模型**: Qwen3-0.6B (28 layers, 16 heads, 8 KV heads, head_dim=128)
- **Threshold**: 0.9
- **Block Size**: 128 tokens (BSA block)
- **Stride**: 8
- **Chunk Size**: 16384 tokens
### Offload 模式
- **模型**: Llama-3.1-8B-Instruct (32 layers, 32 heads, 8 KV heads, head_dim=128)
- **Threshold**: 0.9
- **Block Size**: 128 tokens (BSA block)
- **Stride**: 4
- **Chunk Size**: 4096 tokens
## 三阶段 KV Chunking 对齐测试 (2026-02-02)
### 测试目的
验证 `xattn_estimate` 高层 API 与手动实现的三阶段 KV chunking 流程是否完全一致。
### 三阶段流程
```
┌─────────────────────────────────────────────────────────────┐
│ Stage 1: softmax_compute_partial_stats │
│ └── 每个 KV chunk 独立计算 partial stats (m_i, l_i) │
│ │
│ Stage 2: merge_softmax_stats │
│ └── Host 端合并所有 chunks: (m_global, l_global) │
│ │
│ Stage 3: softmax_normalize_and_block_sum │
│ └── 使用全局 stats 归一化并计算 block sums │
└─────────────────────────────────────────────────────────────┘
```
### 测试结果
#### CHUNK_SIZE = 16384 (默认)
| Context | Tokens | Q Chunks | KV Chunks | Density | Mask 差异 | attn_sums 差异 | 结果 |
|---------|--------|----------|-----------|---------|-----------|----------------|------|
| 4K | 3,692 | 1 | 1 | 63.84% | 0 | 0.0 | ✅ |
| 8K | 7,892 | 1 | 1 | 64.98% | 0 | 0.0 | ✅ |
| 16K | 15,689 | 1 | 1 | 61.63% | 0 | 0.0 | ✅ |
| 32K | 32,485 | 2 | 2 | 50.21% | 0 | 0.0 | ✅ |
| **64K** | **64,891** | **4** | **4** | **37.00%** | **0** | **0.0** | ✅ |
#### CHUNK_SIZE = 4096 (更多 chunks)
| Context | Tokens | Q Chunks | KV Chunks | Density | xattn_estimate vs KV chunking | 结果 |
|---------|--------|----------|-----------|---------|-------------------------------|------|
| 4K | 3,692 | 1 | 1 | 63.84% | 0.000000 | ✅ |
| 8K | 7,892 | 2 | 2 | 63.02% | 0.000000 | ✅ |
| 16K | 15,689 | 4 | 4 | 60.08% | 0.000000 | ✅ |
| 32K | 32,485 | 8 | 8 | 49.84% | 0.000000 | ✅ |
| **64K** | **64,891** | **16** | **16** | **36.91%** | **0.000000** | ✅ |
### 64K 详细验证 (CHUNK_SIZE=4096)
64K 序列使用 chunk_size=4096 时产生 16×16 的 chunk 矩阵:
```
seq_len: 64891, q_chunk_num: 16, kv_chunk_num: 16
Q chunk 0: merged 16 KV chunks → attn_sum shape=[1, 32, 32, 512]
Q chunk 1: merged 16 KV chunks → attn_sum shape=[1, 32, 32, 512]
...
Q chunk 15: merged 16 KV chunks → attn_sum shape=[1, 32, 32, 512]
```
每个 Q chunk 需要合并 16 个 KV chunks 的 softmax stats充分验证了 `merge_softmax_stats` 在大规模 chunk 合并场景下的正确性。
### 验证指标
| 指标 | 预期 | 所有长度实际结果 |
|------|------|------------------|
| attn_sums max diff | 0 | 0.000000e+00 |
| attn_sums mean diff | 0 | 0.000000e+00 |
| mask exact match | True | True |
| density diff | 0% | 0.000000% |
### 结论
**三阶段 KV chunking 与一次性处理完全等价,无任何精度损失。**
- 当 seq_len < CHUNK_SIZE (16384):单 chunk 处理
- 当 seq_len >= CHUNK_SIZE多 chunk 分段处理后合并,结果与一次性处理完全一致
---
## Offload 模式测试 (2026-02-02)
使用 Offload 模式保存的真实 KV cache 数据进行测试。
### 测试结果
| 文件 | Tokens | Layer | Saved Density | Computed Density | Q/KV Chunks | 结果 |
|------|--------|-------|---------------|------------------|-------------|------|
| `qkv_3688.pt` | 3.7K | 3 | 38.34% | 38.34% | 1/1 | ✅ PASSED |
| `qkv_7888.pt` | 7.9K | 3 | 29.06% | 27.56% | 2/2 | ✅ PASSED |
| `qkv_15685.pt` | 15.7K | 3 | 19.77% | 18.60% | 4/4 | ✅ PASSED |
| `qkv_32485.pt` | 32.5K | 5 | 15.71% | 15.62% | 8/8 | ✅ PASSED |
| `qkv_64891.pt` | 64.9K | 3 | 11.09% | 11.09% | 16/16 | ✅ PASSED |
### Layer 5 GPU-only 测试 (threshold=0.9)
| 指标 | 结果 |
|------|------|
| Q/K shape | `[1, 16, 21001, 128]` (21K tokens) |
| Density | 6.24% |
| xattn_estimate vs KV chunking | 完全一致 (0.0000%) |
| mask 差异 | 0 / 435600 blocks |
| attn_sums 差异 | max=0.0, mean=0.0 |
### 观察
1. **Density 随 context 增长而降低**: 3.7K (38%) → 64.9K (11%)
2. **xattn_estimate API 与三阶段 KV chunking 完全一致**: 所有长度差异均为 0.0000%
3. **Saved density vs Computed density 略有差异**: 这是因为 saved density 可能在不同 chunk 下记录,累积计算方式略有不同
---
## 附录xattn_bsa vs xattn_estimate 对齐
| Context | Tokens | Layer 0 Density | Compute Density | Min Layer | 验证结果 |
|---------|--------|-----------------|-----------------|-----------|----------|
| 4k | 3,692 | 63.8% | 52.9% | Layer 3 (31.3%) | ✅ PASSED |
| 8k | 7,892 | 65.0% | 52.5% | Layer 5 (27.3%) | ✅ PASSED |
| 16k | 15,689 | 61.6% | 47.8% | Layer 5 (23.5%) | ✅ PASSED |
| 32k | 32,485 | 50.2% | 40.1% | Layer 5 (18.5%) | ✅ PASSED |
| 64k | 64,891 | 37.0% | 29.6% | Layer 5 (12.4%) | ✅ PASSED |
## Density 计算公式
### Total (分母)
```python
# Causal mask: Q block i 只能看到 K block 0 到 i
causal_mask[i, j] = (j <= i + q_offset_blocks)
# Total = causal 区域内的 block 数 × batch × heads
total = causal_mask.sum() × batch × heads
= (n × (n+1) / 2) × 1 × 32 # n = valid_q_blocks
```
### Selected (分子)
```python
# 在 causal 区域内,被选中 (mask=True) 的 block 数量
selected = (mask & causal_mask).sum()
```
### Density
```python
density = selected / total
```
## 观察
1. **Density 随 context 增长而降低**: 4k (63.8%) → 64k (37.0%),这是因为长序列中 attention 更加分散
2. **Layer 5 通常是最稀疏的层**: 在所有长度测试中Layer 5 的 density 最低
3. **Layer 0 density 最高**: 第一层的 attention pattern 最密集,可能与 sink token 效应有关
4. **Threshold=0.9 对应 ~50% density**: 在 32k context 下threshold=0.9 意味着选择覆盖 90% attention 的 blocks实际 density 约 50%
## 使用方法
### Step 1: 启用 debug 保存
```python
# nanovllm/kvcache/sparse/xattn_bsa.py
_DEBUG_SAVE_MASK = True # 改为 True
```
### Step 2: 运行 GPU-only 推理
```bash
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=/path/to/nano-vllm:$PYTHONPATH \
python tests/test_ruler.py \
--model ~/models/Llama-3.1-8B-Instruct \
--data-dir tests/data/ruler_32k \
--datasets niah_single_1 \
--num-samples 1 \
--max-model-len 40960 \
--sparse-policy XATTN_BSA \
--sparse-threshold 0.9
```
### Step 3: 运行 KV chunking 对齐验证
```bash
# 使用 GPU-only 保存的数据
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=/path/to/nano-vllm:$PYTHONPATH \
python tests/test_xattn_estimate_alignment.py --gpuonly
# 使用 Offload 模式保存的数据 (默认)
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=/path/to/nano-vllm:$PYTHONPATH \
python tests/test_xattn_estimate_alignment.py
# 指定自定义数据文件
python tests/test_xattn_estimate_alignment.py --data-file /path/to/data.pt
# 批量测试所有 Offload 数据
for f in results/kvcache/qkv_*.pt; do
echo "Testing: $(basename $f)"
python tests/test_xattn_estimate_alignment.py --data-file "$f"
done
```
### 批量测试所有长度
```bash
for ctx in 4k 8k 16k 32k 64k; do
case $ctx in
4k) max_len=5000 ;;
8k) max_len=9000 ;;
16k) max_len=17000 ;;
32k) max_len=34000 ;;
64k) max_len=65664 ;;
esac
echo "Testing $ctx..."
python tests/test_ruler.py \
--data-dir tests/data/ruler_$ctx \
--max-model-len $max_len \
--sparse-policy XATTN_BSA \
--num-samples 1 --quiet
python tests/test_xattn_estimate_alignment.py --gpuonly
done
```
## 相关文件
- `nanovllm/kvcache/sparse/xattn_bsa.py`: XAttention BSA Policy 实现
- `nanovllm/ops/xattn.py`: xattn_estimate 函数及三阶段 KV chunking kernels
- `tests/test_xattn_estimate_alignment.py`: KV chunking 对齐验证脚本

View File

@@ -0,0 +1,209 @@
# Issue: XAttention Offload Mode GQA Buffer OOM
## 问题描述
在使用 XAttention BSA (Block Sparse Attention) + CPU Offload 模式运行 GLM-4-9B 等大模型时,出现 CUDA OOM 错误。
### 错误信息
```
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 8.00 GiB.
GPU 0 has a total capacity of 23.57 GiB of which 4.19 GiB is free.
```
### 复现环境
| 项目 | 值 |
|------|-----|
| 模型 | GLM-4-9B-Chat-1M |
| GPU | RTX 3090 (24GB) |
| Context Length | 32K |
| sparse_policy | XATTN_BSA |
| enable_cpu_offload | true |
| max_model_len | 1048576 (1M) |
### 错误位置
```
File "nanovllm/kvcache/sparse/xattn_bsa.py", line 246, in alloc_policy_metadata
self._k_expanded = torch.empty(shape, dtype=dtype, device=device)
```
---
## 问题分析
### 内存分配分析
`alloc_policy_metadata()` 在 KV cache 初始化时分配以下 buffer
| Buffer | 用途 | 大小 (GLM-4, 1M seq) |
|--------|------|----------------------|
| `_prefill_mask_buffer` | BSA mask | ~32 MB |
| `_m_partial_buffer` | KV chunking m stats | ~32 MB |
| `_l_partial_buffer` | KV chunking l stats | ~32 MB |
| `_block_sums_buffer` | Block sums | ~64 MB |
| **`_k_expanded`** | GQA K 扩展 | **~8 GB** |
| **`_v_expanded`** | GQA V 扩展 | **~8 GB** |
### GQA Buffer 计算
```python
shape = (1, num_heads, max_seq_len, head_dim)
= (1, 32, 1048576, 128)
size = 1 × 32 × 1048576 × 128 × 2 bytes (fp16)
= 8,589,934,592 bytes
= 8 GB per buffer
```
### 根本原因
1. **设计意图冲突**`_k_expanded``_v_expanded` 的文档注释明确说是 "for GPU-only mode"
2. **条件检查不完整**:代码只检查了 `num_heads == num_kv_heads` 来跳过分配,没有检查 offload 模式
3. **Offload 模式不需要这些 buffer**`compute_chunked_prefill()` 使用不同的计算路径,不依赖预分配的 GQA buffer
### 相关代码
```python
# xattn_bsa.py:238-247
# Only allocate GQA expansion buffers if GQA (num_heads != num_kv_heads)
if num_heads == num_kv_heads:
logger.info(f"[XAttn] No GQA expansion needed (num_heads == num_kv_heads = {num_heads})")
return # <-- 只检查了 GQA没检查 offload 模式
# Shape: [1, num_heads, max_seq_len, head_dim] for xattn_estimate format
shape = (1, num_heads, max_seq_len, head_dim)
self._k_expanded = torch.empty(shape, dtype=dtype, device=device) # <-- OOM here
self._v_expanded = torch.empty(shape, dtype=dtype, device=device)
```
---
## 解决思路
### 方案 1: 在 Offload 模式下跳过 GQA Buffer 分配 (推荐)
`alloc_policy_metadata()` 中添加 offload 模式检查:
```python
def alloc_policy_metadata(
self,
num_heads: int,
num_kv_heads: int,
head_dim: int,
max_seq_len: int,
dtype: torch.dtype,
device: torch.device,
enable_cpu_offload: bool = False, # <-- 新增参数
) -> None:
# ... 分配 mask buffer 和 KV chunking buffers (offload 模式需要)
# Skip GQA buffers in offload mode
# Chunked prefill uses compute_chunked_prefill() which doesn't need these
if enable_cpu_offload:
logger.info("[XAttn] Offload mode: skipping GQA expansion buffers")
return
# GPU-only mode: pre-allocate GQA buffers for compute_prefill()
if num_heads == num_kv_heads:
logger.info(f"[XAttn] No GQA expansion needed")
return
shape = (1, num_heads, max_seq_len, head_dim)
self._k_expanded = torch.empty(shape, dtype=dtype, device=device)
self._v_expanded = torch.empty(shape, dtype=dtype, device=device)
```
**需要修改的文件**
1. `nanovllm/kvcache/sparse/xattn_bsa.py` - `alloc_policy_metadata()` 方法
2. `nanovllm/engine/model_runner.py` - 调用 `alloc_policy_metadata()` 时传入 `enable_cpu_offload`
### 方案 2: 延迟分配 (Lazy Allocation)
只在 `compute_prefill()` 首次调用时分配 GQA bufferoffload 模式走 `compute_chunked_prefill()` 不会触发分配。
```python
def compute_prefill(self, ...):
# Lazy allocation on first use
if self._k_expanded is None and num_heads != num_kv_heads:
self._allocate_gqa_buffers(...)
...
```
### 方案 3: 基于 chunk_size 限制 buffer 大小
不预分配 max_seq_len 大小,而是只分配 chunk_size 大小:
```python
# 原来: max_seq_len (1M tokens) -> 8 GB
# 修改后: chunk_size (16K tokens) -> ~130 MB
buffer_len = self.chunk_size if enable_cpu_offload else max_seq_len
shape = (1, num_heads, buffer_len, head_dim)
```
---
## 验证方法
修复后运行以下命令验证:
```bash
cd /home/zijie/Code/COMPASS
GPULIST=0 ./scripts/run_ruler.sh glm4-9b-xattn-nanovllm synthetic xattn --task niah_single_1
```
预期结果:
- 不再出现 8GB allocation 的 OOM 错误
- 模型正常加载并完成推理
---
## 相关文档
- `docs/xattn_bsa_policy_design.md` - XAttention BSA Policy 设计文档
- `docs/gpu_only_xattn_guide.md` - GPU-Only XAttention 指南
## 优先级
**High** - 阻塞 9B+ 模型在 24GB 显存 GPU 上使用 XAttention + Offload 模式
---
## 修复状态
**✅ 已修复** (2026-02-05)
### 修复内容
采用方案 1在 offload 模式下跳过 GQA buffer 分配:
1. `nanovllm/kvcache/sparse/policy.py`: 基类添加 `enable_cpu_offload` 参数
2. `nanovllm/kvcache/sparse/xattn_bsa.py`: 实现 offload 模式检查,跳过 GQA buffer
3. `nanovllm/engine/model_runner.py`: 传入 `enable_cpu_offload` 参数
### 验证结果
```bash
# 64K offload 测试
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH \
python tests/test_ruler.py \
--model ~/models/Llama-3.1-8B-Instruct \
--data-dir tests/data/ruler_64k \
--datasets niah_single_1 \
--num-samples 1 \
--max-model-len 72000 \
--enable-offload \
--sparse-policy XATTN_BSA
```
- ✅ 日志显示: `[XAttn] Offload mode: skipping GQA expansion buffers`
- ✅ 测试通过: 100% 准确率
- ✅ 内存节省: ~16 GB (for 1M max_seq_len)
### 内存对比
| 配置 | 修复前 | 修复后 |
|------|--------|--------|
| max_model_len=72K | +1.1 GB | 0 GB |
| max_model_len=1M | +16 GB | 0 GB |

View File

@@ -0,0 +1,184 @@
# 1M+ 上下文长度模型列表
本文档收集了 Hugging Face 上支持 1M (1,048,576) 及以上上下文长度的开源模型。
> 更新时间: 2026-01-28
---
## 一、纯语言模型 (≤10B 参数)
### 1. 官方原版模型
| 厂商 | 模型 | 上下文 | 规模 | 下载量 | 链接 |
|------|------|--------|------|--------|------|
| **Qwen** | Qwen2.5-7B-Instruct-1M | 1M | 7B | 69.3K | [HF](https://hf.co/Qwen/Qwen2.5-7B-Instruct-1M) |
| **THUDM** | GLM-4-9B-Chat-1M | 1M | 9B | 5.0K | [HF](https://hf.co/zai-org/glm-4-9b-chat-1m) |
| **InternLM** | InternLM2.5-7B-Chat-1M | 1M | 7B | 322 | [HF](https://hf.co/internlm/internlm2_5-7b-chat-1m) |
| **NVIDIA** | Llama-3.1-Nemotron-8B-UltraLong-1M-Instruct | 1M | 8B | 2.9K | [HF](https://hf.co/nvidia/Llama-3.1-Nemotron-8B-UltraLong-1M-Instruct) |
| **LWM** | LWM-Text-1M | 1M | 7B | 75 | [HF](https://hf.co/LargeWorldModel/LWM-Text-1M) |
| **LWM** | LWM-Text-Chat-1M | 1M | 7B | 3.0K | [HF](https://hf.co/LargeWorldModel/LWM-Text-Chat-1M) |
### 2. Gradient AI 扩展系列 (基于 Llama 3)
| 模型 | 上下文 | 规模 | 下载量 | 链接 |
|------|--------|------|--------|------|
| Llama-3-8B-Instruct-Gradient-1048k | **1M** | 8B | 44.8K | [HF](https://hf.co/gradientai/Llama-3-8B-Instruct-Gradient-1048k) |
| Llama-3-8B-Instruct-Gradient-4194k | **4M** | 8B | 9 | [HF](https://hf.co/gradientai/Llama-3-8B-Instruct-Gradient-4194k) |
### 3. 社区衍生版本 (Abliterated)
| 模型 | 上下文 | 基础模型 | 下载量 | 链接 |
|------|--------|----------|--------|------|
| Qwen2.5-7B-Instruct-1M-abliterated | 1M | Qwen2.5-7B | 375 | [HF](https://hf.co/huihui-ai/Qwen2.5-7B-Instruct-1M-abliterated) |
| Nemotron-8B-UltraLong-1M-Abliterated | 1M | Nemotron-8B | 46 | [HF](https://hf.co/SicariusSicariiStuff/Llama-3.1-Nemotron-8B-UltraLong-1M-Instruct_Abliterated) |
---
## 二、视觉-语言模型 (≤10B 参数)
### Qwen3 VL 系列
#### Instruct 版本
| 模型 | 上下文 | 规模 | 下载量 | 链接 |
|------|--------|------|--------|------|
| Qwen3-VL-2B-Instruct-1M-GGUF | 1M | 2B | 824 | [HF](https://hf.co/unsloth/Qwen3-VL-2B-Instruct-1M-GGUF) |
| Qwen3-VL-4B-Instruct-1M-GGUF | 1M | 4B | 936 | [HF](https://hf.co/unsloth/Qwen3-VL-4B-Instruct-1M-GGUF) |
| Qwen3-VL-8B-Instruct-1M-GGUF | 1M | 8B | 962 | [HF](https://hf.co/unsloth/Qwen3-VL-8B-Instruct-1M-GGUF) |
#### Thinking 推理版本
| 模型 | 上下文 | 规模 | 下载量 | 链接 |
|------|--------|------|--------|------|
| Qwen3-VL-2B-Thinking-1M-GGUF | 1M | 2B | 808 | [HF](https://hf.co/unsloth/Qwen3-VL-2B-Thinking-1M-GGUF) |
| Qwen3-VL-4B-Thinking-1M-GGUF | 1M | 4B | 666 | [HF](https://hf.co/unsloth/Qwen3-VL-4B-Thinking-1M-GGUF) |
| Qwen3-VL-8B-Thinking-1M-GGUF | 1M | 8B | 4.6K | [HF](https://hf.co/unsloth/Qwen3-VL-8B-Thinking-1M-GGUF) |
---
## 三、推荐模型 (≤10B)
| 用途 | 推荐模型 | 理由 |
|------|----------|------|
| **通用对话** | Qwen2.5-7B-Instruct-1M | 官方支持RULER 93.1分Apache 2.0 |
| **中英双语** | GLM-4-9B-Chat-1M | 清华出品,中文优化 |
| **最长上下文** | Llama-3-8B-Gradient-4194k | 支持 4M 上下文 |
| **多模态** | Qwen3-VL-8B-Thinking-1M | 视觉理解 + 推理能力 |
| **无审查** | Qwen2.5-7B-Instruct-1M-abliterated | 移除安全限制 |
---
## 四、VRAM 需求参考
| 模型规模 | 1M 上下文 VRAM | 备注 |
|----------|----------------|------|
| 7B (FP16) | ~120GB | 需多卡 |
| 7B (INT4) | ~40GB | 单卡 A100 可行 |
| 8B (FP16) | ~130GB | 需多卡 |
| 9B (FP16) | ~140GB | 需多卡 |
---
## 五、技术对比
| 模型系列 | 扩展技术 | RULER 得分 | 许可证 |
|---------|---------|------------|--------|
| Qwen2.5-1M | Dual Chunk Attention | 93.1 | Apache 2.0 |
| GLM-4-1M | - | 89.9 | 自定义 |
| Gradient-Llama | 渐进式扩展 | - | Llama 3 |
| Nemotron-1M | NVIDIA 训练 | - | CC-BY-NC-4.0 |
| LWM-1M | RingAttention | - | 开源 |
---
---
# 附录:大参数模型 (>10B)
> 以下模型参数量超过 10B需要更多计算资源。
## A. 纯语言模型 (>10B)
### 官方模型
| 厂商 | 模型 | 上下文 | 规模 | 下载量 | 链接 |
|------|------|--------|------|--------|------|
| **Qwen** | Qwen2.5-14B-Instruct-1M | 1M | 14B | 4.7K | [HF](https://hf.co/Qwen/Qwen2.5-14B-Instruct-1M) |
| **MiniMax** | MiniMax-Text-01 | 1M | 456B MoE | 721 | [HF](https://hf.co/MiniMaxAI/MiniMax-Text-01) |
| **Gradient** | Llama-3-70B-Instruct-Gradient-1048k | 1M | 70B | 9 | [HF](https://hf.co/gradientai/Llama-3-70B-Instruct-Gradient-1048k) |
### Qwen3 Coder 系列 (MoE)
| 模型 | 上下文 | 总参数/激活参数 | 下载量 | 链接 |
|------|--------|-----------------|--------|------|
| Qwen3-Coder-30B-A3B-Instruct-1M-GGUF | 1M | 30B / 3B | 13.1K | [HF](https://hf.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-1M-GGUF) |
| Qwen3-Coder-480B-A35B-Instruct-1M | 1M | 480B / 35B | 50 | [HF](https://hf.co/unsloth/Qwen3-Coder-480B-A35B-Instruct-1M) |
| Qwen3-Coder-480B-A35B-Instruct-1M-GGUF | 1M | 480B / 35B | 1.7K | [HF](https://hf.co/unsloth/Qwen3-Coder-480B-A35B-Instruct-1M-GGUF) |
| Qwen3-Coder-42B-A3B-TOTAL-RECALL-1M | 1M | 42B / 3B | - | [HF](https://hf.co/DavidAU/Qwen3-Coder-42B-A3B-Instruct-TOTAL-RECALL-MASTER-CODER-M-1million-ctx) |
### 社区衍生版本
| 模型 | 上下文 | 规模 | 下载量 | 链接 |
|------|--------|------|--------|------|
| Qwen2.5-14B-Instruct-1M-abliterated | 1M | 14B | 147 | [HF](https://hf.co/huihui-ai/Qwen2.5-14B-Instruct-1M-abliterated) |
---
## B. 视觉-语言模型 (>10B)
### Meta Llama 4 系列 (MoE 多模态)
| 模型 | 上下文 | 总参数/激活参数 | 下载量 | 链接 |
|------|--------|-----------------|--------|------|
| Llama-4-Scout-17B-16E-Instruct | **10M** | 109B / 17B | 180K | [HF](https://hf.co/meta-llama/Llama-4-Scout-17B-16E-Instruct) |
| Llama-4-Maverick-17B-128E-Instruct | **1M** | 400B / 17B | 32.6K | [HF](https://hf.co/meta-llama/Llama-4-Maverick-17B-128E-Instruct) |
| Llama-4-Scout-17B-16E | 10M | 109B / 17B | 8.4K | [HF](https://hf.co/meta-llama/Llama-4-Scout-17B-16E) |
| Llama-4-Maverick-17B-128E | 1M | 400B / 17B | 368 | [HF](https://hf.co/meta-llama/Llama-4-Maverick-17B-128E) |
| Llama-4-Maverick-17B-128E-Instruct-FP8 | 1M | 400B / 17B | 29.6K | [HF](https://hf.co/meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8) |
### Qwen3 VL 大模型系列
#### Dense 模型
| 模型 | 上下文 | 规模 | 下载量 | 链接 |
|------|--------|------|--------|------|
| Qwen3-VL-32B-Instruct-1M-GGUF | 1M | 32B | 1.2K | [HF](https://hf.co/unsloth/Qwen3-VL-32B-Instruct-1M-GGUF) |
| Qwen3-VL-32B-Thinking-1M-GGUF | 1M | 32B | 452 | [HF](https://hf.co/unsloth/Qwen3-VL-32B-Thinking-1M-GGUF) |
#### MoE 模型
| 模型 | 上下文 | 总参数/激活参数 | 下载量 | 链接 |
|------|--------|-----------------|--------|------|
| Qwen3-VL-30B-A3B-Instruct-1M-GGUF | 1M | 30B / 3B | 821 | [HF](https://hf.co/unsloth/Qwen3-VL-30B-A3B-Instruct-1M-GGUF) |
| Qwen3-VL-30B-A3B-Thinking-1M-GGUF | 1M | 30B / 3B | 944 | [HF](https://hf.co/unsloth/Qwen3-VL-30B-A3B-Thinking-1M-GGUF) |
| Qwen3-VL-235B-A22B-Instruct-1M-GGUF | 1M | 235B / 22B | 581 | [HF](https://hf.co/unsloth/Qwen3-VL-235B-A22B-Instruct-1M-GGUF) |
| Qwen3-VL-235B-A22B-Thinking-1M-GGUF | 1M | 235B / 22B | 733 | [HF](https://hf.co/unsloth/Qwen3-VL-235B-A22B-Thinking-1M-GGUF) |
#### MXFP4 量化版本
| 模型 | 上下文 | 规模 | 下载量 | 链接 |
|------|--------|------|--------|------|
| Qwen3-VL-30B-A3B-Instruct-1M-MXFP4_MOE-GGUF | 1M | 30B MoE | 689 | [HF](https://hf.co/noctrex/Qwen3-VL-30B-A3B-Instruct-1M-MXFP4_MOE-GGUF) |
| Qwen3-VL-30B-A3B-Thinking-1M-MXFP4_MOE-GGUF | 1M | 30B MoE | 565 | [HF](https://hf.co/noctrex/Qwen3-VL-30B-A3B-Thinking-1M-MXFP4_MOE-GGUF) |
| Qwen3-VL-235B-A22B-Instruct-1M-MXFP4_MOE-GGUF | 1M | 235B MoE | 136 | [HF](https://hf.co/noctrex/Qwen3-VL-235B-A22B-Instruct-1M-MXFP4_MOE-GGUF) |
| Qwen3-VL-235B-A22B-Thinking-1M-MXFP4_MOE-GGUF | 1M | 235B MoE | 244 | [HF](https://hf.co/noctrex/Qwen3-VL-235B-A22B-Thinking-1M-MXFP4_MOE-GGUF) |
---
## 统计汇总
| 类别 | ≤10B 模型数 | >10B 模型数 | 最大上下文 |
|------|-------------|-------------|-----------|
| 纯语言模型 | 10 | 8 | 4M |
| 视觉-语言模型 | 6 | 14 | 10M |
| **合计** | **16** | **22** | **10M** |
---
## 参考资源
- [Qwen2.5-1M 官方博客](https://qwenlm.github.io/blog/qwen2.5-1m/)
- [LongRoPE 论文](https://huggingface.co/papers/2402.13753)
- [InfiniteHiP 论文](https://huggingface.co/papers/2502.08910)
- [Top LLMs for Long Context Windows](https://www.siliconflow.com/articles/en/top-LLMs-for-long-context-windows)

View File

@@ -0,0 +1,120 @@
# Memory Communication Benchmark
GPU-CPU 通信量测试结果,对比 Full Policy 和 XAttention BSA Policy。
## 测试环境
- **模型**: Llama-3.1-8B-Instruct
- **GPU**: RTX 3090 (24GB)
- **配置**: `num_gpu_blocks=4`, `block_size=1024`, `enable_cpu_offload=True`
- **XAttention 参数**: `threshold=0.95`, `stride=8`
## 32K 上下文测试结果
| 指标 | Full Policy | XAttention | 比率 |
|------|-------------|------------|------|
| **Prefill H2D** | 66.57 GB | 111.12 GB | **1.67x** |
| Prefill D2H | 4.29 GB | 4.29 GB | 1.00x |
| TTFT | 8473 ms | 10367 ms | 1.22x |
### XAttention Block Selection (32K)
| 指标 | 数值 |
|------|------|
| 可用 blocks | 465 |
| 选中 blocks | 374 |
| 选择密度 | 80.4% |
## 64K 上下文测试结果
| 指标 | Full Policy | XAttention | 比率 |
|------|-------------|------------|------|
| **Prefill H2D** | 262.13 GB | 386.62 GB | **1.48x** |
| Prefill D2H | 8.46 GB | 8.46 GB | 1.00x |
| Decode H2D (32 tokens) | 262.13 GB | 262.13 GB | 1.00x |
| TTFT | 27081 ms | 33634 ms | 1.24x |
## 通信量比率对比 (K-only 优化前)
| 上下文长度 | XAttn/Full Prefill H2D 比率 |
|------------|----------------------------|
| 32K | 1.67x |
| 64K | 1.48x |
### 分析 (优化前)
1. **XAttention 通信量增加原因**
- Estimate 阶段:加载 **100%** 历史 blocks 的 **K+V**(用于 attention score 估计)
- Compute 阶段:加载 **选中的** blocks约 70-80%
- 理论比率:`1 + selection_density`
2. **64K 比率更低的原因**
- 更长上下文时attention 分布更稀疏
- XAttention 的 block 选择更有效(选中比例更低)
- First/last block 强制包含的影响相对减小
3. **Decode 阶段通信量相同**
- XAttention 仅支持 prefill 阶段
- Decode 阶段 fallback 到 Full Policy
---
## K-only 优化 (2026-01-28)
### 优化原理
XAttention 的 `select_blocks` 估计阶段只需要 K 来计算 attention scores
```python
# flat_group_gemm_fuse_reshape 只使用 Q 和 K
attn_scores = flat_group_gemm_fuse_reshape(Q, K_chunk, stride, ...)
```
V 在估计阶段完全不使用,但之前代码会同时加载 K 和 V造成 50% 通信量浪费。
### 优化实现
1. **新增方法**: `OffloadEngine.load_k_only_to_slot_layer()` - 只加载 K
2. **修改 select_blocks**: 使用只加载 K 的新方法
### 优化后测试结果
| 上下文 | Full Policy | XAttn (优化前) | XAttn (优化后) | 优化节省 |
|--------|-------------|---------------|---------------|---------|
| 32K | 66.57 GB | 111.12 GB | **79.76 GB** | **28.2%** |
| 64K | 262.13 GB | 386.62 GB | **258.78 GB** | **33.1%** |
### XAttn/Full 比率变化
| 上下文 | 优化前比率 | 优化后比率 |
|--------|-----------|-----------|
| 32K | 1.67x | **1.20x** |
| 64K | 1.48x | **0.99x** |
### 结论
优化后64K 上下文的 XAttention 通信量与 Full Policy 基本持平 (0.99x)
而 32K 也从 1.67x 降到 1.20x。这说明估计阶段的 K-only 优化非常有效
## 测试命令
```bash
# 32K Full Policy
python bench_offload.py --max-len 32768 --input-len 32000
# 32K XAttention
python bench_offload.py --max-len 32768 --input-len 32000 --enable-xattn
# 64K Full Policy
python bench_offload.py --max-len 65536 --input-len 64000
# 64K XAttention
python bench_offload.py --max-len 65536 --input-len 64000 --enable-xattn
# 包含 decode 测试
python bench_offload.py --max-len 65536 --input-len 64000 --bench-decode --output-len 32
```
## 相关文档
- [`observer_architecture.md`](observer_architecture.md) - Observer 架构设计
- [`xattn_bsa_policy_design.md`](xattn_bsa_policy_design.md) - XAttention BSA 算法设计

View File

@@ -0,0 +1,323 @@
# 新模型整合指南
本文档总结了将新模型如GLM-4整合到nanovllm的经验和常见问题。
## 整合流程概览
```
1. 分析模型配置 (config.json)
2. 创建模型文件 (nanovllm/models/<model>.py)
3. 实现权重加载 (nanovllm/utils/loader.py)
4. 处理特殊组件 (RoPE, Attention, etc.)
5. 处理tokenizer差异 (EOS tokens, chat template)
6. 验证输出正确性
```
---
## 1. 配置字段映射
不同模型使用不同的配置字段名称,需要建立映射关系:
| 标准字段 | GLM-4 | Qwen | Llama | 说明 |
|----------|-------|------|-------|------|
| `num_key_value_heads` | `multi_query_group_num` | `num_key_value_heads` | `num_key_value_heads` | KV heads数量 |
| `head_dim` | `kv_channels` | 计算得出 | 计算得出 | 每个head的维度 |
| `intermediate_size` | `ffn_hidden_size` | `intermediate_size` | `intermediate_size` | FFN隐藏层大小 |
| `max_position_embeddings` | `seq_length` | `max_position_embeddings` | `max_position_embeddings` | 最大位置 |
| `rope_theta` | `10000 * rope_ratio` | `rope_theta` | `rope_theta` | RoPE基础频率 |
### 代码示例
```python
# 在模型 __init__ 中处理配置差异
num_kv_heads = getattr(config, 'num_key_value_heads',
getattr(config, 'multi_query_group_num', num_heads))
head_dim = getattr(config, 'head_dim',
getattr(config, 'kv_channels', hidden_size // num_heads))
intermediate_size = getattr(config, 'intermediate_size',
getattr(config, 'ffn_hidden_size', None))
max_position = getattr(config, 'max_position_embeddings',
getattr(config, 'seq_length', 4096))
```
---
## 2. RoPE实现差异
RoPE是模型整合中**最容易出错**的部分。不同模型可能使用不同的RoPE变体
### 2.1 旋转方式
| 类型 | 描述 | 使用模型 |
|------|------|----------|
| **Half rotation** | 前半和后半分别旋转 `[x0,x1,...] → [x0*cos-x_{d/2}*sin, ...]` | Llama, Qwen |
| **Interleaved rotation** | 相邻元素配对旋转 `[x0,x1,...] → [x0*cos-x1*sin, x1*cos+x0*sin, ...]` | GLM-4 |
### 2.2 旋转维度
| 类型 | 描述 | 使用模型 |
|------|------|----------|
| **Full rotation** | 旋转整个head_dim | Llama, Qwen |
| **Partial rotation** | 只旋转head_dim的一部分其余pass-through | GLM-4 (rotary_dim = head_dim // 2) |
### 2.3 GLM-4 RoPE实现
```python
class GLM4RotaryEmbedding(nn.Module):
def __init__(self, head_dim, rotary_dim, ...):
# GLM-4只旋转一半维度
self.rotary_dim = rotary_dim # = head_dim // 2
def forward(self, positions, query, key):
# 分离旋转部分和pass-through部分
q_rot = query[..., :self.rotary_dim]
q_pass = query[..., self.rotary_dim:]
# 只对旋转部分应用interleaved RoPE
q_rot = apply_rotary_emb_interleaved(q_rot, cos, sin)
# 拼接回去
return torch.cat([q_rot, q_pass], dim=-1), ...
```
### 2.4 调试RoPE问题
**症状**:模型输出乱码或重复无意义的内容(如 "The. The. The..."
**调试方法**
```python
# 对比HuggingFace参考实现的输出
hf_q, hf_k = hf_model.apply_rotary_pos_emb(query, key, cos, sin)
my_q, my_k = my_rotary_emb(positions, query, key)
print(f"Q max diff: {(hf_q - my_q).abs().max()}") # 应该 < 1e-5
print(f"K max diff: {(hf_k - my_k).abs().max()}") # 应该 < 1e-5
```
---
## 3. 权重名称映射
不同模型的权重命名规范不同:
### 3.1 常见映射
| 组件 | Llama/Qwen | GLM-4 |
|------|------------|-------|
| Attention QKV | `q_proj`, `k_proj`, `v_proj` | `query_key_value` (合并) |
| Attention Output | `o_proj` | `dense` |
| MLP Gate | `gate_proj` | `dense_h_to_4h` (部分) |
| MLP Up | `up_proj` | `dense_h_to_4h` (部分) |
| MLP Down | `down_proj` | `dense_4h_to_h` |
| LayerNorm | `input_layernorm` | `input_layernorm` |
| Post-Attention LN | `post_attention_layernorm` | `post_attention_layernorm` |
### 3.2 实现权重转换
```python
def convert_glm4_weights(name, param):
"""将GLM-4权重名称转换为nanovllm格式"""
# 处理合并的QKV权重
if "query_key_value" in name:
# 拆分为q, k, v
q, k, v = param.split([q_size, kv_size, kv_size], dim=0)
return {"q_proj": q, "k_proj": k, "v_proj": v}
# 处理合并的gate+up权重
if "dense_h_to_4h" in name:
gate, up = param.chunk(2, dim=0)
return {"gate_proj": gate, "up_proj": up}
return {name: param}
```
---
## 4. EOS Token处理
### 4.1 问题
某些模型使用**多个EOS tokens**
| 模型 | EOS Token(s) | 说明 |
|------|--------------|------|
| Llama | `128001` | 单一EOS |
| Qwen | `151643` | 单一EOS |
| GLM-4 | `[151329, 151336, 151338]` | 多个endoftext, user, observation |
**问题**`tokenizer.eos_token_id` 只返回第一个导致模型不会在其他EOS token处停止。
### 4.2 解决方案
```python
# config.py - 支持多个EOS
eos: int | list[int] = -1
# llm_engine.py - 从hf_config读取完整EOS列表
eos_from_config = getattr(config.hf_config, 'eos_token_id', None)
if eos_from_config is not None:
config.eos = eos_from_config
else:
config.eos = self.tokenizer.eos_token_id
# scheduler.py - 使用set进行高效查找
self.eos_set = set(eos) if isinstance(eos, list) else {eos}
# 检查时使用 in 而不是 ==
if token_id in self.eos_set:
# 停止生成
```
### 4.3 调试EOS问题
**症状**模型总是生成到max_tokens才停止
**调试方法**
```python
# 检查EOS配置
print(f"tokenizer.eos_token_id: {tokenizer.eos_token_id}")
print(f"hf_config.eos_token_id: {config.hf_config.eos_token_id}")
# 检查输出中的EOS tokens
output = llm.generate([prompt], params)
for eos_id in [151329, 151336, 151338]:
if eos_id in output[0]['token_ids']:
print(f"Found EOS {eos_id} at position {output[0]['token_ids'].index(eos_id)}")
```
---
## 5. Chat Template
不同模型使用不同的对话模板:
| 模型 | 模板格式 |
|------|----------|
| Llama-3 | `<\|begin_of_text\|><\|start_header_id\|>user<\|end_header_id\|>\n{content}<\|eot_id\|><\|start_header_id\|>assistant<\|end_header_id\|>\n` |
| Qwen | `<\|im_start\|>user\n{content}<\|im_end\|>\n<\|im_start\|>assistant\n` |
| GLM-4 | `[gMASK]<sop><\|user\|>\n{content}<\|assistant\|>\n` |
### 实现模板转换
```python
def convert_to_model_prompt(prompt: str, model_type: str) -> str:
"""将标准prompt转换为模型特定格式"""
if model_type == "glm4":
return f"[gMASK]<sop><|user|>\n{prompt}<|assistant|>\n"
elif model_type == "llama3":
return f"<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n{prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n"
# ...
```
---
## 6. 验证清单
整合新模型后,按以下顺序验证:
### 6.1 权重加载验证
```python
# 检查所有权重是否正确加载
for name, param in model.named_parameters():
if param.abs().sum() == 0:
print(f"WARNING: {name} is all zeros!")
```
### 6.2 单层输出验证
```python
# 对比embedding层输出
my_emb = my_model.embed_tokens(input_ids)
hf_emb = hf_model.model.embed_tokens(input_ids)
print(f"Embedding diff: {(my_emb - hf_emb).abs().max()}") # < 1e-5
# 对比第一层输出
my_out = my_model.layers[0](my_emb, ...)
hf_out = hf_model.model.layers[0](hf_emb, ...)
print(f"Layer 0 diff: {(my_out - hf_out).abs().max()}") # < 1e-4
```
### 6.3 生成质量验证
```python
# 简单问答测试
prompt = "Hello, how are you?"
output = llm.generate([prompt], SamplingParams(max_tokens=50))
print(output[0]['text']) # 应该是连贯的回答
# 检查是否正确停止
print(f"Generated {len(output[0]['token_ids'])} tokens (max=50)")
```
### 6.4 RULER基准测试
```bash
# 运行1个sample快速验证
python tests/test_ruler.py --model <path> --num-samples 1
# 验证通过后运行完整测试
python tests/test_ruler.py --model <path> --num-samples 100
```
---
## 7. 常见问题速查
| 症状 | 可能原因 | 解决方案 |
|------|----------|----------|
| 输出乱码/重复 | RoPE实现错误 | 检查旋转方式(interleaved vs half)和旋转维度(full vs partial) |
| 数值爆炸(NaN/Inf) | 权重加载错误或dtype不匹配 | 检查权重映射确保dtype一致 |
| 不停止生成 | EOS token处理错误 | 从hf_config读取完整EOS列表 |
| 输出质量差 | LayerNorm或bias缺失 | 检查add_qkv_bias等配置 |
| 位置编码错误 | max_position_embeddings读取错误 | 检查配置字段名称(seq_length等) |
---
## 8. 文件结构
新模型整合需要修改/创建的文件:
```
nanovllm/
├── models/
│ └── <model>.py # 新建:模型定义
├── layers/
│ └── rotary_embedding.py # 修改如需特殊RoPE
├── utils/
│ └── loader.py # 修改:权重加载
├── config.py # 可能修改:新配置字段
└── engine/
├── llm_engine.py # 可能修改EOS处理
└── scheduler.py # 可能修改EOS检查
tests/
└── test_ruler.py # 修改chat template
```
---
## 附录GLM-4整合案例
### 遇到的问题及解决
1. **配置字段差异** → 添加getattr fallback链
2. **Interleaved RoPE** → 实现`apply_rotary_emb_interleaved`
3. **Partial rotation (head_dim//2)** → 实现`GLM4RotaryEmbedding`
4. **多EOS tokens** → 修改config/llm_engine/scheduler支持list
5. **合并的QKV权重** → 在loader中拆分
### 关键代码位置
- RoPE实现: `nanovllm/layers/rotary_embedding.py:GLM4RotaryEmbedding`
- 模型定义: `nanovllm/models/glm4.py`
- 权重加载: `nanovllm/utils/loader.py:load_glm4_weights`
- EOS处理: `nanovllm/engine/scheduler.py:eos_set`

View File

@@ -0,0 +1,194 @@
# Observer Architecture
nanovllm 的 Observer 架构用于统计推理过程中的关键指标采用类变量class variable模式实现全局状态管理。
## 架构概览
```
Observer (基类)
├── InferenceObserver - 推理时间指标 (TTFT, TPOT)
└── MemoryObserver - 内存传输统计 (H2D, D2H, D2D)
```
## 设计原则
### 1. 类变量模式
所有 Observer 使用类变量(而非实例变量)存储状态:
```python
class Observer:
"""Observer 基类"""
_enabled: bool = True # 类变量,控制是否启用
class InferenceObserver(Observer):
ttft: int = 0 # 类变量,全局共享
tpot: int = 0
ttft_start: int = 0
tpot_start: int = 0
```
**优点**
- 无需实例化,任何地方都可以直接访问
- 避免跨模块传递 observer 实例
- 适合全局统计场景
### 2. 启用/禁用控制
每个 Observer 可独立启用/禁用:
```python
# 启用 MemoryObserver
MemoryObserver._enabled = True
# 禁用后record_* 方法不会记录
MemoryObserver._enabled = False
```
### 3. 阶段分离
MemoryObserver 支持 prefill/decode 阶段分离统计:
```python
@classmethod
def record_h2d(cls, num_bytes: int, is_prefill: bool = True) -> None:
if not cls._enabled:
return
cls.h2d_bytes += num_bytes
cls.h2d_count += 1
if is_prefill:
cls.prefill_h2d_bytes += num_bytes
else:
cls.decode_h2d_bytes += num_bytes
```
## Observer 实现
### InferenceObserver
**位置**: `nanovllm/utils/observer.py`
**统计指标**
| 指标 | 说明 | 单位 |
|------|------|------|
| `ttft` | Time To First Token | 纳秒 |
| `tpot` | Time Per Output Token | 纳秒 |
| `ttft_start` | TTFT 计时开始点 | 纳秒 |
| `tpot_start` | TPOT 计时开始点 | 纳秒 |
**统计位置**
| 位置 | 代码 | 说明 |
|------|------|------|
| `scheduler.py:add()` | `InferenceObserver.ttft_start = perf_counter_ns()` | 开始计时 |
| `llm_engine.py:step()` | `InferenceObserver.ttft = ... - ttft_start` | Prefill 完成后计算 TTFT |
| `llm_engine.py:step()` | `InferenceObserver.tpot = ... - tpot_start` | Decode 时计算 TPOT |
### MemoryObserver
**位置**: `nanovllm/utils/memory_observer.py`
**统计指标**
| 指标 | 说明 |
|------|------|
| `h2d_bytes` / `h2d_count` | Host to Device 传输量/次数 |
| `d2h_bytes` / `d2h_count` | Device to Host 传输量/次数 |
| `d2d_bytes` / `d2d_count` | Device to Device 复制量/次数 |
| `prefill_h2d_bytes` / `prefill_d2h_bytes` | Prefill 阶段 H2D/D2H |
| `decode_h2d_bytes` / `decode_d2h_bytes` | Decode 阶段 H2D/D2H |
**统计位置** (均在 `offload_engine.py`)
| 方法 | 传输类型 | 说明 |
|------|----------|------|
| `load_to_slot_layer()` | H2D | 从 CPU 加载 block 到 GPU slot |
| `load_block_sample_from_cpu()` | H2D | 采样加载Quest |
| `load_block_full_from_cpu()` | H2D | 完整加载 block |
| `offload_slot_layer_to_cpu()` | D2H | GPU slot 卸载到 CPU |
| `offload_prefill_buffer_async()` | D2H | Prefill buffer 异步卸载 |
| `write_to_prefill_buffer()` | D2D | 写入 prefill buffer |
| `write_to_decode_buffer()` | D2D | 写入 decode buffer |
**重置位置**
| 位置 | 代码 |
|------|------|
| `llm_engine.py:generate()` | `MemoryObserver.complete_reset()` |
| `llm_engine.py:generate()` | `InferenceObserver.complete_reset()` |
## 使用示例
### 1. 启用并统计
```python
from nanovllm.utils.memory_observer import MemoryObserver
# 启用统计
MemoryObserver._enabled = True
# 运行推理
outputs = llm.generate(prompts, sampling_params)
# 获取结果
print(f"Prefill H2D: {MemoryObserver.prefill_h2d_bytes / 1e9:.2f} GB")
print(f"Decode H2D: {MemoryObserver.decode_h2d_bytes / 1e9:.2f} GB")
# 或使用 print_summary
MemoryObserver.print_summary()
```
### 2. 在 bench_offload.py 中
```python
from nanovllm.utils.memory_observer import MemoryObserver
# 启用
MemoryObserver._enabled = True
# benchmark 结束后打印
def print_memory_stats():
fmt = MemoryObserver._fmt_bytes
print(f"[Memory] Prefill H2D: {fmt(MemoryObserver.prefill_h2d_bytes)}")
print(f" Decode H2D: {fmt(MemoryObserver.decode_h2d_bytes)}")
```
### 3. 获取结构化数据
```python
summary = MemoryObserver.get_summary()
# {
# "total": {"h2d_bytes": ..., "d2h_bytes": ..., "d2d_bytes": ...},
# "prefill": {"h2d_bytes": ..., "d2h_bytes": ...},
# "decode": {"h2d_bytes": ..., "d2h_bytes": ...}
# }
```
## 添加新 Observer
1. 继承 `Observer` 基类
2. 定义类变量存储统计数据
3. 实现 `record_*` 方法(需检查 `_enabled`
4. 实现 `complete_reset()` 方法
5. 在相关代码位置添加 `record_*` 调用
6.`llm_engine.py:generate()` 中添加 reset 调用
```python
from nanovllm.utils.observer import Observer
class MyObserver(Observer):
_enabled: bool = False
my_metric: int = 0
@classmethod
def record_event(cls, value: int) -> None:
if not cls._enabled:
return
cls.my_metric += value
@classmethod
def complete_reset(cls) -> None:
cls.my_metric = 0
```
## 相关文档
- [`memory_communication_benchmark.md`](memory_communication_benchmark.md) - 通信量测试结果
- [`architecture_guide.md`](architecture_guide.md) - 整体架构指南

View File

@@ -0,0 +1,338 @@
# test_ruler.py 使用指南
RULER benchmark 综合测试工具,用于评估 LLM 长上下文能力。
**测试日期**: 2026-02-05
**测试 GPU**: RTX 3090 (GPU 4)
---
## 支持的任务
| 类别 | 任务 |
|------|------|
| NIAH (Needle-In-A-Haystack) | `niah_single_1/2/3`, `niah_multikey_1/2/3`, `niah_multiquery`, `niah_multivalue` |
| QA (Question Answering) | `qa_1`, `qa_2` |
| Recall | `cwe`, `fwe`, `vt` |
---
## 基本命令格式
```bash
CUDA_VISIBLE_DEVICES=<GPU_ID> PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH \
python tests/test_ruler.py [OPTIONS]
```
---
## 参数说明
### 必要参数
| 参数 | 默认值 | 说明 |
|------|--------|------|
| `--model` | `~/models/Llama-3.1-8B-Instruct` | 模型路径 |
| `--data-dir` | `tests/data/ruler_64k` | 数据目录 |
| `--max-model-len` | 65664 | 最大上下文长度 |
### 数据选择
| 参数 | 默认值 | 说明 |
|------|--------|------|
| `--datasets` | 全部 | 逗号分隔的数据集名 |
| `--num-samples` | 0 (全部) | 每个数据集测试样本数 |
| `--sample-indices` | - | 指定样本索引 (如 `0,5,10`) |
### Offload 配置
| 参数 | 默认值 | 说明 |
|------|--------|------|
| `--enable-offload` | False | 启用 CPU offload 模式 |
| `--num-gpu-blocks` | 4 | GPU 上的 KV cache blocks 数量 |
| `--block-size` | 4096 | KV cache block 大小 (tokens) |
| `--num-kv-buffers` | 4 | Ring buffer 数量 |
| `--gpu-utilization` | 0.9 | GPU 显存利用率 |
### Sparse Attention 配置
| 参数 | 默认值 | 说明 |
|------|--------|------|
| `--sparse-policy` | - | 稀疏策略: `FULL`, `QUEST`, `XATTN_BSA` |
| `--sparse-threshold` | 0.9 | XAttn cumulative attention 阈值 |
| `--sparse-samples` | 128 | XAttn 每 chunk 采样数 |
| `--sparse-stride` | 8 | XAttn Q/K 下采样步长 |
### 输出控制
| 参数 | 说明 |
|------|------|
| `--quiet` / `-q` | 安静模式 |
| `--json-output` | JSON 格式输出 |
| `--fresh-llm` | 每个样本重新初始化 LLM |
### 其他
| 参数 | 默认值 | 说明 |
|------|--------|------|
| `--dtype` | auto | 模型数据类型 (`bfloat16`, `float16`) |
| `--use-cuda-graph` | False | 启用 CUDA Graph |
| `--max-new-tokens` | 16 | 最大生成 token 数 |
---
## 已验证的命令示例
以下命令均在 RTX 3090 (24GB) 上测试通过。
### 1. 基础 Offload 测试 (32K)
```bash
CUDA_VISIBLE_DEVICES=4 PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH \
python tests/test_ruler.py \
--model ~/models/Llama-3.1-8B-Instruct \
--data-dir tests/data/ruler_32k \
--datasets niah_single_1 \
--num-samples 1 \
--max-model-len 40960 \
--enable-offload
```
**结果**: 100% 准确率, 耗时 ~16s
### 2. Offload + XAttention BSA (32K)
```bash
CUDA_VISIBLE_DEVICES=4 PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH \
python tests/test_ruler.py \
--model ~/models/Llama-3.1-8B-Instruct \
--data-dir tests/data/ruler_32k \
--datasets niah_single_1 \
--num-samples 1 \
--max-model-len 40960 \
--enable-offload \
--sparse-policy XATTN_BSA \
--sparse-threshold 0.9
```
**结果**: 100% 准确率, compute density ~50%, 耗时 ~19s
### 3. Offload + XAttention BSA (64K)
```bash
CUDA_VISIBLE_DEVICES=4 PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH \
python tests/test_ruler.py \
--model ~/models/Llama-3.1-8B-Instruct \
--data-dir tests/data/ruler_64k \
--datasets niah_single_1 \
--num-samples 1 \
--max-model-len 72000 \
--enable-offload \
--sparse-policy XATTN_BSA \
--sparse-threshold 0.9
```
**结果**: 100% 准确率, compute density ~37%, 耗时 ~52s
### 4. 多数据集多样本测试
```bash
CUDA_VISIBLE_DEVICES=4 PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH \
python tests/test_ruler.py \
--model ~/models/Llama-3.1-8B-Instruct \
--data-dir tests/data/ruler_32k \
--datasets niah_single_1,qa_1 \
--num-samples 2 \
--max-model-len 40960 \
--enable-offload \
--sparse-policy XATTN_BSA
```
**结果**: 4/4 (100%), 耗时 ~71s
### 5. 指定样本索引测试
```bash
CUDA_VISIBLE_DEVICES=4 PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH \
python tests/test_ruler.py \
--model ~/models/Llama-3.1-8B-Instruct \
--data-dir tests/data/ruler_32k \
--datasets niah_single_1 \
--sample-indices 0,5,10 \
--max-model-len 40960 \
--enable-offload
```
### 6. JSON 输出模式 (用于脚本)
```bash
CUDA_VISIBLE_DEVICES=4 PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH \
python tests/test_ruler.py \
--model ~/models/Llama-3.1-8B-Instruct \
--data-dir tests/data/ruler_32k \
--datasets niah_single_1 \
--num-samples 1 \
--max-model-len 40960 \
--enable-offload \
--json-output
```
**输出格式**:
```json
{
"total_correct": 1,
"total_samples": 1,
"overall_accuracy": 1.0,
"avg_score": 1.0,
"time": 30.44,
"tasks": {"niah_single_1": {"correct": 1, "total": 1, "accuracy": 1.0}},
"failed_samples": {}
}
```
### 7. 安静模式
```bash
CUDA_VISIBLE_DEVICES=4 PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH \
python tests/test_ruler.py \
--model ~/models/Llama-3.1-8B-Instruct \
--data-dir tests/data/ruler_32k \
--datasets niah_single_1 \
--num-samples 1 \
--max-model-len 40960 \
--enable-offload \
--quiet
```
### 8. 调整 GPU blocks 数量
```bash
CUDA_VISIBLE_DEVICES=4 PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH \
python tests/test_ruler.py \
--model ~/models/Llama-3.1-8B-Instruct \
--data-dir tests/data/ruler_32k \
--datasets niah_single_1 \
--num-samples 1 \
--max-model-len 40960 \
--enable-offload \
--num-gpu-blocks 8 \
--sparse-policy XATTN_BSA
```
### 9. GLM-4 模型测试
```bash
CUDA_VISIBLE_DEVICES=4 PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH \
python tests/test_ruler.py \
--model ~/models/GLM-4-9B-Chat-1M \
--data-dir tests/data/ruler_32k \
--datasets niah_single_1 \
--num-samples 1 \
--max-model-len 40960 \
--enable-offload \
--dtype bfloat16
```
**结果**: 100% 准确率, 耗时 ~17s
---
## 数据目录结构
```
tests/data/
├── ruler_4k/ # 4K context
├── ruler_8k/ # 8K context
├── ruler_16k/ # 16K context
├── ruler_32k/ # 32K context (推荐测试)
├── ruler_64k/ # 64K context
├── ruler_128k/ # 128K context
├── ruler_256k/ # 256K context
├── ruler_512k/ # 512K context
├── ruler_768k/ # 768K context
└── ruler_1m/ # 1M context
```
每个目录包含 13 个任务子目录,每个任务有 `validation.jsonl` 文件。
---
## GPU 与模式选择
| GPU 显存 | 推荐模式 | 说明 |
|---------|---------|------|
| 24GB (3090/4090) | `--enable-offload` | 必须使用 offload |
| 40GB+ (A100) | 两种模式均可 | 可测试 GPU-only |
**RTX 3090 限制**: 由于显存限制,必须使用 `--enable-offload` 参数。
---
## max-model-len 设置指南
| 数据目录 | 推荐 max-model-len | 说明 |
|---------|-------------------|------|
| ruler_4k | 5000 | 留出 output 空间 |
| ruler_8k | 9000 | |
| ruler_16k | 17000 | |
| ruler_32k | 40960 | |
| ruler_64k | 72000 | |
| ruler_128k | 135000 | |
**公式**: `max_model_len >= max_input_len + max_new_tokens`
---
## DensityObserver 输出
使用 `--sparse-policy XATTN_BSA` 时自动启用,输出示例:
```
============================================================
Density Statistics (XAttention BSA)
============================================================
[DensityObserver] Mode: offload
Compute density: 0.3691 (min: 0.3691 @ layer 0)
Comm density: 1.0000 (CPU block granularity)
Savings ratio: 0.0% H2D transfer reduction
Num layers: 1
Layer 0 density: 0.369052
```
| 指标 | 说明 |
|------|------|
| Compute density | BSA block (128 tokens) 粒度的计算密度 |
| Comm density | CPU block (4096 tokens) 粒度的通信密度 |
| Savings ratio | H2D 传输减少比例 |
---
## 常见问题
### 1. OOM 错误
**原因**: 显存不足
**解决**:
- 使用 `--enable-offload`
- 降低 `--gpu-utilization`
- 减少 `--num-gpu-blocks`
### 2. 模型加载失败
**原因**: 模型配置不兼容
**解决**:
- 检查 `--dtype` 参数 (GLM 模型需要 `--dtype bfloat16`)
- 确认模型路径正确
### 3. 准确率异常
**原因**: 状态泄漏
**解决**: 使用 `--fresh-llm` 参数为每个样本重新初始化 LLM
---
## 相关文档
- [`docs/xattn_density_types.md`](xattn_density_types.md) - Compute vs Comm density 解释
- [`docs/xattn_density_alignment_verification.md`](xattn_density_alignment_verification.md) - GPU-only vs Offload 对齐验证
- [`docs/ruler_benchmark_results_32k.md`](ruler_benchmark_results_32k.md) - RULER 32K 基准测试结果

View File

@@ -0,0 +1,142 @@
# XAttention Density Alignment Verification
验证 GPU-only 和 Offload 模式的 density 对齐情况。
**测试日期**: 2026-02-05
**测试模型**: Llama-3.1-8B-Instruct
**测试任务**: RULER niah_single_1
---
## 测试配置
| 参数 | 值 |
|------|-----|
| sparse_policy | XATTN_BSA |
| threshold | 0.9 |
| chunk_size | 4096 (已对齐) |
| stride | 8 |
| BSA block_size | 128 |
---
## 测试结果
### 32K Context
| 模式 | Layer 0 Density | Overall Density | 准确率 |
|------|-----------------|-----------------|--------|
| GPU-only | 0.502079 | 0.4012 | 100% |
| Offload | 0.498421 | 0.4984 | 100% |
| **差异** | **0.37%** | - | - |
### 64K Context
| 模式 | Layer 0 Density | Overall Density | 准确率 |
|------|-----------------|-----------------|--------|
| GPU-only | 0.369972 | 0.2963 | 100% |
| Offload | 0.369052 | 0.3691 | 100% |
| **差异** | **0.09%** | - | - |
---
## 关键修复
### Commit 829b311 - chunk_size 对齐 + Stream 同步修复
**问题**: 之前 GPU-only 和 Offload 模式的 density 差异达 10-13%
**根因**:
1. GPU-only 使用 `chunk_size=16384`Offload 使用 `chunk_size=4096`
2. Stream 同步 bug 导致 Pass 1/2 K 数据不一致
**修复**:
1.`XAttentionBSAPolicy.chunk_size` 默认值从 16384 改为 4096
2. 所有 compute kernels 包装在 `compute_stream` context 中
---
## 测试命令
### GPU-only 模式
```bash
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=/path/to/nano-vllm:$PYTHONPATH \
python tests/test_ruler.py \
--model ~/models/Llama-3.1-8B-Instruct \
--data-dir tests/data/ruler_32k \
--datasets niah_single_1 \
--num-samples 1 \
--max-model-len 40960 \
--sparse-policy XATTN_BSA \
--sparse-threshold 0.9
```
### Offload 模式
```bash
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=/path/to/nano-vllm:$PYTHONPATH \
python tests/test_ruler.py \
--model ~/models/Llama-3.1-8B-Instruct \
--data-dir tests/data/ruler_32k \
--datasets niah_single_1 \
--num-samples 1 \
--max-model-len 40960 \
--enable-offload \
--sparse-policy XATTN_BSA \
--sparse-threshold 0.9
```
---
## 详细日志
### 32K Offload 模式 Per-Chunk Density
```
Layer0 chunk: q_len=4096, k_len=4096, density=0.6234
Layer0 chunk: q_len=4096, k_len=8192, density=0.6239
Layer0 chunk: q_len=4096, k_len=12288, density=0.6026
Layer0 chunk: q_len=4096, k_len=16384, density=0.5695
Layer0 chunk: q_len=4096, k_len=20480, density=0.5285
Layer0 chunk: q_len=4096, k_len=24576, density=0.4891
Layer0 chunk: q_len=4096, k_len=28672, density=0.4514
Layer0 chunk: q_len=3813, k_len=32485, density=0.4208
```
### 64K Offload 模式 Per-Chunk Density
```
Layer0 chunk: q_len=4096, k_len=4096, density=0.6234
Layer0 chunk: q_len=4096, k_len=8192, density=0.6239
Layer0 chunk: q_len=4096, k_len=12288, density=0.6026
Layer0 chunk: q_len=4096, k_len=16384, density=0.5681
Layer0 chunk: q_len=4096, k_len=20480, density=0.5255
Layer0 chunk: q_len=4096, k_len=24576, density=0.4859
Layer0 chunk: q_len=4096, k_len=28672, density=0.4485
Layer0 chunk: q_len=4096, k_len=32768, density=0.4161
Layer0 chunk: q_len=4096, k_len=36864, density=0.3892
Layer0 chunk: q_len=4096, k_len=40960, density=0.3658
Layer0 chunk: q_len=4096, k_len=45056, density=0.3464
Layer0 chunk: q_len=4096, k_len=49152, density=0.3303
Layer0 chunk: q_len=4096, k_len=53248, density=0.3170
Layer0 chunk: q_len=4096, k_len=57344, density=0.3068
Layer0 chunk: q_len=4096, k_len=61440, density=0.2988
Layer0 chunk: q_len=3451, k_len=64891, density=0.2947
```
---
## 结论
1. **Density 对齐成功**: 差异从 10-13% 降到 <0.5%
2. **准确率一致**: 两种模式都达到 100% 准确率
3. **Density 随 context 增长下降**: 符合预期,更长的 context 稀疏性更高
---
## 相关文档
- [`docs/xattn_offload_stream_sync_fix.md`](xattn_offload_stream_sync_fix.md) - Stream 同步修复详情
- [`docs/xattn_density_types.md`](xattn_density_types.md) - Compute vs Comm density
- [`docs/gpuonly_density_alignment_test.md`](gpuonly_density_alignment_test.md) - 早期对齐测试

View File

@@ -0,0 +1,195 @@
# XAttention Density Benchmark
GPU-only 模式下 XAttention Block Sparse Attention 的 density 测试结果。
## 测试配置
| 参数 | 值 | 说明 |
|------|-----|------|
| Model | Llama-3.1-8B-Instruct | 32 layers, 32 heads, 8 KV heads |
| Block Size | 128 tokens | BSA kernel 固定要求 |
| Threshold | 0.9 / 0.95 | 累积注意力阈值 |
| Stride | 4 / 8 / 16 | Q/K 下采样步长 |
| Dataset | RULER niah_single_1 | Sample 0 |
| Mode | GPU-only | 无 CPU offload |
## Density 定义
```python
# Density = selected_blocks / total_causal_blocks
# 在 causal attention 下,只计算下三角区域的 blocks
# Overall density = 所有层的平均值
def compute_density(mask, causal=True):
"""
mask: [batch, heads, q_blocks, k_blocks] boolean tensor
"""
if causal:
causal_mask = torch.tril(torch.ones(q_blocks, k_blocks))
total = causal_mask.sum() * batch * heads
selected = (mask & causal_mask).sum()
return selected / total
```
## 测试结果
### threshold=0.9
#### Overall Density (平均)
| Context | stride=4 | stride=8 | stride=16 |
|---------|----------|----------|-----------|
| **4K** | 0.5220 (52.2%) | 0.5292 (52.9%) | 0.5430 (54.3%) |
| **8K** | 0.5152 (51.5%) | 0.5252 (52.5%) | 0.5396 (54.0%) |
| **16K** | 0.4682 (46.8%) | 0.4775 (47.8%) | 0.4888 (48.9%) |
| **32K** | 0.3700 (37.0%) | 0.4012 (40.1%) | 0.4196 (42.0%) |
#### Min Density (per layer)
| Context | stride=4 | stride=8 | stride=16 |
|---------|----------|----------|-----------|
| **4K** | 0.2805 (Layer 3) | 0.3132 (Layer 3) | 0.3376 (Layer 5) |
| **8K** | 0.2886 (Layer 5) | 0.2725 (Layer 5) | 0.2995 (Layer 5) |
| **16K** | 0.2247 (Layer 5) | 0.2349 (Layer 5) | 0.2451 (Layer 5) |
| **32K** | 0.1799 (Layer 5) | 0.1846 (Layer 5) | 0.1964 (Layer 5) |
### threshold=0.95
#### Overall Density (平均)
| Context | stride=4 | stride=8 | stride=16 |
|---------|----------|----------|-----------|
| **4K** | 0.6561 (65.6%) | 0.6699 (67.0%) | 0.6815 (68.2%) |
| **8K** | 0.6462 (64.6%) | 0.6584 (65.8%) | 0.6732 (67.3%) |
| **16K** | 0.6004 (60.0%) | 0.6114 (61.1%) | 0.6193 (61.9%) |
| **32K** | 0.4894 (48.9%) | 0.5203 (52.0%) | 0.5385 (53.9%) |
#### Min Density (per layer)
| Context | stride=4 | stride=8 | stride=16 |
|---------|----------|----------|-----------|
| **4K** | 0.3972 (Layer 3) | 0.4348 (Layer 5) | 0.4517 (Layer 4) |
| **8K** | 0.4004 (Layer 5) | 0.3906 (Layer 5) | 0.4239 (Layer 5) |
| **16K** | 0.3331 (Layer 5) | 0.3453 (Layer 5) | 0.3589 (Layer 5) |
| **32K** | 0.2656 (Layer 5) | 0.2784 (Layer 5) | 0.2917 (Layer 5) |
### threshold 对比 (stride=8)
| Context | threshold=0.9 | threshold=0.95 | 差异 |
|---------|---------------|----------------|------|
| **4K** | 0.5292 (52.9%) | 0.6699 (67.0%) | -14.1% |
| **8K** | 0.5252 (52.5%) | 0.6584 (65.8%) | -13.3% |
| **16K** | 0.4775 (47.8%) | 0.6114 (61.1%) | -13.4% |
| **32K** | 0.4012 (40.1%) | 0.5203 (52.0%) | -11.9% |
## 关键发现
### 1. Context Length 影响最大
Density 随 context length 显著下降threshold=0.9, stride=8
- 4K: 52.9% density
- 8K: 52.5% density
- 16K: 47.8% density
- 32K: 40.1% density
**结论**: 长序列有更多稀疏化机会XAttention 的优势在长序列上更明显。
### 2. Threshold 影响显著
threshold=0.9 比 0.95 的 density 低约 12-14%
- 0.9 更激进,选择更少的 blocks
- 0.95 更保守,保留更多 blocks
- 两者准确性都不受影响RULER NIAH 全部 PASS
### 3. Stride 影响较小
同一 context 下,不同 stride 的 density 差异约 2-5%
- stride 越大 → density 略高(采样越粗糙,选择更保守)
- stride=4 最激进stride=16 最保守
### 4. Min Density 集中在中间层
- 大多数情况下 min density 出现在 Layer 5
- 中间层的稀疏性最高,首尾层相对密集
- 这符合 Transformer 注意力模式的一般规律
### 5. 最佳稀疏化配置
32K + stride=4 + threshold=0.9 达到最低 density
- Overall: **37.0%** (节省 63% 计算)
- Min: **18.0%** (Layer 5)
### 6. 准确性稳定
所有配置下 RULER NIAH 测试都 PASS (score=1.0),说明:
- threshold=0.9 和 0.95 都足够保守,不损失准确性
- 不同 stride 不影响最终结果
## 推荐配置
| 场景 | threshold | stride | 说明 |
|------|-----------|--------|------|
| 精度优先 | 0.95 | 8 | 保守配置density ~52-67% |
| 平衡 | 0.9 | 8 | 默认配置density ~40-53% |
| 性能优先 | 0.9 | 4 | 激进配置density ~37-52% |
## 测试命令
```bash
# 基本测试
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=/path/to/nano-vllm:$PYTHONPATH \
python tests/test_ruler.py \
--data-dir tests/data/ruler_32k \
--datasets niah_single_1 \
--sample-indices 0 \
--max-model-len 33792 \
--sparse-policy XATTN_BSA \
--sparse-threshold 0.9 \
--sparse-stride 8 \
--gpu-utilization 0.85
# 参数说明
# --sparse-policy XATTN_BSA 启用 XAttention Block Sparse Attention
# --sparse-threshold 0.9 累积注意力阈值 (0.9-0.99)
# --sparse-stride 8 Q/K 下采样步长 (4/8/16)
```
## DensityObserver 使用
```python
from nanovllm.utils.density_observer import DensityObserver
# 启用并重置
DensityObserver.enable()
DensityObserver.complete_reset()
# ... 运行 inference (compute_prefill 自动记录) ...
# 获取结果
summary = DensityObserver.get_summary()
# {
# "mode": "gpu_only",
# "overall_density": 0.40, # 所有层的平均值
# "per_layer_density": {0: 0.55, 1: 0.45, ...},
# "num_layers": 32
# }
# 获取最低 density
min_layer, min_density = DensityObserver.get_min_density()
# 打印摘要
DensityObserver.print_summary()
# [DensityObserver] Mode: gpu_only
# Overall density: 0.4012
# Min density: 0.1846 (layer 5)
# Num layers: 32
```
## 相关文件
| 文件 | 说明 |
|------|------|
| `nanovllm/kvcache/sparse/xattn_bsa.py` | XAttention BSA Policy 实现 |
| `nanovllm/utils/density_observer.py` | Density 统计 Observer |
| `nanovllm/ops/xattn.py` | xattn_estimate 核心算法 |
| `tests/test_ruler.py` | RULER benchmark 测试脚本 |

152
docs/xattn_density_types.md Normal file
View File

@@ -0,0 +1,152 @@
# XAttention Density Types: Compute vs Communication
XAttention BSA 统计两种不同粒度的 density它们反映不同的优化效果。
## 两种 Density 的定义
### 1. Compute Density计算密度
**粒度**: BSA block (128 tokens)
**公式**:
```
compute_density = selected_bsa_blocks / total_causal_bsa_blocks
```
**含义**: 实际需要计算 attention 的 blocks 占 causal 区域的比例。
**影响**: 决定 attention 计算量的减少。
### 2. Communication Density通信密度
**粒度**: CPU block (4096 tokens = 32 BSA blocks)
**公式**:
```
comm_density = selected_cpu_blocks / total_cpu_blocks
```
**含义**: 需要从 CPU 传输到 GPU 的 blocks 占总 blocks 的比例。
**影响**: 决定 H2D 传输量的减少。
## 为什么 Comm Density 通常高于 Compute Density
### 聚合效应
由于 CPU block 粒度是 BSA block 的 32 倍CPU block 选择使用 `any()` 聚合:
```python
# BSA mask: [B, H, Q_bsa, K_bsa]
# Reshape to CPU block level
mask_per_cpu = mask.view(B, H, Q_bsa, num_cpu_blocks, bsa_per_cpu)
# Any BSA block selected -> whole CPU block needed
cpu_needed = mask_per_cpu.any(dim=-1).any(dim=2).any(dim=1)
```
只要 CPU block 中**任意一个**:
- Head 选择了该 block
- Q position 选择了该 block
- BSA sub-block 被选中
则整个 CPU block 都需要传输。
### 示例
| 场景 | Compute Density | Comm Density | 说明 |
|------|-----------------|--------------|------|
| 64K context, threshold=0.9 | 37% | 100% | 稀疏 blocks 均匀分布在所有 CPU blocks |
| 32K context, threshold=0.9 | 50% | 100% | 同上 |
## 测试结果
### 测试命令
```bash
# Offload 模式测试
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=.:$PYTHONPATH python tests/test_ruler.py \
--model ~/models/Llama-3.1-8B-Instruct \
--data-dir tests/data/ruler_64k \
--datasets niah_single_1 \
--num-samples 1 \
--max-model-len 72000 \
--enable-offload \
--sparse-policy XATTN_BSA \
--sparse-threshold 0.9
```
### 输出示例
```
[DensityObserver] Mode: offload
Compute density: 0.3691 (min: 0.3691 @ layer 0)
Comm density: 1.0000 (CPU block granularity)
Savings ratio: 0.0% H2D transfer reduction
Num layers: 1
Layer 0 density: 0.369052
```
## 关键发现
### 当前 XAttention 的通信优化局限
1. **Compute density 有效降低**: ~37% @ 64K context计算量减少 63%
2. **Comm density 没有降低**: 100%(通信量没有减少)
### 原因分析
Attention pattern 的特点:
- 不同 heads 关注不同位置
- 不同 Q positions 关注不同 K positions
- 稀疏选择分布在整个 sequence 上
这导致虽然每个 (head, Q, K) 组合只选择少量 blocks但聚合后覆盖了所有 CPU blocks。
### 潜在优化方向
1. **Per-head block selection**: 每个 head 独立选择 CPU blocks
2. **Block clustering**: 将相关 blocks 聚合到同一 CPU block
3. **Dynamic block size**: 根据 attention pattern 动态调整 CPU block 大小
## DensityObserver API
### 启用和重置
```python
from nanovllm.utils.density_observer import DensityObserver
DensityObserver.enable()
DensityObserver.complete_reset()
DensityObserver.set_mode("offload") # or "gpu_only"
```
### 记录
```python
# Compute density (GPU-only 模式自动记录)
DensityObserver.record(layer_id, mask, causal=True)
# Comm density (Offload 模式在 select_blocks 中记录)
DensityObserver.record_comm_density(layer_id, selected_cpu_blocks, total_cpu_blocks)
```
### 获取结果
```python
# 总体 density
overall_compute = DensityObserver.get_overall_density()
overall_comm = DensityObserver.get_overall_comm_density()
# Per-layer density
per_layer_compute = DensityObserver.get_per_layer_density()
per_layer_comm = DensityObserver.get_per_layer_comm_density()
# 打印摘要
DensityObserver.print_summary()
```
## 相关文件
- `nanovllm/utils/density_observer.py`: DensityObserver 实现
- `nanovllm/kvcache/sparse/xattn_bsa.py`: XAttention BSA Policyselect_blocks 中记录 comm density
- `tests/test_ruler.py`: RULER benchmark 测试脚本

View File

@@ -0,0 +1,122 @@
# XAttention KV Chunking Density 验证测试
## 背景
验证 XAttention Triton kernel 是否只能沿 Q 轴分 chunk不能沿 KV 轴分 chunk。
**假设**`softmax_fuse_block_sum` 需要完整的 K 来计算正确的归一化分母,分 chunk 后的 attention 分布与完整序列不同。
## 测试方法
1. **GPU-only 模式**:一次性对完整序列调用 `xattn_estimate`,记录 Layer 0 的 density
2. **Offload DEBUG 模式**:分 chunk 调用 `xattn_estimate`,累积 selected/total counts计算最终 density
3. 使用相同的 `_debug_k_full` buffer 收集完整 K cache确保输入数据一致
### 关键代码逻辑
```python
# Offload DEBUG: 每个 chunk 累积 selected/total
for each chunk:
K_full = _debug_k_full[:, :, :total_k_len, :] # 累积的 K
_, mask_chunk = xattn_estimate(Q_chunk, K_full, threshold=threshold, causal=True)
# 裁剪到有效区域,计算正确的 causal mask (考虑 Q 偏移量)
q_offset_blocks = k_blocks - q_blocks
causal_mask = indices <= (q_indices + q_offset_blocks)
selected += (mask_valid & causal_mask).sum()
total += causal_mask.sum()
density = selected / total
```
## 测试结果
### 64K 序列 (niah_single_1, 序列长度 64891)
| threshold | GPU-only selected | Offload selected | GPU-only density | Offload density | 差异 (selected) |
|-----------|------------------|------------------|------------------|-----------------|-----------------|
| **0.90** | 1,524,617 | 1,330,506 | **0.3700** | **0.3229** | 194,111 (12.7%) |
| **0.95** | 1,955,015 | 1,747,585 | **0.4744** | **0.4241** | 207,430 (10.6%) |
| **1.00** | 4,118,719 | 4,118,896 | **0.9995** | **0.9995** | -177 (~0%) |
- **total**: 4,120,896 (两种模式一致)
### 32K 序列 (niah_single_1, 序列长度 32485)
| threshold | GPU-only selected | Offload selected | GPU-only density | Offload density | 差异 (selected) |
|-----------|------------------|------------------|------------------|-----------------|-----------------|
| **0.90** | 520,314 | 466,937 | **0.5021** | **0.4506** | 53,377 (10.3%) |
| **0.95** | 647,765 | 602,953 | **0.6251** | **0.5818** | 44,812 (6.9%) |
| **1.00** | 1,036,295 | 1,036,264 | **0.9999** | **0.9999** | 31 (~0%) |
- **total**: 1,036,320 (两种模式一致)
### 汇总对比
| 序列长度 | threshold | GPU-only density | Offload density | density 差异 |
|---------|-----------|------------------|-----------------|--------------|
| 32K | 0.90 | 0.5021 | 0.4506 | 5.2% |
| 64K | 0.90 | 0.3700 | 0.3229 | 4.7% |
| 32K | 0.95 | 0.6251 | 0.5818 | 4.3% |
| 64K | 0.95 | 0.4744 | 0.4241 | 5.0% |
| 32K | 1.00 | 0.9999 | 0.9999 | ~0% |
| 64K | 1.00 | 0.9995 | 0.9995 | ~0% |
## 结论
### 1. Softmax 归一化本身是正确的
`threshold=1.0`(选择所有 blocksGPU-only 和 Offload 模式的 density 几乎完全对齐(差异 < 0.01%)。
这说明:
- `_debug_k_full` 正确收集了完整的 K cache
- 分 chunk 调用 `xattn_estimate`softmax 归一化在正确的 K 序列上计算
- causal mask 的 Q 偏移量处理正确
### 2. 问题在于 threshold 的应用方式
`threshold < 1.0`差异显著10-13%
- **GPU-only**:对完整序列一次性应用 threshold选择 cumulative attention >= threshold 的 blocks
- **Offload**:每个 chunk 独立应用 threshold累积 selected counts
每个 chunk 独立应用 threshold 会导致:
- 某些在 GPU-only 中被选中的 blocks在分 chunk 时因 attention 分布不同而未被选中
- 累积的 selected 比一次性计算的要少
### 3. XAttention Triton kernel 的 KV chunking 限制
**验证结论**XAttention 的 `xattn_estimate` 可以正确处理 KV chunkingsoftmax 归一化正确),但 **threshold-based block selection 不能简单累积**
如果要在 Offload 模式下获得与 GPU-only 一致的 block selection
1. 需要先累积所有 chunks 的 attention scores
2. 最后一次性应用 threshold 选择 blocks
或者接受 10-13% 的 density 差异,这对实际推理准确性的影响需要进一步评估。
## 测试命令
```bash
# GPU-only 模式
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH \
python tests/test_ruler.py --dataset niah_single_1 --sample 0 \
--sparse-policy xattn_bsa --sparse-threshold 0.9
# Offload 模式 (64K)
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH \
python tests/test_ruler.py --dataset niah_single_1 --sample 0 \
--sparse-policy xattn_bsa --sparse-threshold 0.9 --enable-offload
# Offload 模式 (32K)
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH \
python tests/test_ruler.py --dataset niah_single_1 --sample 0 \
--sparse-policy xattn_bsa --sparse-threshold 0.9 --enable-offload \
--data-dir /home/zijie/Code/nano-vllm/tests/data/ruler_32k --max-model-len 34000
```
## 相关文件
- `nanovllm/kvcache/sparse/xattn_bsa.py`: DEBUG 代码位置
- `nanovllm/ops/xattn.py`: `xattn_estimate` 实现
- `nanovllm/utils/density_observer.py`: DensityObserver 实现

View File

@@ -0,0 +1,400 @@
# XAttention KV Chunking Kernels
## 概述
本文档描述了支持 KV 维度分 chunk 的 softmax kernels 实现。这些 kernels 允许在 CPU offload 场景下,沿 KV 维度分块计算 sparse attention estimation而不需要在 GPU 上保存完整的 raw attention scores。
## 背景
原始的 `softmax_fuse_block_sum` kernel 需要完整的 K 序列来计算正确的 softmax 归一化分母:
```
softmax(x_i) = exp(x_i) / Σ_j exp(x_j)
```
如果只有部分 K (KV chunk),分母 `Σ_j exp(x_j)` 不完整,导致归一化错误。
## 解决方案:三阶段计算
通过将 softmax 计算拆分为三个阶段,实现正确的 KV chunking
```
┌─────────────────────────────────────────────────────────────────┐
│ 三阶段 Pipeline │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ KV Chunk 0 │ │ KV Chunk 1 │ │ KV Chunk N │ │
│ │ attn_scores │ │ attn_scores │ │ attn_scores │ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ 阶段 1: softmax_compute_partial_stats │ │
│ │ 计算每个 chunk 的 (m_partial, l_partial) │ │
│ └─────────────────────────────────────────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ (m_0, l_0) (m_1, l_1) (m_N, l_N) │
│ │ │ │ │
│ └────────────────┬┴─────────────────┘ │
│ ▼ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ 阶段 2: merge_softmax_stats │ │
│ │ Host 端合并 → (m_global, l_global) │ │
│ └─────────────────────────────────────────────────┘ │
│ │ │
│ ┌────────────────┼────────────────┐ │
│ ▼ ▼ ▼ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ 阶段 3: softmax_normalize_and_block_sum │ │
│ │ 使用全局 stats 归一化并计算 block sums │ │
│ └─────────────────────────────────────────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ block_sums_0 block_sums_1 block_sums_N │
│ │ │ │ │
│ └────────────────┴────────────────┘ │
│ │ │
│ ▼ │
│ torch.cat → final mask │
│ │
└─────────────────────────────────────────────────────────────────┘
```
### 阶段 1: `softmax_compute_partial_stats`
计算每个 KV chunk 的 partial statistics
- `m_partial`: 该 chunk 内的最大值 (per query row)
- `l_partial`: 该 chunk 内的 partial sum = Σ exp(x - m_partial)
```python
m_partial, l_partial = softmax_compute_partial_stats(
attn_weights_kv, # [batch, heads, q_len, k_chunk_len]
reshaped_block_size,
segment_size,
scale,
chunk_start=chunk_start,
kv_offset=kv_offset, # KV chunk 在完整序列中的偏移
is_causal=True,
)
# 输出: m_partial, l_partial 形状为 [batch, heads, q_len]
```
### 阶段 2: `merge_softmax_stats`
Host 端合并所有 KV chunks 的 statistics
```python
m_global, l_global = merge_softmax_stats(m_chunks, l_chunks)
```
合并公式 (Online Softmax):
```
m_new = max(m_global, m_chunk)
l_new = l_global * exp(m_global - m_new) + l_chunk * exp(m_chunk - m_new)
```
### 阶段 3: `softmax_normalize_and_block_sum`
使用全局 statistics 归一化并计算 block sums
```python
attn_sum_kv = softmax_normalize_and_block_sum(
attn_weights_kv, # [batch, heads, q_len, k_chunk_len]
m_global, # [batch, heads, q_len]
l_global, # [batch, heads, q_len]
reshaped_block_size,
segment_size,
chunk_start=chunk_start,
real_q_len=real_q_len,
scale=scale,
kv_offset=kv_offset,
is_causal=True,
)
# 输出: [batch, heads, q_blocks, k_chunk_blocks]
```
## 数学等价性证明
原始 softmax 计算 (完整 K):
```
softmax(x_i) = exp(x_i - m) / Σ_j exp(x_j - m)
```
分 KV chunk 计算:
```
Chunk 0: m_0 = max(x[0:N/2]), l_0 = Σ exp(x[0:N/2] - m_0)
Chunk 1: m_1 = max(x[N/2:N]), l_1 = Σ exp(x[N/2:N] - m_1)
合并:
m_global = max(m_0, m_1)
l_global = l_0 * exp(m_0 - m_global) + l_1 * exp(m_1 - m_global)
= Σ exp(x[0:N] - m_global) # 等于全局 sum
归一化:
softmax(x_i) = exp(x_i - m_global) / l_global # 正确!
```
## Causal Mask 处理
两个 kernel 都正确处理了 causal attention
1. **`softmax_partial_stats_kernel`**: 通过 `kv_offset` 参数确定当前 KV chunk 在完整序列中的位置,正确计算 causal boundary
2. **`softmax_normalize_block_sum_kernel`**: 同样使用 `kv_offset`,对 causal boundary 之后的位置输出 0
## 存储开销分析
### 符号定义
| 符号 | 含义 | 典型值 |
|------|------|--------|
| S | seq_len | 64K |
| B | batch_size | 1 |
| H | num_heads | 32 |
| D | head_dim | 128 |
| T | stride | 4-8 |
| C | chunk_size | 16K |
| n | num_kv_chunks = ceil(S/C) | 4 |
### 原始方式 (无 KV chunking)
**attn_weights 峰值内存**:
```
[B, H, S/T, S/T] × 4 bytes = B × H × (S/T)² × 4
例: S=64K, T=4, B=1, H=32
= 1 × 32 × 16384² × 4 = 32 GB
```
### KV Chunking 方式的额外存储
#### 1. Partial Stats (每个 KV chunk)
```
m_partial: [B, H, C/T] × 4 bytes
l_partial: [B, H, C/T] × 4 bytes
单个 chunk = 2 × B × H × (C/T) × 4
= 2 × 1 × 32 × 4096 × 4 = 1 MB
```
#### 2. Global Stats
```
m_global: [B, H, S/T] × 4 bytes
l_global: [B, H, S/T] × 4 bytes
= 2 × B × H × (S/T) × 4
= 2 × 1 × 32 × 16384 × 4 = 4 MB
```
#### 3. 总额外开销
```
total_extra = n × partial_stats + global_stats
= 4 × 1MB + 4MB = 8 MB
```
### 存储开销随 seqlen 变化
| seqlen | num_chunks | 原始 attn_weights | 额外 stats | 比例 |
|--------|------------|-------------------|------------|------|
| 16K | 1 | 2 GB | 2 MB | 0.1% |
| 32K | 2 | 8 GB | 4 MB | 0.05% |
| 64K | 4 | 32 GB | 8 MB | 0.025% |
| 128K | 8 | 128 GB | 16 MB | 0.012% |
### 复杂度分析
| 存储组件 | 复杂度 | 说明 |
|----------|--------|------|
| 原始 attn_weights | O(S²) | 二次增长 |
| Partial/Global stats | O(S) | 线性增长 |
| **相对开销** | O(1/S) | **随 seqlen 递减** |
### 峰值显存优化
KV chunking 的主要收益是**峰值显存**从 O(S²) 降到 O(S×C)
```
原始: O(B × H × (S/T)²) # 完整 attn_weights
KV chunking: O(B × H × (S/T) × (C/T)) # 一次只处理一个 chunk
```
以 S=128K, C=16K 为例:
- 原始峰值: ~128 GB
- KV chunking 峰值: ~16 GB (降低 **8 倍**)
## 支持不同 Q/KV Chunk Size
三阶段 pipeline 支持 Q 和 KV 使用不同的 chunk size
```python
q_chunk_size = 8192 # Q 分块大小
kv_chunk_size = 16384 # KV 分块大小
for q_chunk_idx in range(q_chunk_num):
Q_chunk = Q[:, :, q_start:q_end, :] # [B, H, q_chunk_size, D]
for kv_chunk_idx in range(kv_chunk_num):
K_chunk = K[:, :, kv_start:kv_end, :] # [B, H, kv_chunk_size, D]
# ... 三阶段处理
```
### 测试验证结果
| Config | seq_len | Q chunks | KV chunks | density | 对齐 |
|--------|---------|----------|-----------|---------|------|
| Q=16K, KV=16K | 64891 | 4 | 4 | 0.1117 | ✓ 100% |
| Q=8K, KV=16K | 64891 | 8 | 4 | 0.1112 | ✓ 100% |
| Q=16K, KV=8K | 64891 | 4 | 8 | 0.1117 | ✓ 100% |
| Q=8K, KV=8K | 64891 | 8 | 8 | 0.1112 | ✓ 100% |
| Q=4K, KV=16K | 64891 | 16 | 4 | 0.1109 | ✓ 100% |
## API 参考
### `softmax_compute_partial_stats`
```python
def softmax_compute_partial_stats(
attn_weights_slice: torch.Tensor, # [batch, heads, q_len, k_chunk_len]
reshaped_block_size: int,
segment_size: int,
scale: float,
chunk_start: int = 0, # Q chunk 起始位置 (reshaped space)
kv_offset: int = 0, # KV chunk 偏移 (reshaped space)
is_causal: bool = False,
) -> Tuple[torch.Tensor, torch.Tensor]:
"""返回 (m, l) partial stats"""
```
### `merge_softmax_stats`
```python
def merge_softmax_stats(
m_chunks: list, # List of [batch, heads, q_len] tensors
l_chunks: list, # List of [batch, heads, q_len] tensors
) -> Tuple[torch.Tensor, torch.Tensor]:
"""返回 (m_global, l_global)"""
```
### `softmax_normalize_and_block_sum`
```python
def softmax_normalize_and_block_sum(
attn_weights_slice: torch.Tensor, # [batch, heads, q_len, k_chunk_len]
m_global: torch.Tensor, # [batch, heads, q_len]
l_global: torch.Tensor, # [batch, heads, q_len]
reshaped_block_size: int,
segment_size: int,
chunk_start: int,
real_q_len: int,
scale: float,
kv_offset: int = 0,
is_causal: bool = False,
) -> torch.Tensor:
"""返回 block sums [batch, heads, q_blocks, k_chunk_blocks]"""
```
## 使用示例
```python
from nanovllm.ops.xattn import (
flat_group_gemm_fuse_reshape,
softmax_compute_partial_stats,
softmax_normalize_and_block_sum,
merge_softmax_stats,
find_blocks_chunked,
)
# 对每个 Q chunk
for q_chunk_idx in range(q_chunk_num):
Q_chunk = Q_padded[:, :, q_start:q_end, :]
# 阶段 1: 每个 KV chunk 计算 partial stats
m_chunks, l_chunks = [], []
attn_weights_chunks = []
for kv_chunk_idx in range(kv_chunk_num):
K_chunk = K_padded[:, :, kv_start:kv_end, :]
kv_offset = kv_chunk_idx * kv_chunk_size // STRIDE
# 计算 raw scores
attn_weights = flat_group_gemm_fuse_reshape(
Q_chunk, K_chunk, STRIDE,
chunk_start=chunk_start,
chunk_end=chunk_end,
is_causal=False, # K 不完整
)
attn_weights_chunks.append(attn_weights)
# 计算 partial stats
m, l = softmax_compute_partial_stats(
attn_weights, block_size, segment_size, scale,
chunk_start=chunk_start,
kv_offset=kv_offset,
is_causal=True,
)
m_chunks.append(m)
l_chunks.append(l)
# 阶段 2: 合并 stats
m_global, l_global = merge_softmax_stats(m_chunks, l_chunks)
# 阶段 3: 归一化并计算 block sums
block_sums_list = []
for kv_chunk_idx, attn_weights in enumerate(attn_weights_chunks):
kv_offset = kv_chunk_idx * kv_chunk_size // STRIDE
block_sums = softmax_normalize_and_block_sum(
attn_weights, m_global, l_global,
block_size, segment_size, chunk_start, real_q_len, scale,
kv_offset=kv_offset,
is_causal=True,
)
block_sums_list.append(block_sums)
# 拼接并选择 blocks
attn_sum = torch.cat(block_sums_list, dim=-1)
mask = find_blocks_chunked(attn_sum, ...)
```
## 性能对比
| 方面 | 原始实现 | KV Chunking 实现 |
|------|---------|-----------------|
| Kernel 数量 | 1 | 2 (stats + normalize) |
| Raw scores 读取次数 | 2 | 2 |
| 额外内存 | 0 | O(B × H × S/T × 2) for (m, l) |
| Host 计算 | 无 | merge stats (轻量) |
| **峰值显存** | O(S²) | **O(S × C)** |
## 验证测试
### 批量测试结果
测试脚本 `tests/test_xattn_kv_chunking_batch.py` 验证了不同 seqlen 下的一致性:
```
| seq_len | stride | threshold | kv_chunks | density_api | density_kv | diff | mask_diff | status |
|---------|--------|-----------|-----------|-------------|------------|----------|-----------|--------|
| 3688 | 4 | 0.90 | 1 | 0.383405 | 0.383405 | 0.000000 | 0.0000% | PASS |
| 7888 | 4 | 0.90 | 1 | 0.290611 | 0.290611 | 0.000000 | 0.0000% | PASS |
| 15685 | 4 | 0.90 | 1 | 0.197724 | 0.197724 | 0.000000 | 0.0000% | PASS |
| 32485 | 4 | 0.90 | 2 | 0.159023 | 0.159023 | 0.000000 | 0.0000% | PASS |
| 64891 | 4 | 0.90 | 4 | 0.111656 | 0.111656 | 0.000000 | 0.0000% | PASS |
```
### 关键结论
1. **数学等价性**: density_diff = 0.000000 对于所有测试
2. **Mask 完全对齐**: mask_diff = 0.0000% 对于所有测试
3. **支持任意 Q/KV chunk size 组合**
## 相关文件
- `nanovllm/ops/xattn.py`: Kernel 实现
- `tests/test_xattn_estimate_alignment.py`: 单文件验证测试
- `tests/test_xattn_kv_chunking_batch.py`: 批量验证测试
- `docs/xattn_kernels_guide.md`: 原始 kernel 文档

View File

@@ -0,0 +1,154 @@
# XAttention Memory Benchmark
GPU-only 模式下 XAttention 的内存使用分析。
## 测试配置
### 硬件
- **GPU**: NVIDIA A100 80GB (用于基准测试)
- **目标**: 验证在 RTX 3090/4090 (24GB) 上的可行性
### 模型
- **Model**: Qwen3-0.6B (28 layers, 16 heads, 8 KV heads, head_dim=128)
- **Context Length**: 32K (max_model_len=40960)
### XAttention 配置
- **Sparse Policy**: XATTN_BSA
- **Threshold**: 0.9
- **Block Size**: 128 tokens (BSA block)
- **Stride**: 8
---
## 内存使用分析
### 基准测试 (gpu-utilization=0.9)
| 指标 | 数值 |
|------|------|
| KV Cache | 157 blocks × 448 MB = 70.3 GB |
| **峰值内存** | **73,949 MiB (72.2 GB)** |
| GPU 利用率 | 90.2% |
### 24GB 显存可行性测试
| gpu-utilization | KV Cache Blocks | KV Cache Size | 峰值内存 | 测试结果 |
|-----------------|-----------------|---------------|----------|----------|
| 0.25 | 39 blocks | 17.5 GB | **20.6 GB** | ✅ 5/5 PASSED |
| 0.28 | 44 blocks | 19.7 GB | **22.8 GB** | ✅ 5/5 PASSED |
---
## 24GB 显存推荐配置
适用于 **RTX 3090 / RTX 4090 (24GB)**
```bash
CUDA_VISIBLE_DEVICES=0 python tests/test_ruler.py \
--model ~/models/Qwen3-0.6B \
--data-dir tests/data/ruler_32k \
--datasets niah_single_1 \
--num-samples 5 \
--max-model-len 40960 \
--sparse-policy XATTN_BSA \
--sparse-threshold 0.9 \
--gpu-utilization 0.28
```
### 配置说明
| 参数 | 值 | 说明 |
|------|-----|------|
| `--gpu-utilization` | 0.28 | 限制 GPU 内存使用到 ~23GB |
| `--max-model-len` | 40960 | 支持 32K+ context |
| `--sparse-policy` | XATTN_BSA | 启用 XAttention 稀疏注意力 |
| `--sparse-threshold` | 0.9 | 选择覆盖 90% attention 的 blocks |
---
## 内存分解
### Qwen3-0.6B @ 32K Context
| 组件 | 计算公式 | 大小 |
|------|----------|------|
| 模型权重 | 0.6B × 2 bytes | ~1.2 GB |
| KV Cache (per-token) | 2 × 28 layers × 8 kv_heads × 128 head_dim × 2 bytes | 112 KB |
| KV Cache (per-block) | 112 KB × 4096 tokens | 448 MB |
| KV Cache (44 blocks) | 448 MB × 44 | 19.7 GB |
| XAttention Buffers | GQA + mask + KV chunking | ~0.3 GB |
| 中间激活 | 运行时分配 | ~1.5 GB |
| **总计** | | **~22.8 GB** |
---
## 性能指标
### RULER niah_single_1 (5 samples)
| 指标 | gpu-util=0.25 | gpu-util=0.28 | gpu-util=0.9 |
|------|---------------|---------------|--------------|
| 准确率 | 100% (5/5) | 100% (5/5) | 100% (5/5) |
| 耗时 | 11.4s | 11.5s | 11.6s |
| Compute Density | 24.77% | 24.77% | 24.77% |
| Min Layer Density | 4.29% (Layer 5) | 4.29% (Layer 5) | 4.29% (Layer 5) |
**结论**: 降低 gpu-utilization 不影响准确率和性能,只影响可支持的最大序列长度。
---
## 不同模型的估算
### KV Cache 公式
```
KV Cache per-token = 2 × num_layers × num_kv_heads × head_dim × dtype_size
KV Cache per-block = per-token × block_size
```
### 常见模型估算 (32K context, block_size=4096)
| 模型 | Layers | KV Heads | Head Dim | Per-Token | 32K Tokens | 24GB 可行? |
|------|--------|----------|----------|-----------|------------|------------|
| Qwen3-0.6B | 28 | 8 | 128 | 112 KB | 3.5 GB | ✅ 是 |
| Qwen3-4B | 36 | 8 | 128 | 144 KB | 4.5 GB | ✅ 是 |
| Llama-3.1-8B | 32 | 8 | 128 | 128 KB | 4.0 GB | ⚠️ 需要 offload |
| Qwen2.5-7B | 28 | 4 | 128 | 56 KB | 1.8 GB | ✅ 是 |
注: 8B 模型权重约 16GB加上 KV cache 超过 24GB需要 CPU offload。
---
## 使用建议
### RTX 3090/4090 (24GB)
1. **小模型 (≤4B)**:可直接使用 GPU-only + XAttention
```bash
--gpu-utilization 0.28 --sparse-policy XATTN_BSA
```
2. **大模型 (7B-8B)**:需要 CPU offload
```bash
--enable-offload --num-gpu-blocks 4 --num-cpu-blocks 32
```
### A100 (40GB/80GB)
1. **所有模型**:可使用 GPU-only 模式
```bash
--gpu-utilization 0.9 --sparse-policy XATTN_BSA
```
---
## 相关文件
- `tests/test_ruler.py`: RULER 测试脚本
- `nanovllm/kvcache/sparse/xattn_bsa.py`: XAttention BSA Policy 实现
- `docs/gpuonly_density_alignment_test.md`: Density 对齐验证
---
**Date**: 2026-02-02
**Author**: Zijie Tian

View File

@@ -0,0 +1,184 @@
# XAttention Offload Profiling - 32K Context
Nsys profiling 分析 XAttention vs Full Attention 在 Offload 模式下的性能。
**测试日期**: 2026-02-05
**测试模型**: Llama-3.1-8B-Instruct
**Context**: 32K tokens
**GPU**: A100-80GB (GPU 0)
---
## 测试配置
| 参数 | Full | XAttention |
|------|------|------------|
| Policy | FULL | XATTN_BSA |
| Block size | 4096 | 4096 |
| GPU blocks | 4 | 4 |
| Threshold | - | 0.95 |
| Density | 100% | ~50% |
---
## XAttention 各阶段时间统计
### NVTX Markers Summary
| 阶段 | 总时间(ms) | 调用次数 | 平均时间(ms) | 说明 |
|------|------------|----------|--------------|------|
| xattn_find_blocks | 1155.1 | 256 | 4.51 | 块选择 (threshold-based) |
| xattn_estimate_pass1 | 588.3 | 256 | 2.30 | 第一轮: partial stats |
| xattn_compute_historical | 512.0 | 224 | 2.29 | 历史 KV attention |
| xattn_estimate_pass2 | 501.6 | 256 | 1.96 | 第二轮: block sums |
| xattn_estimate_merge | 197.9 | 256 | 0.77 | 合并 softmax stats |
| xattn_compute_merge | 93.8 | 256 | 0.37 | 计算结果合并 |
| xattn_compute_current | 59.2 | 256 | 0.23 | 当前 chunk attention |
### 时间分配
```
Total XAttention overhead: 3108 ms
Estimate 阶段: 1288 ms (41.4%)
- pass1: 588 ms
- pass2: 502 ms
- merge: 198 ms
Find blocks: 1155 ms (37.2%)
Compute 阶段: 665 ms (21.4%)
- historical: 512 ms
- merge: 94 ms
- current: 59 ms
```
---
## Chunk7 (最后一个 chunk) 对比
### Per-Layer 时间
| Policy | Layer 0 | Layer 1 | ... | Layer 31 | Avg |
|--------|---------|---------|-----|----------|-----|
| Full | 36.5 ms | 33.6 ms | ... | 32.7 ms | ~35 ms |
| XAttn | 39.7 ms | 39.3 ms | ... | 38.5 ms | ~38 ms |
### 分析
Chunk7 是序列的最后 ~4K tokens (3813 tokens),此时:
- K 长度: 32485 tokens
- Density: 42.08%
**结论**: XAttention 在 Chunk7 比 Full 慢约 8%,原因:
1. Estimate 开销无法被稀疏计算收益抵消
2. 42% density 仍然较高,稀疏收益有限
---
## Full Attention Chunk7 详细数据
```
Layer Time(ms)
L0 36.5
L1 44.3
L2 43.7
L3 38.7
L4 34.2
L5 45.2
...
L31 32.7
Avg ~35
```
---
## XAttention Chunk7 详细数据
```
Layer Time(ms)
L0 39.7
L1 39.3
L2 37.1
L3 39.1
L4 38.7
L5 39.4
...
L31 38.5
Avg ~38
```
---
## 性能瓶颈分析
### 1. xattn_find_blocks 开销过高
- 平均 4.51 ms per call
- 占总时间 37.2%
- 原因: threshold-based 块选择涉及排序和累积求和
### 2. 两轮 estimate 开销
- Pass1 + Pass2 共 1090 ms
- 需要遍历所有 KV chunks 两次
- 可优化方向: 单轮 estimate
### 3. Compute 阶段相对高效
- 只占 21.4%
- 说明 BSA 稀疏计算本身效率不错
---
## 优化建议
### 短期
1. **减少 find_blocks 开销**
- 使用 top-k 而不是 threshold
- 预分配 mask buffer 避免动态分配
2. **合并 estimate 两轮**
- 在单轮中同时计算 stats 和 block sums
### 中期
1. **estimate 阶段使用更小的 block_size**
- 当前 block_size=4096 对 estimate 不友好
- 参考 `docs/estimate_block_size_performance.md`
2. **Pipeline estimate 和 H2D**
- 将 estimate 与下一个 chunk 的 H2D 重叠
### 长期
1. **预测式块选择**
- 基于历史 pattern 预测下一个 chunk 的重要 blocks
- 减少 estimate 开销
---
## 相关文件
- `results/nsys/full_offload_32k_blk4096_20260205_023257.nsys-rep`
- `results/nsys/xattn_offload_32k_blk4096_20260205_023435.nsys-rep`
---
## 命令
### Profile Full
```bash
bash scripts/profile_offload.sh --policy full --ctx-len 32k --gpu 0 --model ~/models/Llama-3.1-8B-Instruct
```
### Profile XAttention
```bash
bash scripts/profile_offload.sh --policy xattn --ctx-len 32k --gpu 0 --model ~/models/Llama-3.1-8B-Instruct
```
### 分析 NVTX
```bash
nsys stats --report nvtx_pushpop_sum <file>.nsys-rep
```

View File

@@ -0,0 +1,307 @@
# XAttention Offload Stream Synchronization Fix
修复 XAttention BSA Policy 在 Offload 模式下的 CUDA stream 同步 bug。
**修复日期**: 2026-02-05
**Commit**: `829b311`
**影响文件**: `nanovllm/kvcache/sparse/xattn_bsa.py`, `nanovllm/kvcache/offload_engine.py`
---
## 问题描述
### 症状
在 Offload 模式下运行 RULER benchmark 时XAttention BSA 的 `select_blocks` 方法中 Pass 1 和 Pass 2 从**同一个 CPU block** 加载的 K 数据不一致:
```
Pass 1: K_chunk sum = 745472.00 (正确)
Pass 2: K_chunk sum = 0.00 (错误,数据未加载完成)
```
这导致 attention 计算结果错误RULER 准确率下降。
### 复现条件
- 模式: Offload (`--enable-offload`)
- Context: ≥ 32K tokens
- 稀疏策略: `--sparse-policy XATTN_BSA`
---
## 根因分析
### Stream 配置回顾
nano-vllm 的 CPU offload 使用多个 CUDA streams 实现 pipeline
| Stream | 用途 |
|--------|------|
| `slot_transfer_streams[i]` | H2D 传输 (CPU → GPU slot) |
| `compute_stream` | Attention 计算 |
| `prefill_offload_streams[i]` | D2H 传输 (GPU → CPU cache) |
### 同步机制
`wait_slot_layer(slot)` 使用 event 机制同步:
```python
def wait_slot_layer(self, slot_idx: int):
"""Make compute_stream wait for H2D transfer completion."""
self.compute_stream.wait_event(self.ring_slot_ready[slot_idx])
```
### Bug 根因
`select_blocks` 方法中:
1. H2D 传输在 `slot_transfer_streams` 上执行
2. `wait_slot_layer``compute_stream` 等待传输完成
3. **但是** 后续的 compute kernels 在**默认 stream** 上执行,而不是 `compute_stream`
```python
# Bug 代码
offload_engine.load_k_only_to_slot_layer(slot, layer_id, cpu_block_id)
offload_engine.wait_slot_layer(slot) # compute_stream 等待
# 这些 kernel 在默认 stream 上运行,没有等待 H2D 完成!
k_block = offload_engine.get_k_for_slot(slot)
K_chunk = k_block.transpose(1, 2)
# ... 后续计算 ...
```
### 时序图
```
slot_transfer_stream: [====H2D====]
compute_stream: |wait|
default_stream: [kernel1][kernel2] ← 没有等待!
数据未就绪
```
---
## 修复方案
### 核心修改
将所有 estimate 阶段的 compute kernels 包装在 `with torch.cuda.stream(compute_stream):` 中:
```python
# 修复后代码
compute_stream = offload_engine.compute_stream
offload_engine.load_k_only_to_slot_layer(slot, layer_id, cpu_block_id)
offload_engine.wait_slot_layer(slot) # compute_stream 等待
# 所有计算在 compute_stream 上执行
with torch.cuda.stream(compute_stream):
k_block = offload_engine.get_k_for_slot(slot)
K_chunk = k_block.transpose(1, 2)
# ... 后续计算 ...
```
### 修复位置
`select_blocks` 方法中共 6 处需要修复:
| 位置 | 阶段 | 修复内容 |
|------|------|----------|
| Pass 1 历史 blocks | `xattn_estimate_pass1` | 历史 KV chunk 处理 |
| Pass 1 当前 chunk | `xattn_estimate_pass1` | 当前 GPU 上的 K 处理 |
| Step 2 合并 | `merge_softmax_stats` | softmax stats 合并 |
| Pass 2 历史 blocks | `xattn_estimate_pass2` | 带全局 stats 的 block_sum |
| Pass 2 当前 chunk | `xattn_estimate_pass2` | 当前 chunk 的 block_sum |
| Step 4 block 选择 | `find_blocks_chunked` | 最终 block 选择 |
### 时序图(修复后)
```
slot_transfer_stream: [====H2D====]
compute_stream: |wait|[kernel1][kernel2]
数据已就绪
```
---
## 代码变更详情
### 1. Pass 1 历史 blocks 处理
```python
# Before (bug)
for kv_chunk_idx, cpu_block_id in enumerate(available_blocks):
offload_engine.load_k_only_to_slot_layer(slot, layer_id, cpu_block_id)
offload_engine.wait_slot_layer(slot)
k_block = offload_engine.get_k_for_slot(slot) # 默认 stream
K_chunk = k_block.transpose(1, 2)
# ... compute ...
# After (fixed)
compute_stream = offload_engine.compute_stream
for kv_chunk_idx, cpu_block_id in enumerate(available_blocks):
offload_engine.load_k_only_to_slot_layer(slot, layer_id, cpu_block_id)
offload_engine.wait_slot_layer(slot)
with torch.cuda.stream(compute_stream): # 显式指定 stream
k_block = offload_engine.get_k_for_slot(slot)
K_chunk = k_block.transpose(1, 2)
# ... compute ...
```
### 2. 移除 STRONG SYNC
`offload_engine.py` 中移除了不必要的强同步:
```python
# Removed from load_to_slot_layer() and load_k_only_to_slot_layer()
# STRONG SYNC: Synchronize all prefill offload streams before H2D
# for offload_stream in self.prefill_offload_streams:
# offload_stream.synchronize()
```
这些同步现在由 event 机制正确处理,不再需要阻塞式同步。
### 3. 其他清理
- 移除 DEBUG print 语句
- 移除 `torch.save()` debug 代码
- 合并多个 fallback 条件
-`chunk_size` 默认值从 16384 改为 4096匹配 offload Q chunk size
---
## 测试验证
### 测试命令
**GPU 0 - Offload 模式测试**:
```bash
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH \
python tests/test_ruler.py \
--model ~/models/Llama-3.1-8B-Instruct \
--data-dir tests/data/ruler_32k \
--datasets niah_single_1 \
--num-samples 10 \
--max-model-len 40960 \
--enable-offload \
--sparse-policy XATTN_BSA \
--sparse-threshold 0.9
```
**GPU 1 - GPU-only 模式测试**:
```bash
CUDA_VISIBLE_DEVICES=1 PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH \
python tests/test_ruler.py \
--model ~/models/Qwen3-0.6B \
--data-dir tests/data/ruler_32k \
--datasets niah_single_1 \
--num-samples 10 \
--max-model-len 40960 \
--sparse-policy XATTN_BSA \
--sparse-threshold 0.9
```
### 测试结果
| 模式 | 模型 | Context | Samples | Pass Rate | Density |
|------|------|---------|---------|-----------|---------|
| Offload | Llama-3.1-8B | 32K | 10/10 | **100%** | 9.53% |
| GPU-only | Qwen3-0.6B | 32K | 10/10 | **100%** | 9.84% |
### Density 对齐验证
| 模式 | Layer 0 Density | 差异 |
|------|-----------------|------|
| GPU-only | 9.84% | - |
| Offload | 9.53% | ~3% |
~3% 的差异是预期的,因为两种模式的 KV 累积模式不同:
- GPU-only: 一次性处理所有 KV
- Offload: 分 chunk 处理,每个 chunk 独立计算 softmax stats 后合并
---
## 技术细节
### 三阶段 KV Chunking 流程
```
┌─────────────────────────────────────────────────────────────┐
│ Stage 1: softmax_compute_partial_stats │
│ └── 每个 KV chunk 独立计算 partial stats (m_i, l_i) │
│ │
│ Stage 2: merge_softmax_stats │
│ └── Host 端合并所有 chunks: (m_global, l_global) │
│ │
│ Stage 3: softmax_normalize_and_block_sum │
│ └── 使用全局 stats 归一化并计算 block sums │
└─────────────────────────────────────────────────────────────┘
```
### Stream 配置要求
| 操作类型 | Stream | 原因 |
|----------|--------|------|
| H2D 传输 | `slot_transfer_streams` | 异步传输,不阻塞计算 |
| D2H 传输 | `prefill_offload_streams` | 异步 offload不阻塞计算 |
| Estimate kernels | `compute_stream` | 与 attention 计算共享,确保同步 |
| Attention kernels | `compute_stream` | 主计算流 |
### Event 同步机制
```python
# H2D 传输完成后记录 event
self.ring_slot_ready[slot_idx].record(slot_transfer_stream)
# 计算前等待 H2D 完成
self.compute_stream.wait_event(self.ring_slot_ready[slot_idx])
# 计算完成后记录 event用于下一轮 H2D
self.ring_slot_compute_done[slot_idx].record(compute_stream)
```
---
## 相关文档
- [`docs/architecture_guide.md`](architecture_guide.md): Stream 配置和 ring buffer 架构
- [`docs/xattn_kv_chunking_kernels.md`](xattn_kv_chunking_kernels.md): 三阶段 softmax kernels
- [`docs/gpuonly_density_alignment_test.md`](gpuonly_density_alignment_test.md): Density 对齐测试
- [`docs/xattn_bsa_policy_design.md`](xattn_bsa_policy_design.md): XAttention BSA Policy 设计
---
## 经验总结
### 1. Stream 同步的隐蔽性
CUDA stream 同步 bug 很难发现:
- 数据可能"大部分时间"正确(取决于时序)
- 错误表现为随机/间歇性的结果偏差
- 需要精确的 debug logging 才能定位
### 2. Event vs Synchronize
| 方法 | 优点 | 缺点 |
|------|------|------|
| `stream.wait_event()` | 非阻塞,保持 pipeline | 只同步指定 stream |
| `stream.synchronize()` | 保证完成 | 阻塞整个 stream破坏 pipeline |
**最佳实践**: 使用 event 进行精确同步,避免 synchronize 阻塞。
### 3. 调试方法
```python
# 打印 tensor sum 验证数据一致性
print(f"K_chunk sum = {K_chunk.sum().item()}")
# 保存中间结果进行离线比较
torch.save({'K': K_chunk, 'layer': layer_id}, f'/tmp/debug_{pass}_{chunk}.pt')
```

View File

@@ -0,0 +1,170 @@
# XAttention Performance Analysis
本文档记录 XAttention 在不同配置下的性能分析结果,包括 NVTX 标记位置、block size 影响和性能瓶颈。
## NVTX 标记
XAttention 代码中添加了 NVTX 标记用于 nsys profiling便于分析 estimate 和 compute 阶段的性能。
### 标记位置
| 模式 | 标记名称 | 文件位置 | 说明 |
|------|---------|---------|------|
| GPU-only | `xattn_estimate` | `xattn_bsa.py:compute_prefill` | xattn_estimate 调用 |
| GPU-only | `xattn_bsa_compute` | `xattn_bsa.py:compute_prefill` | BSA kernel 调用 |
| Offload | `xattn_estimate_gemm` | `xattn_bsa.py:select_blocks` | flat_group_gemm 循环 |
| Offload | `xattn_estimate_softmax` | `xattn_bsa.py:select_blocks` | softmax_fuse_block_sum |
| Offload | `xattn_estimate_find_blocks` | `xattn_bsa.py:select_blocks` | find_blocks_chunked |
| Offload | `xattn_compute_historical` | `xattn_bsa.py:compute_chunked_prefill` | 历史 chunks attention |
| Offload | `xattn_compute_current` | `xattn_bsa.py:compute_chunked_prefill` | 当前 chunk attention |
| Offload | `xattn_compute_merge` | `xattn_bsa.py:compute_chunked_prefill` | merge 操作 |
### 查看 NVTX 统计
```bash
# 生成 profile
bash scripts/profile_offload.sh --policy xattn --ctx-len 64k --block-size 4096 --gpu 0
# 查看 NVTX 统计
nsys stats --report nvtx_pushpop_sum results/nsys/<filename>.nsys-rep
```
## Block Size 对 Offload 模式的影响
### 测试配置
- Model: Llama-3.1-8B-Instruct
- Context: 64K tokens
- Mode: xattn + offload
- GPU: A100 40GB
### 性能对比
| 指标 | block_size=4096 | block_size=1024 | 变化 |
|------|----------------|-----------------|------|
| **总时间** | 27.7s | 55.5s | **2x 慢** |
| **Chunks 数量** | 16 | 64 | 4x |
| **CPU blocks** | 18 | 71 | ~4x |
### 各阶段耗时分布
#### block_size=4096
| 阶段 | 占比 | 总时间 | 平均时间 | 调用次数 |
|-----|------|--------|---------|---------|
| **xattn_estimate_find_blocks** | **39.7%** | 18.0s | 37.6ms | 480 |
| xattn_compute_historical | 4.4% | 2.0s | 4.2ms | 480 |
| xattn_estimate_gemm | 3.4% | 1.5s | 3.2ms | 480 |
| xattn_compute_current | 0.2% | 113ms | 0.22ms | 512 |
| xattn_compute_merge | 0.2% | 96ms | 0.19ms | 512 |
| xattn_estimate_softmax | 0.2% | 88ms | 0.18ms | 480 |
#### block_size=1024
| 阶段 | 占比 | 总时间 | 平均时间 | 调用次数 |
|-----|------|--------|---------|---------|
| **xattn_estimate_gemm** | **23.6%** | 22.6s | 11.4ms | 1984 |
| **xattn_compute_historical** | **16.9%** | 16.2s | 8.0ms | 2016 |
| xattn_estimate_find_blocks | 1.4% | 1.3s | 0.66ms | 1984 |
| xattn_compute_current | 0.5% | 433ms | 0.21ms | 2048 |
| xattn_compute_merge | 0.4% | 373ms | 0.18ms | 2048 |
| xattn_estimate_softmax | 0.2% | 222ms | 0.11ms | 1984 |
### 关键发现
1. **Block size 对性能影响显著**
- block_size=1024 比 4096 慢约 2x
- 更小的 block size 导致更多的 chunks增加调用次数
2. **性能瓶颈随 block size 变化**
- **block_size=4096**: 瓶颈是 `find_blocks_chunked` (39.7%)
- **block_size=1024**: 瓶颈转移到 `estimate_gemm` (23.6%) 和 `compute_historical` (16.9%)
3. **Amortization 效应**
- 大 block size 虽然单次 `find_blocks` 更慢 (37.6ms vs 0.66ms)
- 但调用次数少 (480 vs 1984),总时间反而更少
4. **find_blocks_chunked 的特殊性**
- 该函数主要在 CPU 上执行 block 选择逻辑
- 处理更大的数据量时开销显著增加
- block_size=4096 时占用 40% 时间,是主要优化目标
## softmax_fuse_block_sum_kernel 性能分析
`softmax_fuse_block_sum_kernel_non_causal` 是 XAttention 估计阶段的核心 Triton kernel。
### Kernel 结构
```python
# 每个 thread block 处理的数据形状
工作负载: [block_size, segment_size] # 单个 Q block 对所有 K 的注意力
# Pass 1: 计算全局 softmax 参数 (m_i, l_i)
for iter in range(num_iters): # num_iters = k_len / segment_size
X = load [block_size, segment_size]
compute max, sum for softmax normalization
# Pass 2: Normalize + Block Sum
for iter in range(num_iters):
X = load [block_size, segment_size]
X = softmax(X)
X = reshape(X, [block_size, segment_size/block_size, block_size])
X = sum(X, axis=2) # → [block_size, segment_size/block_size]
X = sum(X, axis=0) # → [segment_size/block_size]
store output
```
### 性能随 block_size 变化的因素
| 因素 | 小 block_size (64) | 大 block_size (256) |
|------|-------------------|---------------------|
| Grid 并行度 | 高 (更多 blocks) | 低 (更少 blocks) |
| 寄存器使用 | 低 | 高 (可能 spill) |
| L2 Cache 复用 | 差 | 好 |
| 输出大小 | 大 | 小 |
### 典型性能曲线
```
Performance
│ ┌─────┐
│ / \
│ / \
│ / \
│ / \
└────/───────────────\────────→ block_size
64 128 256 512
最优点通常在 128-256 之间
```
## 优化建议
1. **优先使用 block_size=4096**
- 减少 chunk 数量,降低调度开销
- 更好的 amortization 效果
2. **优化 find_blocks_chunked**
- 当前是 block_size=4096 的主要瓶颈
- 考虑 GPU 加速或批量处理
3. **Pipeline 优化**
- 利用多 slot 的 ring buffer 实现计算和传输 overlap
- 当前已实现,但 find_blocks 是 CPU 操作,无法 overlap
## 测试命令
```bash
# GPU-only 模式 (需要 40GB+ VRAM)
bash scripts/profile_offload.sh --policy xattn --ctx-len 64k --no-offload --gpu 0
# Offload 模式block_size=4096
bash scripts/profile_offload.sh --policy xattn --ctx-len 64k --block-size 4096 --gpu 0
# Offload 模式block_size=1024
bash scripts/profile_offload.sh --policy xattn --ctx-len 64k --block-size 1024 --gpu 0
# 128K context
bash scripts/profile_offload.sh --policy xattn --ctx-len 128k --block-size 4096 --gpu 0
```

View File

@@ -22,7 +22,7 @@ class Config:
tensor_parallel_size: int = 1
enforce_eager: bool = False
hf_config: AutoConfig | None = None
eos: int = -1
eos: int | list[int] = -1 # Single EOS token or list of EOS tokens (e.g., GLM-4)
kvcache_block_size: int = 1024
num_kvcache_blocks: int = -1
dtype: str | None = None # "float16", "bfloat16", or None (use model default)
@@ -57,8 +57,11 @@ class Config:
assert os.path.isdir(self.model)
assert self.kvcache_block_size % 256 == 0
assert 1 <= self.tensor_parallel_size <= 8
self.hf_config = AutoConfig.from_pretrained(self.model)
self.max_model_len = min(self.max_model_len, self.hf_config.max_position_embeddings)
self.hf_config = AutoConfig.from_pretrained(self.model, trust_remote_code=True)
# Get max position embeddings (GLM-4 uses seq_length instead of max_position_embeddings)
max_pos = getattr(self.hf_config, 'max_position_embeddings',
getattr(self.hf_config, 'seq_length', 4096))
self.max_model_len = min(self.max_model_len, max_pos)
assert self.max_num_batched_tokens >= self.max_model_len
# Override torch_dtype if user specified

View File

@@ -10,7 +10,8 @@ from nanovllm.sampling_params import SamplingParams
from nanovllm.engine.sequence import Sequence
from nanovllm.engine.scheduler import Scheduler
from nanovllm.engine.model_runner import ModelRunner
from nanovllm.utils.observer import Observer
from nanovllm.utils.observer import InferenceObserver
from nanovllm.utils.memory_observer import MemoryObserver
class LLMEngine:
@@ -29,8 +30,14 @@ class LLMEngine:
self.ps.append(process)
self.events.append(event)
self.model_runner = ModelRunner(config, 0, self.events)
self.tokenizer = AutoTokenizer.from_pretrained(config.model, use_fast=True)
config.eos = self.tokenizer.eos_token_id
self.tokenizer = AutoTokenizer.from_pretrained(config.model, use_fast=True, trust_remote_code=True)
# Get EOS token(s) from config (may be int or list, e.g., GLM-4 uses list)
# Prefer hf_config.eos_token_id which contains full list, fallback to tokenizer
eos_from_config = getattr(config.hf_config, 'eos_token_id', None)
if eos_from_config is not None:
config.eos = eos_from_config
else:
config.eos = self.tokenizer.eos_token_id
# Set Sequence.block_size to match the KV cache block size
Sequence.block_size = config.kvcache_block_size
self.scheduler = Scheduler(config, self.model_runner.kvcache_manager)
@@ -58,15 +65,18 @@ class LLMEngine:
print(f"[DEBUG LLMEngine.step] Mode={mode}, active_sequences={len(seqs)}")
if not is_prefill:
# The end of the prefill mode. Get TTFT.
if Observer.ttft_start != 0:
Observer.ttft = perf_counter_ns() - Observer.ttft_start
Observer.reset_ttft()
# The start of the decode mode. Get TPOT.
if Observer.tpot_start != 0:
Observer.tpot = perf_counter_ns() - Observer.tpot_start
Observer.tpot_start = perf_counter_ns()
# Decode mode: calculate TPOT from previous decode step
if InferenceObserver.tpot_start != 0:
InferenceObserver.tpot = perf_counter_ns() - InferenceObserver.tpot_start
InferenceObserver.tpot_start = perf_counter_ns()
token_ids = self.model_runner.call("run", seqs, is_prefill)
if is_prefill:
# Calculate TTFT after prefill completes (including chunked prefill)
if InferenceObserver.ttft_start != 0:
InferenceObserver.ttft = perf_counter_ns() - InferenceObserver.ttft_start
InferenceObserver.reset_ttft()
self.scheduler.postprocess(seqs, token_ids)
outputs = [(seq.seq_id, seq.completion_token_ids) for seq in seqs if seq.is_finished]
@@ -91,7 +101,8 @@ class LLMEngine:
log_level = os.environ.get('NANOVLLM_LOG_LEVEL', 'INFO')
debug_enabled = log_level.upper() == 'DEBUG'
Observer.complete_reset()
InferenceObserver.complete_reset()
MemoryObserver.complete_reset()
if use_tqdm:
pbar = tqdm(total=len(prompts), desc="Generating", dynamic_ncols=True)
if not isinstance(sampling_params, list):
@@ -128,8 +139,8 @@ class LLMEngine:
pbar.set_postfix({
"Prefill": f"{int(prefill_throughput)}tok/s",
"Decode": f"{int(decode_throughput)}tok/s",
"ttft": f"{float(Observer.ttft) / 1e6}ms",
"tpot": f"{float(Observer.tpot) / 1e6}ms",
"ttft": f"{float(InferenceObserver.ttft) / 1e6}ms",
"tpot": f"{float(InferenceObserver.tpot) / 1e6}ms",
})
for seq_id, token_ids in output:
outputs[seq_id] = token_ids

View File

@@ -30,6 +30,18 @@ def _find_free_port() -> int:
return s.getsockname()[1]
def get_num_kv_heads(hf_config) -> int:
"""Get number of KV heads from config (handles GLM-4's multi_query_group_num)."""
return getattr(hf_config, 'num_key_value_heads',
getattr(hf_config, 'multi_query_group_num', hf_config.num_attention_heads))
def get_head_dim(hf_config) -> int:
"""Get head dimension from config (handles GLM-4's kv_channels)."""
return getattr(hf_config, "head_dim",
getattr(hf_config, "kv_channels", hf_config.hidden_size // hf_config.num_attention_heads))
class ModelRunner:
def __init__(self, config: Config, rank: int, event: Event | list[Event]):
@@ -144,8 +156,8 @@ class ModelRunner:
used = total - free
peak = torch.cuda.memory_stats()["allocated_bytes.all.peak"]
current = torch.cuda.memory_stats()["allocated_bytes.all.current"]
num_kv_heads = hf_config.num_key_value_heads // self.world_size
head_dim = getattr(hf_config, "head_dim", hf_config.hidden_size // hf_config.num_attention_heads)
num_kv_heads = get_num_kv_heads(hf_config) // self.world_size
head_dim = get_head_dim(hf_config)
block_bytes = 2 * hf_config.num_hidden_layers * self.block_size * num_kv_heads * head_dim * hf_config.torch_dtype.itemsize
# Calculate max GPU blocks based on available memory
@@ -202,19 +214,37 @@ class ModelRunner:
dtype=hf_config.torch_dtype,
)
# Initialize sparse policy if manager has one (CPU offload mode)
# Initialize sparse policy if manager has one (works for both CPU offload and GPU-only modes)
if hasattr(self.kvcache_manager, 'sparse_policy') and self.kvcache_manager.sparse_policy is not None:
# Use CPU blocks for offload mode, GPU blocks for GPU-only mode
num_blocks_for_init = config.num_cpu_kvcache_blocks if config.enable_cpu_offload else config.num_kvcache_blocks
self.kvcache_manager.sparse_policy.initialize(
num_layers=hf_config.num_hidden_layers,
num_kv_heads=num_kv_heads,
head_dim=head_dim,
num_cpu_blocks=config.num_cpu_kvcache_blocks,
num_cpu_blocks=num_blocks_for_init,
dtype=hf_config.torch_dtype,
device=torch.device("cuda"),
)
# Pre-allocate policy metadata buffers
# - Offload mode: allocate chunked prefill buffers (mask, KV chunking stats)
# - GPU-only mode: additionally allocate GQA expansion buffers
num_heads = hf_config.num_attention_heads // self.world_size
self.kvcache_manager.sparse_policy.alloc_policy_metadata(
num_heads=num_heads,
num_kv_heads=num_kv_heads,
head_dim=head_dim,
max_seq_len=config.max_model_len,
dtype=hf_config.torch_dtype,
device=torch.device("cuda"),
enable_cpu_offload=config.enable_cpu_offload,
)
# Log policy info (handle both enum and None cases)
policy_name = config.sparse_policy.name if config.sparse_policy is not None else "FULL"
logger.info(
f"Sparse policy initialized: {config.sparse_policy.name} "
f"Sparse policy initialized: {policy_name} "
f"(topk={config.sparse_topk_blocks}, threshold={config.sparse_threshold_blocks})"
)
@@ -375,7 +405,16 @@ class ModelRunner:
cu_seqlens_q = torch.tensor(cu_seqlens_q, dtype=torch.int32, pin_memory=True).cuda(non_blocking=True)
cu_seqlens_k = torch.tensor(cu_seqlens_k, dtype=torch.int32, pin_memory=True).cuda(non_blocking=True)
slot_mapping = torch.tensor(slot_mapping, dtype=torch.int32, pin_memory=True).cuda(non_blocking=True)
set_context(True, cu_seqlens_q, cu_seqlens_k, max_seqlen_q, max_seqlen_k, slot_mapping, None, block_tables)
set_context(
is_prefill=True,
cu_seqlens_q=cu_seqlens_q,
cu_seqlens_k=cu_seqlens_k,
max_seqlen_q=max_seqlen_q,
max_seqlen_k=max_seqlen_k,
slot_mapping=slot_mapping,
block_tables=block_tables,
kvcache_manager=getattr(self, 'kvcache_manager', None),
)
return input_ids, positions
def prepare_decode(self, seqs: list[Sequence]):
@@ -404,7 +443,13 @@ class ModelRunner:
context_lens = torch.tensor(context_lens, dtype=torch.int32, pin_memory=True).cuda(non_blocking=True)
# Use GPU physical block tables for attention
block_tables = self._prepare_gpu_block_tables(gpu_block_tables)
set_context(False, slot_mapping=slot_mapping, context_lens=context_lens, block_tables=block_tables)
set_context(
is_prefill=False,
slot_mapping=slot_mapping,
context_lens=context_lens,
block_tables=block_tables,
kvcache_manager=self.kvcache_manager,
)
return input_ids, positions
def _prepare_gpu_block_tables(self, gpu_block_tables: list[list[int]]):
@@ -713,7 +758,13 @@ class ModelRunner:
for bs in reversed(self.graph_bs):
graph = torch.cuda.CUDAGraph()
set_context(False, slot_mapping=slot_mapping[:bs], context_lens=context_lens[:bs], block_tables=block_tables[:bs])
set_context(
is_prefill=False,
slot_mapping=slot_mapping[:bs],
context_lens=context_lens[:bs],
block_tables=block_tables[:bs],
kvcache_manager=self.kvcache_manager,
)
outputs[:bs] = self.model(input_ids[:bs], positions[:bs]) # warmup
with torch.cuda.graph(graph, self.graph_pool):
outputs[:bs] = self.model(input_ids[:bs], positions[:bs]) # capture
@@ -749,8 +800,8 @@ class ModelRunner:
- LastGraph: o_proj → post_norm → mlp → final_norm
"""
hf_config = self.config.hf_config
num_kv_heads = hf_config.num_key_value_heads // self.world_size
head_dim = getattr(hf_config, "head_dim", hf_config.hidden_size // hf_config.num_attention_heads)
num_kv_heads = get_num_kv_heads(hf_config) // self.world_size
head_dim = get_head_dim(hf_config)
# Create Decode Graph Manager (seq_len=1)
self.decode_graph_manager = OffloadGraphManager(

View File

@@ -4,7 +4,7 @@ from typing import TYPE_CHECKING
from nanovllm.config import Config
from nanovllm.engine.sequence import Sequence, SequenceStatus
from nanovllm.utils.observer import Observer
from nanovllm.utils.observer import InferenceObserver
if TYPE_CHECKING:
from nanovllm.kvcache import KVCacheManager
@@ -15,7 +15,9 @@ class Scheduler:
def __init__(self, config: Config, kvcache_manager: "KVCacheManager"):
self.max_num_seqs = config.max_num_seqs
self.max_num_batched_tokens = config.max_num_batched_tokens
self.eos = config.eos
# Convert EOS to set for efficient lookup (supports single int or list)
eos = config.eos
self.eos_set = set(eos) if isinstance(eos, list) else {eos}
self.kvcache_manager = kvcache_manager
self.waiting: deque[Sequence] = deque()
self.running: deque[Sequence] = deque()
@@ -32,8 +34,8 @@ class Scheduler:
num_seqs = 0
num_batched_tokens = 0
while self.waiting and num_seqs < self.max_num_seqs:
if Observer.ttft_start == 0:
Observer.ttft_start = perf_counter_ns()
if InferenceObserver.ttft_start == 0:
InferenceObserver.ttft_start = perf_counter_ns()
seq = self.waiting[0]
# Check if sequence is too large
@@ -94,7 +96,7 @@ class Scheduler:
def postprocess(self, seqs: list[Sequence], token_ids: list[int]) -> list[bool]:
for seq, token_id in zip(seqs, token_ids):
seq.append_token(token_id)
if (not seq.ignore_eos and token_id == self.eos) or seq.num_completion_tokens == seq.max_tokens:
if (not seq.ignore_eos and token_id in self.eos_set) or seq.num_completion_tokens == seq.max_tokens:
seq.status = SequenceStatus.FINISHED
self.kvcache_manager.deallocate(seq)
self.running.remove(seq)

View File

@@ -25,7 +25,7 @@ def create_kvcache_manager(config: "Config") -> KVCacheManager:
Factory function to create the appropriate KV cache manager.
Decision logic:
1. If enable_cpu_offload=False: use GPUOnlyManager
1. If enable_cpu_offload=False: use GPUOnlyManager (optionally with sparse policy)
2. If enable_cpu_offload=True but all blocks fit in GPU: use GPUOnlyManager
3. If enable_cpu_offload=True and need CPU blocks: use HybridKVCacheManager
@@ -37,9 +37,44 @@ def create_kvcache_manager(config: "Config") -> KVCacheManager:
"""
if not getattr(config, 'enable_cpu_offload', False):
# Default: pure GPU mode
# Check if sparse policy is requested for GPU-only mode
from nanovllm.config import SparsePolicyType
sparse_policy_type = getattr(config, 'sparse_policy', None)
# Handle None case - use FULL as default
if sparse_policy_type is None:
sparse_policy_type = SparsePolicyType.FULL
sparse_policy = None
if sparse_policy_type != SparsePolicyType.FULL:
# Create sparse policy for GPU-only mode
from nanovllm.kvcache.sparse import create_sparse_policy
policy_kwargs = {}
if sparse_policy_type == SparsePolicyType.QUEST:
policy_kwargs = {
'topk_blocks': getattr(config, 'sparse_topk_blocks', 8),
'threshold_blocks': getattr(config, 'sparse_threshold_blocks', 4),
}
elif sparse_policy_type == SparsePolicyType.XATTN_BSA:
policy_kwargs = {
'block_size': getattr(config, 'sparse_block_size', 128),
'samples_per_chunk': getattr(config, 'sparse_samples_per_chunk', 128),
'threshold': getattr(config, 'sparse_threshold', 0.9),
'use_triton': getattr(config, 'sparse_use_triton', True),
'stride': getattr(config, 'sparse_stride', 8),
'chunk_size': getattr(config, 'sparse_chunk_size', 16384),
}
sparse_policy = create_sparse_policy(sparse_policy_type, **policy_kwargs)
else:
# FULL policy for GPU-only mode - always create for consistent API
from nanovllm.kvcache.sparse import FullAttentionPolicy
sparse_policy = FullAttentionPolicy()
return GPUOnlyManager(
num_blocks=config.num_kvcache_blocks,
block_size=config.kvcache_block_size,
sparse_policy=sparse_policy,
)
# CPU offload is enabled

View File

@@ -7,13 +7,16 @@ the KVCacheManager interface.
"""
from collections import deque
from typing import List, Tuple, Dict, Optional
from typing import List, Tuple, Dict, Optional, TYPE_CHECKING
import torch
from torch import Tensor
from nanovllm.engine.sequence import Sequence
from nanovllm.kvcache.base_manager import KVCacheManager
if TYPE_CHECKING:
from nanovllm.kvcache.sparse.policy import SparsePolicy
class Block:
"""Physical block in GPU memory."""
@@ -50,17 +53,28 @@ class GPUOnlyManager(KVCacheManager):
all data stays on GPU at fixed addresses.
"""
def __init__(self, num_blocks: int, block_size: int):
def __init__(
self,
num_blocks: int,
block_size: int,
sparse_policy: Optional["SparsePolicy"] = None,
):
"""
Initialize GPU-only manager.
Args:
num_blocks: Total number of blocks to manage
block_size: Tokens per block (default 256)
sparse_policy: Optional sparse attention policy for GPU-only mode
"""
self._block_size = block_size
self._num_blocks = num_blocks
# Sparse policy for GPU-only mode (optional)
self.sparse_policy = sparse_policy
# No offload engine in GPU-only mode
self.offload_engine = None
# Block metadata
self.blocks: List[Block] = [Block(i) for i in range(num_blocks)]

View File

@@ -17,6 +17,7 @@ from dataclasses import dataclass
from nanovllm.kvcache.kernels import gathered_copy_kv
from nanovllm.comm import memcpy_2d_async
from nanovllm.utils.logger import get_logger
from nanovllm.utils.memory_observer import MemoryObserver
# Import for type hints only (avoid circular import)
from typing import TYPE_CHECKING
@@ -376,7 +377,8 @@ class OffloadEngine:
self.ring_slot_compute_done[slot_idx].record()
def load_to_slot_layer(
self, slot_idx: int, layer_id: int, cpu_block_id: int, chunk_idx: int = -1
self, slot_idx: int, layer_id: int, cpu_block_id: int, chunk_idx: int = -1,
is_prefill: bool = True,
) -> None:
"""
Async load a single CPU block to a ring buffer slot for one layer.
@@ -393,6 +395,7 @@ class OffloadEngine:
layer_id: Layer index to load (for CPU cache indexing)
cpu_block_id: Source CPU block ID
chunk_idx: Optional chunk index for NVTX labeling (-1 means not specified)
is_prefill: True if in prefill phase, False if in decode phase (for MemoryObserver)
"""
logger.debug(f"Ring load: layer={layer_id}, CPU[{cpu_block_id}] -> GPU slot[{slot_idx}]")
@@ -425,6 +428,65 @@ class OffloadEngine:
self.ring_slot_ready[slot_idx].record(stream)
nvtx.pop_range()
# Record H2D transfer: K + V = 2 * block_bytes
MemoryObserver.record_h2d(2 * self.gpu_block_bytes, is_prefill=is_prefill)
def load_k_only_to_slot_layer(
self, slot_idx: int, layer_id: int, cpu_block_id: int, chunk_idx: int = -1,
is_prefill: bool = True,
) -> None:
"""
Async load only K (not V) from CPU block to GPU slot.
Used by XAttention estimate phase which only needs K for attention score
computation. Saves 50% communication compared to loading K+V.
Args:
slot_idx: Target GPU slot index
layer_id: Layer index to load (for CPU cache indexing)
cpu_block_id: Source CPU block ID
chunk_idx: Optional chunk index for NVTX labeling (-1 means not specified)
is_prefill: True if in prefill phase, False if in decode phase
"""
logger.debug(f"Ring load K-only: layer={layer_id}, CPU[{cpu_block_id}] -> GPU slot[{slot_idx}]")
stream = self.slot_transfer_streams[slot_idx]
if chunk_idx >= 0:
nvtx_label = f"H2D K-only: L{layer_id} Chunk{chunk_idx} CPU[{cpu_block_id}]->Slot[{slot_idx}]"
else:
nvtx_label = f"H2D K-only: L{layer_id} CPU[{cpu_block_id}]->Slot[{slot_idx}]"
nvtx.push_range(message=nvtx_label, color="cyan")
with torch.cuda.stream(stream):
stream.wait_event(self.ring_slot_compute_done[slot_idx])
stream.wait_event(self.ring_slot_offload_done[slot_idx])
# Only copy K, not V
self.k_cache_gpu[slot_idx].copy_(
self.k_cache_cpu[layer_id, cpu_block_id], non_blocking=True
)
self.ring_slot_ready[slot_idx].record(stream)
nvtx.pop_range()
# Record H2D transfer: K only = 1 * block_bytes
MemoryObserver.record_h2d(self.gpu_block_bytes, is_prefill=is_prefill)
def get_k_for_slot(self, slot_idx: int) -> Tensor:
"""
Get only K for a ring buffer slot (no V).
Used by XAttention estimate phase which only needs K for attention
score computation.
Args:
slot_idx: GPU slot index
Returns:
k_cache, shape: [1, block_size, kv_heads, head_dim]
"""
return self.k_cache_gpu[slot_idx].unsqueeze(0)
def wait_slot_layer(self, slot_idx: int) -> None:
"""
Wait for a slot's loading to complete.
@@ -499,6 +561,9 @@ class OffloadEngine:
self.ring_slot_offload_done[slot_idx].record(self.transfer_stream_main)
nvtx.pop_range()
# Record D2H transfer: K + V = 2 * block_bytes
MemoryObserver.record_d2h(2 * self.gpu_block_bytes, is_prefill=is_prefill)
# ----- KV access methods for ring buffer -----
def get_kv_for_slot(self, slot_idx: int) -> Tuple[Tensor, Tensor]:
@@ -745,6 +810,10 @@ class OffloadEngine:
self.prefill_v_buffer[layer_id, :num_tokens].copy_(v)
torch.cuda.nvtx.range_pop()
# Record D2D transfer: K + V
transfer_bytes = 2 * k.numel() * k.element_size()
MemoryObserver.record_d2d(transfer_bytes)
def write_to_decode_buffer(
self,
layer_id: int,
@@ -768,6 +837,10 @@ class OffloadEngine:
self.decode_v_buffer[layer_id, pos_in_block].copy_(v)
torch.cuda.nvtx.range_pop()
# Record D2D transfer: K + V (single token)
transfer_bytes = 2 * k.numel() * k.element_size()
MemoryObserver.record_d2d(transfer_bytes)
def offload_prefill_buffer_async(
self,
layer_id: int,
@@ -813,6 +886,9 @@ class OffloadEngine:
self.prefill_offload_events[layer_id].record(stream)
nvtx.pop_range()
# Record D2H transfer: K + V = 2 * block_bytes
MemoryObserver.record_d2h(2 * self.gpu_block_bytes, is_prefill=True)
def wait_all_prefill_offloads(self) -> None:
"""Wait for all prefill buffer offloads to complete."""
for stream in self.prefill_offload_streams:
@@ -851,6 +927,11 @@ class OffloadEngine:
v_sample = self.v_cache_cpu[
layer_id, cpu_block_id, :num_samples
].clone().cuda()
# Record H2D transfer: K + V samples
transfer_bytes = 2 * k_sample.numel() * k_sample.element_size()
MemoryObserver.record_h2d(transfer_bytes, is_prefill=True)
return k_sample, v_sample
def load_block_full_from_cpu(
@@ -877,4 +958,8 @@ class OffloadEngine:
v_full = self.v_cache_cpu[
layer_id, cpu_block_id
].clone().cuda()
# Record H2D transfer: K + V full block
MemoryObserver.record_h2d(2 * self.gpu_block_bytes, is_prefill=True)
return k_full, v_full

View File

@@ -47,6 +47,8 @@ class FullAttentionPolicy(SparsePolicy):
available_blocks: List[int],
offload_engine: "OffloadEngine",
ctx: PolicyContext,
q: torch.Tensor,
k: torch.Tensor,
) -> List[int]:
"""Return all blocks - no sparsity."""
# Update statistics (only for layer 0 to avoid overcounting)
@@ -76,6 +78,75 @@ class FullAttentionPolicy(SparsePolicy):
logger.info(f"[Full Policy] Density Stats: chunks={stats['num_chunks']}, "
f"blocks={stats['total_available_blocks']}, density=100.0%")
# =========================================================================
# GPU-only methods (non-chunked)
# =========================================================================
def compute_prefill(
self,
q: torch.Tensor,
k: torch.Tensor,
v: torch.Tensor,
cu_seqlens_q: torch.Tensor,
cu_seqlens_k: torch.Tensor,
max_seqlen_q: int,
max_seqlen_k: int,
softmax_scale: float,
layer_id: int,
block_tables: Optional[torch.Tensor] = None,
) -> torch.Tensor:
"""
GPU-only prefill attention using flash_attn_varlen_func.
This is the simplest implementation - just call flash attention directly.
For sparse policies, this method would implement block selection.
"""
from flash_attn import flash_attn_varlen_func
return flash_attn_varlen_func(
q, k, v,
cu_seqlens_q=cu_seqlens_q,
cu_seqlens_k=cu_seqlens_k,
max_seqlen_q=max_seqlen_q,
max_seqlen_k=max_seqlen_k,
softmax_scale=softmax_scale,
causal=True,
block_table=block_tables,
)
def compute_decode(
self,
q: torch.Tensor,
k_cache: torch.Tensor,
v_cache: torch.Tensor,
cache_seqlens: torch.Tensor,
softmax_scale: float,
layer_id: int,
block_tables: Optional[torch.Tensor] = None,
) -> torch.Tensor:
"""
GPU-only decode attention using flash_attn_with_kvcache.
This is the simplest implementation - just call flash attention directly.
For sparse policies, this method would implement block selection.
"""
from flash_attn import flash_attn_with_kvcache
# q is [batch, num_heads, head_dim], need to add seq dim
return flash_attn_with_kvcache(
q.unsqueeze(1), # [batch, 1, heads, dim]
k_cache,
v_cache,
cache_seqlens=cache_seqlens,
block_table=block_tables,
softmax_scale=softmax_scale,
causal=True,
)
# =========================================================================
# Chunked offload methods
# =========================================================================
def compute_chunked_prefill(
self,
q: torch.Tensor,
@@ -116,7 +187,11 @@ class FullAttentionPolicy(SparsePolicy):
Returns:
Attention output [seq_len, num_heads, head_dim]
"""
from nanovllm.ops.chunked_attention import flash_attn_with_lse, merge_attention_outputs
# Use FlashInfer-based implementations (more optimized)
from nanovllm.ops.chunked_attention import (
flash_attn_with_lse_flashinfer as flash_attn_with_lse,
merge_attention_outputs_flashinfer as merge_attention_outputs,
)
logger.debug(f"[DEBUG] FullPolicy.compute_chunked_prefill called, "
f"layer={layer_id}, chunk={current_chunk_idx}, num_tokens={num_tokens}, "
@@ -244,7 +319,11 @@ class FullAttentionPolicy(SparsePolicy):
Returns:
Attention output [batch_size, 1, num_heads, head_dim]
"""
from nanovllm.ops.chunked_attention import flash_attn_with_lse, merge_attention_outputs
# Use FlashInfer-based implementations (more optimized)
from nanovllm.ops.chunked_attention import (
flash_attn_with_lse_flashinfer as flash_attn_with_lse,
merge_attention_outputs_flashinfer as merge_attention_outputs,
)
# q shape: [batch_size, num_heads, head_dim] (single decode token per sequence)
q_batched = q.unsqueeze(1) # [batch, 1, heads, dim]
@@ -336,7 +415,11 @@ class FullAttentionPolicy(SparsePolicy):
Loads one block at a time, computes attention, and merges results.
Uses load_to_slot_layer / wait_slot_layer / get_kv_for_slot methods.
"""
from nanovllm.ops.chunked_attention import flash_attn_with_lse, merge_attention_outputs
# Use FlashInfer-based implementations (more optimized)
from nanovllm.ops.chunked_attention import (
flash_attn_with_lse_flashinfer as flash_attn_with_lse,
merge_attention_outputs_flashinfer as merge_attention_outputs,
)
num_blocks = len(cpu_block_table)
if num_blocks == 0:
@@ -353,7 +436,7 @@ class FullAttentionPolicy(SparsePolicy):
num_preload = min(num_slots, num_blocks)
for i in range(num_preload):
cpu_block_id = cpu_block_table[i]
offload_engine.load_to_slot_layer(load_slots[i], layer_id, cpu_block_id, chunk_idx=cpu_block_id)
offload_engine.load_to_slot_layer(load_slots[i], layer_id, cpu_block_id, chunk_idx=cpu_block_id, is_prefill=False)
# Phase 2: Process blocks with pipeline
for block_idx in range(num_blocks):
@@ -387,7 +470,7 @@ class FullAttentionPolicy(SparsePolicy):
next_block_idx = block_idx + num_slots
if next_block_idx < num_blocks:
next_cpu_block_id = cpu_block_table[next_block_idx]
offload_engine.load_to_slot_layer(current_slot, layer_id, next_cpu_block_id, chunk_idx=next_cpu_block_id)
offload_engine.load_to_slot_layer(current_slot, layer_id, next_cpu_block_id, chunk_idx=next_cpu_block_id, is_prefill=False)
# Merge with accumulated
with torch.cuda.stream(compute_stream):

View File

@@ -108,12 +108,45 @@ class SparsePolicy(ABC):
"""
pass
def alloc_policy_metadata(
self,
num_heads: int,
num_kv_heads: int,
head_dim: int,
max_seq_len: int,
dtype: torch.dtype,
device: torch.device,
enable_cpu_offload: bool = False,
) -> None:
"""
Pre-allocate GPU buffers for policy computation.
Called by the framework after KV cache allocation. Implementations should
use enable_cpu_offload to decide which buffers to allocate:
- Offload mode: allocate chunked prefill buffers (mask, KV chunking stats)
- GPU-only mode: additionally allocate GQA expansion buffers
This is separate from initialize() which is used for CPU offload metadata.
Args:
num_heads: Number of query heads
num_kv_heads: Number of KV heads (for GQA)
head_dim: Dimension per head
max_seq_len: Maximum sequence length (for buffer sizing)
dtype: Data type (typically float16/bfloat16)
device: Target device (cuda)
enable_cpu_offload: Whether CPU offload is enabled
"""
pass
@abstractmethod
def select_blocks(
self,
available_blocks: List[int],
offload_engine: "OffloadEngine",
ctx: PolicyContext,
q: torch.Tensor,
k: torch.Tensor,
) -> List[int]:
"""
Select which KV blocks to load for the current query chunk.
@@ -130,6 +163,8 @@ class SparsePolicy(ABC):
to load KV to make selection decisions).
ctx: PolicyContext with information about the current query
chunk, layer, phase (prefill/decode), etc.
q: Query tensor [seq_len, num_heads, head_dim] for current chunk
k: Key tensor [seq_len, num_kv_heads, head_dim] for current chunk
Returns:
List of block IDs to load (must be a subset of available_blocks).
@@ -191,6 +226,87 @@ class SparsePolicy(ABC):
"""
pass
# =========================================================================
# GPU-only methods (non-chunked)
# These methods are used when all KV cache is on GPU, no CPU offload needed.
# =========================================================================
def compute_prefill(
self,
q: torch.Tensor,
k: torch.Tensor,
v: torch.Tensor,
cu_seqlens_q: torch.Tensor,
cu_seqlens_k: torch.Tensor,
max_seqlen_q: int,
max_seqlen_k: int,
softmax_scale: float,
layer_id: int,
block_tables: Optional[torch.Tensor] = None,
) -> torch.Tensor:
"""
Compute GPU-only prefill attention (non-chunked).
This method is used when all KV cache resides on GPU (no CPU offload).
Override this to implement sparse prefill attention for GPU-only mode.
Default implementation raises NotImplementedError.
Args:
q: [total_q, num_heads, head_dim] query tensor (packed variable length)
k: [total_kv, num_kv_heads, head_dim] key tensor
v: [total_kv, num_kv_heads, head_dim] value tensor
cu_seqlens_q: [batch+1] cumulative sequence lengths for queries
cu_seqlens_k: [batch+1] cumulative sequence lengths for keys
max_seqlen_q: maximum query sequence length
max_seqlen_k: maximum key sequence length
softmax_scale: softmax scaling factor (1/sqrt(head_dim))
layer_id: transformer layer index
block_tables: [batch, max_blocks] paged attention block tables (optional)
Returns:
[total_q, num_heads, head_dim] attention output
"""
raise NotImplementedError(
f"{self.__class__.__name__} does not implement compute_prefill for GPU-only mode"
)
def compute_decode(
self,
q: torch.Tensor,
k_cache: torch.Tensor,
v_cache: torch.Tensor,
cache_seqlens: torch.Tensor,
softmax_scale: float,
layer_id: int,
block_tables: Optional[torch.Tensor] = None,
) -> torch.Tensor:
"""
Compute GPU-only decode attention (non-chunked).
This method is used when all KV cache resides on GPU (no CPU offload).
Override this to implement sparse decode attention for GPU-only mode.
Default implementation raises NotImplementedError.
Args:
q: [batch, num_heads, head_dim] query tensor (single token per sequence)
k_cache: [num_blocks, block_size, num_kv_heads, head_dim] paged key cache
v_cache: [num_blocks, block_size, num_kv_heads, head_dim] paged value cache
cache_seqlens: [batch] sequence lengths in cache
softmax_scale: softmax scaling factor (1/sqrt(head_dim))
layer_id: transformer layer index
block_tables: [batch, max_blocks] paged attention block tables (optional)
Returns:
[batch, 1, num_heads, head_dim] attention output
"""
raise NotImplementedError(
f"{self.__class__.__name__} does not implement compute_decode for GPU-only mode"
)
# =========================================================================
# Chunked offload methods (for CPU offload mode)
# =========================================================================
@abstractmethod
def compute_chunked_prefill(
self,

View File

@@ -191,13 +191,26 @@ class QuestPolicy(SparsePolicy):
def select_blocks(
self,
available_blocks: List[int],
offload_engine: "OffloadEngine",
ctx: PolicyContext,
q: torch.Tensor,
k: torch.Tensor,
) -> List[int]:
"""
Select Top-K blocks based on query-key similarity bounds.
If query is not available (some prefill scenarios), falls back
to loading all blocks.
Args:
available_blocks: List of CPU block IDs
offload_engine: OffloadEngine for loading KV (unused in Quest)
ctx: PolicyContext with metadata
q: Query tensor [seq_len, num_heads, head_dim] or [batch, seq_len, num_heads, head_dim]
k: Key tensor [seq_len, num_kv_heads, head_dim] (unused in Quest, uses metadata instead)
Returns:
Selected block IDs
"""
if self.metadata is None:
raise RuntimeError(
@@ -211,7 +224,7 @@ class QuestPolicy(SparsePolicy):
if n <= self.config.threshold_blocks:
return available_blocks
if ctx.query is None:
if q is None:
# No query available - cannot compute scores
return available_blocks
@@ -221,11 +234,10 @@ class QuestPolicy(SparsePolicy):
)
# Metadata is already on GPU, same device as query
device = ctx.query.device
device = q.device
# Compute upper bound scores
# query shape: [1, num_heads, head_dim] or [1, seq_len, num_heads, head_dim]
q = ctx.query
# query shape: [seq_len, num_heads, head_dim] or [batch, seq_len, num_heads, head_dim]
if q.dim() == 4:
# Prefill: use mean over sequence length
q = q.mean(dim=1) # [1, num_heads, head_dim]

File diff suppressed because it is too large Load Diff

View File

@@ -124,24 +124,47 @@ class Attention(nn.Module):
if k_cache.numel() and v_cache.numel():
store_kvcache(k, v, k_cache, v_cache, context.slot_mapping)
# Get sparse_policy from kvcache_manager (required, never None after warmup)
# During warmup, kvcache_manager is not yet allocated
if context.kvcache_manager is None:
# Warmup phase: use flash_attn directly
if context.is_prefill:
return flash_attn_varlen_func(
q, k, v,
max_seqlen_q=context.max_seqlen_q, cu_seqlens_q=context.cu_seqlens_q,
max_seqlen_k=context.max_seqlen_k, cu_seqlens_k=context.cu_seqlens_k,
softmax_scale=self.scale, causal=True,
)
else:
return flash_attn_with_kvcache(
q.unsqueeze(1), k_cache, v_cache,
cache_seqlens=context.context_lens, block_table=context.block_tables,
softmax_scale=self.scale, causal=True,
)
sparse_policy = context.kvcache_manager.sparse_policy
assert sparse_policy is not None, "sparse_policy must not be None"
if context.is_prefill:
if context.is_chunked_prefill:
# Chunked prefill: merge attention from previous KV
# Chunked prefill: merge attention from previous KV (CPU offload mode)
o = self._chunked_prefill_attention(q, k, v, context)
elif context.block_tables is not None: # prefix cache
k, v = k_cache, v_cache
o = flash_attn_varlen_func(q, k, v,
max_seqlen_q=context.max_seqlen_q, cu_seqlens_q=context.cu_seqlens_q,
max_seqlen_k=context.max_seqlen_k, cu_seqlens_k=context.cu_seqlens_k,
softmax_scale=self.scale, causal=True, block_table=context.block_tables)
else:
o = flash_attn_varlen_func(q, k, v,
max_seqlen_q=context.max_seqlen_q, cu_seqlens_q=context.cu_seqlens_q,
max_seqlen_k=context.max_seqlen_k, cu_seqlens_k=context.cu_seqlens_k,
softmax_scale=self.scale, causal=True, block_table=context.block_tables)
# GPU-only mode: use policy for attention
# Use paged attention if block_tables provided, else use k, v directly
if context.block_tables is not None:
k_for_attn, v_for_attn = k_cache, v_cache
else:
k_for_attn, v_for_attn = k, v
o = sparse_policy.compute_prefill(
q, k_for_attn, v_for_attn,
context.cu_seqlens_q, context.cu_seqlens_k,
context.max_seqlen_q, context.max_seqlen_k,
self.scale, self.layer_id,
context.block_tables,
)
else: # decode
if context.is_chunked_prefill:
# Chunked decode: need to load all KV from CPU+GPU
# Chunked decode: need to load all KV from CPU+GPU (CPU offload mode)
# Store current decode token to per-layer decode buffer
# This is needed because GPU cache has no layer dimension,
# so all layers would overwrite each other in decode_slot.
@@ -152,9 +175,12 @@ class Attention(nn.Module):
offload_engine.write_to_decode_buffer(self.layer_id, pos_in_block, k.squeeze(0), v.squeeze(0))
o = self._chunked_decode_attention(q, k, v, context)
else:
o = flash_attn_with_kvcache(q.unsqueeze(1), k_cache, v_cache,
cache_seqlens=context.context_lens, block_table=context.block_tables,
softmax_scale=self.scale, causal=True)
# GPU-only mode: use policy for attention
o = sparse_policy.compute_decode(
q, k_cache, v_cache,
context.context_lens, self.scale, self.layer_id,
context.block_tables,
)
return o
def _chunked_prefill_attention(
@@ -195,20 +221,19 @@ class Attention(nn.Module):
cpu_block_table = kvcache_manager.get_prefilled_cpu_blocks(seq)
# Step 2: Apply select_blocks to filter blocks (before calling compute_chunked_prefill)
selected_blocks = []
if cpu_block_table:
num_chunks = current_chunk_idx + 1
policy_ctx = PolicyContext(
query_chunk_idx=current_chunk_idx,
num_query_chunks=num_chunks,
layer_id=self.layer_id,
query=q, # Pass query for sparse policies that need it
is_prefill=True,
block_size=kvcache_manager.block_size,
total_kv_len=len(cpu_block_table) * kvcache_manager.block_size,
)
selected_blocks = sparse_policy.select_blocks(cpu_block_table, offload_engine, policy_ctx)
logger.debug(f"[DEBUG] select_blocks: {len(cpu_block_table)} -> {len(selected_blocks)} blocks")
# Always call select_blocks even for first chunk (cpu_block_table may be empty)
num_chunks = current_chunk_idx + 1
policy_ctx = PolicyContext(
query_chunk_idx=current_chunk_idx,
num_query_chunks=num_chunks,
layer_id=self.layer_id,
query=q, # Pass query for sparse policies that need it
is_prefill=True,
block_size=kvcache_manager.block_size,
total_kv_len=len(cpu_block_table) * kvcache_manager.block_size if cpu_block_table else 0,
)
selected_blocks = sparse_policy.select_blocks(cpu_block_table, offload_engine, policy_ctx, q, k)
logger.debug(f"[DEBUG] select_blocks: {len(cpu_block_table)} -> {len(selected_blocks)} blocks")
# [DEBUG] Verify execution path
logger.debug(f"[DEBUG] Calling sparse_policy.compute_chunked_prefill, "
@@ -294,7 +319,7 @@ class Attention(nn.Module):
block_size=kvcache_manager.block_size,
total_kv_len=len(cpu_block_table) * kvcache_manager.block_size,
)
selected_blocks = sparse_policy.select_blocks(cpu_block_table, offload_engine, policy_ctx)
selected_blocks = sparse_policy.select_blocks(cpu_block_table, offload_engine, policy_ctx, q, k)
logger.debug(f"[DEBUG] decode select_blocks: {len(cpu_block_table)} -> {len(selected_blocks)} blocks")
# [DEBUG] Verify execution path

View File

@@ -8,12 +8,43 @@ def apply_rotary_emb(
cos: torch.Tensor,
sin: torch.Tensor,
) -> torch.Tensor:
"""Non-interleaved RoPE (used by Llama, Qwen, etc.)"""
x1, x2 = torch.chunk(x.float(), 2, dim=-1)
y1 = x1 * cos - x2 * sin
y2 = x2 * cos + x1 * sin
return torch.cat((y1, y2), dim=-1).to(x.dtype)
def apply_rotary_emb_interleaved(
x: torch.Tensor,
cos: torch.Tensor,
sin: torch.Tensor,
) -> torch.Tensor:
"""Interleaved RoPE (used by GLM-4, etc.)
Args:
x: [seq_len, num_heads, head_dim]
cos: [seq_len, 1, head_dim // 2]
sin: [seq_len, 1, head_dim // 2]
x is reshaped to [seq_len, num_heads, head_dim // 2, 2] where:
- x[..., 0] are even positions
- x[..., 1] are odd positions
"""
rot_dim = x.shape[-1]
# x_shaped: [seq_len, num_heads, rot_dim // 2, 2]
x_shaped = x.float().reshape(*x.shape[:-1], rot_dim // 2, 2)
# x_0, x_1: [seq_len, num_heads, rot_dim // 2]
x_0 = x_shaped[..., 0]
x_1 = x_shaped[..., 1]
# cos/sin: [seq_len, 1, rot_dim // 2] - broadcasts to num_heads
x_out = torch.stack([
x_0 * cos - x_1 * sin,
x_1 * cos + x_0 * sin,
], dim=-1)
return x_out.flatten(-2).to(x.dtype)
class RotaryEmbedding(nn.Module):
def __init__(
@@ -140,6 +171,76 @@ class Llama3RotaryEmbedding(nn.Module):
return query, key
class GLM4RotaryEmbedding(nn.Module):
"""
GLM-4 RoPE with interleaved rotation and partial rotation.
GLM-4 uses:
- Interleaved rotation (pairs adjacent elements, not first/second half)
- rope_ratio to scale base: base = 10000 * rope_ratio
- Partial rotation: only rotates first rotary_dim elements, rest pass through
- rotary_dim = head_dim // 2 (only half of head_dim is rotated)
"""
def __init__(
self,
head_size: int,
rotary_dim: int,
max_position_embeddings: int,
base: float,
) -> None:
super().__init__()
self.head_size = head_size
self.rotary_dim = rotary_dim # GLM-4: rotary_dim = head_dim // 2
# inv_freq shape: [rotary_dim // 2]
inv_freq = 1.0 / (base ** (torch.arange(0, rotary_dim, 2, dtype=torch.float) / rotary_dim))
t = torch.arange(max_position_embeddings, dtype=torch.float)
freqs = torch.einsum("i,j -> ij", t, inv_freq) # [max_pos, rotary_dim // 2]
cos = freqs.cos()
sin = freqs.sin()
# cache shape [max_pos, 1, rotary_dim // 2, 2]
cache = torch.stack((cos, sin), dim=-1).unsqueeze_(1)
self.register_buffer("cos_sin_cache", cache, persistent=False)
@torch.compile
def forward(
self,
positions: torch.Tensor,
query: torch.Tensor,
key: torch.Tensor,
) -> tuple[torch.Tensor, torch.Tensor]:
"""
Apply RoPE to query and key.
Args:
positions: [seq_len]
query: [seq_len, num_heads, head_dim]
key: [seq_len, num_kv_heads, head_dim]
Returns:
Rotated query and key with same shapes as input.
"""
cache = self.cos_sin_cache[positions] # [seq_len, 1, rotary_dim // 2, 2]
cos = cache[..., 0] # [seq_len, 1, rotary_dim // 2]
sin = cache[..., 1] # [seq_len, 1, rotary_dim // 2]
# Split into rotated and pass-through parts
q_rot = query[..., :self.rotary_dim]
q_pass = query[..., self.rotary_dim:]
k_rot = key[..., :self.rotary_dim]
k_pass = key[..., self.rotary_dim:]
# Apply interleaved RoPE to rotated part
q_rot = apply_rotary_emb_interleaved(q_rot, cos, sin)
k_rot = apply_rotary_emb_interleaved(k_rot, cos, sin)
# Concatenate rotated and pass-through parts
query = torch.cat([q_rot, q_pass], dim=-1)
key = torch.cat([k_rot, k_pass], dim=-1)
return query, key
# Cache for RoPE instances (keyed by hashable parameters)
_rope_cache: dict[tuple, nn.Module] = {}
@@ -150,10 +251,11 @@ def get_rope(
max_position: int,
base: float,
rope_scaling: dict | None = None,
is_interleaved: bool = False,
):
# Create hashable cache key
if rope_scaling is None:
cache_key = (head_size, rotary_dim, max_position, base, None)
cache_key = (head_size, rotary_dim, max_position, base, None, is_interleaved)
else:
rope_type = rope_scaling.get("rope_type", rope_scaling.get("type", "default"))
if rope_type == "llama3":
@@ -163,15 +265,19 @@ def get_rope(
rope_scaling["low_freq_factor"],
rope_scaling["high_freq_factor"],
rope_scaling["original_max_position_embeddings"],
is_interleaved,
)
else:
cache_key = (head_size, rotary_dim, max_position, base, rope_type)
cache_key = (head_size, rotary_dim, max_position, base, rope_type, is_interleaved)
if cache_key in _rope_cache:
return _rope_cache[cache_key]
if rope_scaling is None:
rope = RotaryEmbedding(head_size, rotary_dim, max_position, base)
if is_interleaved:
rope = GLM4RotaryEmbedding(head_size, rotary_dim, max_position, base)
else:
rope = RotaryEmbedding(head_size, rotary_dim, max_position, base)
else:
rope_type = rope_scaling.get("rope_type", rope_scaling.get("type", "default"))
if rope_type == "llama3":

View File

@@ -3,7 +3,9 @@
from nanovllm.models.registry import register_model, get_model_class, MODEL_REGISTRY
# Import models to trigger registration
from nanovllm.models import qwen2
from nanovllm.models import qwen3
from nanovllm.models import llama
from nanovllm.models import glm4
__all__ = ["register_model", "get_model_class", "MODEL_REGISTRY"]

235
nanovllm/models/glm4.py Normal file
View File

@@ -0,0 +1,235 @@
"""GLM-4 model implementation for nano-vllm."""
import torch
from torch import nn
import torch.distributed as dist
from nanovllm.layers.activation import SiluAndMul
from nanovllm.layers.attention import Attention
from nanovllm.layers.layernorm import RMSNorm
from nanovllm.layers.linear import QKVParallelLinear, MergedColumnParallelLinear, RowParallelLinear
from nanovllm.layers.rotary_embedding import get_rope
from nanovllm.layers.embed_head import VocabParallelEmbedding, ParallelLMHead
from nanovllm.models.registry import register_model
class GLM4Attention(nn.Module):
def __init__(
self,
hidden_size: int,
num_heads: int,
num_kv_heads: int,
max_position: int = 1048576,
head_dim: int = 128,
rope_theta: float = 10000,
rope_scaling: dict | None = None,
) -> None:
super().__init__()
tp_size = dist.get_world_size()
self.total_num_heads = num_heads
assert self.total_num_heads % tp_size == 0
self.num_heads = self.total_num_heads // tp_size
self.total_num_kv_heads = num_kv_heads
assert self.total_num_kv_heads % tp_size == 0
self.num_kv_heads = self.total_num_kv_heads // tp_size
self.head_dim = head_dim
self.q_size = self.num_heads * self.head_dim
self.kv_size = self.num_kv_heads * self.head_dim
self.scaling = self.head_dim ** -0.5
self.qkv_proj = QKVParallelLinear(
hidden_size,
self.head_dim,
self.total_num_heads,
self.total_num_kv_heads,
bias=True, # GLM-4 has QKV bias
)
self.o_proj = RowParallelLinear(
self.total_num_heads * self.head_dim,
hidden_size,
bias=False, # GLM-4 has no output bias
)
# GLM-4 only rotates half of head_dim
rotary_dim = self.head_dim // 2
self.rotary_emb = get_rope(
self.head_dim,
rotary_dim=rotary_dim,
max_position=max_position,
base=rope_theta,
rope_scaling=rope_scaling,
is_interleaved=True, # GLM-4 uses interleaved RoPE
)
self.attn = Attention(
self.num_heads,
self.head_dim,
self.scaling,
self.num_kv_heads,
)
def forward(
self,
positions: torch.Tensor,
hidden_states: torch.Tensor,
) -> torch.Tensor:
qkv = self.qkv_proj(hidden_states)
q, k, v = qkv.split([self.q_size, self.kv_size, self.kv_size], dim=-1)
q = q.view(-1, self.num_heads, self.head_dim)
k = k.view(-1, self.num_kv_heads, self.head_dim)
v = v.view(-1, self.num_kv_heads, self.head_dim)
q, k = self.rotary_emb(positions, q, k)
o = self.attn(q, k, v)
output = self.o_proj(o.flatten(1, -1))
return output
class GLM4MLP(nn.Module):
def __init__(
self,
hidden_size: int,
intermediate_size: int,
) -> None:
super().__init__()
self.gate_up_proj = MergedColumnParallelLinear(
hidden_size,
[intermediate_size] * 2,
bias=False, # GLM-4 has no MLP bias
)
self.down_proj = RowParallelLinear(
intermediate_size,
hidden_size,
bias=False,
)
self.act_fn = SiluAndMul()
def forward(self, x):
gate_up = self.gate_up_proj(x)
x = self.act_fn(gate_up)
x = self.down_proj(x)
return x
class GLM4DecoderLayer(nn.Module):
def __init__(self, config) -> None:
super().__init__()
# GLM-4 config field mapping
hidden_size = config.hidden_size
num_heads = config.num_attention_heads
num_kv_heads = getattr(config, 'multi_query_group_num', num_heads)
head_dim = getattr(config, 'kv_channels', hidden_size // num_heads)
max_position = getattr(config, 'seq_length', 1048576)
rope_ratio = getattr(config, 'rope_ratio', 1)
rope_theta = 10000 * rope_ratio # GLM-4 uses rope_ratio to scale base
intermediate_size = getattr(config, 'ffn_hidden_size', getattr(config, 'intermediate_size', None))
rms_norm_eps = getattr(config, 'layernorm_epsilon', 1e-5)
self.self_attn = GLM4Attention(
hidden_size=hidden_size,
num_heads=num_heads,
num_kv_heads=num_kv_heads,
max_position=max_position,
head_dim=head_dim,
rope_theta=rope_theta,
rope_scaling=getattr(config, "rope_scaling", None),
)
self.mlp = GLM4MLP(
hidden_size=hidden_size,
intermediate_size=intermediate_size,
)
self.input_layernorm = RMSNorm(hidden_size, eps=rms_norm_eps)
self.post_attention_layernorm = RMSNorm(hidden_size, eps=rms_norm_eps)
def forward(
self,
positions: torch.Tensor,
hidden_states: torch.Tensor,
residual: torch.Tensor | None,
) -> tuple[torch.Tensor, torch.Tensor]:
if residual is None:
hidden_states, residual = self.input_layernorm(hidden_states), hidden_states
else:
hidden_states, residual = self.input_layernorm(hidden_states, residual)
hidden_states = self.self_attn(positions, hidden_states)
hidden_states, residual = self.post_attention_layernorm(hidden_states, residual)
hidden_states = self.mlp(hidden_states)
return hidden_states, residual
class GLM4Model(nn.Module):
def __init__(self, config) -> None:
super().__init__()
vocab_size = getattr(config, 'padded_vocab_size', config.vocab_size)
num_layers = getattr(config, 'num_layers', config.num_hidden_layers)
rms_norm_eps = getattr(config, 'layernorm_epsilon', 1e-5)
self.embed_tokens = VocabParallelEmbedding(vocab_size, config.hidden_size)
self.layers = nn.ModuleList([GLM4DecoderLayer(config) for _ in range(num_layers)])
self.norm = RMSNorm(config.hidden_size, eps=rms_norm_eps)
def forward(
self,
input_ids: torch.Tensor,
positions: torch.Tensor,
) -> torch.Tensor:
hidden_states = self.embed_tokens(input_ids)
residual = None
for layer in self.layers:
hidden_states, residual = layer(positions, hidden_states, residual)
hidden_states, _ = self.norm(hidden_states, residual)
return hidden_states
@register_model("ChatGLMModel", "ChatGLMForConditionalGeneration")
class ChatGLMForCausalLM(nn.Module):
"""
GLM-4 model for causal language modeling.
Weight mapping from HuggingFace to nanovllm:
- transformer.embedding.word_embeddings → model.embed_tokens
- transformer.encoder.layers.X.input_layernorm → model.layers.X.input_layernorm
- transformer.encoder.layers.X.self_attention.query_key_value → model.layers.X.self_attn.qkv_proj (split q/k/v)
- transformer.encoder.layers.X.self_attention.dense → model.layers.X.self_attn.o_proj
- transformer.encoder.layers.X.post_attention_layernorm → model.layers.X.post_attention_layernorm
- transformer.encoder.layers.X.mlp.dense_h_to_4h → model.layers.X.mlp.gate_up_proj (split gate/up)
- transformer.encoder.layers.X.mlp.dense_4h_to_h → model.layers.X.mlp.down_proj
- transformer.encoder.final_layernorm → model.norm
- transformer.output_layer → lm_head
"""
packed_modules_mapping = {
# QKV is merged in GLM-4 as query_key_value
"query_key_value": ("qkv_proj", None), # Special handling needed
# MLP gate and up are merged as dense_h_to_4h
"dense_h_to_4h": ("gate_up_proj", None), # Special handling needed
}
# Weight name mapping for loader
hf_to_nanovllm_mapping = {
"transformer.embedding.word_embeddings": "model.embed_tokens",
"transformer.encoder.final_layernorm": "model.norm",
"transformer.output_layer": "lm_head",
}
def __init__(self, config) -> None:
super().__init__()
vocab_size = getattr(config, 'padded_vocab_size', config.vocab_size)
self.config = config
self.model = GLM4Model(config)
self.lm_head = ParallelLMHead(vocab_size, config.hidden_size)
# GLM-4 does not tie embeddings
# if getattr(config, 'tie_word_embeddings', False):
# self.lm_head.weight.data = self.model.embed_tokens.weight.data
def forward(
self,
input_ids: torch.Tensor,
positions: torch.Tensor,
) -> torch.Tensor:
return self.model(input_ids, positions)
def compute_logits(
self,
hidden_states: torch.Tensor,
) -> torch.Tensor:
return self.lm_head(hidden_states)

207
nanovllm/models/qwen2.py Normal file
View File

@@ -0,0 +1,207 @@
import torch
from torch import nn
import torch.distributed as dist
from transformers import Qwen2Config
from nanovllm.layers.activation import SiluAndMul
from nanovllm.layers.attention import Attention
from nanovllm.layers.layernorm import RMSNorm
from nanovllm.layers.linear import QKVParallelLinear, MergedColumnParallelLinear, RowParallelLinear
from nanovllm.layers.rotary_embedding import get_rope
from nanovllm.layers.embed_head import VocabParallelEmbedding, ParallelLMHead
from nanovllm.models.registry import register_model
class Qwen2Attention(nn.Module):
"""Qwen2/2.5 Attention without QK norm (unlike Qwen3)."""
def __init__(
self,
hidden_size: int,
num_heads: int,
num_kv_heads: int,
max_position: int = 4096 * 32,
head_dim: int | None = None,
rope_theta: float = 10000,
rope_scaling: tuple | None = None,
) -> None:
super().__init__()
tp_size = dist.get_world_size()
self.total_num_heads = num_heads
assert self.total_num_heads % tp_size == 0
self.num_heads = self.total_num_heads // tp_size
self.total_num_kv_heads = num_kv_heads
assert self.total_num_kv_heads % tp_size == 0
self.num_kv_heads = self.total_num_kv_heads // tp_size
self.head_dim = head_dim or hidden_size // self.total_num_heads
self.q_size = self.num_heads * self.head_dim
self.kv_size = self.num_kv_heads * self.head_dim
self.scaling = self.head_dim ** -0.5
self.qkv_proj = QKVParallelLinear(
hidden_size,
self.head_dim,
self.total_num_heads,
self.total_num_kv_heads,
bias=True, # Qwen2/2.5 always uses bias
)
self.o_proj = RowParallelLinear(
self.total_num_heads * self.head_dim,
hidden_size,
bias=False,
)
self.rotary_emb = get_rope(
self.head_dim,
rotary_dim=self.head_dim,
max_position=max_position,
base=rope_theta,
rope_scaling=rope_scaling,
)
self.attn = Attention(
self.num_heads,
self.head_dim,
self.scaling,
self.num_kv_heads,
)
def forward(
self,
positions: torch.Tensor,
hidden_states: torch.Tensor,
) -> torch.Tensor:
qkv = self.qkv_proj(hidden_states)
q, k, v = qkv.split([self.q_size, self.kv_size, self.kv_size], dim=-1)
q = q.view(-1, self.num_heads, self.head_dim)
k = k.view(-1, self.num_kv_heads, self.head_dim)
v = v.view(-1, self.num_kv_heads, self.head_dim)
q, k = self.rotary_emb(positions, q, k)
o = self.attn(q, k, v)
output = self.o_proj(o.flatten(1, -1))
return output
class Qwen2MLP(nn.Module):
def __init__(
self,
hidden_size: int,
intermediate_size: int,
hidden_act: str,
) -> None:
super().__init__()
self.gate_up_proj = MergedColumnParallelLinear(
hidden_size,
[intermediate_size] * 2,
bias=False,
)
self.down_proj = RowParallelLinear(
intermediate_size,
hidden_size,
bias=False,
)
assert hidden_act == "silu"
self.act_fn = SiluAndMul()
def forward(self, x):
gate_up = self.gate_up_proj(x)
x = self.act_fn(gate_up)
x = self.down_proj(x)
return x
class Qwen2DecoderLayer(nn.Module):
def __init__(
self,
config: Qwen2Config,
) -> None:
super().__init__()
self.self_attn = Qwen2Attention(
hidden_size=config.hidden_size,
num_heads=config.num_attention_heads,
num_kv_heads=config.num_key_value_heads,
max_position=config.max_position_embeddings,
head_dim=getattr(config, 'head_dim', None),
rope_theta=getattr(config, "rope_theta", 1000000),
rope_scaling=getattr(config, "rope_scaling", None),
)
self.mlp = Qwen2MLP(
hidden_size=config.hidden_size,
intermediate_size=config.intermediate_size,
hidden_act=config.hidden_act,
)
self.input_layernorm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
self.post_attention_layernorm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
def forward(
self,
positions: torch.Tensor,
hidden_states: torch.Tensor,
residual: torch.Tensor | None,
) -> tuple[torch.Tensor, torch.Tensor]:
if residual is None:
hidden_states, residual = self.input_layernorm(hidden_states), hidden_states
else:
hidden_states, residual = self.input_layernorm(hidden_states, residual)
hidden_states = self.self_attn(positions, hidden_states)
hidden_states, residual = self.post_attention_layernorm(hidden_states, residual)
hidden_states = self.mlp(hidden_states)
return hidden_states, residual
class Qwen2Model(nn.Module):
def __init__(
self,
config: Qwen2Config,
) -> None:
super().__init__()
self.embed_tokens = VocabParallelEmbedding(config.vocab_size, config.hidden_size)
self.layers = nn.ModuleList([Qwen2DecoderLayer(config) for _ in range(config.num_hidden_layers)])
self.norm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
def forward(
self,
input_ids: torch.Tensor,
positions: torch.Tensor,
) -> torch.Tensor:
hidden_states = self.embed_tokens(input_ids)
residual = None
for layer in self.layers:
hidden_states, residual = layer(positions, hidden_states, residual)
hidden_states, _ = self.norm(hidden_states, residual)
return hidden_states
@register_model("Qwen2ForCausalLM")
class Qwen2ForCausalLM(nn.Module):
packed_modules_mapping = {
"q_proj": ("qkv_proj", "q"),
"k_proj": ("qkv_proj", "k"),
"v_proj": ("qkv_proj", "v"),
"gate_proj": ("gate_up_proj", 0),
"up_proj": ("gate_up_proj", 1),
}
def __init__(
self,
config: Qwen2Config
) -> None:
super().__init__()
self.model = Qwen2Model(config)
self.lm_head = ParallelLMHead(config.vocab_size, config.hidden_size)
if config.tie_word_embeddings:
self.lm_head.weight.data = self.model.embed_tokens.weight.data
def forward(
self,
input_ids: torch.Tensor,
positions: torch.Tensor,
) -> torch.Tensor:
return self.model(input_ids, positions)
def compute_logits(
self,
hidden_states: torch.Tensor,
) -> torch.Tensor:
return self.lm_head(hidden_states)

View File

@@ -187,7 +187,7 @@ class Qwen3Model(nn.Module):
return hidden_states
@register_model("Qwen3ForCausalLM", "Qwen2ForCausalLM")
@register_model("Qwen3ForCausalLM")
class Qwen3ForCausalLM(nn.Module):
packed_modules_mapping = {
"q_proj": ("qkv_proj", "q"),

View File

@@ -414,6 +414,90 @@ def merge_attention_outputs(
return o_merged, lse_merged
# ============================================================
# FlashInfer-based implementations (recommended for merge only)
# ============================================================
# LSE conversion constants: FlashInfer uses log2, flash_attn uses ln
_LOG2_E = 1.4426950408889634 # math.log2(math.e) - ln -> log2
_LN_2 = 0.6931471805599453 # math.log(2) - log2 -> ln
# Check FlashInfer availability (only for merge_state, not attention kernel)
try:
from flashinfer.cascade import merge_state, merge_state_in_place
FLASHINFER_MERGE_AVAILABLE = True
except ImportError:
FLASHINFER_MERGE_AVAILABLE = False
def flash_attn_with_lse_flashinfer(
q: torch.Tensor,
k: torch.Tensor,
v: torch.Tensor,
softmax_scale: Optional[float] = None,
causal: bool = False,
) -> Tuple[torch.Tensor, torch.Tensor]:
"""
Flash attention that returns output and LSE.
Uses flash_attn library (FlashInfer attention has JIT compatibility issues).
Args:
q: Query tensor [batch, seqlen_q, nheads_q, headdim]
k: Key tensor [batch, seqlen_k, nheads_kv, headdim]
v: Value tensor [batch, seqlen_k, nheads_kv, headdim]
softmax_scale: Scaling factor (default: 1/sqrt(headdim))
causal: Whether to apply causal masking
Returns:
out: Output tensor [batch, seqlen_q, nheads_q, headdim]
lse: Log-sum-exp tensor [batch, nheads_q, seqlen_q] (ln format)
"""
# Use flash_attn directly (FlashInfer attention JIT has CUDA version issues)
return flash_attn_with_lse(q, k, v, softmax_scale, causal)
def merge_attention_outputs_flashinfer(
o1: torch.Tensor,
lse1: torch.Tensor,
o2: torch.Tensor,
lse2: torch.Tensor,
) -> Tuple[torch.Tensor, torch.Tensor]:
"""
Merge two attention outputs using FlashInfer's optimized kernel.
Args:
o1: First output [batch, seqlen_q, nheads, headdim]
lse1: First LSE [batch, nheads, seqlen_q] (ln format)
o2: Second output [batch, seqlen_q, nheads, headdim]
lse2: Second LSE [batch, nheads, seqlen_q] (ln format)
Returns:
o_merged: Merged output [batch, seqlen_q, nheads, headdim]
lse_merged: Merged LSE [batch, nheads, seqlen_q] (ln format)
"""
if not FLASHINFER_MERGE_AVAILABLE:
# Fallback to Triton implementation
return merge_attention_outputs(o1, lse1, o2, lse2)
# Convert to FlashInfer format
# o: [batch, seq, heads, dim] -> [seq, heads, dim]
# lse: [batch, heads, seq] -> [seq, heads] (convert ln -> log2)
v_a = o1.squeeze(0).contiguous()
s_a = (lse1.squeeze(0).transpose(0, 1).contiguous().float() * _LOG2_E)
v_b = o2.squeeze(0).contiguous()
s_b = (lse2.squeeze(0).transpose(0, 1).contiguous().float() * _LOG2_E)
# FlashInfer merge
v_merged, s_merged = merge_state(v_a, s_a, v_b, s_b)
# Convert back to flash_attn format
o_merged = v_merged.unsqueeze(0) # [1, seq, heads, dim]
lse_merged = (s_merged * _LN_2).transpose(0, 1).unsqueeze(0) # [1, heads, seq]
return o_merged, lse_merged
def chunked_attention_varlen(
q: torch.Tensor,
kv_chunks: List[Tuple[torch.Tensor, torch.Tensor]],

View File

@@ -218,6 +218,209 @@ def softmax_fuse_block_sum_kernel_non_causal(
tl.store(output_ptr + iter * segment_size // block_size, X.to(Out.type.element_ty))
# ============================================================
# KV Chunking Support Kernels
# ============================================================
@triton.jit
def softmax_partial_stats_kernel(
In,
M_out, # max per row
L_out, # sum per row (normalized by M_out)
scale,
input_stride_0,
input_stride_1,
input_stride_2,
stats_stride_0,
stats_stride_1,
k_len,
chunk_start, # Q start position (for causal)
kv_offset, # KV chunk offset (for causal)
segment_size: tl.constexpr,
block_size: tl.constexpr,
is_causal: tl.constexpr,
):
"""
Compute partial softmax statistics for a KV chunk.
For each query row, computes:
- m: max value in this chunk
- l: sum of exp(x - m) in this chunk
These can be merged across chunks using online softmax formula.
Input shape: [batch, heads, q_len, k_chunk_len]
Output shapes: M[batch, heads, q_len], L[batch, heads, q_len]
"""
block_id = tl.program_id(0)
head_id = tl.program_id(1)
batch_id = tl.program_id(2)
offs_q = tl.arange(0, block_size) + chunk_start + block_id * block_size
offs_k = tl.arange(0, segment_size)
num_iters = k_len // segment_size
# For causal: compute boundary
if is_causal:
# causal boundary: Q position where this KV chunk starts to be valid
# Q[i] can attend K[j] if i >= j
# For KV chunk at kv_offset, Q[i] can attend if i >= kv_offset
num_iters_before_causal = (chunk_start + (block_id + 1) * block_size - 1 - kv_offset) // segment_size
num_iters_before_causal = tl.minimum(num_iters_before_causal, num_iters)
num_iters_before_causal = tl.maximum(num_iters_before_causal, 0)
else:
num_iters_before_causal = num_iters
# Online softmax state
m_i = tl.zeros([block_size], dtype=tl.float32) - float("inf")
l_i = tl.zeros([block_size], dtype=tl.float32)
# Input pointer
input_ptr = In + batch_id * input_stride_0 + head_id * input_stride_1 + block_id * block_size * input_stride_2
input_ptr = input_ptr + tl.arange(0, segment_size) + tl.arange(0, block_size)[:, None] * input_stride_2
# Compute max and sum (before causal boundary)
for iter in range(0, num_iters_before_causal):
X = tl.load(input_ptr + iter * segment_size).to(tl.float32) * scale
m_local = tl.max(X, 1)
m_new = tl.maximum(m_i, m_local)
alpha = tl.math.exp2(m_i - m_new)
X = X - m_new[:, None]
l_local = tl.sum(tl.math.exp2(X), 1)
l_i = l_i * alpha + l_local
m_i = m_new
# Handle causal boundary
if is_causal:
for iter in range(num_iters_before_causal, num_iters_before_causal + 1):
if iter < num_iters:
X = tl.load(input_ptr + iter * segment_size).to(tl.float32) * scale
# causal mask: Q[i] >= K[j] + kv_offset
mask = offs_q[:, None] >= (offs_k[None, :] + iter * segment_size + kv_offset)
X = tl.where(mask, X, -1.0e6)
m_local = tl.max(X, 1)
m_new = tl.maximum(m_i, m_local)
alpha = tl.math.exp2(m_i - m_new)
X = X - m_new[:, None]
l_local = tl.sum(tl.math.exp2(X), 1)
l_i = l_i * alpha + l_local
m_i = m_new
# Output pointers
m_ptr = M_out + batch_id * stats_stride_0 + head_id * stats_stride_1 + block_id * block_size
l_ptr = L_out + batch_id * stats_stride_0 + head_id * stats_stride_1 + block_id * block_size
offs = tl.arange(0, block_size)
tl.store(m_ptr + offs, m_i.to(M_out.type.element_ty))
tl.store(l_ptr + offs, l_i.to(L_out.type.element_ty))
@triton.jit
def softmax_normalize_block_sum_kernel(
In,
Out,
M_global, # global max per row
L_global, # global sum per row
scale,
input_stride_0,
input_stride_1,
input_stride_2,
output_stride_0,
output_stride_1,
output_stride_2,
stats_stride_0,
stats_stride_1,
real_q_len,
k_len,
chunk_start,
kv_offset, # KV chunk offset (for causal)
segment_size: tl.constexpr,
block_size: tl.constexpr,
is_causal: tl.constexpr,
):
"""
Normalize with global stats and compute block sums for a KV chunk.
Uses pre-computed global m and l to correctly normalize softmax
across all KV chunks.
Input shape: [batch, heads, q_len, k_chunk_len]
Output shape: [batch, heads, q_blocks, k_chunk_blocks]
"""
block_id = tl.program_id(0)
head_id = tl.program_id(1)
batch_id = tl.program_id(2)
offs_q = tl.arange(0, block_size) + chunk_start + block_id * block_size
offs_k = tl.arange(0, segment_size)
num_iters = k_len // segment_size
# For causal: compute boundary
if is_causal:
num_iters_before_causal = (chunk_start + (block_id + 1) * block_size - 1 - kv_offset) // segment_size
num_iters_before_causal = tl.minimum(num_iters_before_causal, num_iters)
num_iters_before_causal = tl.maximum(num_iters_before_causal, 0)
else:
num_iters_before_causal = num_iters
# Load global stats
m_ptr = M_global + batch_id * stats_stride_0 + head_id * stats_stride_1 + block_id * block_size
l_ptr = L_global + batch_id * stats_stride_0 + head_id * stats_stride_1 + block_id * block_size
offs = tl.arange(0, block_size)
m_global = tl.load(m_ptr + offs).to(tl.float32)
l_global = tl.load(l_ptr + offs).to(tl.float32)
# Handle l_global = 0 (when all positions are masked)
l_global_safe = tl.where(l_global > 0, l_global, 1.0)
l_global_inv = 1.0 / l_global_safe
# Input pointer
input_ptr = In + batch_id * input_stride_0 + head_id * input_stride_1 + block_id * block_size * input_stride_2
input_ptr = input_ptr + tl.arange(0, segment_size) + tl.arange(0, block_size)[:, None] * input_stride_2
# Output pointer
output_ptr = Out + batch_id * output_stride_0 + head_id * output_stride_1 + block_id * output_stride_2
output_ptr = output_ptr + tl.arange(0, segment_size // block_size)
sum_mask = offs_q[:, None] < real_q_len
# Normalize and compute block sums (before causal boundary)
for iter in range(0, num_iters_before_causal):
X = tl.load(input_ptr + iter * segment_size).to(tl.float32) * scale
X = tl.exp2(X - m_global[:, None]) * l_global_inv[:, None]
X = tl.where(sum_mask, X, 0)
X = tl.reshape(X, (block_size, segment_size // block_size, block_size))
X = tl.sum(X, 2)
X = tl.sum(X, 0)
tl.store(output_ptr + iter * segment_size // block_size, X.to(Out.type.element_ty))
# Handle causal boundary
if is_causal:
for iter in range(num_iters_before_causal, num_iters_before_causal + 1):
if iter < num_iters:
X = tl.load(input_ptr + iter * segment_size).to(tl.float32) * scale
# causal mask: Q[i] >= K[j] + kv_offset
causal_mask = offs_q[:, None] >= (offs_k[None, :] + iter * segment_size + kv_offset)
X = tl.where(causal_mask, X, -1.0e6)
X = tl.exp2(X - m_global[:, None]) * l_global_inv[:, None]
X = tl.where(sum_mask, X, 0)
X = tl.reshape(X, (block_size, segment_size // block_size, block_size))
X = tl.sum(X, 2)
X = tl.sum(X, 0)
tl.store(output_ptr + iter * segment_size // block_size, X.to(Out.type.element_ty))
# Zero out future blocks
for iter in range(num_iters_before_causal + 1, num_iters):
X = tl.zeros([segment_size // block_size], dtype=tl.float32)
tl.store(output_ptr + iter * segment_size // block_size, X.to(Out.type.element_ty))
@triton.jit
def flat_group_gemm_fuse_reshape_kernel(
Q, K, Out,
@@ -380,6 +583,194 @@ def softmax_fuse_block_sum(
return output
def softmax_compute_partial_stats(
attn_weights_slice: torch.Tensor,
reshaped_block_size: int,
segment_size: int,
scale: float,
chunk_start: int = 0,
kv_offset: int = 0,
is_causal: bool = False,
) -> Tuple[torch.Tensor, torch.Tensor]:
"""
Compute partial softmax statistics for a KV chunk.
This is the first step for KV-chunked softmax computation.
For each query row, computes:
- m: max value in this chunk
- l: sum of exp(x - m) in this chunk
These partial stats can be merged across KV chunks using
`merge_softmax_stats()`, then used with `softmax_normalize_and_block_sum()`.
Args:
attn_weights_slice: Raw attention scores [batch, heads, q_len, k_chunk_len]
reshaped_block_size: Block size in reshaped space
segment_size: Processing segment size
scale: Softmax scale factor
chunk_start: Q chunk start position (in reshaped space)
kv_offset: KV chunk offset (in reshaped space, for causal masking)
is_causal: Whether to apply causal masking
Returns:
Tuple of (m, l) where:
- m: [batch, heads, q_len] max values per row
- l: [batch, heads, q_len] partial sums per row
"""
batch_size, num_heads, q_len, k_len = attn_weights_slice.shape
assert q_len % reshaped_block_size == 0
assert k_len % segment_size == 0
assert attn_weights_slice.stride(-1) == 1
m_out = torch.empty(
(batch_size, num_heads, q_len),
dtype=torch.float32,
device=attn_weights_slice.device
)
l_out = torch.empty(
(batch_size, num_heads, q_len),
dtype=torch.float32,
device=attn_weights_slice.device
)
grid = (q_len // reshaped_block_size, num_heads, batch_size)
softmax_partial_stats_kernel[grid](
attn_weights_slice,
m_out,
l_out,
scale,
attn_weights_slice.stride(0),
attn_weights_slice.stride(1),
attn_weights_slice.stride(2),
m_out.stride(0),
m_out.stride(1),
k_len,
chunk_start,
kv_offset,
segment_size,
reshaped_block_size,
is_causal,
)
return m_out, l_out
def merge_softmax_stats(
m_chunks: list,
l_chunks: list,
) -> Tuple[torch.Tensor, torch.Tensor]:
"""
Merge partial softmax statistics from multiple KV chunks.
Uses the online softmax merging formula:
m_new = max(m1, m2)
l_new = l1 * exp(m1 - m_new) + l2 * exp(m2 - m_new)
Args:
m_chunks: List of max tensors [batch, heads, q_len] from each chunk
l_chunks: List of sum tensors [batch, heads, q_len] from each chunk
Returns:
Tuple of (m_global, l_global) with same shape as inputs
"""
assert len(m_chunks) == len(l_chunks)
assert len(m_chunks) > 0
# Use log2 scale to match kernel (exp2)
LOG2E = 1.4426950408889634
m_global = m_chunks[0].clone()
l_global = l_chunks[0].clone()
for i in range(1, len(m_chunks)):
m_chunk = m_chunks[i]
l_chunk = l_chunks[i]
m_new = torch.maximum(m_global, m_chunk)
# exp2(m - m_new) = 2^(m - m_new)
l_global = l_global * torch.pow(2.0, m_global - m_new) + l_chunk * torch.pow(2.0, m_chunk - m_new)
m_global = m_new
return m_global, l_global
def softmax_normalize_and_block_sum(
attn_weights_slice: torch.Tensor,
m_global: torch.Tensor,
l_global: torch.Tensor,
reshaped_block_size: int,
segment_size: int,
chunk_start: int,
real_q_len: int,
scale: float,
kv_offset: int = 0,
is_causal: bool = False,
) -> torch.Tensor:
"""
Normalize with global stats and compute block sums for a KV chunk.
This is the second step for KV-chunked softmax computation.
Uses pre-computed global m and l (from `merge_softmax_stats()`)
to correctly normalize softmax values and compute block sums.
Args:
attn_weights_slice: Raw attention scores [batch, heads, q_len, k_chunk_len]
m_global: Global max values [batch, heads, q_len]
l_global: Global sum values [batch, heads, q_len]
reshaped_block_size: Block size in reshaped space
segment_size: Processing segment size
chunk_start: Start position for this chunk (for masking)
real_q_len: Actual Q length (before padding)
scale: Softmax scale factor
kv_offset: KV chunk offset (in reshaped space, for causal masking)
is_causal: Whether to apply causal masking
Returns:
Block-level attention sums [batch, heads, q_blocks, k_chunk_blocks]
"""
batch_size, num_heads, q_len, k_len = attn_weights_slice.shape
assert q_len % reshaped_block_size == 0
assert k_len % segment_size == 0
assert segment_size % reshaped_block_size == 0
assert attn_weights_slice.stride(-1) == 1
output = torch.empty(
(batch_size, num_heads, q_len // reshaped_block_size, k_len // reshaped_block_size),
dtype=attn_weights_slice.dtype,
device=attn_weights_slice.device
)
grid = (q_len // reshaped_block_size, num_heads, batch_size)
softmax_normalize_block_sum_kernel[grid](
attn_weights_slice,
output,
m_global,
l_global,
scale,
attn_weights_slice.stride(0),
attn_weights_slice.stride(1),
attn_weights_slice.stride(2),
output.stride(0),
output.stride(1),
output.stride(2),
m_global.stride(0),
m_global.stride(1),
real_q_len,
k_len,
chunk_start,
kv_offset,
segment_size,
reshaped_block_size,
is_causal,
)
return output
def flat_group_gemm_fuse_reshape(
query_states: torch.Tensor,
key_states: torch.Tensor,

View File

@@ -0,0 +1,327 @@
"""
DensityObserver - Sparse Attention Density 统计 Observer。
统计两种 density:
1. Compute Density (计算密度): 基于 BSA block size (128)
- density = selected_bsa_blocks / total_causal_bsa_blocks
- GPU-only 和 Offload 模式应该一致
2. Communication Density (通信密度): 基于 CPU block size (如 4096)
- comm_density = selected_cpu_blocks / total_cpu_blocks
- 仅用于 Offload 模式,由于粒度更粗,必然 >= compute density
统计位置:
- GPU-only: xattn_bsa.py compute_prefill() - 只记录 compute density
- Offload: xattn_bsa.py select_blocks() - 记录两种 density
对于 Offload 模式的 Density 计算:
- 不是简单的 avg 或 min
- 而是 sum(selected) / sum(total),正确处理不同 chunk 大小的权重
"""
from typing import List, Dict, Optional, Tuple
import torch
from nanovllm.utils.observer import Observer
class DensityObserver(Observer):
"""
Sparse Attention Density Observer。
记录每层的 density用于验证 GPU-only 和 Offload 模式的一致性。
使用方式:
DensityObserver.enable()
DensityObserver.complete_reset()
# ... run inference ...
DensityObserver.record(layer_id, mask, causal=True)
# 或者使用累积模式 (offload):
DensityObserver.record_counts(layer_id, selected, total)
# ...
DensityObserver.print_summary()
"""
_enabled: bool = False # 默认禁用
# 每层的 compute density 记录 (BSA block 粒度)
# key: layer_id, value: list of density values (每次 prefill chunk 一个)
_layer_densities: Dict[int, List[float]] = {}
# 每层的 communication density 记录 (CPU block 粒度,仅 offload 模式)
_layer_comm_densities: Dict[int, List[float]] = {}
# 累积模式: 记录 selected/total counts (用于 offload 模式)
# 这样可以在所有 chunks 完成后正确计算 density = sum(selected) / sum(total)
_layer_selected_counts: Dict[int, List[int]] = {}
_layer_total_counts: Dict[int, List[int]] = {}
# Mask shape 记录 (用于调试)
_last_q_blocks: int = 0
_last_k_blocks: int = 0
# 模式标记
_mode: str = "unknown" # "gpu_only" or "offload"
@classmethod
def set_mode(cls, mode: str) -> None:
"""设置当前模式 (gpu_only / offload)"""
cls._mode = mode
@classmethod
def record(
cls,
layer_id: int,
mask: torch.Tensor,
causal: bool = True,
) -> float:
"""
记录一层的 density (适用于 GPU-only 模式)。
Args:
layer_id: 层 ID
mask: [batch, heads, q_blocks, k_blocks] boolean tensor
causal: 是否考虑 causal mask (只计算下三角)
Returns:
density 值
"""
if not cls._enabled:
return 0.0
density = cls._compute_density(mask, causal)
# 记录
if layer_id not in cls._layer_densities:
cls._layer_densities[layer_id] = []
cls._layer_densities[layer_id].append(density)
# 记录 mask shape
cls._last_q_blocks = mask.shape[2]
cls._last_k_blocks = mask.shape[3]
return density
@classmethod
def record_counts(
cls,
layer_id: int,
selected_blocks: int,
total_blocks: int,
) -> None:
"""
记录一层的 selected/total block counts (适用于 offload 累积模式)。
使用累积计数而不是直接计算 density这样在所有 chunks 处理完后可以正确计算:
overall_density = sum(selected) / sum(total)
这比 avg(density) 更准确,因为不同 chunk 的 Q 和 K 长度不同。
Args:
layer_id: 层 ID
selected_blocks: 这个 chunk 选中的 blocks 数量
total_blocks: 这个 chunk 的 total possible blocks 数量
"""
if not cls._enabled:
return
# 初始化列表
if layer_id not in cls._layer_selected_counts:
cls._layer_selected_counts[layer_id] = []
if layer_id not in cls._layer_total_counts:
cls._layer_total_counts[layer_id] = []
# 累积记录
cls._layer_selected_counts[layer_id].append(selected_blocks)
cls._layer_total_counts[layer_id].append(total_blocks)
@classmethod
def record_comm_density(
cls,
layer_id: int,
selected_cpu_blocks: int,
total_cpu_blocks: int,
) -> float:
"""
记录一层的 communication density (CPU block 粒度)。
Args:
layer_id: 层 ID
selected_cpu_blocks: 选中的 CPU blocks 数量
total_cpu_blocks: 总 CPU blocks 数量
Returns:
communication density 值
"""
if not cls._enabled:
return 0.0
if total_cpu_blocks == 0:
return 1.0
comm_density = selected_cpu_blocks / total_cpu_blocks
# 记录
if layer_id not in cls._layer_comm_densities:
cls._layer_comm_densities[layer_id] = []
cls._layer_comm_densities[layer_id].append(comm_density)
return comm_density
@classmethod
def _compute_density(cls, mask: torch.Tensor, causal: bool) -> float:
"""计算 mask 的 density"""
batch, heads, q_blocks, k_blocks = mask.shape
if causal:
# 只计算下三角区域
causal_mask = torch.tril(
torch.ones(q_blocks, k_blocks, device=mask.device, dtype=torch.bool)
)
total_blocks = causal_mask.sum().item() * batch * heads
selected_blocks = (mask & causal_mask.unsqueeze(0).unsqueeze(0)).sum().item()
else:
total_blocks = mask.numel()
selected_blocks = mask.sum().item()
if total_blocks == 0:
return 1.0
return selected_blocks / total_blocks
@classmethod
def complete_reset(cls) -> None:
"""重置所有统计"""
cls._layer_densities = {}
cls._layer_comm_densities = {}
cls._layer_selected_counts = {}
cls._layer_total_counts = {}
cls._last_q_blocks = 0
cls._last_k_blocks = 0
cls._mode = "unknown"
@classmethod
def get_per_layer_density(cls) -> Dict[int, float]:
"""
获取每层的 density。
对于累积模式 (offload): density = sum(selected) / sum(total)
对于直接记录模式 (gpu_only): density = avg(density_values)
"""
result = {}
# 优先使用累积模式 (offload)
if cls._layer_selected_counts:
for layer_id in cls._layer_selected_counts:
selected_list = cls._layer_selected_counts.get(layer_id, [])
total_list = cls._layer_total_counts.get(layer_id, [])
total_selected = sum(selected_list)
total_total = sum(total_list)
if total_total > 0:
result[layer_id] = total_selected / total_total
else:
# 直接记录模式 (gpu_only)
for layer_id, densities in cls._layer_densities.items():
if densities:
result[layer_id] = sum(densities) / len(densities)
return result
@classmethod
def get_overall_density(cls) -> float:
"""
获取所有层的总体 compute density。
对于累积模式 (offload): density = sum(all_selected) / sum(all_total)
对于直接记录模式 (gpu_only): density = avg(all_density_values)
注意: 总体 density 不是简单的 avg(per_layer_density)
而是 sum(all_selected) / sum(all_total),这样可以正确处理权重。
"""
# 优先使用累积模式 (offload)
if cls._layer_selected_counts:
total_selected = 0
total_total = 0
for layer_id in cls._layer_selected_counts:
total_selected += sum(cls._layer_selected_counts[layer_id])
total_total += sum(cls._layer_total_counts.get(layer_id, []))
if total_total > 0:
return total_selected / total_total
return 0.0
# 直接记录模式 (gpu_only)
all_densities = []
for densities in cls._layer_densities.values():
all_densities.extend(densities)
if not all_densities:
return 0.0
return sum(all_densities) / len(all_densities)
@classmethod
def get_overall_comm_density(cls) -> float:
"""获取所有层的平均 communication density"""
all_densities = []
for densities in cls._layer_comm_densities.values():
all_densities.extend(densities)
if not all_densities:
return 0.0
return sum(all_densities) / len(all_densities)
@classmethod
def get_per_layer_comm_density(cls) -> Dict[int, float]:
"""
获取每层的 communication density (CPU block 粒度)。
Returns:
Dict[layer_id, avg_comm_density]
"""
result = {}
for layer_id, densities in cls._layer_comm_densities.items():
if densities:
result[layer_id] = sum(densities) / len(densities)
return result
@classmethod
def get_summary(cls) -> dict:
"""返回统计摘要"""
per_layer = cls.get_per_layer_density()
per_layer_comm = cls.get_per_layer_comm_density()
return {
"mode": cls._mode,
"overall_compute_density": cls.get_overall_density(),
"overall_comm_density": cls.get_overall_comm_density(),
"per_layer_compute_density": per_layer,
"per_layer_comm_density": per_layer_comm,
"num_layers": len(per_layer),
"last_mask_shape": {
"q_blocks": cls._last_q_blocks,
"k_blocks": cls._last_k_blocks,
},
}
@classmethod
def get_min_density(cls) -> Tuple[int, float]:
"""获取最低 density 的层和值"""
per_layer = cls.get_per_layer_density()
if not per_layer:
return -1, 0.0
min_layer = min(per_layer, key=per_layer.get)
return min_layer, per_layer[min_layer]
@classmethod
def print_summary(cls) -> None:
"""打印人类可读的摘要"""
per_layer = cls.get_per_layer_density()
overall = cls.get_overall_density()
min_layer, min_density = cls.get_min_density()
overall_comm = cls.get_overall_comm_density()
print(f"[DensityObserver] Mode: {cls._mode}")
print(f" Compute density: {overall:.4f} (min: {min_density:.4f} @ layer {min_layer})")
if overall_comm > 0:
# Offload mode: show both densities with explanation
print(f" Comm density: {overall_comm:.4f} (CPU block granularity)")
print(f" Savings ratio: {1 - overall_comm:.1%} H2D transfer reduction")
print(f" Num layers: {len(per_layer)}")
# 输出 layer 0 的 density 用于对比
if 0 in per_layer:
print(f" Layer 0 density: {per_layer[0]:.6f}")

View File

@@ -1,4 +1,5 @@
import os
import re
from glob import glob
import torch
from torch import nn
@@ -9,20 +10,146 @@ def default_weight_loader(param: nn.Parameter, loaded_weight: torch.Tensor):
param.data.copy_(loaded_weight)
# GLM-4 weight name mappings
GLM4_NAME_MAPPING = {
"transformer.embedding.word_embeddings": "model.embed_tokens",
"transformer.encoder.final_layernorm": "model.norm",
"transformer.output_layer": "lm_head",
}
GLM4_LAYER_MAPPING = {
"self_attention.query_key_value": "self_attn.qkv_proj",
"self_attention.dense": "self_attn.o_proj",
"mlp.dense_h_to_4h": "mlp.gate_up_proj",
"mlp.dense_4h_to_h": "mlp.down_proj",
}
def convert_glm4_weight_name(weight_name: str) -> tuple[str, str | None]:
"""
Convert GLM-4 weight name to nanovllm format.
Returns:
tuple: (converted_name, shard_id) where shard_id is used for packed modules
Returns (None, None) for weights that should be skipped
"""
# Skip rotary embedding weights (we use our own RoPE implementation)
if "rotary_pos_emb" in weight_name:
return None, None
# Check direct mappings first
for glm_name, nano_name in GLM4_NAME_MAPPING.items():
if weight_name.startswith(glm_name):
return weight_name.replace(glm_name, nano_name), None
# Handle layer weights: transformer.encoder.layers.X.xxx
layer_match = re.match(r"transformer\.encoder\.layers\.(\d+)\.(.+)", weight_name)
if layer_match:
layer_idx = layer_match.group(1)
remainder = layer_match.group(2)
# Handle packed modules (QKV and gate_up)
for glm_subname, nano_subname in GLM4_LAYER_MAPPING.items():
if remainder.startswith(glm_subname):
suffix = remainder[len(glm_subname):] # .weight or .bias
new_name = f"model.layers.{layer_idx}.{nano_subname}{suffix}"
# Determine shard_id for packed modules
if "qkv_proj" in nano_subname:
return new_name, "qkv" # Special marker for GLM4 QKV
elif "gate_up_proj" in nano_subname:
return new_name, "gate_up" # Special marker for GLM4 gate_up
else:
return new_name, None
# Handle non-packed layer weights (layernorms)
new_name = f"model.layers.{layer_idx}.{remainder}"
return new_name, None
# No mapping found, return original
return weight_name, None
def load_glm4_qkv(param: nn.Parameter, loaded_weight: torch.Tensor, config):
"""Load GLM-4 merged QKV weights by splitting into q, k, v."""
num_heads = config.num_attention_heads
num_kv_heads = getattr(config, 'multi_query_group_num', num_heads)
head_dim = getattr(config, 'kv_channels', config.hidden_size // num_heads)
q_size = num_heads * head_dim
kv_size = num_kv_heads * head_dim
# Split QKV: [q_size + kv_size + kv_size, hidden_size]
q, k, v = loaded_weight.split([q_size, kv_size, kv_size], dim=0)
# Load each part using the weight_loader
weight_loader = getattr(param, "weight_loader")
weight_loader(param, q, "q")
weight_loader(param, k, "k")
weight_loader(param, v, "v")
def load_glm4_gate_up(param: nn.Parameter, loaded_weight: torch.Tensor, config):
"""Load GLM-4 merged gate_up weights by splitting into gate, up."""
ffn_hidden_size = getattr(config, 'ffn_hidden_size', getattr(config, 'intermediate_size', None))
# Split gate_up: [ffn_hidden_size * 2, hidden_size]
gate, up = loaded_weight.split([ffn_hidden_size, ffn_hidden_size], dim=0)
# Load each part using the weight_loader
weight_loader = getattr(param, "weight_loader")
weight_loader(param, gate, 0) # gate_proj is shard 0
weight_loader(param, up, 1) # up_proj is shard 1
def is_glm4_model(model: nn.Module) -> bool:
"""Check if the model is a GLM-4 model."""
return model.__class__.__name__ in ("ChatGLMForCausalLM",)
def load_model(model: nn.Module, path: str):
packed_modules_mapping = getattr(model, "packed_modules_mapping", {})
is_glm4 = is_glm4_model(model)
config = getattr(model, "config", None)
for file in glob(os.path.join(path, "*.safetensors")):
with safe_open(file, "pt", "cpu") as f:
for weight_name in f.keys():
loaded_weight = f.get_tensor(weight_name)
# GLM-4 specific handling
if is_glm4:
param_name, shard_id = convert_glm4_weight_name(weight_name)
# Skip weights that don't need to be loaded
if param_name is None:
continue
if shard_id == "qkv":
param = model.get_parameter(param_name)
load_glm4_qkv(param, loaded_weight, config)
continue
elif shard_id == "gate_up":
param = model.get_parameter(param_name)
load_glm4_gate_up(param, loaded_weight, config)
continue
else:
# Regular weight, use converted name
param = model.get_parameter(param_name)
weight_loader = getattr(param, "weight_loader", default_weight_loader)
weight_loader(param, loaded_weight)
continue
# Original loading logic for other models
for k in packed_modules_mapping:
if k in weight_name:
v, shard_id = packed_modules_mapping[k]
param_name = weight_name.replace(k, v)
param = model.get_parameter(param_name)
weight_loader = getattr(param, "weight_loader")
weight_loader(param, f.get_tensor(weight_name), shard_id)
weight_loader(param, loaded_weight, shard_id)
break
else:
param = model.get_parameter(weight_name)
weight_loader = getattr(param, "weight_loader", default_weight_loader)
weight_loader(param, f.get_tensor(weight_name))
weight_loader(param, loaded_weight)

View File

@@ -0,0 +1,133 @@
"""
MemoryObserver - 内存传输统计 Observer。
统计 GPU-CPU 间的数据传输量:
- H2D (Host to Device): CPU → GPU
- D2H (Device to Host): GPU → CPU
- D2D (Device to Device): GPU → GPU (buffer copy)
"""
from nanovllm.utils.observer import Observer
class MemoryObserver(Observer):
"""
内存传输 Observer统计 GPU-CPU 间的数据传输量。
统计类型:
- H2D (Host to Device): CPU → GPU
- D2H (Device to Host): GPU → CPU
- D2D (Device to Device): GPU → GPU (buffer copy)
统计位置(均在 offload_engine.py
- H2D: load_to_slot_layer(), load_block_sample_from_cpu(), load_block_full_from_cpu()
- D2H: offload_slot_layer_to_cpu(), offload_prefill_buffer_async()
- D2D: write_to_prefill_buffer(), write_to_decode_buffer()
- 重置: llm_engine.py:generate() - 与 InferenceObserver 一起重置
"""
_enabled: bool = False # 默认禁用,需要显式启用
# H2D 统计
h2d_bytes: int = 0
h2d_count: int = 0
# D2H 统计
d2h_bytes: int = 0
d2h_count: int = 0
# D2D 统计
d2d_bytes: int = 0
d2d_count: int = 0
# 按阶段统计
prefill_h2d_bytes: int = 0
prefill_d2h_bytes: int = 0
decode_h2d_bytes: int = 0
decode_d2h_bytes: int = 0
@classmethod
def record_h2d(cls, num_bytes: int, is_prefill: bool = True) -> None:
"""记录 H2D 传输"""
if not cls._enabled:
return
cls.h2d_bytes += num_bytes
cls.h2d_count += 1
if is_prefill:
cls.prefill_h2d_bytes += num_bytes
else:
cls.decode_h2d_bytes += num_bytes
@classmethod
def record_d2h(cls, num_bytes: int, is_prefill: bool = True) -> None:
"""记录 D2H 传输"""
if not cls._enabled:
return
cls.d2h_bytes += num_bytes
cls.d2h_count += 1
if is_prefill:
cls.prefill_d2h_bytes += num_bytes
else:
cls.decode_d2h_bytes += num_bytes
@classmethod
def record_d2d(cls, num_bytes: int) -> None:
"""记录 D2D 传输"""
if not cls._enabled:
return
cls.d2d_bytes += num_bytes
cls.d2d_count += 1
@classmethod
def complete_reset(cls) -> None:
"""重置所有统计"""
cls.h2d_bytes = cls.h2d_count = 0
cls.d2h_bytes = cls.d2h_count = 0
cls.d2d_bytes = cls.d2d_count = 0
cls.prefill_h2d_bytes = cls.prefill_d2h_bytes = 0
cls.decode_h2d_bytes = cls.decode_d2h_bytes = 0
@classmethod
def get_summary(cls) -> dict:
"""返回统计摘要"""
return {
"total": {
"h2d_bytes": cls.h2d_bytes,
"h2d_count": cls.h2d_count,
"d2h_bytes": cls.d2h_bytes,
"d2h_count": cls.d2h_count,
"d2d_bytes": cls.d2d_bytes,
"d2d_count": cls.d2d_count,
},
"prefill": {
"h2d_bytes": cls.prefill_h2d_bytes,
"d2h_bytes": cls.prefill_d2h_bytes,
},
"decode": {
"h2d_bytes": cls.decode_h2d_bytes,
"d2h_bytes": cls.decode_d2h_bytes,
},
}
@classmethod
def _fmt_bytes(cls, b: int) -> str:
"""格式化字节数"""
if b >= 1e9:
return f"{b/1e9:.2f} GB"
if b >= 1e6:
return f"{b/1e6:.2f} MB"
if b >= 1e3:
return f"{b/1e3:.2f} KB"
return f"{b} B"
@classmethod
def print_summary(cls) -> None:
"""打印人类可读的摘要"""
fmt = cls._fmt_bytes
total = cls.h2d_bytes + cls.d2h_bytes + cls.d2d_bytes
print(f"[MemoryObserver] Total: {fmt(total)}")
print(f" H2D: {fmt(cls.h2d_bytes)} ({cls.h2d_count} ops)")
print(f" D2H: {fmt(cls.d2h_bytes)} ({cls.d2h_count} ops)")
print(f" D2D: {fmt(cls.d2d_bytes)} ({cls.d2d_count} ops)")
print(f" Prefill - H2D: {fmt(cls.prefill_h2d_bytes)}, D2H: {fmt(cls.prefill_d2h_bytes)}")
print(f" Decode - H2D: {fmt(cls.decode_h2d_bytes)}, D2H: {fmt(cls.decode_d2h_bytes)}")

View File

@@ -1,17 +1,106 @@
class Observer():
ttft_start = 0
tpot_start = 0
"""
Observer 基类和 InferenceObserver 实现。
ttft = 0
tpot = 0
Observer 架构:
- Observer: 基类,定义通用接口
- InferenceObserver: 推理性能观测TTFT/TPOT
- MemoryObserver: 内存传输观测(在 memory_observer.py 中定义)
"""
class Observer:
"""
Observer 基类,提供通用的启用/禁用、重置、输出接口。
所有 Observer 子类应继承此类并实现:
- complete_reset(): 重置所有统计数据
- get_summary(): 返回统计摘要 dict
- print_summary(): 打印人类可读的摘要
"""
_enabled: bool = True # 默认启用
@classmethod
def reset_ttft(cls):
def enable(cls) -> None:
"""启用 observer"""
cls._enabled = True
@classmethod
def disable(cls) -> None:
"""禁用 observer"""
cls._enabled = False
@classmethod
def is_enabled(cls) -> bool:
"""检查是否启用"""
return cls._enabled
@classmethod
def complete_reset(cls) -> None:
"""重置所有统计数据(子类实现)"""
raise NotImplementedError
@classmethod
def get_summary(cls) -> dict:
"""返回统计摘要(子类实现)"""
raise NotImplementedError
@classmethod
def print_summary(cls) -> None:
"""打印人类可读的摘要(子类可选覆盖)"""
import json
print(json.dumps(cls.get_summary(), indent=2))
class InferenceObserver(Observer):
"""
推理性能 Observer统计 TTFT 和 TPOT。
- TTFT (Time To First Token): 首个 token 生成延迟
- TPOT (Time Per Output Token): 每个输出 token 的平均延迟
统计位置:
- TTFT 开始: scheduler.py:35-36 - 第一个 sequence 从 waiting 队列取出时
- TTFT 结束: llm_engine.py:69-72 - prefill 完成后(包括 chunked prefill 所有 chunks
- TPOT 开始: llm_engine.py:65 - 每次 decode step 结束时
- TPOT 结束: llm_engine.py:62-63 - 下一次 decode step 开始时计算(测量上一次 decode 时间)
- 重置: llm_engine.py:97 - generate() 开始时
注意TPOT 需要至少 2 个输出 token 才能计算(测量 decode step 间隔)。
"""
# 时间戳 (nanoseconds)
ttft_start: int = 0
tpot_start: int = 0
# 统计结果 (nanoseconds)
ttft: int = 0
tpot: int = 0
@classmethod
def reset_ttft(cls) -> None:
"""重置 TTFT 计时器"""
cls.ttft_start = 0
@classmethod
def complete_reset(cls):
def complete_reset(cls) -> None:
"""重置所有统计数据"""
cls.ttft_start = 0
cls.tpot_start = 0
cls.ttft = 0
cls.tpot = 0
@classmethod
def get_summary(cls) -> dict:
"""返回统计摘要"""
return {
"ttft_ns": cls.ttft,
"ttft_ms": cls.ttft / 1e6,
"tpot_ns": cls.tpot,
"tpot_ms": cls.tpot / 1e6,
}
@classmethod
def print_summary(cls) -> None:
"""打印摘要"""
print(f"[InferenceObserver] TTFT: {cls.ttft / 1e6:.2f}ms, TPOT: {cls.tpot / 1e6:.2f}ms")

158
scripts/profile.sh Executable file
View File

@@ -0,0 +1,158 @@
#!/bin/bash
# Profile bench.py using NVIDIA Nsight Systems (GPU-only mode)
#
# Usage:
# bash scripts/profile.sh [options]
#
# Options:
# --max-len LENGTH Max sequence length (default: 32768)
# --policy POLICY Sparse policy: full, xattn (default: xattn)
# --gpu GPU_ID GPU to use (default: 0)
# --gpu-util UTIL GPU memory utilization (default: 0.9)
# --input-len LENGTH Input length (default: max-len - 1)
# --bench-decode Run decode benchmark instead of prefill
#
# Output:
# results/nsys/bench_<policy>_<max_len>_<timestamp>.nsys-rep
#
# Examples:
# bash scripts/profile.sh
# bash scripts/profile.sh --max-len 65536 --gpu-util 0.7
# bash scripts/profile.sh --policy full --max-len 32768
# bash scripts/profile.sh --bench-decode
set -e
# Default configuration
MAX_LEN="32768"
POLICY="xattn"
GPU_ID="0"
GPU_UTIL="0.9"
INPUT_LEN=""
BENCH_MODE="prefill"
# Parse arguments
while [[ $# -gt 0 ]]; do
case $1 in
--max-len)
MAX_LEN="$2"
shift 2
;;
--policy)
POLICY="$2"
shift 2
;;
--gpu)
GPU_ID="$2"
shift 2
;;
--gpu-util)
GPU_UTIL="$2"
shift 2
;;
--input-len)
INPUT_LEN="$2"
shift 2
;;
--bench-decode)
BENCH_MODE="decode"
shift
;;
-h|--help)
echo "Usage: $0 [options]"
echo ""
echo "Options:"
echo " --max-len LENGTH Max sequence length (default: 32768)"
echo " --policy POLICY Sparse policy: full, xattn (default: xattn)"
echo " --gpu GPU_ID GPU to use (default: 0)"
echo " --gpu-util UTIL GPU memory utilization (default: 0.9)"
echo " --input-len LENGTH Input length (default: max-len - 1)"
echo " --bench-decode Run decode benchmark instead of prefill"
exit 0
;;
*)
echo "Unknown option: $1"
exit 1
;;
esac
done
# Path configuration
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
PROJECT_ROOT="$(dirname "$SCRIPT_DIR")"
OUTPUT_DIR="$PROJECT_ROOT/results/nsys"
BENCH_SCRIPT="$PROJECT_ROOT/bench.py"
# Create output directory if needed
mkdir -p "$OUTPUT_DIR"
# Generate timestamp for unique filename
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
# Convert max_len to human-readable format (e.g., 32768 -> 32k)
if [ "$MAX_LEN" -ge 1024 ]; then
MAX_LEN_SUFFIX="$((MAX_LEN / 1024))k"
else
MAX_LEN_SUFFIX="${MAX_LEN}"
fi
OUTPUT_FILE="$OUTPUT_DIR/bench_${POLICY}_${MAX_LEN_SUFFIX}_${BENCH_MODE}_${TIMESTAMP}"
# Build bench.py arguments
BENCH_ARGS="--max-len $MAX_LEN --gpu-util $GPU_UTIL"
if [ -n "$POLICY" ]; then
BENCH_ARGS="$BENCH_ARGS --policy $POLICY"
fi
if [ -n "$INPUT_LEN" ]; then
BENCH_ARGS="$BENCH_ARGS --input-len $INPUT_LEN"
fi
if [ "$BENCH_MODE" = "decode" ]; then
BENCH_ARGS="$BENCH_ARGS --bench-decode"
fi
echo "============================================================"
echo "NVIDIA Nsight Systems Profiling (GPU-only)"
echo "============================================================"
echo "Bench script: $BENCH_SCRIPT"
echo "Policy: $POLICY"
echo "Max length: $MAX_LEN"
echo "GPU: $GPU_ID"
echo "GPU util: $GPU_UTIL"
echo "Bench mode: $BENCH_MODE"
echo "Output file: $OUTPUT_FILE.nsys-rep"
echo ""
# nsys profile options:
# --trace=cuda,nvtx : Trace CUDA API and NVTX markers
# --force-overwrite=true : Overwrite existing output file
# --output=<path> : Output file path (without .nsys-rep extension)
echo "Running nsys profile..."
echo "Command: python bench.py $BENCH_ARGS"
echo ""
CUDA_VISIBLE_DEVICES=$GPU_ID PYTHONPATH="$PROJECT_ROOT:$PYTHONPATH" \
nsys profile \
--trace=cuda,nvtx \
--force-overwrite=true \
--output="$OUTPUT_FILE" \
python "$BENCH_SCRIPT" $BENCH_ARGS
echo ""
echo "============================================================"
echo "Profiling completed successfully!"
echo "============================================================"
echo "Output file: $OUTPUT_FILE.nsys-rep"
echo ""
echo "To view results in GUI:"
echo " nsight-sys $OUTPUT_FILE.nsys-rep"
echo ""
echo "To export statistics:"
echo " nsys stats --report cuda_api_sum $OUTPUT_FILE.nsys-rep"
echo " nsys stats --report cuda_gpu_kern_sum $OUTPUT_FILE.nsys-rep"
echo " nsys stats --report cuda_gpu_mem_size_sum $OUTPUT_FILE.nsys-rep"
echo "============================================================"

View File

@@ -6,33 +6,47 @@
# bash scripts/profile_offload.sh [options]
#
# Options:
# --policy POLICY Sparse policy name (default: full)
# --ctx-len LENGTH Context length: 32k, 64k, 128k (default: 64k)
# --dataset DATASET Task name (default: niah_single_1)
# --sample INDEX Sample index (default: 0)
# --gpu GPU_ID GPU to use (default: 0)
# --num-gpu-blocks N Number of GPU blocks/slots (default: 4)
# --block-size SIZE KV cache block size (default: 4096)
# --no-offload Disable CPU offload
#
# Output:
# results/nsys/ruler_<dataset>_sample<index>_<timestamp>.nsys-rep
# results/nsys/<policy>_<gpuonly|offload>_<ctx-len>_blk<size>_<timestamp>.nsys-rep
#
# Examples:
# bash scripts/profile_offload.sh
# bash scripts/profile_offload.sh --dataset niah_single_1 --sample 5
# bash scripts/profile_offload.sh --gpu 1 --no-offload
# bash scripts/profile_offload.sh --num-gpu-blocks 8
set -e
# bash scripts/profile_offload.sh --policy xattn --ctx-len 128k --no-offload
# bash scripts/profile_offload.sh --policy full --ctx-len 32k --num-gpu-blocks 8
# Default configuration
POLICY="full"
CTX_LEN="64k"
DATASET="niah_single_1"
SAMPLE_INDEX="0"
GPU_ID="0"
NUM_GPU_BLOCKS="4"
BLOCK_SIZE="4096"
GPU_UTIL="0.9"
ENABLE_OFFLOAD="--enable-offload"
MODEL=""
DATA_DIR_OVERRIDE=""
# Parse arguments
while [[ $# -gt 0 ]]; do
case $1 in
--policy)
POLICY="$2"
shift 2
;;
--ctx-len)
CTX_LEN="$2"
shift 2
;;
--dataset)
DATASET="$2"
shift 2
@@ -53,13 +67,33 @@ while [[ $# -gt 0 ]]; do
NUM_GPU_BLOCKS="$2"
shift 2
;;
--gpu-util)
GPU_UTIL="$2"
shift 2
;;
--block-size)
BLOCK_SIZE="$2"
shift 2
;;
--model)
MODEL="$2"
shift 2
;;
--data-dir)
DATA_DIR_OVERRIDE="$2"
shift 2
;;
-h|--help)
echo "Usage: $0 [options]"
echo ""
echo "Options:"
echo " --policy POLICY Sparse policy name (default: full)"
echo " --ctx-len LENGTH Context length: 32k, 64k, 128k (default: 64k)"
echo " --block-size SIZE KV cache block size (default: 4096)"
echo " --dataset DATASET Task name (default: niah_single_1)"
echo " --sample INDEX Sample index (default: 0)"
echo " --gpu GPU_ID GPU to use (default: 0)"
echo " --gpu-util UTIL GPU memory utilization (default: 0.9)"
echo " --no-offload Disable CPU offload"
echo " --num-gpu-blocks N Number of GPU blocks/slots (default: 4)"
exit 0
@@ -76,27 +110,62 @@ SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
PROJECT_ROOT="$(dirname "$SCRIPT_DIR")"
OUTPUT_DIR="$PROJECT_ROOT/results/nsys"
TEST_SCRIPT="$PROJECT_ROOT/tests/test_ruler.py"
DATA_DIR="$PROJECT_ROOT/tests/data/ruler_${CTX_LEN}"
# Set max-model-len based on context length
case "$CTX_LEN" in
32k)
MAX_MODEL_LEN=36000
;;
64k)
MAX_MODEL_LEN=72000
;;
128k)
MAX_MODEL_LEN=144000
;;
256k)
MAX_MODEL_LEN=288000
;;
512k)
MAX_MODEL_LEN=576000
;;
1m)
MAX_MODEL_LEN=1100000
;;
*)
MAX_MODEL_LEN=72000
;;
esac
# Override DATA_DIR if specified
if [ -n "$DATA_DIR_OVERRIDE" ]; then
DATA_DIR="$DATA_DIR_OVERRIDE"
fi
# Create output directory if needed
mkdir -p "$OUTPUT_DIR"
# Generate timestamp for unique filename
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
OFFLOAD_SUFFIX=""
if [ -n "$ENABLE_OFFLOAD" ]; then
OFFLOAD_SUFFIX="_offload_${NUM_GPU_BLOCKS}slots"
OFFLOAD_TAG="offload"
else
OFFLOAD_TAG="gpuonly"
fi
OUTPUT_FILE="$OUTPUT_DIR/ruler_${DATASET}_sample${SAMPLE_INDEX}${OFFLOAD_SUFFIX}_${TIMESTAMP}"
OUTPUT_FILE="$OUTPUT_DIR/${POLICY}_${OFFLOAD_TAG}_${CTX_LEN}_blk${BLOCK_SIZE}_${TIMESTAMP}"
echo "============================================================"
echo "NVIDIA Nsight Systems Profiling"
echo "============================================================"
echo "Test script: $TEST_SCRIPT"
echo "Policy: $POLICY"
echo "Offload: $OFFLOAD_TAG"
echo "Context: $CTX_LEN"
echo "Block Size: $BLOCK_SIZE"
echo "Dataset: $DATASET"
echo "Sample: $SAMPLE_INDEX"
echo "GPU: $GPU_ID"
echo "GPU Blocks: $NUM_GPU_BLOCKS"
echo "Offload: ${ENABLE_OFFLOAD:-disabled}"
echo "Data Dir: $DATA_DIR"
echo "Output file: $OUTPUT_FILE.nsys-rep"
echo ""
@@ -110,17 +179,59 @@ echo ""
echo "Running nsys profile..."
echo ""
# Map policy name to internal enum name
# User-friendly name -> SparsePolicyType enum name
case "$POLICY" in
xattn)
POLICY_ENUM="XATTN_BSA"
;;
*)
POLICY_ENUM="$POLICY"
;;
esac
# Build sparse policy argument
SPARSE_POLICY_ARG=""
if [ -n "$POLICY_ENUM" ] && [ "$POLICY_ENUM" != "full" ]; then
SPARSE_POLICY_ARG="--sparse-policy $POLICY_ENUM"
fi
# Build model argument
MODEL_ARG=""
if [ -n "$MODEL" ]; then
MODEL_ARG="--model $MODEL"
fi
# Run nsys profile and capture exit code
CUDA_VISIBLE_DEVICES=$GPU_ID PYTHONPATH="$PROJECT_ROOT:$PYTHONPATH" \
nsys profile \
--trace=cuda,nvtx \
--force-overwrite=true \
--output="$OUTPUT_FILE" \
python "$TEST_SCRIPT" \
--data-dir "$DATA_DIR" \
--datasets "$DATASET" \
--sample-indices "$SAMPLE_INDEX" \
--num-gpu-blocks "$NUM_GPU_BLOCKS" \
--block-size "$BLOCK_SIZE" \
--max-model-len "$MAX_MODEL_LEN" \
--gpu-utilization "$GPU_UTIL" \
$ENABLE_OFFLOAD \
$SPARSE_POLICY_ARG \
$MODEL_ARG \
--quiet
EXIT_CODE=$?
# If test failed, delete the output file
if [ $EXIT_CODE -ne 0 ]; then
echo ""
echo "============================================================"
echo "Test FAILED! Cleaning up..."
echo "============================================================"
rm -f "$OUTPUT_FILE.nsys-rep"
echo "Deleted: $OUTPUT_FILE.nsys-rep"
exit $EXIT_CODE
fi
echo ""
echo "============================================================"

View File

@@ -1,757 +0,0 @@
"""
Custom Qwen3 implementation using only torch and transformers.
This file provides a clean reference implementation for understanding the model computation graph.
Computation Graph:
==================
Input: token_ids [batch, seq_len]
┌─────────────┐
│ Embedding │ embed_tokens: [vocab_size, hidden_size]
└─────────────┘
hidden_states [batch, seq_len, hidden_size]
┌─────────────────────────────────────────────────────────┐
│ Decoder Layer (x N) │
│ ┌───────────────────────────────────────────────────┐ │
│ │ Self Attention Block │ │
│ │ │ │
│ │ input_layernorm (RMSNorm) │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ ┌─────────────────────────────────────────────┐ │ │
│ │ │ Qwen3Attention │ │ │
│ │ │ Q = q_proj(x) → q_norm → reshape │ │ │
│ │ │ K = k_proj(x) → k_norm → reshape │ │ │
│ │ │ V = v_proj(x) → reshape │ │ │
│ │ │ │ │ │ │
│ │ │ ▼ │ │ │
│ │ │ Q, K = apply_rotary_pos_emb(Q, K, cos, sin)│ │ │
│ │ │ │ │ │ │
│ │ │ ▼ │ │ │
│ │ │ attn_output = attention(Q, K, V) │ │ │
│ │ │ │ │ │ │
│ │ │ ▼ │ │ │
│ │ │ output = o_proj(attn_output) │ │ │
│ │ └─────────────────────────────────────────────┘ │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ hidden_states = residual + attn_output │ │
│ └───────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────────────────────────────────┐ │
│ │ MLP Block │ │
│ │ │ │
│ │ post_attention_layernorm (RMSNorm) │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ ┌─────────────────────────────────────────────┐ │ │
│ │ │ Qwen3MLP │ │ │
│ │ │ gate = gate_proj(x) │ │ │
│ │ │ up = up_proj(x) │ │ │
│ │ │ output = down_proj(silu(gate) * up) │ │ │
│ │ └─────────────────────────────────────────────┘ │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ hidden_states = residual + mlp_output │ │
│ └───────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────┘
┌─────────────┐
│ norm │ final RMSNorm
└─────────────┘
┌─────────────┐
│ lm_head │ [hidden_size, vocab_size]
└─────────────┘
logits [batch, seq_len, vocab_size]
"""
import math
from typing import Optional, Tuple, List
import torch
import torch.nn as nn
import torch.nn.functional as F
class Qwen3RMSNorm(nn.Module):
"""RMSNorm implementation."""
def __init__(self, hidden_size: int, eps: float = 1e-6):
super().__init__()
self.weight = nn.Parameter(torch.ones(hidden_size))
self.eps = eps
def forward(self, x: torch.Tensor) -> torch.Tensor:
input_dtype = x.dtype
x = x.float()
variance = x.pow(2).mean(-1, keepdim=True)
x = x * torch.rsqrt(variance + self.eps)
return self.weight * x.to(input_dtype)
class Qwen3RotaryEmbedding(nn.Module):
"""Rotary Position Embedding (RoPE)."""
def __init__(self, dim: int, max_position_embeddings: int = 32768, base: float = 10000.0):
super().__init__()
self.dim = dim
self.max_position_embeddings = max_position_embeddings
self.base = base
# Compute inverse frequencies
inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2).float() / dim))
self.register_buffer("inv_freq", inv_freq, persistent=False)
@torch.no_grad()
def forward(self, x: torch.Tensor, position_ids: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
"""
Args:
x: Input tensor [batch, seq_len, num_heads, head_dim] or similar
position_ids: Position indices [batch, seq_len]
Returns:
cos, sin: [batch, seq_len, head_dim]
"""
# inv_freq: [dim/2]
# position_ids: [batch, seq_len]
inv_freq_expanded = self.inv_freq[None, :, None].float() # [1, dim/2, 1]
position_ids_expanded = position_ids[:, None, :].float() # [batch, 1, seq_len]
# freqs: [batch, dim/2, seq_len]
freqs = inv_freq_expanded @ position_ids_expanded
# freqs: [batch, seq_len, dim/2]
freqs = freqs.transpose(1, 2)
# Duplicate for full head_dim: [batch, seq_len, dim]
emb = torch.cat((freqs, freqs), dim=-1)
cos = emb.cos().to(x.dtype)
sin = emb.sin().to(x.dtype)
return cos, sin
def rotate_half(x: torch.Tensor) -> torch.Tensor:
"""Rotate half the hidden dims of the input."""
x1 = x[..., : x.shape[-1] // 2]
x2 = x[..., x.shape[-1] // 2 :]
return torch.cat((-x2, x1), dim=-1)
def apply_rotary_pos_emb(
q: torch.Tensor,
k: torch.Tensor,
cos: torch.Tensor,
sin: torch.Tensor,
) -> Tuple[torch.Tensor, torch.Tensor]:
"""
Apply rotary position embeddings to Q and K.
Args:
q: [batch, num_heads, seq_len, head_dim]
k: [batch, num_kv_heads, seq_len, head_dim]
cos: [batch, seq_len, head_dim]
sin: [batch, seq_len, head_dim]
Returns:
q_embed, k_embed with same shapes as inputs
"""
# Unsqueeze for broadcasting: [batch, 1, seq_len, head_dim]
cos = cos.unsqueeze(1)
sin = sin.unsqueeze(1)
q_embed = (q * cos) + (rotate_half(q) * sin)
k_embed = (k * cos) + (rotate_half(k) * sin)
return q_embed, k_embed
class Qwen3Attention(nn.Module):
"""
Qwen3 Multi-Head Attention with Grouped Query Attention (GQA) support.
Data Flow:
---------
hidden_states [batch, seq_len, hidden_size]
├──► q_proj ──► q_norm ──► reshape ──► Q [batch, num_heads, seq_len, head_dim]
├──► k_proj ──► k_norm ──► reshape ──► K [batch, num_kv_heads, seq_len, head_dim]
└──► v_proj ──► reshape ──► V [batch, num_kv_heads, seq_len, head_dim]
apply_rotary_pos_emb(Q, K)
attention(Q, K, V) ──► attn_output [batch, num_heads, seq_len, head_dim]
reshape ──► o_proj ──► output [batch, seq_len, hidden_size]
"""
def __init__(
self,
hidden_size: int,
num_attention_heads: int,
num_key_value_heads: int,
head_dim: int,
max_position_embeddings: int = 32768,
rope_theta: float = 10000.0,
attention_bias: bool = False,
rms_norm_eps: float = 1e-6,
layer_idx: int = 0,
):
super().__init__()
self.hidden_size = hidden_size
self.num_heads = num_attention_heads
self.num_kv_heads = num_key_value_heads
self.head_dim = head_dim
self.num_kv_groups = num_attention_heads // num_key_value_heads
self.layer_idx = layer_idx
# Scaling factor
self.scaling = head_dim ** -0.5
# QKV projections
self.q_proj = nn.Linear(hidden_size, num_attention_heads * head_dim, bias=attention_bias)
self.k_proj = nn.Linear(hidden_size, num_key_value_heads * head_dim, bias=attention_bias)
self.v_proj = nn.Linear(hidden_size, num_key_value_heads * head_dim, bias=attention_bias)
self.o_proj = nn.Linear(num_attention_heads * head_dim, hidden_size, bias=attention_bias)
# QK normalization (Qwen3 specific)
self.q_norm = Qwen3RMSNorm(head_dim, eps=rms_norm_eps)
self.k_norm = Qwen3RMSNorm(head_dim, eps=rms_norm_eps)
# Rotary embeddings
self.rotary_emb = Qwen3RotaryEmbedding(
head_dim,
max_position_embeddings=max_position_embeddings,
base=rope_theta,
)
def forward(
self,
hidden_states: torch.Tensor,
position_ids: torch.Tensor,
attention_mask: Optional[torch.Tensor] = None,
past_key_value: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
use_cache: bool = False,
output_qkv: bool = False,
) -> Tuple[torch.Tensor, Optional[Tuple[torch.Tensor, torch.Tensor]], Optional[dict]]:
"""
Args:
hidden_states: [batch, seq_len, hidden_size]
position_ids: [batch, seq_len]
attention_mask: [batch, 1, seq_len, kv_seq_len] (causal mask)
past_key_value: (k_cache, v_cache) from previous steps
use_cache: Whether to return updated cache
output_qkv: Whether to output Q, K, V tensors for debugging
Returns:
output: [batch, seq_len, hidden_size]
past_key_value: Updated cache (if use_cache=True)
qkv_dict: {"q": Q, "k": K, "v": V} (if output_qkv=True)
"""
batch_size, seq_len, _ = hidden_states.shape
# === QKV Projections ===
q = self.q_proj(hidden_states) # [batch, seq_len, num_heads * head_dim]
k = self.k_proj(hidden_states) # [batch, seq_len, num_kv_heads * head_dim]
v = self.v_proj(hidden_states) # [batch, seq_len, num_kv_heads * head_dim]
# Reshape to [batch, seq_len, num_heads, head_dim]
q = q.view(batch_size, seq_len, self.num_heads, self.head_dim)
k = k.view(batch_size, seq_len, self.num_kv_heads, self.head_dim)
v = v.view(batch_size, seq_len, self.num_kv_heads, self.head_dim)
# === QK Normalization (Qwen3 specific) ===
q = self.q_norm(q)
k = self.k_norm(k)
# Transpose to [batch, num_heads, seq_len, head_dim]
q = q.transpose(1, 2)
k = k.transpose(1, 2)
v = v.transpose(1, 2)
# === Rotary Position Embeddings ===
cos, sin = self.rotary_emb(v, position_ids)
q, k = apply_rotary_pos_emb(q, k, cos, sin)
# === KV Cache Update ===
if past_key_value is not None:
k_cache, v_cache = past_key_value
k = torch.cat([k_cache, k], dim=2)
v = torch.cat([v_cache, v], dim=2)
new_past_key_value = (k, v) if use_cache else None
# === Grouped Query Attention (expand KV heads if needed) ===
if self.num_kv_groups > 1:
# Repeat KV for each query group
k = k.repeat_interleave(self.num_kv_groups, dim=1)
v = v.repeat_interleave(self.num_kv_groups, dim=1)
# === Attention Computation (using SDPA for memory efficiency) ===
# Use PyTorch's scaled_dot_product_attention which can use FlashAttention backend
# is_causal only works when q_len == kv_len (prefill), not during decode
q_len, kv_len = q.shape[2], k.shape[2]
is_causal = (q_len == kv_len) and (q_len > 1)
attn_output = F.scaled_dot_product_attention(
q, k, v,
attn_mask=None,
dropout_p=0.0,
is_causal=is_causal,
scale=self.scaling,
) # [batch, num_heads, seq_len, head_dim]
# === Output Projection ===
# Transpose back and reshape
attn_output = attn_output.transpose(1, 2).contiguous() # [batch, seq_len, num_heads, head_dim]
attn_output = attn_output.view(batch_size, seq_len, -1) # [batch, seq_len, hidden_size]
output = self.o_proj(attn_output)
# Optional QKV output for debugging
qkv_dict = None
if output_qkv:
qkv_dict = {
"q": q, # [batch, num_heads, seq_len, head_dim] (post-RoPE)
"k": k, # [batch, num_heads, kv_seq_len, head_dim] (post-RoPE, expanded)
"v": v, # [batch, num_heads, kv_seq_len, head_dim] (expanded)
}
return output, new_past_key_value, qkv_dict
class Qwen3MLP(nn.Module):
"""
Qwen3 MLP with SwiGLU activation.
Data Flow:
---------
hidden_states [batch, seq_len, hidden_size]
├──► gate_proj ──► gate [batch, seq_len, intermediate_size]
└──► up_proj ──► up [batch, seq_len, intermediate_size]
silu(gate) * up
down_proj ──► output [batch, seq_len, hidden_size]
"""
def __init__(self, hidden_size: int, intermediate_size: int, bias: bool = False):
super().__init__()
self.gate_proj = nn.Linear(hidden_size, intermediate_size, bias=bias)
self.up_proj = nn.Linear(hidden_size, intermediate_size, bias=bias)
self.down_proj = nn.Linear(intermediate_size, hidden_size, bias=bias)
def forward(self, x: torch.Tensor) -> torch.Tensor:
gate = self.gate_proj(x)
up = self.up_proj(x)
return self.down_proj(F.silu(gate) * up)
class Qwen3DecoderLayer(nn.Module):
"""Single Qwen3 Decoder Layer."""
def __init__(
self,
hidden_size: int,
intermediate_size: int,
num_attention_heads: int,
num_key_value_heads: int,
head_dim: int,
max_position_embeddings: int = 32768,
rope_theta: float = 10000.0,
rms_norm_eps: float = 1e-6,
attention_bias: bool = False,
mlp_bias: bool = False,
layer_idx: int = 0,
):
super().__init__()
self.layer_idx = layer_idx
# Pre-attention LayerNorm
self.input_layernorm = Qwen3RMSNorm(hidden_size, eps=rms_norm_eps)
# Self-attention
self.self_attn = Qwen3Attention(
hidden_size=hidden_size,
num_attention_heads=num_attention_heads,
num_key_value_heads=num_key_value_heads,
head_dim=head_dim,
max_position_embeddings=max_position_embeddings,
rope_theta=rope_theta,
attention_bias=attention_bias,
rms_norm_eps=rms_norm_eps,
layer_idx=layer_idx,
)
# Post-attention LayerNorm
self.post_attention_layernorm = Qwen3RMSNorm(hidden_size, eps=rms_norm_eps)
# MLP
self.mlp = Qwen3MLP(hidden_size, intermediate_size, bias=mlp_bias)
def forward(
self,
hidden_states: torch.Tensor,
position_ids: torch.Tensor,
attention_mask: Optional[torch.Tensor] = None,
past_key_value: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
use_cache: bool = False,
output_qkv: bool = False,
) -> Tuple[torch.Tensor, Optional[Tuple[torch.Tensor, torch.Tensor]], Optional[dict]]:
"""
Args:
hidden_states: [batch, seq_len, hidden_size]
position_ids: [batch, seq_len]
attention_mask: Causal attention mask
past_key_value: KV cache for this layer
use_cache: Whether to return updated cache
output_qkv: Whether to output Q, K, V for debugging
Returns:
hidden_states: [batch, seq_len, hidden_size]
past_key_value: Updated cache
qkv_dict: QKV tensors (if output_qkv=True)
"""
# === Self Attention Block ===
residual = hidden_states
hidden_states = self.input_layernorm(hidden_states)
attn_output, new_past_key_value, qkv_dict = self.self_attn(
hidden_states=hidden_states,
position_ids=position_ids,
attention_mask=attention_mask,
past_key_value=past_key_value,
use_cache=use_cache,
output_qkv=output_qkv,
)
hidden_states = residual + attn_output
# === MLP Block ===
residual = hidden_states
hidden_states = self.post_attention_layernorm(hidden_states)
hidden_states = self.mlp(hidden_states)
hidden_states = residual + hidden_states
return hidden_states, new_past_key_value, qkv_dict
class Qwen3Model(nn.Module):
"""Qwen3 Transformer Model (without LM head)."""
def __init__(
self,
vocab_size: int,
hidden_size: int,
intermediate_size: int,
num_hidden_layers: int,
num_attention_heads: int,
num_key_value_heads: int,
head_dim: int,
max_position_embeddings: int = 32768,
rope_theta: float = 10000.0,
rms_norm_eps: float = 1e-6,
attention_bias: bool = False,
mlp_bias: bool = False,
):
super().__init__()
self.vocab_size = vocab_size
self.num_hidden_layers = num_hidden_layers
# Token embeddings
self.embed_tokens = nn.Embedding(vocab_size, hidden_size)
# Decoder layers
self.layers = nn.ModuleList([
Qwen3DecoderLayer(
hidden_size=hidden_size,
intermediate_size=intermediate_size,
num_attention_heads=num_attention_heads,
num_key_value_heads=num_key_value_heads,
head_dim=head_dim,
max_position_embeddings=max_position_embeddings,
rope_theta=rope_theta,
rms_norm_eps=rms_norm_eps,
attention_bias=attention_bias,
mlp_bias=mlp_bias,
layer_idx=i,
)
for i in range(num_hidden_layers)
])
# Final LayerNorm
self.norm = Qwen3RMSNorm(hidden_size, eps=rms_norm_eps)
def forward(
self,
input_ids: torch.Tensor,
position_ids: Optional[torch.Tensor] = None,
attention_mask: Optional[torch.Tensor] = None,
past_key_values: Optional[List[Tuple[torch.Tensor, torch.Tensor]]] = None,
use_cache: bool = False,
output_qkv_layers: Optional[List[int]] = None,
) -> Tuple[torch.Tensor, Optional[List], Optional[dict]]:
"""
Args:
input_ids: [batch, seq_len]
position_ids: [batch, seq_len]
attention_mask: [batch, seq_len] or pre-computed 4D mask
past_key_values: List of (k, v) tuples for each layer
use_cache: Whether to return new cache
output_qkv_layers: List of layer indices to output QKV for
Returns:
hidden_states: [batch, seq_len, hidden_size]
new_past_key_values: Updated cache
qkv_outputs: {layer_idx: qkv_dict}
"""
batch_size, seq_len = input_ids.shape
# Embedding
hidden_states = self.embed_tokens(input_ids)
# Position IDs
if position_ids is None:
past_len = past_key_values[0][0].shape[2] if past_key_values else 0
position_ids = torch.arange(past_len, past_len + seq_len, device=input_ids.device)
position_ids = position_ids.unsqueeze(0).expand(batch_size, -1)
# Attention mask (create causal mask if not provided)
if attention_mask is None or attention_mask.dim() == 2:
kv_seq_len = seq_len + (past_key_values[0][0].shape[2] if past_key_values else 0)
causal_mask = torch.triu(
torch.full((seq_len, kv_seq_len), float("-inf"), device=input_ids.device),
diagonal=kv_seq_len - seq_len + 1,
)
attention_mask = causal_mask.unsqueeze(0).unsqueeze(0) # [1, 1, seq_len, kv_seq_len]
# Initialize cache list
new_past_key_values = [] if use_cache else None
qkv_outputs = {} if output_qkv_layers else None
# Decoder layers
for i, layer in enumerate(self.layers):
past_kv = past_key_values[i] if past_key_values else None
output_qkv = output_qkv_layers is not None and i in output_qkv_layers
hidden_states, new_kv, qkv_dict = layer(
hidden_states=hidden_states,
position_ids=position_ids,
attention_mask=attention_mask,
past_key_value=past_kv,
use_cache=use_cache,
output_qkv=output_qkv,
)
if use_cache:
new_past_key_values.append(new_kv)
if qkv_dict is not None:
qkv_outputs[i] = qkv_dict
# Final norm
hidden_states = self.norm(hidden_states)
return hidden_states, new_past_key_values, qkv_outputs
class Qwen3ForCausalLM(nn.Module):
"""Qwen3 Model with Language Modeling head."""
def __init__(
self,
vocab_size: int,
hidden_size: int,
intermediate_size: int,
num_hidden_layers: int,
num_attention_heads: int,
num_key_value_heads: int,
head_dim: int,
max_position_embeddings: int = 32768,
rope_theta: float = 10000.0,
rms_norm_eps: float = 1e-6,
attention_bias: bool = False,
mlp_bias: bool = False,
tie_word_embeddings: bool = True,
):
super().__init__()
self.vocab_size = vocab_size
self.tie_word_embeddings = tie_word_embeddings
# Transformer model
self.model = Qwen3Model(
vocab_size=vocab_size,
hidden_size=hidden_size,
intermediate_size=intermediate_size,
num_hidden_layers=num_hidden_layers,
num_attention_heads=num_attention_heads,
num_key_value_heads=num_key_value_heads,
head_dim=head_dim,
max_position_embeddings=max_position_embeddings,
rope_theta=rope_theta,
rms_norm_eps=rms_norm_eps,
attention_bias=attention_bias,
mlp_bias=mlp_bias,
)
# LM head
self.lm_head = nn.Linear(hidden_size, vocab_size, bias=False)
def forward(
self,
input_ids: torch.Tensor,
position_ids: Optional[torch.Tensor] = None,
attention_mask: Optional[torch.Tensor] = None,
past_key_values: Optional[List[Tuple[torch.Tensor, torch.Tensor]]] = None,
use_cache: bool = False,
output_qkv_layers: Optional[List[int]] = None,
) -> Tuple[torch.Tensor, Optional[List], Optional[dict]]:
"""
Args:
input_ids: [batch, seq_len]
... (same as Qwen3Model)
Returns:
logits: [batch, seq_len, vocab_size]
past_key_values: Updated KV cache
qkv_outputs: QKV tensors for specified layers
"""
hidden_states, new_past_key_values, qkv_outputs = self.model(
input_ids=input_ids,
position_ids=position_ids,
attention_mask=attention_mask,
past_key_values=past_key_values,
use_cache=use_cache,
output_qkv_layers=output_qkv_layers,
)
logits = self.lm_head(hidden_states)
return logits, new_past_key_values, qkv_outputs
@classmethod
def from_pretrained(cls, model_path: str, dtype: torch.dtype = torch.float16) -> "Qwen3ForCausalLM":
"""
Load weights from a pretrained Qwen3 model.
Args:
model_path: Path to model directory containing config.json and model weights
dtype: Data type for model weights
Returns:
Initialized Qwen3ForCausalLM model
"""
import json
import os
from safetensors.torch import load_file
# Load config
config_path = os.path.join(model_path, "config.json")
with open(config_path) as f:
config = json.load(f)
# Create model
model = cls(
vocab_size=config["vocab_size"],
hidden_size=config["hidden_size"],
intermediate_size=config["intermediate_size"],
num_hidden_layers=config["num_hidden_layers"],
num_attention_heads=config["num_attention_heads"],
num_key_value_heads=config.get("num_key_value_heads", config["num_attention_heads"]),
head_dim=config.get("head_dim", config["hidden_size"] // config["num_attention_heads"]),
max_position_embeddings=config.get("max_position_embeddings", 32768),
rope_theta=config.get("rope_theta", 10000.0),
rms_norm_eps=config.get("rms_norm_eps", 1e-6),
attention_bias=config.get("attention_bias", False),
mlp_bias=config.get("mlp_bias", False),
tie_word_embeddings=config.get("tie_word_embeddings", True),
)
# Load weights
weight_files = sorted([
f for f in os.listdir(model_path)
if f.endswith(".safetensors")
])
state_dict = {}
for wf in weight_files:
state_dict.update(load_file(os.path.join(model_path, wf)))
# Load into model
model.load_state_dict(state_dict, strict=False)
# Tie lm_head weights to embed_tokens if configured
if model.tie_word_embeddings:
model.lm_head.weight = model.model.embed_tokens.weight
model = model.to(dtype)
return model
@torch.no_grad()
def generate(
self,
input_ids: torch.Tensor,
max_new_tokens: int = 32,
temperature: float = 1.0,
do_sample: bool = True,
pad_token_id: Optional[int] = None,
eos_token_id: Optional[int] = None,
) -> torch.Tensor:
"""Simple autoregressive generation."""
device = input_ids.device
batch_size, seq_len = input_ids.shape
past_key_values = None
generated = input_ids.clone()
for _ in range(max_new_tokens):
if past_key_values is None:
current_input = generated
else:
current_input = generated[:, -1:]
logits, past_key_values, _ = self(
input_ids=current_input,
past_key_values=past_key_values,
use_cache=True,
)
next_token_logits = logits[:, -1, :]
if temperature > 0 and do_sample:
next_token_logits = next_token_logits / temperature
probs = torch.softmax(next_token_logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)
else:
next_token = next_token_logits.argmax(dim=-1, keepdim=True)
generated = torch.cat([generated, next_token], dim=1)
if eos_token_id is not None and (next_token == eos_token_id).all():
break
return generated
def print_computation_graph():
"""Print the computation graph for reference."""
print(__doc__)
if __name__ == "__main__":
print_computation_graph()

View File

@@ -1,151 +0,0 @@
#!/usr/bin/env python3
"""
Test: Pre-allocated chunk pair graphs for block sparse attention.
Each (Q_chunk, K_chunk) pair has its own captured CUDA graph.
Zero copy_() during replay - all data pre-filled.
Usage:
CUDA_VISIBLE_DEVICES=0 python tests/test_chunk_attention_graph.py
"""
from dataclasses import dataclass
from typing import List, Optional
import torch
from nanovllm.ops.chunked_attention import flash_attn_with_lse, merge_attention_outputs
@dataclass
class ChunkAttentionGraph:
"""Container for a captured chunk attention graph."""
graph: torch.cuda.CUDAGraph
static_q: torch.Tensor
static_k: torch.Tensor
static_v: torch.Tensor
static_output: torch.Tensor
static_lse: torch.Tensor
causal: bool
def capture_chunk_attention_graph(
chunk_size: int,
num_heads: int,
num_kv_heads: int,
head_dim: int,
scale: float,
device: torch.device,
dtype: torch.dtype,
causal: bool = False,
) -> ChunkAttentionGraph:
"""Capture a CUDA graph for single chunk attention."""
static_q = torch.zeros(1, chunk_size, num_heads, head_dim, dtype=dtype, device=device)
static_k = torch.zeros(1, chunk_size, num_kv_heads, head_dim, dtype=dtype, device=device)
static_v = torch.zeros(1, chunk_size, num_kv_heads, head_dim, dtype=dtype, device=device)
static_q.normal_()
static_k.normal_()
static_v.normal_()
# Warmup
with torch.inference_mode():
for _ in range(3):
_ = flash_attn_with_lse(static_q, static_k, static_v, scale, causal)
torch.cuda.synchronize()
# Capture
graph = torch.cuda.CUDAGraph()
with torch.inference_mode():
with torch.cuda.graph(graph):
static_output, static_lse = flash_attn_with_lse(static_q, static_k, static_v, scale, causal)
torch.cuda.synchronize()
return ChunkAttentionGraph(
graph=graph,
static_q=static_q,
static_k=static_k,
static_v=static_v,
static_output=static_output,
static_lse=static_lse,
causal=causal,
)
def main():
device = torch.device("cuda")
dtype = torch.bfloat16
chunk_size = 64
num_chunks = 4
num_heads = 8
num_kv_heads = 8
head_dim = 64
scale = 1.0 / (head_dim ** 0.5)
seq_len = chunk_size * num_chunks
print(f"Device: {torch.cuda.get_device_name()}")
print(f"Chunk size: {chunk_size}, Num chunks: {num_chunks}")
print(f"Total graphs: {num_chunks * (num_chunks + 1) // 2}")
# Test data
full_q = torch.randn(1, seq_len, num_heads, head_dim, dtype=dtype, device=device)
full_k = torch.randn(1, seq_len, num_kv_heads, head_dim, dtype=dtype, device=device)
full_v = torch.randn(1, seq_len, num_kv_heads, head_dim, dtype=dtype, device=device)
# Reference
with torch.inference_mode():
full_output, _ = flash_attn_with_lse(full_q, full_k, full_v, scale, causal=True)
# Capture all graphs
graphs: List[List[Optional[ChunkAttentionGraph]]] = [[None] * num_chunks for _ in range(num_chunks)]
for q_idx in range(num_chunks):
for k_idx in range(q_idx + 1):
graphs[q_idx][k_idx] = capture_chunk_attention_graph(
chunk_size, num_heads, num_kv_heads, head_dim, scale, device, dtype,
causal=(k_idx == q_idx)
)
print("All graphs captured")
# Pre-fill static tensors
for q_idx in range(num_chunks):
for k_idx in range(q_idx + 1):
g = graphs[q_idx][k_idx]
g.static_q.copy_(full_q[:, q_idx*chunk_size:(q_idx+1)*chunk_size])
g.static_k.copy_(full_k[:, k_idx*chunk_size:(k_idx+1)*chunk_size])
g.static_v.copy_(full_v[:, k_idx*chunk_size:(k_idx+1)*chunk_size])
print("Static tensors pre-filled")
# Replay and merge
chunked_output = torch.zeros_like(full_output)
for q_idx in range(num_chunks):
acc_out, acc_lse = None, None
for k_idx in range(q_idx + 1):
g = graphs[q_idx][k_idx]
g.graph.replay()
out, lse = g.static_output.clone(), g.static_lse.clone()
if acc_out is None:
acc_out, acc_lse = out, lse
else:
with torch.inference_mode():
acc_out, acc_lse = merge_attention_outputs(acc_out, acc_lse, out, lse)
chunked_output[:, q_idx*chunk_size:(q_idx+1)*chunk_size] = acc_out
torch.cuda.synchronize()
# Compare
all_pass = True
for q_idx in range(num_chunks):
s, e = q_idx * chunk_size, (q_idx + 1) * chunk_size
diff = (full_output[:, s:e] - chunked_output[:, s:e]).abs().max().item()
status = "" if diff < 1e-2 else ""
print(f"Q[{q_idx}]: max_diff={diff:.2e} {status}")
if diff >= 1e-2:
all_pass = False
print("✅ PASSED" if all_pass else "❌ FAILED")
if __name__ == "__main__":
main()

View File

@@ -1,156 +0,0 @@
#!/usr/bin/env python3
"""
Test: Reuse a single CUDA Graph across all layers and all chunk pairs.
Key insight: LLM layers have identical computation structure.
We only need 2 graphs (causal + non-causal), reused for all (layer, Q_i, K_j) combinations.
Usage:
CUDA_VISIBLE_DEVICES=0 python tests/test_chunk_attention_graph_reuse.py
"""
from dataclasses import dataclass
import torch
from nanovllm.ops.chunked_attention import flash_attn_with_lse, merge_attention_outputs
@dataclass
class ReusableChunkGraph:
"""A single graph that can be reused with copy_() updates."""
graph: torch.cuda.CUDAGraph
static_q: torch.Tensor
static_k: torch.Tensor
static_v: torch.Tensor
static_output: torch.Tensor
static_lse: torch.Tensor
def capture_reusable_graph(
chunk_size: int,
num_heads: int,
num_kv_heads: int,
head_dim: int,
scale: float,
device: torch.device,
dtype: torch.dtype,
causal: bool,
) -> ReusableChunkGraph:
"""Capture ONE graph to be reused for all chunk pairs."""
static_q = torch.zeros(1, chunk_size, num_heads, head_dim, dtype=dtype, device=device)
static_k = torch.zeros(1, chunk_size, num_kv_heads, head_dim, dtype=dtype, device=device)
static_v = torch.zeros(1, chunk_size, num_kv_heads, head_dim, dtype=dtype, device=device)
static_q.normal_()
static_k.normal_()
static_v.normal_()
# Warmup
with torch.inference_mode():
for _ in range(3):
_ = flash_attn_with_lse(static_q, static_k, static_v, scale, causal)
torch.cuda.synchronize()
# Capture
graph = torch.cuda.CUDAGraph()
with torch.inference_mode():
with torch.cuda.graph(graph):
static_output, static_lse = flash_attn_with_lse(static_q, static_k, static_v, scale, causal)
torch.cuda.synchronize()
return ReusableChunkGraph(
graph=graph,
static_q=static_q,
static_k=static_k,
static_v=static_v,
static_output=static_output,
static_lse=static_lse,
)
def replay_with_copy(graph: ReusableChunkGraph, q: torch.Tensor, k: torch.Tensor, v: torch.Tensor):
"""Replay graph after updating static tensors with copy_()."""
graph.static_q.copy_(q)
graph.static_k.copy_(k)
graph.static_v.copy_(v)
graph.graph.replay()
return graph.static_output.clone(), graph.static_lse.clone()
def main():
device = torch.device("cuda")
dtype = torch.bfloat16
chunk_size = 64
num_chunks = 4
num_layers = 3 # Simulate multiple layers
num_heads = 8
num_kv_heads = 8
head_dim = 64
scale = 1.0 / (head_dim ** 0.5)
seq_len = chunk_size * num_chunks
print(f"Device: {torch.cuda.get_device_name()}")
print(f"Chunk size: {chunk_size}, Num chunks: {num_chunks}, Num layers: {num_layers}")
print(f"Only 2 graphs (causal + non-causal) for ALL layer × chunk combinations")
# Capture only 2 graphs
graph_causal = capture_reusable_graph(
chunk_size, num_heads, num_kv_heads, head_dim, scale, device, dtype, causal=True
)
graph_non_causal = capture_reusable_graph(
chunk_size, num_heads, num_kv_heads, head_dim, scale, device, dtype, causal=False
)
print("2 graphs captured (causal + non-causal)")
all_pass = True
for layer_id in range(num_layers):
# Different Q/K/V for each layer (simulating different layer outputs)
full_q = torch.randn(1, seq_len, num_heads, head_dim, dtype=dtype, device=device)
full_k = torch.randn(1, seq_len, num_kv_heads, head_dim, dtype=dtype, device=device)
full_v = torch.randn(1, seq_len, num_kv_heads, head_dim, dtype=dtype, device=device)
# Reference: full causal attention
with torch.inference_mode():
full_output, _ = flash_attn_with_lse(full_q, full_k, full_v, scale, causal=True)
# Chunked with graph reuse
chunked_output = torch.zeros_like(full_output)
for q_idx in range(num_chunks):
q_chunk = full_q[:, q_idx*chunk_size:(q_idx+1)*chunk_size]
acc_out, acc_lse = None, None
for k_idx in range(q_idx + 1):
k_chunk = full_k[:, k_idx*chunk_size:(k_idx+1)*chunk_size]
v_chunk = full_v[:, k_idx*chunk_size:(k_idx+1)*chunk_size]
# Reuse graph with copy_()
graph = graph_causal if k_idx == q_idx else graph_non_causal
out, lse = replay_with_copy(graph, q_chunk, k_chunk, v_chunk)
if acc_out is None:
acc_out, acc_lse = out, lse
else:
with torch.inference_mode():
acc_out, acc_lse = merge_attention_outputs(acc_out, acc_lse, out, lse)
chunked_output[:, q_idx*chunk_size:(q_idx+1)*chunk_size] = acc_out
torch.cuda.synchronize()
# Compare
max_diff = (full_output - chunked_output).abs().max().item()
status = "" if max_diff < 1e-2 else ""
print(f"Layer {layer_id}: max_diff={max_diff:.2e} {status}")
if max_diff >= 1e-2:
all_pass = False
print("✅ PASSED - Single graph reuse across layers works!" if all_pass else "❌ FAILED")
if __name__ == "__main__":
main()

View File

@@ -1,357 +0,0 @@
#!/usr/bin/env python3
"""
CUDA Graph Memory Analysis Test
This script analyzes the memory overhead of CUDA Graph at each stage:
1. Model loading
2. StaticCache allocation
3. Warmup runs
4. Graph capture
5. Graph replay
Usage:
CUDA_VISIBLE_DEVICES=4 python tests/test_cudagraph_memory.py
CUDA_VISIBLE_DEVICES=4 python tests/test_cudagraph_memory.py --model ~/models/Qwen3-0.6B
CUDA_VISIBLE_DEVICES=4 python tests/test_cudagraph_memory.py --max-cache-len 2048
"""
import argparse
import os
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.cache_utils import StaticCache
def get_memory_mb():
"""Get current allocated memory in MB."""
return torch.cuda.memory_allocated() / 1024**2
def get_memory_gb():
"""Get current allocated memory in GB."""
return torch.cuda.memory_allocated() / 1024**3
def get_peak_memory_gb():
"""Get peak allocated memory in GB."""
return torch.cuda.max_memory_allocated() / 1024**3
def print_separator(title=None):
"""Print a separator line."""
if title:
print(f"\n{'=' * 70}")
print(f" {title}")
print(f"{'=' * 70}")
else:
print("-" * 70)
def test_memory_stages(model_path: str, max_cache_len: int, batch_size: int = 1):
"""
Test memory usage at each stage of CUDA Graph setup.
Args:
model_path: Path to the model
max_cache_len: Maximum cache length for StaticCache
batch_size: Batch size for inference
"""
print_separator("CUDA Graph Memory Analysis")
print(f"Model: {model_path}")
print(f"Max cache length: {max_cache_len}")
print(f"Batch size: {batch_size}")
results = {}
# Stage 0: Initial
torch.cuda.empty_cache()
torch.cuda.reset_peak_memory_stats()
results["initial"] = get_memory_mb()
# Stage 1: Load model
print_separator("Stage 1: Model Loading")
model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
device_map="cuda",
trust_remote_code=True,
)
model.eval()
results["after_model"] = get_memory_mb()
model_size = results["after_model"] - results["initial"]
print(f" Memory: {results['after_model']:.0f} MB")
print(f" Model size: {model_size:.0f} MB ({model_size/1024:.2f} GB)")
config = model.config
device = next(model.parameters()).device
dtype = next(model.parameters()).dtype
# Stage 2: Allocate StaticCache
print_separator("Stage 2: StaticCache Allocation")
torch.cuda.reset_peak_memory_stats()
before = get_memory_mb()
static_cache = StaticCache(
config=config,
max_batch_size=batch_size,
max_cache_len=max_cache_len,
device=device,
dtype=dtype,
)
results["after_cache"] = get_memory_mb()
cache_size = results["after_cache"] - before
print(f" Memory: {results['after_cache']:.0f} MB")
print(f" StaticCache size: {cache_size:.0f} MB")
# Calculate theoretical cache size
num_layers = config.num_hidden_layers
num_kv_heads = getattr(config, "num_key_value_heads", config.num_attention_heads)
head_dim = config.hidden_size // config.num_attention_heads
dtype_size = 2 # bfloat16
theoretical_cache = (
num_layers * 2 * batch_size * num_kv_heads * max_cache_len * head_dim * dtype_size
) / (1024**2)
print(f" Theoretical: {theoretical_cache:.0f} MB")
print(f" Overhead: {cache_size - theoretical_cache:.0f} MB ({(cache_size/theoretical_cache - 1)*100:.1f}%)")
# Stage 3: Prepare static tensors
print_separator("Stage 3: Static Tensor Allocation")
before = get_memory_mb()
static_input_ids = torch.zeros(batch_size, 1, dtype=torch.long, device=device)
static_position_ids = torch.zeros(batch_size, 1, dtype=torch.long, device=device)
static_cache_position = torch.tensor([0], dtype=torch.long, device=device)
results["after_tensors"] = get_memory_mb()
tensor_size = results["after_tensors"] - before
print(f" Memory: {results['after_tensors']:.0f} MB")
print(f" Static tensors: {tensor_size:.2f} MB (negligible)")
# Stage 4: Warmup runs
print_separator("Stage 4: Warmup Runs (3 iterations)")
torch.cuda.reset_peak_memory_stats()
before = get_memory_mb()
with torch.inference_mode():
for i in range(3):
_ = model(
input_ids=static_input_ids,
position_ids=static_position_ids,
past_key_values=static_cache,
cache_position=static_cache_position,
use_cache=True,
)
torch.cuda.synchronize()
results["after_warmup"] = get_memory_mb()
results["warmup_peak"] = get_peak_memory_gb() * 1024
warmup_size = results["after_warmup"] - before
print(f" Memory: {results['after_warmup']:.0f} MB")
print(f" Peak: {results['warmup_peak']:.0f} MB")
print(f" Warmup overhead: {warmup_size:.0f} MB")
# Stage 5: CUDA Graph capture
print_separator("Stage 5: CUDA Graph Capture")
torch.cuda.reset_peak_memory_stats()
before = get_memory_mb()
graph = torch.cuda.CUDAGraph()
with torch.inference_mode():
with torch.cuda.graph(graph):
outputs = model(
input_ids=static_input_ids,
position_ids=static_position_ids,
past_key_values=static_cache,
cache_position=static_cache_position,
use_cache=True,
)
static_logits = outputs.logits
torch.cuda.synchronize()
results["after_capture"] = get_memory_mb()
results["capture_peak"] = get_peak_memory_gb() * 1024
capture_size = results["after_capture"] - before
print(f" Memory: {results['after_capture']:.0f} MB")
print(f" Peak: {results['capture_peak']:.0f} MB")
print(f" Graph capture overhead: {capture_size:.0f} MB")
# Stage 6: Graph replay
print_separator("Stage 6: Graph Replay (10 iterations)")
torch.cuda.reset_peak_memory_stats()
before = get_memory_mb()
with torch.inference_mode():
for _ in range(10):
static_input_ids.fill_(1)
static_cache_position.fill_(0)
graph.replay()
torch.cuda.synchronize()
results["after_replay"] = get_memory_mb()
results["replay_peak"] = get_peak_memory_gb() * 1024
replay_change = results["after_replay"] - before
print(f" Memory: {results['after_replay']:.0f} MB")
print(f" Peak: {results['replay_peak']:.0f} MB")
print(f" Replay memory change: {replay_change:.0f} MB (should be ~0)")
# Summary
print_separator("SUMMARY")
total_overhead = results["after_capture"] - results["after_model"]
print(f"{'Stage':<25} {'Memory (MB)':>12} {'Delta (MB)':>12}")
print("-" * 50)
print(f"{'Model loaded':<25} {results['after_model']:>12.0f} {model_size:>+12.0f}")
print(f"{'StaticCache allocated':<25} {results['after_cache']:>12.0f} {cache_size:>+12.0f}")
print(f"{'After warmup':<25} {results['after_warmup']:>12.0f} {warmup_size:>+12.0f}")
print(f"{'After graph capture':<25} {results['after_capture']:>12.0f} {capture_size:>+12.0f}")
print(f"{'After graph replay':<25} {results['after_replay']:>12.0f} {replay_change:>+12.0f}")
print("-" * 50)
print(f"{'Total (excl. model)':<25} {'':<12} {total_overhead:>+12.0f}")
print_separator("KEY FINDINGS")
print(f" 1. Model size: {model_size/1024:.2f} GB")
print(f" 2. StaticCache: {cache_size:.0f} MB (main overhead, scales with cache_len)")
print(f" 3. Graph capture: {capture_size:.0f} MB (small, stores kernel sequence)")
print(f" 4. Graph replay: {replay_change:.0f} MB (zero allocation, reuses memory)")
print(f" 5. Total CUDA Graph overhead: {total_overhead:.0f} MB")
return results
def test_cache_length_scaling(model_path: str, cache_lengths: list):
"""
Test how memory scales with different cache lengths.
Args:
model_path: Path to the model
cache_lengths: List of cache lengths to test
"""
print_separator("Cache Length Scaling Test")
print(f"Model: {model_path}")
print(f"Cache lengths: {cache_lengths}")
# Load model once
model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
device_map="cuda",
trust_remote_code=True,
)
model.eval()
config = model.config
device = next(model.parameters()).device
dtype = next(model.parameters()).dtype
model_mem = get_memory_mb()
results = []
for cache_len in cache_lengths:
torch.cuda.empty_cache()
torch.cuda.reset_peak_memory_stats()
# Create cache and capture graph
static_cache = StaticCache(
config=config,
max_batch_size=1,
max_cache_len=cache_len,
device=device,
dtype=dtype,
)
static_input_ids = torch.zeros(1, 1, dtype=torch.long, device=device)
static_position_ids = torch.zeros(1, 1, dtype=torch.long, device=device)
static_cache_position = torch.tensor([0], dtype=torch.long, device=device)
with torch.inference_mode():
# Warmup
for _ in range(3):
_ = model(
input_ids=static_input_ids,
position_ids=static_position_ids,
past_key_values=static_cache,
cache_position=static_cache_position,
use_cache=True,
)
torch.cuda.synchronize()
# Capture
graph = torch.cuda.CUDAGraph()
with torch.cuda.graph(graph):
outputs = model(
input_ids=static_input_ids,
position_ids=static_position_ids,
past_key_values=static_cache,
cache_position=static_cache_position,
use_cache=True,
)
torch.cuda.synchronize()
total_mem = get_memory_mb()
overhead = total_mem - model_mem
results.append((cache_len, total_mem, overhead))
del static_cache, graph
torch.cuda.empty_cache()
# Print results
print()
print(f"{'Cache Length':>12} | {'Total (MB)':>12} | {'Overhead (MB)':>14} | {'Per 1K tokens':>14}")
print("-" * 60)
for cache_len, total, overhead in results:
per_1k = overhead / (cache_len / 1000)
print(f"{cache_len:>12} | {total:>12.0f} | {overhead:>14.0f} | {per_1k:>14.1f}")
return results
def main():
parser = argparse.ArgumentParser(description="CUDA Graph Memory Analysis")
parser.add_argument(
"--model",
type=str,
default="~/models/Qwen3-4B-Instruct-2507",
help="Model path",
)
parser.add_argument(
"--max-cache-len",
type=int,
default=1024,
help="Maximum cache length",
)
parser.add_argument(
"--batch-size",
type=int,
default=1,
help="Batch size",
)
parser.add_argument(
"--test-scaling",
action="store_true",
help="Test cache length scaling",
)
args = parser.parse_args()
model_path = os.path.expanduser(args.model)
if not torch.cuda.is_available():
print("CUDA is not available!")
return
print(f"Device: cuda:{torch.cuda.current_device()}")
print(f"GPU: {torch.cuda.get_device_name()}")
if args.test_scaling:
cache_lengths = [256, 512, 1024, 2048, 4096]
test_cache_length_scaling(model_path, cache_lengths)
else:
test_memory_stages(model_path, args.max_cache_len, args.batch_size)
print("\ntest_cudagraph_memory: PASSED")
if __name__ == "__main__":
main()

View File

@@ -1,254 +0,0 @@
"""
Needle-in-a-haystack test for LLM.
Tests: Long context retrieval capability with configurable sequence length.
NOTE: CPU offload mode has a known bug that causes incorrect outputs for
sequences longer than ~200 tokens. Use --no-offload for correctness testing.
"""
import os
os.environ["NANOVLLM_LOG_LEVEL"] = "INFO"
import argparse
from nanovllm import LLM, SamplingParams
from nanovllm.config import SparsePolicyType
from utils import generate_needle_prompt, check_needle_answer
# ============================================================
# Main Test
# ============================================================
def run_needle_test(
model_path: str,
max_model_len: int,
input_len: int,
num_gpu_blocks: int = 4,
block_size: int = 1024,
needle_position: float = 0.5,
needle_value: str = "7492",
max_new_tokens: int = 32,
enable_cpu_offload: bool = False,
enable_quest: bool = False,
enable_xattn_bsa: bool = False,
sparse_topk: int = 8,
sparse_threshold: int = 4,
sparse_samples: int = 128,
verbose: bool = True,
) -> bool:
"""
Run a needle-in-haystack test.
Args:
model_path: Path to model
max_model_len: Maximum model context length
input_len: Target input sequence length
num_gpu_blocks: Number of GPU blocks for offload
block_size: KV cache block size
needle_position: Where to place needle (0.0-1.0)
needle_value: The secret value to find
max_new_tokens: Maximum tokens to generate
enable_cpu_offload: Enable CPU offload mode
enable_quest: Enable Quest sparse attention (decode-only Top-K)
enable_xattn_bsa: Enable XAttention BSA sparse attention (prefill-only)
sparse_topk: Top-K blocks for Quest
sparse_threshold: Threshold for sparse selection (Quest/XAttention BSA)
sparse_samples: Samples per chunk for XAttention BSA estimation
verbose: Print detailed output
Returns:
True if test passed, False otherwise
"""
# Determine sparse policy
if enable_xattn_bsa:
sparse_policy = SparsePolicyType.XATTN_BSA
elif enable_quest:
sparse_policy = SparsePolicyType.QUEST
else:
sparse_policy = SparsePolicyType.FULL
if verbose:
print(f"\n{'='*60}")
print(f"Needle-in-Haystack Test")
print(f"{'='*60}")
print(f"Model: {model_path}")
print(f"Max model len: {max_model_len}")
print(f"Input length: {input_len}")
print(f"Block size: {block_size}")
print(f"Needle position: {needle_position:.0%}")
print(f"Needle value: {needle_value}")
print(f"CPU offload: {enable_cpu_offload}")
if enable_cpu_offload:
print(f"Sparse policy: {sparse_policy.name}")
if sparse_policy == SparsePolicyType.QUEST:
print(f" Quest: topk={sparse_topk}, threshold={sparse_threshold}")
elif sparse_policy == SparsePolicyType.XATTN_BSA:
print(f" XAttention BSA: threshold={sparse_threshold}, samples={sparse_samples}")
print(f"{'='*60}\n")
# 1. Initialize LLM
llm_kwargs = {
"enforce_eager": True,
"max_model_len": max_model_len,
"max_num_batched_tokens": max_model_len,
"enable_cpu_offload": enable_cpu_offload,
"kvcache_block_size": block_size,
}
if enable_cpu_offload:
llm_kwargs["num_gpu_blocks"] = num_gpu_blocks
llm_kwargs["sparse_policy"] = sparse_policy
if sparse_policy == SparsePolicyType.QUEST:
llm_kwargs["sparse_topk_blocks"] = sparse_topk
llm_kwargs["sparse_threshold_blocks"] = sparse_threshold
elif sparse_policy == SparsePolicyType.XATTN_BSA:
llm_kwargs["sparse_threshold"] = float(sparse_threshold) / 10.0 # Convert to 0.0-1.0 range
llm_kwargs["sparse_samples_per_chunk"] = sparse_samples
llm = LLM(model_path, **llm_kwargs)
# 2. Generate needle prompt
prompt, expected = generate_needle_prompt(
tokenizer=llm.tokenizer,
target_length=input_len,
needle_position=needle_position,
needle_value=needle_value,
)
# 3. Generate output
sampling_params = SamplingParams(
temperature=0.6, # Moderate temperature
max_tokens=max_new_tokens,
)
outputs = llm.generate([prompt], sampling_params, use_tqdm=True)
# 4. Check result
output_text = outputs[0]["text"]
output_token_ids = outputs[0]["token_ids"]
passed = check_needle_answer(output_text, expected)
if verbose:
print(f"\n{'='*60}")
print(f"Result")
print(f"{'='*60}")
print(f"Expected: {expected}")
print(f"Output tokens ({len(output_token_ids)}): {output_token_ids[:20]}")
print(f"Output: {output_text[:200]}...")
print(f"Status: {'PASSED' if passed else 'FAILED'}")
print(f"{'='*60}\n")
return passed
# ============================================================
# CLI Entry Point
# ============================================================
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Needle-in-haystack test for long context LLM")
parser.add_argument(
"--model", "-m",
type=str,
default=os.path.expanduser("~/models/Qwen3-4B-Instruct-2507/"),
help="Path to model"
)
parser.add_argument(
"--max-model-len",
type=int,
default=128 * 1024,
help="Maximum model context length"
)
parser.add_argument(
"--input-len",
type=int,
default=8 * 1024,
help="Target input sequence length"
)
parser.add_argument(
"--num-gpu-blocks",
type=int,
default=2,
help="Number of GPU blocks for CPU offload"
)
parser.add_argument(
"--block-size",
type=int,
default=1024,
help="KV cache block size"
)
parser.add_argument(
"--needle-position",
type=float,
default=0.5,
help="Needle position (0.0=start, 0.5=middle, 1.0=end)"
)
parser.add_argument(
"--needle-value",
type=str,
default="7492",
help="The secret value to hide"
)
parser.add_argument(
"--max-new-tokens",
type=int,
default=32,
help="Maximum tokens to generate"
)
parser.add_argument(
"--enable-offload",
action="store_true",
help="Enable CPU offload (has known bug for long sequences)"
)
parser.add_argument(
"--enable-quest",
action="store_true",
help="Enable Quest sparse attention (decode-only Top-K selection)"
)
parser.add_argument(
"--enable-xattn-bsa",
action="store_true",
help="Enable XAttention BSA sparse attention (prefill-only)"
)
parser.add_argument(
"--sparse-topk",
type=int,
default=8,
help="Top-K blocks for Quest sparse attention"
)
parser.add_argument(
"--sparse-threshold",
type=int,
default=4,
help="Apply sparse only when blocks > threshold (Quest) or attention threshold 0-9 (XAttention BSA)"
)
parser.add_argument(
"--sparse-samples",
type=int,
default=128,
help="Samples per chunk for XAttention BSA estimation"
)
args = parser.parse_args()
passed = run_needle_test(
model_path=args.model,
max_model_len=args.max_model_len,
input_len=args.input_len,
num_gpu_blocks=args.num_gpu_blocks,
block_size=args.block_size,
needle_position=args.needle_position,
needle_value=args.needle_value,
max_new_tokens=args.max_new_tokens,
enable_cpu_offload=args.enable_offload,
enable_quest=args.enable_quest,
enable_xattn_bsa=args.enable_xattn_bsa,
sparse_topk=args.sparse_topk,
sparse_threshold=args.sparse_threshold,
sparse_samples=args.sparse_samples,
verbose=True,
)
if passed:
print("test_needle: PASSED")
else:
print("test_needle: FAILED")
exit(1)

View File

@@ -1,176 +0,0 @@
"""
Needle-in-a-haystack reference test using pure torch + transformers.
This is a reference implementation for comparison with nanovllm.
Uses standard HuggingFace inference (no custom KV cache, no offload).
"""
import os
import argparse
import torch
from transformers import AutoTokenizer
from modeling_qwen3 import Qwen3ForCausalLM
from utils import generate_needle_prompt, check_needle_answer
# ============================================================
# Main Test
# ============================================================
def run_needle_test(
model_path: str,
input_len: int,
needle_position: float = 0.5,
needle_value: str = "7492",
max_new_tokens: int = 32,
dtype: str = "auto",
verbose: bool = True,
) -> bool:
"""
Run a needle-in-haystack test using standard transformers inference.
Args:
model_path: Path to model
input_len: Target input sequence length
needle_position: Where to place needle (0.0-1.0)
needle_value: The secret value to find
max_new_tokens: Maximum tokens to generate
dtype: Model dtype ("auto", "float16", "bfloat16")
verbose: Print detailed output
Returns:
True if test passed, False otherwise
"""
if verbose:
print(f"\n{'='*60}")
print(f"Needle-in-Haystack Reference Test (torch + transformers)")
print(f"{'='*60}")
print(f"Model: {model_path}")
print(f"Input length: {input_len}")
print(f"Needle position: {needle_position:.0%}")
print(f"Needle value: {needle_value}")
print(f"Dtype: {dtype}")
print(f"{'='*60}\n")
# 1. Load tokenizer
print("[1/4] Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
# 2. Generate needle prompt
print("[2/4] Generating needle prompt...")
prompt, expected = generate_needle_prompt(
tokenizer=tokenizer,
target_length=input_len,
needle_position=needle_position,
needle_value=needle_value,
)
# 3. Load model
print("[3/4] Loading model...")
torch_dtype = {
"auto": torch.float16, # default to float16 for custom model
"float16": torch.float16,
"bfloat16": torch.bfloat16,
}.get(dtype, torch.float16)
model = Qwen3ForCausalLM.from_pretrained(model_path, dtype=torch_dtype)
model = model.to("cuda" if torch.cuda.is_available() else "cpu")
model.eval()
# 4. Generate output
print("[4/4] Running inference...")
device = next(model.parameters()).device
input_ids = tokenizer.encode(prompt, return_tensors="pt").to(device)
print(f" Input shape: {input_ids.shape}")
with torch.no_grad():
output_ids = model.generate(
input_ids,
max_new_tokens=max_new_tokens,
temperature=0.6,
do_sample=True,
pad_token_id=tokenizer.eos_token_id,
)
# Decode only the new tokens
new_token_ids = output_ids[0, input_ids.shape[1]:]
output_text = tokenizer.decode(new_token_ids, skip_special_tokens=False)
# 5. Check result
passed = check_needle_answer(output_text, expected)
if verbose:
print(f"\n{'='*60}")
print(f"Result")
print(f"{'='*60}")
print(f"Expected: {expected}")
print(f"Output tokens ({len(new_token_ids)}): {new_token_ids[:20].tolist()}")
print(f"Output: {output_text[:200]}...")
print(f"Status: {'PASSED' if passed else 'FAILED'}")
print(f"{'='*60}\n")
return passed
# ============================================================
# CLI Entry Point
# ============================================================
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Needle-in-haystack reference test (torch + transformers)"
)
parser.add_argument(
"--model", "-m",
type=str,
default=os.path.expanduser("~/models/Qwen3-4B-Instruct-2507/"),
help="Path to model"
)
parser.add_argument(
"--input-len",
type=int,
default=8 * 1024,
help="Target input sequence length"
)
parser.add_argument(
"--needle-position",
type=float,
default=0.5,
help="Needle position (0.0=start, 0.5=middle, 1.0=end)"
)
parser.add_argument(
"--needle-value",
type=str,
default="7492",
help="The secret value to hide"
)
parser.add_argument(
"--max-new-tokens",
type=int,
default=32,
help="Maximum tokens to generate"
)
parser.add_argument(
"--dtype",
type=str,
default="auto",
choices=["auto", "float16", "bfloat16"],
help="Model dtype"
)
args = parser.parse_args()
passed = run_needle_test(
model_path=args.model,
input_len=args.input_len,
needle_position=args.needle_position,
needle_value=args.needle_value,
max_new_tokens=args.max_new_tokens,
dtype=args.dtype,
verbose=True,
)
if passed:
print("test_needle_ref: PASSED")
else:
print("test_needle_ref: FAILED")
exit(1)

View File

@@ -1,136 +0,0 @@
"""
Test for QuestPolicy block selection with GQA (Grouped Query Attention).
Demonstrates the key limitation: scores are AVERAGED across heads,
so blocks strongly needed by one head but not others may be dropped.
This is the expected Quest behavior - not a bug.
"""
import torch
from nanovllm.kvcache.sparse import (
create_sparse_policy,
SparsePolicyType,
PolicyContext,
)
# ============================================================
# Test: Per-Head Score Averaging in GQA
# ============================================================
# Determine device (GPU if available, else CPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Running test on device: {device}")
# Setup: 2 KV heads, 4 query heads (GQA group_size=2)
# topk=2 to make selection competitive
quest = create_sparse_policy(SparsePolicyType.QUEST, topk_blocks=2, threshold_blocks=0)
quest.initialize(
num_layers=1,
num_kv_heads=2,
head_dim=4,
num_cpu_blocks=6,
dtype=torch.float32,
device=device, # Metadata stored on GPU
)
metadata = quest.metadata
def set_key(block_id, head_id, values):
"""Set both key_min and key_max to same values for deterministic scoring."""
# Values need to be on the same device as metadata
tensor = torch.tensor(values, device=device)
metadata.key_min[block_id, 0, head_id, :] = tensor
metadata.key_max[block_id, 0, head_id, :] = tensor
# ============================================================
# Design: Different heads want different blocks
# ============================================================
#
# Query = [1,1,1,1] for all heads, so score = sum(key values)
#
# Block | Head 0 | Head 1 | Average | Result
# ------|--------|--------|---------|--------
# 0 | +4 | -4 | 0 | Head0 wants, Head1 doesn't → DROPPED
# 1 | -4 | +4 | 0 | Head1 wants, Head0 doesn't → DROPPED
# 2 | +4 | +4 | +4 | Both want → SELECTED (rank 1)
# 3 | +3 | +3 | +3 | Both want → SELECTED (rank 2)
# 4 | +4 | 0 | +2 | Head0 strongly wants, Head1 neutral → rank 3
# 5 | 0 | +4 | +2 | Head1 strongly wants, Head0 neutral → rank 3
# Block 0: Head 0 strongly wants, Head 1 strongly rejects
set_key(0, 0, [1.0, 1.0, 1.0, 1.0]) # head0: +4
set_key(0, 1, [-1.0, -1.0, -1.0, -1.0]) # head1: -4
# Block 1: Head 1 strongly wants, Head 0 strongly rejects
set_key(1, 0, [-1.0, -1.0, -1.0, -1.0]) # head0: -4
set_key(1, 1, [1.0, 1.0, 1.0, 1.0]) # head1: +4
# Block 2: Both heads want equally (highest average)
set_key(2, 0, [1.0, 1.0, 1.0, 1.0]) # head0: +4
set_key(2, 1, [1.0, 1.0, 1.0, 1.0]) # head1: +4
# Block 3: Both heads want moderately
set_key(3, 0, [0.75, 0.75, 0.75, 0.75]) # head0: +3
set_key(3, 1, [0.75, 0.75, 0.75, 0.75]) # head1: +3
# Block 4: Head 0 strongly wants, Head 1 neutral
set_key(4, 0, [1.0, 1.0, 1.0, 1.0]) # head0: +4
set_key(4, 1, [0.0, 0.0, 0.0, 0.0]) # head1: 0
# Block 5: Head 1 strongly wants, Head 0 neutral
set_key(5, 0, [0.0, 0.0, 0.0, 0.0]) # head0: 0
set_key(5, 1, [1.0, 1.0, 1.0, 1.0]) # head1: +4
# ============================================================
# Run selection
# ============================================================
# Query on same device as metadata
query = torch.ones(1, 4, 4, device=device) # GQA: 4 query heads → 2 KV heads
ctx = PolicyContext(
query_chunk_idx=0,
num_query_chunks=1,
layer_id=0,
query=query,
is_prefill=False,
block_size=1024,
total_kv_len=6144,
)
available = list(range(6))
selected = quest.select_blocks(available, ctx)
# ============================================================
# Verify: Averaging behavior
# ============================================================
# topk=2, so only blocks 2 (+4 avg) and 3 (+3 avg) should be selected
assert len(selected) == 2, f"Expected 2 blocks, got {len(selected)}"
assert selected == [2, 3], f"Expected [2, 3], got {selected}"
# Key insight: blocks 0 and 1 have score +4 for ONE head,
# but they cancel out due to averaging with the other head's -4
assert 0 not in selected, "Block 0 should NOT be selected (head scores cancel out)"
assert 1 not in selected, "Block 1 should NOT be selected (head scores cancel out)"
# Blocks 4 and 5 have +4 for one head, 0 for other → avg=+2
# But +2 < +3 (block 3), so they don't make the top-2
assert 4 not in selected, "Block 4 avg=+2 < block 3 avg=+3"
assert 5 not in selected, "Block 5 avg=+2 < block 3 avg=+3"
print("✓ Block 2 selected: both heads want it (+4, +4) → avg=+4")
print("✓ Block 3 selected: both heads want it (+3, +3) → avg=+3")
print("✓ Block 0 NOT selected: head0=+4, head1=-4 → avg=0 (cancel out)")
print("✓ Block 1 NOT selected: head0=-4, head1=+4 → avg=0 (cancel out)")
print("✓ Block 4 NOT selected: head0=+4, head1=0 → avg=+2 (lower rank)")
print("✓ Block 5 NOT selected: head0=0, head1=+4 → avg=+2 (lower rank)")
# Verify metadata is on correct device
assert metadata.key_min.device.type == device.type, f"key_min on wrong device: {metadata.key_min.device}"
assert metadata.key_max.device.type == device.type, f"key_max on wrong device: {metadata.key_max.device}"
print(f"✓ Metadata stored on {device.type.upper()}")
print("\ntest_quest_policy: PASSED")

View File

@@ -41,6 +41,7 @@ from pathlib import Path
from typing import List, Dict, Tuple, Optional
from nanovllm import LLM, SamplingParams
from nanovllm.utils.density_observer import DensityObserver
# ============================================================
@@ -48,11 +49,67 @@ from nanovllm import LLM, SamplingParams
# ============================================================
DEFAULT_DATA_DIR = Path(__file__).parent / "data/ruler_64k"
# ============================================================
# Chat Template Conversion
# ============================================================
def convert_llama_to_glm4_format(prompt: str) -> str:
"""
Convert Llama 3 chat template format to GLM-4 format.
Llama 3 format:
<|begin_of_text|><|start_header_id|>user<|end_header_id|>
{user_content}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
{assistant_prefix}
GLM-4 format:
[gMASK]<sop><|user|>
{user_content}<|assistant|>
{assistant_prefix}
"""
# Split into user content and assistant prefix
parts = prompt.split("<|eot_id|><|start_header_id|>assistant<|end_header_id|>")
# Extract user content (remove Llama header tokens)
user_content = parts[0]
user_content = user_content.replace("<|begin_of_text|>", "")
user_content = user_content.replace("<|start_header_id|>user<|end_header_id|>", "")
user_content = user_content.strip()
# Extract assistant prefix (if exists)
assistant_prefix = ""
if len(parts) > 1:
assistant_prefix = parts[1].replace("<|eot_id|>", "").strip()
# Apply GLM-4 format
glm_prompt = f"[gMASK]<sop><|user|>\n{user_content}<|assistant|>"
if assistant_prefix:
glm_prompt += f"\n{assistant_prefix}"
return glm_prompt
def is_glm_model(model_path: str) -> bool:
"""Check if the model is a GLM model based on config."""
from transformers import AutoConfig
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
return getattr(config, 'model_type', '') == 'chatglm'
def convert_prompt_for_model(prompt: str, model_path: str) -> str:
"""Convert prompt format based on model type."""
if is_glm_model(model_path):
return convert_llama_to_glm4_format(prompt)
return prompt # Keep original format for Llama and other models
DEFAULT_MODEL = os.path.expanduser("~/models/Llama-3.1-8B-Instruct")
# Note: max_model_len must be > max_input_len to leave room for output tokens
# 64k benchmark has inputs up to 65536 tokens, so we need 65536 + 128 = 65664
DEFAULT_MAX_MODEL_LEN = 65664
DEFAULT_MAX_NEW_TOKENS = 128 # Larger for multi-value tasks
DEFAULT_MAX_NEW_TOKENS = 16 # Sufficient for NIAH single-value answers
# Task categories for evaluation
NIAH_TASKS = ["niah_single_1", "niah_single_2", "niah_single_3",
@@ -161,6 +218,7 @@ def run_task_test(
verbose: bool = True,
llm_factory: Optional[callable] = None,
fresh_llm: bool = False,
model_path: Optional[str] = None,
) -> Dict:
"""
Run test for a single RULER task.
@@ -198,6 +256,9 @@ def run_task_test(
for sample in samples:
idx = sample.get("index", sample["_local_idx"])
prompt = sample["input"]
# Convert prompt format for GLM models
if model_path:
prompt = convert_prompt_for_model(prompt, model_path)
expected = sample["outputs"]
# Fresh LLM mode: reinitialize for each sample
@@ -263,7 +324,7 @@ def run_ruler_benchmark(
max_new_tokens: int = DEFAULT_MAX_NEW_TOKENS,
enable_cpu_offload: bool = False,
num_gpu_blocks: int = 4,
block_size: int = 1024,
block_size: int = 4096,
num_kv_buffers: int = 4,
gpu_utilization: float = 0.9,
enforce_eager: bool = True,
@@ -275,6 +336,7 @@ def run_ruler_benchmark(
sparse_samples: int = 128,
sparse_block_size: int = 128,
sparse_stride: int = 8,
dtype: Optional[str] = None,
) -> Dict:
"""
Run RULER benchmark on multiple tasks.
@@ -320,6 +382,16 @@ def run_ruler_benchmark(
print(f"Fresh LLM mode: {fresh_llm}")
print(f"{'='*60}")
# Enable DensityObserver for XAttention BSA
if sparse_policy and sparse_policy.upper() == "XATTN_BSA":
DensityObserver.enable()
DensityObserver.complete_reset()
# Set mode for correct density interpretation
DensityObserver.set_mode("offload" if enable_cpu_offload else "gpu_only")
if not json_output:
mode_str = "offload" if enable_cpu_offload else "gpu_only"
print(f"[DensityObserver] Enabled for XAttention BSA (mode: {mode_str})")
# LLM initialization kwargs
llm_kwargs = {
"max_model_len": max_model_len,
@@ -329,6 +401,8 @@ def run_ruler_benchmark(
"kvcache_block_size": block_size,
"enable_cpu_offload": enable_cpu_offload,
}
if dtype:
llm_kwargs["dtype"] = dtype
if enable_cpu_offload:
llm_kwargs["num_gpu_blocks"] = num_gpu_blocks
llm_kwargs["num_kv_buffers"] = num_kv_buffers
@@ -367,6 +441,7 @@ def run_ruler_benchmark(
verbose=verbose and not json_output,
llm_factory=create_llm,
fresh_llm=fresh_llm,
model_path=model_path,
)
task_results.append(result)
@@ -407,6 +482,14 @@ def run_ruler_benchmark(
print(f"{'-'*54}")
print(f"{'TOTAL':<20} {total_correct}/{total_samples:<7} {overall_accuracy*100:>6.1f}% {avg_score:.3f}")
print(f"\nTime: {total_time:.1f}s")
# Print DensityObserver summary if enabled
if sparse_policy and sparse_policy.upper() == "XATTN_BSA" and DensityObserver.is_enabled():
print(f"\n{'='*60}")
print("Density Statistics (XAttention BSA)")
print(f"{'='*60}")
DensityObserver.print_summary()
print(f"{'='*60}\n")
results = {
@@ -464,8 +547,8 @@ if __name__ == "__main__":
help="Enable CPU offload mode")
parser.add_argument("--num-gpu-blocks", type=int, default=4,
help="Number of GPU blocks for CPU offload (default: 4)")
parser.add_argument("--block-size", type=int, default=1024,
help="KV cache block size (default: 1024)")
parser.add_argument("--block-size", type=int, default=4096,
help="KV cache block size (default: 4096)")
parser.add_argument("--num-kv-buffers", type=int, default=4,
help="Number of KV buffers for ring buffer (default: 4)")
parser.add_argument("--gpu-utilization", type=float, default=0.9,
@@ -489,6 +572,8 @@ if __name__ == "__main__":
help="XAttention BSA: block size for estimation")
parser.add_argument("--sparse-stride", type=int, default=8,
help="XAttention BSA: stride for Q/K downsampling")
parser.add_argument("--dtype", type=str, default=None,
help="Model dtype (bfloat16, float16). Required for models with float32 default.")
args = parser.parse_args()
@@ -526,6 +611,7 @@ if __name__ == "__main__":
sparse_samples=args.sparse_samples,
sparse_block_size=args.sparse_block_size,
sparse_stride=args.sparse_stride,
dtype=args.dtype,
)
# Exit code (skip for json output mode)

View File

@@ -1,199 +0,0 @@
"""
Sequential inference test for LLM.
Tests: After completing one prompt, the system can correctly handle
a second prompt with a clean state (first prompt's KV cache deallocated).
"""
import os
os.environ["NANOVLLM_LOG_LEVEL"] = "INFO"
import argparse
from nanovllm import LLM, SamplingParams
from utils import generate_needle_prompt, check_needle_answer
def run_sequential_test(
model_path: str,
max_model_len: int,
input_len: int,
num_gpu_blocks: int = 4,
block_size: int = 1024,
enable_cpu_offload: bool = False,
verbose: bool = True,
) -> bool:
"""
Run sequential inference test with two different prompts.
Each prompt has a different needle value. Both must be retrieved correctly.
"""
if verbose:
print(f"\n{'='*60}")
print(f"Sequential Inference Test")
print(f"{'='*60}")
print(f"Model: {model_path}")
print(f"Max model len: {max_model_len}")
print(f"Input length: {input_len}")
print(f"Block size: {block_size}")
print(f"CPU offload: {enable_cpu_offload}")
print(f"{'='*60}\n")
# Initialize LLM once
llm_kwargs = {
"enforce_eager": True,
"max_model_len": max_model_len,
"max_num_batched_tokens": max_model_len,
"enable_cpu_offload": enable_cpu_offload,
"kvcache_block_size": block_size,
}
if enable_cpu_offload:
llm_kwargs["num_gpu_blocks"] = num_gpu_blocks
llm = LLM(model_path, **llm_kwargs)
sampling_params = SamplingParams(
temperature=0.6,
max_tokens=32,
)
# ============================================================
# Test 1: First prompt with needle value "1234"
# ============================================================
needle_value_1 = "1234"
if verbose:
print(f"\n[Test 1] Generating prompt with needle value: {needle_value_1}")
prompt_1, expected_1 = generate_needle_prompt(
tokenizer=llm.tokenizer,
target_length=input_len,
needle_position=0.5,
needle_value=needle_value_1,
)
outputs_1 = llm.generate([prompt_1], sampling_params, use_tqdm=True)
output_text_1 = outputs_1[0]["text"]
passed_1 = check_needle_answer(output_text_1, expected_1)
if verbose:
print(f" Expected: {expected_1}")
print(f" Output: {output_text_1[:100]}...")
print(f" Status: {'PASSED' if passed_1 else 'FAILED'}")
# ============================================================
# Test 2: Second prompt with needle value "5678"
# ============================================================
needle_value_2 = "5678"
if verbose:
print(f"\n[Test 2] Generating prompt with needle value: {needle_value_2}")
prompt_2, expected_2 = generate_needle_prompt(
tokenizer=llm.tokenizer,
target_length=input_len,
needle_position=0.5,
needle_value=needle_value_2,
)
outputs_2 = llm.generate([prompt_2], sampling_params, use_tqdm=True)
output_text_2 = outputs_2[0]["text"]
passed_2 = check_needle_answer(output_text_2, expected_2)
if verbose:
print(f" Expected: {expected_2}")
print(f" Output: {output_text_2[:100]}...")
print(f" Status: {'PASSED' if passed_2 else 'FAILED'}")
# ============================================================
# Test 3: Third prompt - repeat first needle to ensure no cross-contamination
# ============================================================
needle_value_3 = "9999"
if verbose:
print(f"\n[Test 3] Generating prompt with needle value: {needle_value_3}")
prompt_3, expected_3 = generate_needle_prompt(
tokenizer=llm.tokenizer,
target_length=input_len,
needle_position=0.5,
needle_value=needle_value_3,
)
outputs_3 = llm.generate([prompt_3], sampling_params, use_tqdm=True)
output_text_3 = outputs_3[0]["text"]
passed_3 = check_needle_answer(output_text_3, expected_3)
if verbose:
print(f" Expected: {expected_3}")
print(f" Output: {output_text_3[:100]}...")
print(f" Status: {'PASSED' if passed_3 else 'FAILED'}")
# ============================================================
# Summary
# ============================================================
all_passed = passed_1 and passed_2 and passed_3
if verbose:
print(f"\n{'='*60}")
print(f"Summary")
print(f"{'='*60}")
print(f"Test 1 (needle={needle_value_1}): {'PASSED' if passed_1 else 'FAILED'}")
print(f"Test 2 (needle={needle_value_2}): {'PASSED' if passed_2 else 'FAILED'}")
print(f"Test 3 (needle={needle_value_3}): {'PASSED' if passed_3 else 'FAILED'}")
print(f"Overall: {'PASSED' if all_passed else 'FAILED'}")
print(f"{'='*60}\n")
return all_passed
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Sequential inference test")
parser.add_argument(
"--model", "-m",
type=str,
default=os.path.expanduser("~/models/Qwen3-0.6B/"),
help="Path to model"
)
parser.add_argument(
"--max-model-len",
type=int,
default=36 * 1024,
help="Maximum model context length"
)
parser.add_argument(
"--input-len",
type=int,
default=8 * 1024,
help="Target input sequence length"
)
parser.add_argument(
"--num-gpu-blocks",
type=int,
default=2,
help="Number of GPU blocks for CPU offload"
)
parser.add_argument(
"--block-size",
type=int,
default=1024,
help="KV cache block size"
)
parser.add_argument(
"--enable-offload",
action="store_true",
help="Enable CPU offload"
)
args = parser.parse_args()
passed = run_sequential_test(
model_path=args.model,
max_model_len=args.max_model_len,
input_len=args.input_len,
num_gpu_blocks=args.num_gpu_blocks,
block_size=args.block_size,
enable_cpu_offload=args.enable_offload,
verbose=True,
)
if passed:
print("test_sequential: PASSED")
else:
print("test_sequential: FAILED")
exit(1)

View File

@@ -1,334 +0,0 @@
"""
Test XAttention + BSA with RULER benchmark data.
Tests XAttention sparse attention correctness using RULER NIAH task.
Attention methods:
- Prefill: XAttention + BSA (sparse) or FlashAttention (dense)
- Decode: FlashAttention (always, since q_len=1)
Usage (in compass conda env with BSA available):
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=/path/to/nano-vllm:$PYTHONPATH \
python tests/test_xattn_bsa.py --model ~/models/Llama-3.1-8B-Instruct
# Test with XAttention + BSA for prefill (default)
python tests/test_xattn_bsa.py --prefill-method xattn
# Test with FlashAttention for prefill (baseline)
python tests/test_xattn_bsa.py --prefill-method flash
# Test specific sample(s)
python tests/test_xattn_bsa.py --sample-id 0
python tests/test_xattn_bsa.py --sample-ids 0,1,2
Note: Compatible with transformers 4.53+ (handles both old `past_key_value`
and new `past_key_values` API).
"""
import argparse
import json
import sys
import torch
from pathlib import Path
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.cache_utils import DynamicCache
from nanovllm.ops.xattn import xattn_estimate
from transformers.models.llama.modeling_llama import apply_rotary_pos_emb
# ============================================================
# XAttention + BSA Functions
# ============================================================
def expand_kv_for_gqa(key_states, value_states, num_heads):
"""Expand KV for Grouped Query Attention."""
num_kv_heads = key_states.shape[1]
if num_heads == num_kv_heads:
return key_states, value_states
num_groups = num_heads // num_kv_heads
return key_states.repeat_interleave(num_groups, dim=1), value_states.repeat_interleave(num_groups, dim=1)
def flash_attention_forward(query_states, key_states, value_states, is_causal=True):
"""Standard FlashAttention."""
from flash_attn import flash_attn_func
q = query_states.transpose(1, 2)
k = key_states.transpose(1, 2)
v = value_states.transpose(1, 2)
return flash_attn_func(q, k, v, causal=is_causal).transpose(1, 2)
def xattn_bsa_forward(query_states, key_states, value_states, threshold=0.9):
"""XAttention + BSA sparse attention."""
from block_sparse_attn import block_sparse_attn_func
batch_size, num_heads, q_len, head_dim = query_states.shape
k_len = key_states.shape[2]
_, mask = xattn_estimate(
query_states, key_states,
chunk_size=16384, block_size=128, threshold=threshold,
use_triton=True, causal=True,
)
q_block_num = (q_len + 127) // 128
k_block_num = (k_len + 127) // 128
q = query_states.transpose(1, 2).reshape(q_len, num_heads, head_dim)
k = key_states.transpose(1, 2).reshape(k_len, num_heads, head_dim)
v = value_states.transpose(1, 2).reshape(k_len, num_heads, head_dim)
__import__('pdb').set_trace()
output = block_sparse_attn_func(
q, k, v,
torch.tensor([0, q_len], dtype=torch.int32, device=q.device),
torch.tensor([0, k_len], dtype=torch.int32, device=k.device),
torch.ones(num_heads, dtype=torch.int32, device=q.device),
None,
mask[:, :, :q_block_num, :k_block_num].contiguous(),
q_len, k_len,
p_dropout=0.0, deterministic=True, is_causal=True,
)
return output.reshape(batch_size, q_len, num_heads, head_dim).transpose(1, 2)
DEBUG = False # Set to True to enable debugging
def create_patched_forward(prefill_method="xattn", threshold=0.9):
"""Create patched forward with configurable prefill method.
Args:
prefill_method: "xattn" for XAttention + BSA (sparse), "flash" for FlashAttention (dense)
threshold: XAttention threshold for block selection (only used when prefill_method="xattn")
Note:
- Prefill (q_len > 1): Uses specified prefill_method
- Decode (q_len = 1): Always uses FlashAttention (no sparse needed for single query)
"""
call_count = [0] # Mutable to track calls across layers
def patched_forward(
self,
hidden_states,
position_embeddings=None,
attention_mask=None,
past_key_value=None, # Old API (transformers < 4.57)
past_key_values=None, # New API (transformers >= 4.57)
cache_position=None,
**kwargs
):
# Handle both old and new transformers API
kv_cache = past_key_values if past_key_values is not None else past_key_value
bsz, q_len, _ = hidden_states.size()
num_heads = self.config.num_attention_heads
num_kv_heads = self.config.num_key_value_heads
head_dim = self.head_dim
# Compute Q, K, V projections
query_states = self.q_proj(hidden_states).view(bsz, q_len, num_heads, head_dim).transpose(1, 2)
key_states = self.k_proj(hidden_states).view(bsz, q_len, num_kv_heads, head_dim).transpose(1, 2)
value_states = self.v_proj(hidden_states).view(bsz, q_len, num_kv_heads, head_dim).transpose(1, 2)
# Apply rotary position embedding
cos, sin = position_embeddings
query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
# Handle KV cache
if kv_cache is not None:
cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
key_states, value_states = kv_cache.update(
key_states, value_states, self.layer_idx, cache_kwargs
)
# Expand KV for GQA
key_states_exp, value_states_exp = expand_kv_for_gqa(key_states, value_states, num_heads)
# Debug output
if DEBUG and self.layer_idx == 0:
call_count[0] += 1
if call_count[0] <= 5:
phase = "prefill" if q_len > 1 else "decode"
print(f"\n[DEBUG] Layer {self.layer_idx}, call {call_count[0]} ({phase}): q_len={q_len}, k_len={key_states_exp.shape[2]}")
print(f" kv_cache is None: {kv_cache is None}")
# Choose attention method:
# - Prefill (q_len > 1): Use prefill_method (xattn or flash)
# - Decode (q_len = 1): Always use FlashAttention
is_prefill = q_len > 1
if is_prefill and prefill_method == "xattn":
# Prefill with XAttention + BSA (sparse)
attn_output = xattn_bsa_forward(query_states, key_states_exp, value_states_exp, threshold)
else:
# Prefill with FlashAttention (dense) OR Decode (always FlashAttention)
# Note: For decode (q_len=1), causal=False since single query attends to all KV
attn_output = flash_attention_forward(query_states, key_states_exp, value_states_exp, is_causal=is_prefill)
attn_output = self.o_proj(attn_output.transpose(1, 2).reshape(bsz, q_len, -1))
return attn_output, None
return patched_forward
# ============================================================
# Data & Evaluation
# ============================================================
def load_samples(filepath, indices=None):
"""Load samples from JSONL file."""
samples = []
with open(filepath) as f:
for i, line in enumerate(f):
if indices is None or i in indices:
sample = json.loads(line)
sample["_idx"] = i
samples.append(sample)
return samples
def string_match_all(output_text, expected_list):
"""RULER metric: fraction of expected values found in output."""
output_lower = output_text.lower().replace('\n', ' ')
if not expected_list:
return 1.0
return sum(1.0 if exp.strip().lower() in output_lower else 0.0 for exp in expected_list) / len(expected_list)
# ============================================================
# Test
# ============================================================
def test_with_ruler_data(model_path, data_file, sample_ids, prefill_method="xattn", threshold=0.9, max_new_tokens=50):
"""Test attention methods using RULER data.
Args:
prefill_method: "xattn" for XAttention + BSA, "flash" for FlashAttention
"""
prefill_desc = "XAttention + BSA (sparse)" if prefill_method == "xattn" else "FlashAttention (dense)"
print("=" * 60)
print("RULER NIAH Attention Test")
print("=" * 60)
print(f"Data: {data_file}")
print(f"Samples: {sample_ids}")
print(f"Prefill method: {prefill_desc}")
print(f"Decode method: FlashAttention (always)")
if prefill_method == "xattn":
print(f"XAttention threshold: {threshold}")
samples = load_samples(Path(data_file), set(sample_ids) if sample_ids else None)
if not samples:
print("No samples found!")
return False
print(f"Loaded {len(samples)} samples")
# Load model
print(f"\nLoading model: {model_path}")
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
model_path, torch_dtype=torch.float16, device_map="cuda",
attn_implementation="eager", # Will be patched
)
model.eval()
# Patch all layers
print(f"Patching attention layers...")
print(f" - Prefill: {prefill_desc}")
print(f" - Decode: FlashAttention")
for idx, layer in enumerate(model.model.layers):
layer.self_attn.layer_idx = idx # Ensure layer_idx is set
layer.self_attn.forward = create_patched_forward(prefill_method, threshold).__get__(
layer.self_attn, type(layer.self_attn)
)
total_score = 0.0
results = []
for sample in samples:
idx = sample["_idx"]
prompt = sample["input"]
expected = sample["outputs"]
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
num_tokens = inputs["input_ids"].shape[1]
print(f"\n--- Sample {idx} ({num_tokens} tokens) ---")
print(f"Expected: {expected}")
with torch.no_grad():
output = model.generate(
inputs["input_ids"],
max_new_tokens=max_new_tokens,
do_sample=False,
pad_token_id=tokenizer.eos_token_id,
)
output_text = tokenizer.decode(output[0][num_tokens:], skip_special_tokens=True)
score = string_match_all(output_text, expected)
total_score += score
status = "✓ PASS" if score >= 0.5 else "✗ FAIL"
print(f"Output: '{output_text[:100]}...'")
print(f"Result: {status} (score={score:.2f})")
results.append({"idx": idx, "score": score, "passed": score >= 0.5})
avg_score = total_score / len(samples)
passed = sum(1 for r in results if r["passed"])
print(f"\n{'='*60}")
print(f"Results: {passed}/{len(samples)} passed, avg_score={avg_score:.3f}")
print(f"{'='*60}")
return avg_score >= 0.5
def main():
parser = argparse.ArgumentParser(
description="Test XAttention + BSA vs FlashAttention for prefill using RULER NIAH benchmark"
)
parser.add_argument("--model", default="~/models/Llama-3.1-8B-Instruct")
parser.add_argument("--data-file", default="tests/data/ruler_32k/niah_single_1/validation.jsonl")
parser.add_argument("--sample-id", type=int, default=None, help="Test single sample by index")
parser.add_argument("--sample-ids", type=str, default="", help="Test multiple samples (comma-separated)")
parser.add_argument("--prefill-method", choices=["xattn", "flash"], default="xattn",
help="Prefill attention method: xattn (XAttention+BSA sparse) or flash (FlashAttention dense)")
parser.add_argument("--threshold", type=float, default=0.9, help="XAttention threshold (only for --prefill-method xattn)")
parser.add_argument("--max-new-tokens", type=int, default=50)
# Keep old option for backwards compatibility
parser.add_argument("--no-xattn", action="store_true", help="[Deprecated] Use --prefill-method flash instead")
args = parser.parse_args()
model_path = args.model.replace("~", "/home/zijie")
# Handle deprecated --no-xattn option
prefill_method = args.prefill_method
if args.no_xattn:
prefill_method = "flash"
print("Warning: --no-xattn is deprecated, use --prefill-method flash instead")
if args.sample_id is not None:
sample_ids = [args.sample_id]
elif args.sample_ids:
sample_ids = [int(x) for x in args.sample_ids.split(",")]
else:
sample_ids = [0]
# Check BSA availability if using xattn
if prefill_method == "xattn":
try:
from block_sparse_attn import block_sparse_attn_func
print("✓ BSA (Block Sparse Attention) available")
except ImportError:
print("✗ BSA not found. Install block_sparse_attn or use --prefill-method flash")
sys.exit(1)
if test_with_ruler_data(model_path, args.data_file, sample_ids, prefill_method, args.threshold, args.max_new_tokens):
print("\ntest_xattn_bsa: PASSED")
else:
print("\ntest_xattn_bsa: FAILED")
sys.exit(1)
if __name__ == "__main__":
main()

View File

@@ -1,259 +0,0 @@
"""
Test: Compare xattn_estimate vs xattn_estimate_chunked
Verify that chunked estimation with EXTERNAL chunking produces the same mask as standard estimation.
Uses real QKV data captured from model inference.
"""
import sys
import os
import torch
import warnings
from nanovllm.ops.xattn import xattn_estimate, xattn_estimate_chunked
# ============================================================
# Configuration
# ============================================================
BLOCK_SIZE = 64
STRIDE = 4
THRESHOLD = 0.9
CHUNK_SIZE = 4096
# Default QKV data directory (relative to project root)
DEFAULT_QKV_DIR = os.path.join(os.path.dirname(os.path.dirname(__file__)), "results", "kvcache")
# ============================================================
# Utility Functions
# ============================================================
def load_qkv(path):
"""Load saved QKV data."""
data = torch.load(path, map_location="cpu", weights_only=False)
print(f"Loaded: {path}")
print(f" Query shape: {data['query'].shape}")
print(f" Key shape: {data['key'].shape}")
print(f" Layer: {data['layer_id']}, Density: {data['density']:.2%}")
return data
def compare_masks(mask1, mask2, name1="standard", name2="chunked"):
"""Compare two masks and report differences."""
if mask1.shape != mask2.shape:
print(f"Shape mismatch: {name1}={mask1.shape}, {name2}={mask2.shape}")
return False
diff = (mask1 != mask2).sum().item()
total = mask1.numel()
match_rate = (total - diff) / total * 100
print(f" Match rate: {match_rate:.4f}% ({total - diff}/{total})")
if diff > 0:
diff_indices = torch.where(mask1 != mask2)
print(f" First 5 diff positions: {list(zip(*[idx[:5].tolist() for idx in diff_indices]))}")
return diff == 0
def run_chunked_externally(query, key, q_start_pos, block_size, stride, threshold, chunk_size):
"""
Run xattn_estimate_chunked with EXTERNAL chunking.
This simulates how chunked prefill should be used in practice.
"""
batch_size, num_heads, q_len, head_dim = query.shape
_, _, k_len, _ = key.shape
q_block_num = (q_len + block_size - 1) // block_size
k_block_num = (k_len + block_size - 1) // block_size
# If Q fits in one chunk, call directly
if q_len <= chunk_size:
with warnings.catch_warnings():
warnings.simplefilter("ignore")
return xattn_estimate_chunked(
query, key,
q_start_pos=q_start_pos,
block_size=block_size,
stride=stride,
threshold=threshold,
use_triton=True,
chunk_size=chunk_size,
)
# External chunking: split Q and call for each chunk
num_q_chunks = (q_len + chunk_size - 1) // chunk_size
print(f" External chunking: {num_q_chunks} chunks")
combined_attn_sum = torch.zeros(
batch_size, num_heads, q_block_num, k_block_num,
dtype=query.dtype, device=query.device
)
combined_mask = torch.zeros(
batch_size, num_heads, q_block_num, k_block_num,
dtype=torch.bool, device=query.device
)
q_block_offset = 0
for q_chunk_idx in range(num_q_chunks):
q_chunk_start = q_chunk_idx * chunk_size
q_chunk_end = min((q_chunk_idx + 1) * chunk_size, q_len)
q_chunk = query[:, :, q_chunk_start:q_chunk_end, :]
# For causal attention, K accumulates up to current Q position
k_end = q_start_pos + q_chunk_end
k_chunk = key[:, :, :k_end, :]
with warnings.catch_warnings():
warnings.simplefilter("ignore")
attn_sum_chunk, mask_chunk = xattn_estimate_chunked(
q_chunk, k_chunk,
q_start_pos=q_start_pos + q_chunk_start,
block_size=block_size,
stride=stride,
threshold=threshold,
use_triton=True,
chunk_size=chunk_size,
)
# Place chunk results into combined output
chunk_q_blocks = mask_chunk.shape[2]
chunk_k_blocks = mask_chunk.shape[3]
combined_attn_sum[:, :, q_block_offset:q_block_offset+chunk_q_blocks, :chunk_k_blocks] = attn_sum_chunk
combined_mask[:, :, q_block_offset:q_block_offset+chunk_q_blocks, :chunk_k_blocks] = mask_chunk
q_block_offset += chunk_q_blocks
return combined_attn_sum, combined_mask
def test_single_qkv(qkv_path):
"""Test a single QKV file."""
data = load_qkv(qkv_path)
query = data["query"].cuda().to(torch.bfloat16)
key = data["key"].cuda().to(torch.bfloat16)
seq_len = query.shape[2]
print(f"\nTesting with seq_len={seq_len}")
print("=" * 60)
# Run standard xattn_estimate
print("[1] Running standard xattn_estimate...")
try:
attn_sum_std, mask_std = xattn_estimate(
query, key,
block_size=BLOCK_SIZE,
stride=STRIDE,
threshold=THRESHOLD,
chunk_size=CHUNK_SIZE,
use_triton=True,
)
print(f" mask shape: {mask_std.shape}, density: {mask_std.float().mean().item():.4f}")
except Exception as e:
print(f" ERROR: {e}")
import traceback
traceback.print_exc()
return False
# Run chunked xattn_estimate with EXTERNAL chunking
print("[2] Running chunked xattn_estimate (external chunking)...")
try:
attn_sum_chunked, mask_chunked = run_chunked_externally(
query, key,
q_start_pos=0,
block_size=BLOCK_SIZE,
stride=STRIDE,
threshold=THRESHOLD,
chunk_size=CHUNK_SIZE,
)
print(f" mask shape: {mask_chunked.shape}, density: {mask_chunked.float().mean().item():.4f}")
except Exception as e:
print(f" ERROR: {e}")
import traceback
traceback.print_exc()
return False
# Compare results
print("[3] Comparing results...")
chunked_q_blocks = mask_chunked.shape[2]
chunked_k_blocks = mask_chunked.shape[3]
# Extract comparable region from standard mask
mask_std_comparable = mask_std[:, :, :chunked_q_blocks, :chunked_k_blocks]
# Compare masks
masks_match = compare_masks(mask_std_comparable, mask_chunked, "standard", "chunked")
# Compare attn_sums
attn_sum_std_comparable = attn_sum_std[:, :, :chunked_q_blocks, :chunked_k_blocks]
if attn_sum_std_comparable.shape == attn_sum_chunked.shape:
attn_diff = (attn_sum_std_comparable - attn_sum_chunked).abs().max().item()
print(f" Attn sum max diff: {attn_diff:.6f}")
else:
print(f" Attn sum shape mismatch")
# Clean up GPU memory
del query, key, attn_sum_std, mask_std, attn_sum_chunked, mask_chunked
torch.cuda.empty_cache()
return masks_match
# ============================================================
# Main Test
# ============================================================
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser(description="Test xattn_estimate vs xattn_estimate_chunked")
parser.add_argument("--qkv-dir", type=str, default=DEFAULT_QKV_DIR,
help="Directory containing QKV files")
args = parser.parse_args()
# QKV files to test
qkv_files = [
os.path.join(args.qkv_dir, "qkv_3688.pt"), # ~4K
os.path.join(args.qkv_dir, "qkv_7888.pt"), # ~8K
os.path.join(args.qkv_dir, "qkv_15685.pt"), # ~16K
os.path.join(args.qkv_dir, "qkv_32485.pt"), # ~32K
os.path.join(args.qkv_dir, "qkv_64891.pt"), # ~64K
]
available_files = [p for p in qkv_files if os.path.exists(p)]
if not available_files:
print(f"No QKV file found in {args.qkv_dir}.")
print(f"Expected files: qkv_3688.pt, qkv_7888.pt, qkv_15685.pt, qkv_32485.pt, qkv_64891.pt")
sys.exit(1)
print(f"Found {len(available_files)} QKV files to test")
print(f"Testing EXTERNAL chunking (chunk_size={CHUNK_SIZE})")
print(f"Using Triton kernels")
all_passed = True
results = []
for qkv_path in available_files:
passed = test_single_qkv(qkv_path)
seq_len = int(os.path.basename(qkv_path).replace("qkv_", "").replace(".pt", ""))
results.append((seq_len, passed))
if not passed:
all_passed = False
# Summary
print("\n" + "=" * 60)
print("SUMMARY")
print("=" * 60)
for seq_len, passed in results:
status = "PASSED" if passed else "FAILED"
chunks = (seq_len + CHUNK_SIZE - 1) // CHUNK_SIZE
print(f" seq_len={seq_len} ({chunks} chunk{'s' if chunks > 1 else ''}): {status}")
print("=" * 60)
if all_passed:
print("test_xattn_chunked: PASSED")
sys.exit(0)
else:
print("test_xattn_chunked: FAILED")
sys.exit(1)

View File

@@ -0,0 +1,365 @@
"""
Test: 验证 xattn_estimate 与 KV chunking kernels 的一致性
使用真实 KV cache 数据,对比:
1. xattn_estimate (高层 API)
2. 三阶段 KV chunking (softmax_compute_partial_stats + merge + normalize)
三阶段 KV chunking 流程:
1. softmax_compute_partial_stats: 计算每个 KV chunk 的 (m, l)
2. merge_softmax_stats: Host 端合并所有 chunks 的 stats
3. softmax_normalize_and_block_sum: 使用全局 stats 归一化
支持两种数据格式:
1. offload 模式保存: {"query", "key", "stride", "threshold", "density", "layer_id"}
2. GPU-only 模式保存: {"Q", "K", "chunk_size", "block_size", "stride", "threshold", "mask", "attn_sums", ...}
Usage:
# 使用 offload 模式数据
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH \
python tests/test_xattn_estimate_alignment.py
# 使用 GPU-only 模式数据
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH \
python tests/test_xattn_estimate_alignment.py --gpuonly
"""
import sys
sys.path.insert(0, "/home/zijie/Code/nano-vllm")
import argparse
import torch
import math
from nanovllm.ops.xattn import (
xattn_estimate,
flat_group_gemm_fuse_reshape,
softmax_compute_partial_stats,
softmax_normalize_and_block_sum,
merge_softmax_stats,
find_blocks_chunked,
)
# ============================================================
# 命令行参数
# ============================================================
parser = argparse.ArgumentParser()
parser.add_argument("--gpuonly", action="store_true", help="使用 GPU-only 模式保存的数据")
parser.add_argument("--data-file", type=str, default=None, help="数据文件路径")
parser.add_argument("--chunk-size", type=int, default=None, help="覆盖 CHUNK_SIZE (用于测试不同分块大小)")
args = parser.parse_args()
# ============================================================
# 参数配置
# ============================================================
if args.gpuonly:
DATA_FILE = args.data_file or "/home/zijie/Code/nano-vllm/results/mask_alignment/gpuonly_layer0.pt"
else:
DATA_FILE = args.data_file or "/home/zijie/Code/nano-vllm/results/kvcache/qkv_32485.pt"
device = "cuda"
# ============================================================
# Step 1: 加载真实数据
# ============================================================
print("=" * 60)
print("Step 1: 加载真实 KV cache 数据")
print("=" * 60)
data = torch.load(DATA_FILE, map_location="cpu")
# 检测数据格式并加载
if "Q" in data:
# GPU-only 模式保存的格式
print(f"[INFO] 检测到 GPU-only 模式数据格式")
Q = data["Q"].to(device)
K = data["K"].to(device)
BSA_BLOCK_SIZE = data.get("block_size", 128)
CHUNK_SIZE = data.get("chunk_size", 4096)
STRIDE = data.get("stride", 8)
THRESHOLD = data.get("threshold", 0.9)
if isinstance(THRESHOLD, torch.Tensor):
THRESHOLD = THRESHOLD.item()
# GPU-only 模式保存了 mask 和 attn_sums可以用于验证
saved_mask = data.get("mask", None)
saved_attn_sums = data.get("attn_sums", None)
saved_density = None # GPU-only 模式没有保存 density
layer_id = 0 # GPU-only 只保存 layer 0
else:
# offload 模式保存的格式
print(f"[INFO] 检测到 offload 模式数据格式")
Q = data["query"].to(device)
K = data["key"].to(device)
BSA_BLOCK_SIZE = 128
CHUNK_SIZE = 4096
STRIDE = data["stride"]
THRESHOLD = data["threshold"]
if isinstance(THRESHOLD, torch.Tensor):
THRESHOLD = THRESHOLD[0].item()
saved_mask = None
saved_attn_sums = None
saved_density = data.get("density", None)
layer_id = data.get("layer_id", 0)
batch_size, num_heads, seq_len, head_dim = Q.shape
# 命令行覆盖 CHUNK_SIZE
if args.chunk_size is not None:
CHUNK_SIZE = args.chunk_size
print(f"[INFO] 使用命令行指定的 CHUNK_SIZE={CHUNK_SIZE}")
print(f"Q shape: {Q.shape}")
print(f"K shape: {K.shape}")
if saved_density is not None:
print(f"Data layer_id: {layer_id}, saved density: {saved_density:.4f}")
else:
print(f"Data layer_id: {layer_id}")
print(f"使用参数: STRIDE={STRIDE}, THRESHOLD={THRESHOLD}, CHUNK_SIZE={CHUNK_SIZE}, BSA_BLOCK_SIZE={BSA_BLOCK_SIZE}")
print()
# ============================================================
# Step 2: 使用 xattn_estimate 高层 API
# ============================================================
print("=" * 60)
print("Step 2: 调用 xattn_estimate (高层 API)")
print("=" * 60)
attn_sums_api, mask_api = xattn_estimate(
Q, K,
block_size=BSA_BLOCK_SIZE,
stride=STRIDE,
threshold=THRESHOLD,
chunk_size=CHUNK_SIZE,
causal=True,
)
# 裁剪到有效区域
q_blocks = (seq_len + BSA_BLOCK_SIZE - 1) // BSA_BLOCK_SIZE
k_blocks = (seq_len + BSA_BLOCK_SIZE - 1) // BSA_BLOCK_SIZE
mask_api_valid = mask_api[:, :, :q_blocks, :k_blocks]
# 计算 density (causal)
causal_mask = torch.tril(torch.ones(q_blocks, k_blocks, device=device, dtype=torch.bool))
total_api = causal_mask.sum().item() * batch_size * num_heads
selected_api = (mask_api_valid & causal_mask.unsqueeze(0).unsqueeze(0)).sum().item()
density_api = selected_api / total_api
print(f"mask_api shape (padded): {mask_api.shape}")
print(f"mask_api_valid shape: {mask_api_valid.shape}")
print(f"[xattn_estimate] density: {density_api:.6f} (selected={selected_api}, total={total_api})")
print()
# ============================================================
# Step 3: 三阶段 KV Chunking
# ============================================================
print("=" * 60)
print("Step 3: 三阶段 KV Chunking")
print("=" * 60)
print(" 1) 每个 KV chunk 计算 partial stats")
print(" 2) Host 端合并 stats")
print(" 3) 使用全局 stats 归一化并计算 block sums")
print()
# 计算 padding 参数
k_num_to_pad = ((seq_len + CHUNK_SIZE - 1) // CHUNK_SIZE) * CHUNK_SIZE - seq_len
q_num_to_pad = ((seq_len + CHUNK_SIZE - 1) // CHUNK_SIZE) * CHUNK_SIZE - seq_len
q_chunk_num = (seq_len + q_num_to_pad) // CHUNK_SIZE
kv_chunk_num = (seq_len + k_num_to_pad) // CHUNK_SIZE
k_block_num = (seq_len + k_num_to_pad) // BSA_BLOCK_SIZE
q_block_num = (seq_len + q_num_to_pad) // BSA_BLOCK_SIZE
reshaped_chunk_size = CHUNK_SIZE // STRIDE
reshaped_block_size = BSA_BLOCK_SIZE // STRIDE
k_reshaped_seq_len = (seq_len + k_num_to_pad) // STRIDE
k_reshaped_num_to_pad = k_num_to_pad // STRIDE
num_blocks_per_chunk = reshaped_chunk_size // reshaped_block_size
kv_reshaped_chunk_size = CHUNK_SIZE // STRIDE
print(f"seq_len: {seq_len}, q_chunk_num: {q_chunk_num}, kv_chunk_num: {kv_chunk_num}")
print()
# Padding
if k_num_to_pad > 0:
K_padded = torch.nn.functional.pad(K, (0, 0, 0, k_num_to_pad), value=0)
else:
K_padded = K
if q_num_to_pad > 0:
Q_padded = torch.nn.functional.pad(Q, (0, 0, 0, q_num_to_pad), value=0)
else:
Q_padded = Q
# Softmax scale
norm = 1.0
scale = 1.4426950408889634 / math.sqrt(head_dim) / STRIDE / norm
simple_mask_list = []
for q_chunk_idx in range(q_chunk_num):
q_start = q_chunk_idx * reshaped_chunk_size * STRIDE
q_end = q_start + reshaped_chunk_size * STRIDE
Q_chunk = Q_padded[:, :, q_start:q_end, :]
chunk_start = (k_block_num - q_block_num) * reshaped_block_size + q_chunk_idx * reshaped_chunk_size
chunk_end = chunk_start + reshaped_chunk_size
# 阶段 1: 每个 KV chunk 计算 partial stats 和 raw scores
m_chunks = []
l_chunks = []
attn_weights_chunks = []
for kv_chunk_idx in range(kv_chunk_num):
kv_start = kv_chunk_idx * CHUNK_SIZE
kv_end = kv_start + CHUNK_SIZE
K_chunk = K_padded[:, :, kv_start:kv_end, :]
# KV offset in reshaped space
kv_offset_reshaped = kv_chunk_idx * kv_reshaped_chunk_size
# 计算 raw attention scores
attn_weights_kv = flat_group_gemm_fuse_reshape(
Q_chunk, K_chunk, STRIDE,
chunk_start=chunk_start,
chunk_end=chunk_end,
is_causal=False, # K 不完整,不能在这里用 causal
)
attn_weights_chunks.append(attn_weights_kv)
# 计算 partial stats (带 causal mask)
m_partial, l_partial = softmax_compute_partial_stats(
attn_weights_kv,
reshaped_block_size,
min(4096, reshaped_block_size),
scale,
chunk_start=chunk_start,
kv_offset=kv_offset_reshaped,
is_causal=True,
)
m_chunks.append(m_partial)
l_chunks.append(l_partial)
# 阶段 2: Host 端合并 stats
m_global, l_global = merge_softmax_stats(m_chunks, l_chunks)
# 阶段 3: 使用全局 stats 归一化并计算 block sums
attn_sum_per_kv = []
for kv_chunk_idx, attn_weights_kv in enumerate(attn_weights_chunks):
kv_offset_reshaped = kv_chunk_idx * kv_reshaped_chunk_size
attn_sum_kv = softmax_normalize_and_block_sum(
attn_weights_kv,
m_global,
l_global,
reshaped_block_size,
min(4096, reshaped_block_size),
chunk_start=chunk_start,
real_q_len=k_reshaped_seq_len - k_reshaped_num_to_pad,
scale=scale,
kv_offset=kv_offset_reshaped,
is_causal=True,
)
attn_sum_per_kv.append(attn_sum_kv)
# 拼接各 KV chunk 的 block sums
attn_sum_concat = torch.cat(attn_sum_per_kv, dim=-1)
# 选择 blocks
simple_mask = find_blocks_chunked(
attn_sum_concat,
current_index=k_block_num - q_block_num + q_chunk_idx * num_blocks_per_chunk,
threshold=THRESHOLD,
num_to_choose=None,
decoding=False,
mode="prefill",
causal=True,
)
simple_mask_list.append(simple_mask)
print(f" Q chunk {q_chunk_idx}: merged {kv_chunk_num} KV chunks, attn_sum shape={attn_sum_concat.shape}")
mask_kv_chunking = torch.cat(simple_mask_list, dim=2)
# 应用与 xattn_estimate 相同的 causal mask 后处理 (xattn.py 第 1300-1306 行)
mask_kv_chunking[:, :, -q_block_num:, -q_block_num:] = torch.where(
torch.tril(torch.ones(q_block_num, q_block_num, dtype=bool, device=device), diagonal=0),
mask_kv_chunking[:, :, -q_block_num:, -q_block_num:],
False,
)
mask_kv_chunking_valid = mask_kv_chunking[:, :, :q_blocks, :k_blocks]
selected_kv = (mask_kv_chunking_valid & causal_mask.unsqueeze(0).unsqueeze(0)).sum().item()
density_kv = selected_kv / total_api
print()
print(f"[KV chunking] density: {density_kv:.6f} (selected={selected_kv}, total={total_api})")
print()
# ============================================================
# Step 4: 对比结果
# ============================================================
print("=" * 60)
print("Step 4: 对比结果")
print("=" * 60)
print()
mask_total = mask_api_valid.numel()
mask_diff = (mask_api_valid != mask_kv_chunking_valid).sum().item()
print("| 方法 | density | 与 API 差异 | Mask 差异 |")
print("|------|---------|-------------|-----------|")
print(f"| xattn_estimate API | {density_api:.6f} | - | - |")
print(f"| KV chunking | {density_kv:.6f} | {abs(density_api - density_kv):.6f} | {100*mask_diff/mask_total:.4f}% |")
print()
passed = abs(density_api - density_kv) < 1e-6 and mask_diff / mask_total < 0.001
# ============================================================
# Step 5: 与 GPU-only 保存的数据对比 (如果有)
# ============================================================
if saved_mask is not None or saved_attn_sums is not None:
print("=" * 60)
print("Step 5: 与 GPU-only 保存的数据对比")
print("=" * 60)
print()
if saved_mask is not None:
saved_mask_gpu = saved_mask.to(device)
# 比较 mask
mask_saved_diff = (mask_api_valid != saved_mask_gpu).sum().item()
mask_saved_total = saved_mask_gpu.numel()
print(f"| xattn_estimate vs GPU-only saved mask | 差异 blocks: {mask_saved_diff} / {mask_saved_total} ({100*mask_saved_diff/mask_saved_total:.4f}%) |")
if mask_saved_diff == 0:
print("✅ mask 与 GPU-only 保存完全一致")
else:
print("❌ mask 与 GPU-only 保存存在差异")
passed = False
if saved_attn_sums is not None:
saved_attn_sums_gpu = saved_attn_sums.to(device)
# 需要从 xattn_estimate 获取 attn_sums
# 重新调用一次获取 attn_sums
attn_sums_check, _ = xattn_estimate(
Q, K,
block_size=BSA_BLOCK_SIZE,
stride=STRIDE,
threshold=THRESHOLD,
chunk_size=CHUNK_SIZE,
causal=True,
)
attn_sums_check_valid = attn_sums_check[:, :, :q_blocks, :k_blocks]
max_diff = (attn_sums_check_valid - saved_attn_sums_gpu).abs().max().item()
mean_diff = (attn_sums_check_valid - saved_attn_sums_gpu).abs().mean().item()
print(f"| xattn_estimate vs GPU-only saved attn_sums | max diff: {max_diff:.6e}, mean diff: {mean_diff:.6e} |")
if max_diff < 1e-5:
print("✅ attn_sums 与 GPU-only 保存一致")
else:
print("❌ attn_sums 与 GPU-only 保存存在差异")
passed = False
print()
if passed:
print("test_xattn_estimate_alignment: PASSED")
else:
print("test_xattn_estimate_alignment: FAILED")

View File

@@ -1,244 +0,0 @@
"""
Test: Compare xattn_estimate vs xattn_estimate_chunked
Verify that chunked estimation with EXTERNAL chunking produces the same mask
as standard estimation. This ensures the chunked version can be used in
chunked prefill scenarios without accuracy loss.
Usage:
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH \
python tests/test_xattn_estimate_chunked.py
"""
import sys
import traceback
import torch
from nanovllm.ops.xattn import xattn_estimate, xattn_estimate_chunked
# ============================================================
# Configuration
# ============================================================
# Configuration for xattn_estimate_chunked consistency test.
# Key requirements for 100% match:
# 1. Use matching chunk_size for both standard and chunked versions
# 2. Use same random seed for reproducibility
# Note: Tiny differences (~0.000001) may occur at boundary cases due to
# floating point precision in cumulative sum calculations.
BLOCK_SIZE = 64
STRIDE = 4
THRESHOLD = 0.9
CHUNK_SIZE = 4096 # External chunking size
# Test sequence lengths
TEST_SEQ_LENS = [4096, 8192, 16384, 32768]
# ============================================================
# Utility Functions
# ============================================================
def compare_masks(mask1, mask2, name1="standard", name2="chunked"):
"""Compare two masks and report differences."""
if mask1.shape != mask2.shape:
print(f" Shape mismatch: {name1}={mask1.shape}, {name2}={mask2.shape}")
return False
diff = (mask1 != mask2).sum().item()
total = mask1.numel()
match_rate = (total - diff) / total * 100
print(f" Match rate: {match_rate:.4f}% ({total - diff}/{total})")
if diff > 0:
diff_indices = torch.where(mask1 != mask2)
print(f" First 5 diff positions: {list(zip(*[idx[:5].tolist() for idx in diff_indices]))}")
return diff == 0
def run_chunked_externally(query, key, block_size, stride, threshold, chunk_size):
"""
Run xattn_estimate_chunked with EXTERNAL chunking.
This simulates how chunked prefill should be used in practice.
"""
batch_size, num_heads, q_len, head_dim = query.shape
_, _, k_len, _ = key.shape
q_block_num = (q_len + block_size - 1) // block_size
k_block_num = (k_len + block_size - 1) // block_size
# If Q fits in one chunk, call directly
if q_len <= chunk_size:
return xattn_estimate_chunked(
query, key,
q_start_pos=0,
block_size=block_size,
stride=stride,
threshold=threshold,
use_triton=True,
chunk_size=chunk_size,
)
# External chunking: split Q and call for each chunk
num_q_chunks = (q_len + chunk_size - 1) // chunk_size
print(f" External chunking: {num_q_chunks} chunks")
combined_attn_sum = torch.zeros(
batch_size, num_heads, q_block_num, k_block_num,
dtype=query.dtype, device=query.device
)
combined_mask = torch.zeros(
batch_size, num_heads, q_block_num, k_block_num,
dtype=torch.bool, device=query.device
)
q_block_offset = 0
for q_chunk_idx in range(num_q_chunks):
q_chunk_start = q_chunk_idx * chunk_size
q_chunk_end = min((q_chunk_idx + 1) * chunk_size, q_len)
q_chunk = query[:, :, q_chunk_start:q_chunk_end, :]
# For causal attention, K accumulates up to current Q position
# q_start_pos=0 means Q starts at position 0 in the full sequence
# K is [0, q_chunk_end) for causal attention
k_end = q_chunk_end
k_chunk = key[:, :, :k_end, :]
attn_sum_chunk, mask_chunk = xattn_estimate_chunked(
q_chunk, k_chunk,
q_start_pos=q_chunk_start,
block_size=block_size,
stride=stride,
threshold=threshold,
use_triton=True,
chunk_size=chunk_size,
)
# Place chunk results into combined output
chunk_q_blocks = mask_chunk.shape[2]
chunk_k_blocks = mask_chunk.shape[3]
combined_attn_sum[:, :, q_block_offset:q_block_offset+chunk_q_blocks, :chunk_k_blocks] = attn_sum_chunk
combined_mask[:, :, q_block_offset:q_block_offset+chunk_q_blocks, :chunk_k_blocks] = mask_chunk
q_block_offset += chunk_q_blocks
return combined_attn_sum, combined_mask
def test_single_seq_len(seq_len, num_heads=32, head_dim=128):
"""Test a single sequence length."""
print(f"\nTesting seq_len={seq_len}")
print("=" * 60)
# Generate random Q/K
query = torch.randn(1, num_heads, seq_len, head_dim, device="cuda", dtype=torch.bfloat16)
key = torch.randn(1, num_heads, seq_len, head_dim, device="cuda", dtype=torch.bfloat16)
# Run standard xattn_estimate
print("[1] Running standard xattn_estimate...")
try:
attn_sum_std, mask_std = xattn_estimate(
query, key,
block_size=BLOCK_SIZE,
stride=STRIDE,
threshold=THRESHOLD,
chunk_size=CHUNK_SIZE,
use_triton=True,
causal=True,
)
density_std = mask_std.float().mean().item()
print(f" mask shape: {mask_std.shape}, density: {density_std:.4f}")
except Exception as e:
print(f" ERROR: {e}")
traceback.print_exc()
return False
# Run chunked xattn_estimate with EXTERNAL chunking
print("[2] Running chunked xattn_estimate (external chunking)...")
try:
attn_sum_chunked, mask_chunked = run_chunked_externally(
query, key,
block_size=BLOCK_SIZE,
stride=STRIDE,
threshold=THRESHOLD,
chunk_size=CHUNK_SIZE,
)
density_chunked = mask_chunked.float().mean().item()
print(f" mask shape: {mask_chunked.shape}, density: {density_chunked:.4f}")
except Exception as e:
print(f" ERROR: {e}")
traceback.print_exc()
return False
# Compare results
print("[3] Comparing results...")
chunked_q_blocks = mask_chunked.shape[2]
chunked_k_blocks = mask_chunked.shape[3]
# Extract comparable region from standard mask
mask_std_comparable = mask_std[:, :, :chunked_q_blocks, :chunked_k_blocks]
# Compare masks
masks_match = compare_masks(mask_std_comparable, mask_chunked, "standard", "chunked")
# Compare attn_sums
attn_sum_std_comparable = attn_sum_std[:, :, :chunked_q_blocks, :chunked_k_blocks]
if attn_sum_std_comparable.shape == attn_sum_chunked.shape:
attn_diff = (attn_sum_std_comparable - attn_sum_chunked).abs().max().item()
print(f" Attn sum max diff: {attn_diff:.6f}")
else:
print(f" Attn sum shape mismatch: std={attn_sum_std_comparable.shape}, chunked={attn_sum_chunked.shape}")
# Clean up GPU memory
del query, key, attn_sum_std, mask_std, attn_sum_chunked, mask_chunked
torch.cuda.empty_cache()
return masks_match
# ============================================================
# Main Test
# ============================================================
if __name__ == "__main__":
print("XAttention Chunked vs Standard Test")
print("=" * 60)
print(f"Config: block_size={BLOCK_SIZE}, stride={STRIDE}, threshold={THRESHOLD}")
print(f"External chunk_size={CHUNK_SIZE}")
print()
# Check CUDA availability
if not torch.cuda.is_available():
print("CUDA not available!")
sys.exit(1)
print(f"Using GPU: {torch.cuda.get_device_name(0)}")
print("✓ xattn_estimate imported")
print("✓ xattn_estimate_chunked imported")
# Run tests
all_passed = True
results = []
for seq_len in TEST_SEQ_LENS:
passed = test_single_seq_len(seq_len)
chunks = (seq_len + CHUNK_SIZE - 1) // CHUNK_SIZE
results.append((seq_len, chunks, passed))
if not passed:
all_passed = False
# Summary
print("\n" + "=" * 60)
print("SUMMARY")
print("=" * 60)
for seq_len, chunks, passed in results:
status = "PASSED" if passed else "FAILED"
print(f" seq_len={seq_len:5d} ({chunks} chunk{'s' if chunks > 1 else ' '}): {status}")
print("=" * 60)
if all_passed:
print("ALL TESTS PASSED!")
sys.exit(0)
else:
print("SOME TESTS FAILED!")
sys.exit(1)

View File

@@ -1,129 +0,0 @@
"""
Test: XAttention Triton kernels
演示 XAttention 的两个核心 Triton kernel:
1. flat_group_gemm_fuse_reshape: 计算 stride reshape 后的 attention scores (反对角线求和)
2. softmax_fuse_block_sum: 对 attention scores 做 softmax 后按 block 求和
数据流:
Q [batch, heads, q_len, head_dim]
K [batch, heads, kv_len, head_dim]
↓ flat_group_gemm_fuse_reshape
attn_scores [batch, heads, q_len/stride, kv_len/stride]
↓ softmax_fuse_block_sum
block_sums [batch, heads, q_blocks, k_blocks]
"""
import torch
import sys
sys.path.insert(0, "/home/zijie/Code/nano-vllm")
from nanovllm.ops.xattn import flat_group_gemm_fuse_reshape, softmax_fuse_block_sum
# ============================================================
# 参数配置
# ============================================================
# Triton 约束: q_len >= stride * BLOCK_M, kv_len >= stride * BLOCK_N
# A100: BLOCK_M = BLOCK_N = 128, 所以 min = 4 * 128 = 512
# RTX 3090: BLOCK_M = BLOCK_N = 64, 所以 min = 4 * 64 = 256
q_len = 512
kv_len = 2048
head_dim = 128
stride = 4
block_size = 128 # softmax block size (in reshaped space)
segment_size = 128 # Triton kernel 要求 segment_size >= block_size
# ============================================================
# 构造输入: 偶数位置=1, 奇数位置=2
# ============================================================
Q = torch.zeros(1, 1, q_len, head_dim, dtype=torch.bfloat16).cuda()
K = torch.zeros(1, 1, kv_len, head_dim, dtype=torch.bfloat16).cuda()
for i in range(q_len):
if i % 2 == 0:
Q[0, 0, i, :] = 1
else:
Q[0, 0, i, :] = 2
for i in range(kv_len):
if i % 2 == 0:
K[0, 0, i, :] = 1
else:
K[0, 0, i, :] = 2
# ============================================================
# Step 1: flat_group_gemm_fuse_reshape (chunked along K)
# ============================================================
q_reshaped_len = q_len // stride # 128
kv_reshaped_len = kv_len // stride # 512
# 将 K 沿着长度维度分成多个 chunk
k_chunk_size = 512 # 每个 chunk 512 tokens
num_k_chunks = kv_len // k_chunk_size # 4 chunks
attn_scores_list = []
for k_chunk_idx in range(num_k_chunks):
k_start = k_chunk_idx * k_chunk_size
k_end = k_start + k_chunk_size
K_chunk = K[:, :, k_start:k_end, :] # [1, 1, k_chunk_size, head_dim]
# 对每个 K chunk 调用 flat_group_gemm_fuse_reshape
# 输出: [batch, heads, q_len/stride, k_chunk_size/stride]
attn_chunk = flat_group_gemm_fuse_reshape(
Q, K_chunk, stride,
chunk_start=0,
chunk_end=q_reshaped_len,
is_causal=False
)
attn_scores_list.append(attn_chunk)
# 拼接所有 K chunks 的结果
# 每个 chunk: [1, 1, q_reshaped_len, k_chunk_size/stride]
# 拼接后: [1, 1, q_reshaped_len, kv_reshaped_len]
attn_scores = torch.cat(attn_scores_list, dim=-1)
# 验证 shape: [batch, heads, q_len/stride, kv_len/stride]
assert attn_scores.shape == (1, 1, q_reshaped_len, kv_reshaped_len), \
f"shape mismatch: {attn_scores.shape} != (1, 1, {q_reshaped_len}, {kv_reshaped_len})"
# 验证: 反对角线求和
# 每个 stride x stride 块的反对角线: Q[奇]*K[偶] + Q[偶]*K[奇] = 2*1 + 1*2 = 4
# 反对角线有 stride/2 对,再乘以 head_dim
expected_gemm = (2*1 + 1*2) * (stride // 2) * head_dim
actual_gemm = attn_scores[0, 0, 0, 0].item()
assert actual_gemm == expected_gemm, f"flat_group_gemm: {actual_gemm} != {expected_gemm}"
# ============================================================
# Step 2: softmax_fuse_block_sum
# ============================================================
scale = 1.4426950408889634 # log2(e) for exp2
block_sums = softmax_fuse_block_sum(
attn_scores,
block_size,
segment_size,
chunk_start=0,
chunk_end=q_reshaped_len,
real_q_len=q_reshaped_len,
scale=scale,
is_causal=False
)
# 验证 shape: [batch, heads, q_blocks, k_blocks]
q_blocks = q_reshaped_len // block_size # 128 / 128 = 1
k_blocks = kv_reshaped_len // block_size # 512 / 128 = 4
assert block_sums.shape == (1, 1, q_blocks, k_blocks), \
f"shape mismatch: {block_sums.shape} != (1, 1, {q_blocks}, {k_blocks})"
# 验证: 每个 block 的 softmax 结果求和
# 所有 attn_scores 相同 → softmax 均匀分布
# 每行对一个 K block 的贡献 = block_size / kv_reshaped_len
# 每个 Q block 有 block_size 行
# block_sum = block_size * (block_size / kv_reshaped_len)
expected_sum = block_size * block_size / kv_reshaped_len
actual_sum = block_sums[0, 0, 0, 0].item()
assert actual_sum == expected_sum, f"softmax_fuse_block_sum: {actual_sum} != {expected_sum}"
print("test_xattn_kernels: PASSED")