Commit Graph

14 Commits

Author SHA1 Message Date
Zijie Tian
3da9b8aef2 ️ perf: optimize XAttention estimate phase with K-only loading
Add load_k_only_to_slot_layer() to OffloadEngine for estimate phase:
- Only load K (not K+V) during block selection in select_blocks()
- Reduces H2D transfer by 50% in estimate phase
- 64K context: XAttn/Full ratio drops from 1.48x to 0.99x
- 32K context: XAttn/Full ratio drops from 1.67x to 1.20x

The estimate phase uses flat_group_gemm_fuse_reshape(Q, K) which
only requires K for attention score computation. V is unused.

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-01-28 06:24:20 +08:00
Zijie Tian
7b5d3b34eb 📈 feat: add NVTX markers to XAttention for profiling
Add NVTX range markers to track XAttention performance:
- GPU-only: xattn_estimate, xattn_bsa_compute
- Offload: xattn_estimate_gemm, xattn_estimate_softmax,
  xattn_estimate_find_blocks, xattn_compute_historical,
  xattn_compute_current, xattn_compute_merge

These markers enable detailed nsys profiling to identify
performance bottlenecks in estimate vs compute phases.

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-01-28 00:57:11 +08:00
Zijie Tian
a504bd873d perf: pre-allocate GQA buffers in XAttention policy
Add alloc_policy_metadata() method to SparsePolicy base class for
pre-allocating GPU buffers during initialization. This avoids
dynamic memory allocation during forward pass.

Changes:
- Add alloc_policy_metadata() to SparsePolicy base class
- Implement GQA buffer pre-allocation in XAttentionBSAPolicy
- Call alloc_policy_metadata() in model_runner for GPU-only mode
- Modify compute_prefill() to reuse pre-allocated buffers
- Add --gpu-util parameter to bench.py

Memory savings:
- Previously: 2x GQA expansion (~2GB for 64K)
- Now: 1x pre-allocated buffer (~1GB for 64K, reused)

Tested:
- GPU-only 32K: 5602 tok/s (512MB pre-allocated)
- GPU-only 64K: 4821 tok/s (1GB pre-allocated, gpu_util=0.7)
- Offload Full: PASSED (no changes to offload path)
- Offload XAttention: PASSED (uses compute_chunked_prefill)

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-01-27 05:49:23 +08:00
Zijie Tian
076656c9c2 feat: add GPU-only XAttention BSA sparse attention support
- Implement compute_prefill() in XAttentionBSAPolicy for GPU-only mode
  - Uses xattn_estimate to compute sparse block mask
  - Uses block_sparse_attn_func for efficient sparse attention
  - Handles GQA by expanding K/V heads
  - Falls back to flash_attn for paged KV cache (prefix cache)
- Implement compute_decode() by delegating to FullAttentionPolicy
- Add --policy xattn option to bench.py

Verified: RULER 32k niah_single_1 5/5 samples passed (100%)

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-01-27 05:19:24 +08:00
Zijie Tian
aea3812230 ♻️ refactor: unify KV cache operations through OffloadEngine
- Add write_to_prefill_buffer() and write_to_decode_buffer() methods
- Add chunk_idx parameter to load_to_slot_layer() for NVTX labeling
- Replace direct copy_() calls with OffloadEngine methods in attention.py
- Update all load_to_slot_layer() calls to pass chunk_idx
- NVTX markers now show chunk info: "H2D: L{layer} Chunk{chunk} CPU[{block}]->Slot[{slot}]"

All KV cache data transfers in chunked offload mode now go through
OffloadEngine, enabling better profiling and consistent management.

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-01-27 02:20:59 +08:00
Zijie Tian
ed3c8bb4b8 🐛 fix: memory leak in XAttentionBSAPolicy select_blocks
Fix severe memory leak (64GB -> 4GB growth) by:
- Remove unused sparse_metadata storage (was accumulating attn_scores)
- Delete intermediate tensor list (attn_scores_list) after use
- Explicitly delete intermediate tensors before return

Before: 16GB -> 80GB during 128K prefill
After:  16GB -> 19.8GB during 128K prefill

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 09:30:18 +08:00
Zijie Tian
5eb35982bf 🔧 feat: add density statistics tracking to sparse policies
Add statistics tracking to compare block selection between policies:
- XAttentionBSAPolicy: track available/selected blocks per chunk
- FullAttentionPolicy: track total blocks (always 100% density)
- Add reset_stats(), get_density_stats(), print_density_stats() methods
- Use logger.debug for per-chunk density logging

Results on 32K niah_single_1:
- Full: 100% density across all chunks
- XAttn BSA: 90% -> 73% density (saves ~25-30% blocks in later chunks)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 08:53:22 +08:00
Zijie Tian
4d1e40152d feat(xattn): implement compute_chunked_prefill with ring buffer pipeline
- Copy compute_chunked_prefill implementation from FullAttentionPolicy
- Set default threshold to 0.95 for accuracy testing
- Remove debug code (sys.exit, verbose prints)
- Use ring buffer pipeline for historical block loading
- Merge with current chunk attention using flash_attn_with_lse

RULER NIAH test passed with 5/5 samples (100% accuracy).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 08:27:40 +08:00
Zijie Tian
832b352afa feat(xattn): implement select_blocks with majority voting aggregation
Implement XAttention-based block selection for sparse attention:
- Use flat_group_gemm_fuse_reshape to compute Q@K^T attention scores
- Apply softmax_fuse_block_sum to aggregate into block-level attention
- Use find_blocks_chunked for threshold-based block selection
- Handle GQA by aggregating within KV head groups first
- Use majority voting (>50%) across heads instead of any() for better sparsity
- Align block_size with CPU offload block size (1024 tokens / stride = 128)

Test results show ~45% density at chunk 40 (down from 100% with any() aggregation).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 08:19:05 +08:00
Zijie Tian
a50b4c2ac2 ♻️ refactor: move select_blocks from policy to attention layer
Move block selection logic from compute_chunked_prefill/decode methods
to attention.py caller. This improves separation of concerns:

- attention.py now calls select_blocks() before compute_chunked_*()
- Policy methods receive pre-selected blocks via selected_blocks parameter
- Enables sparse policies to implement custom block selection without
  modifying the compute path

Changes:
- policy.py: Add selected_blocks parameter to abstract methods
- full_policy.py: Remove internal select_blocks calls, use passed blocks
- xattn_bsa.py: Sync signatures for prefill/decode methods
- attention.py: Add select_blocks calls before policy delegation

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 05:21:28 +08:00
Zijie Tian
ca32ea6f93 [WIP] Before refactor the compute)_chunked_prefill. 2026-01-23 03:36:12 +08:00
Zijie Tian
999858e82f feat: add xattn kernels test and update testing rules
- Add test_xattn_kernels.py demonstrating flat_group_gemm_fuse_reshape
  and softmax_fuse_block_sum Triton kernels with structured data
- Update testing.md with new test code style guidelines
- Update xattn.py and xattn_bsa.py with improvements

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 03:01:25 +08:00
Zijie Tian
b97b0b96a0 [WIP] Before refactor the nanovllm sparse policy. 2026-01-19 22:34:44 +08:00
Zijie Tian
b5da802dff [WIP] Before integrate the xattn operator. 2026-01-19 21:19:21 +08:00