Commit Graph

248 Commits

Author SHA1 Message Date
Zijie Tian
52b12a89e3 📋 docs: add changelog for 2026-02-05
Document today's changes:
- GQA buffer OOM fix (saves 16GB for 1M seq in offload mode)
- Tests directory cleanup (removed 16 files, -4306 lines)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 03:16:39 +08:00
Zijie Tian
d35dd76e09 🗑️ chore: clean up tests directory to essential files only
Keep only core test files:
- test_ruler.py - main RULER benchmark
- test_xattn_estimate_alignment.py - XAttn kernel validation
- utils.py - shared utilities

Remove 8 files (recoverable from git history):
- bench_estimate_block_size.py
- modeling_qwen3.py
- test_chunk_attention_graph_reuse.py
- test_cudagraph_memory.py
- test_gpuonly_density_alignment.py
- test_hierarchical_estimate.py
- test_quest_policy.py
- test_sequential.py

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 03:13:50 +08:00
Zijie Tian
2b61c5ab57 🗑️ chore: remove test_needle* files
Remove needle tests (validation now covered by test_ruler.py):
- test_needle.py - basic needle-in-haystack test
- test_needle_ref.py - HuggingFace reference implementation

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 03:11:28 +08:00
Zijie Tian
a709551072 🗑️ chore: remove redundant XAttention test files
Remove 6 obsolete test files:
- test_xattn_bsa.py - XAttn+BSA integration (covered by test_ruler)
- test_xattn_chunked.py - duplicate of test_xattn_estimate_chunked
- test_xattn_estimate_chunked.py - chunked prefill validation
- test_xattn_kernels.py - Triton kernel unit tests
- test_xattn_kv_chunking_batch.py - batch KV chunking validation
- test_chunk_attention_graph.py - superseded by graph_reuse version

Retained: test_xattn_estimate_alignment.py (critical kernel validation)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 03:11:21 +08:00
Zijie Tian
11a867f6fb 🐛 fix: skip GQA buffer allocation in XAttention offload mode
In offload mode, GQA expansion buffers (_k_expanded, _v_expanded) are not
needed since compute_chunked_prefill() handles GQA inline. Previously,
these buffers were always allocated based on max_model_len, causing OOM
on 24GB GPUs (e.g., RTX 3090) when max_model_len=1M (16GB buffer).

Changes:
- Add enable_cpu_offload parameter to alloc_policy_metadata() in base class
- Skip GQA buffer allocation when enable_cpu_offload=True in XAttentionBSAPolicy
- Pass enable_cpu_offload from model_runner to policy

Memory savings: ~16GB for 1M seq, ~1.1GB for 72K seq

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 02:57:18 +08:00
Zijie Tian
af4da454ba 📊 docs: add XAttention offload profiling analysis for 32K context
- Profile XAttn vs Full attention using nsys NVTX markers
- Key finding: estimate (41%) + find_blocks (37%) dominate, compute only 21%
- Chunk7 comparison: XAttn (38ms) vs Full (35ms) - XAttn slightly slower
- Identify optimization opportunities: reduce find_blocks overhead, merge estimate passes

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-02-05 02:49:59 +08:00
Zijie Tian
ef37d4f1a8 🐛 docs: document XAttention offload GQA buffer OOM issue
Document OOM issue when using XAttention BSA + CPU offload
with large models (GLM-4-9B) on 24GB GPUs.

Issue: 8GB allocation for k_expanded buffer fails due to
using num_heads instead of num_kv_heads in GQA models.

Root cause analysis and proposed fix included.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 02:46:50 +08:00
Zijie Tian
c8a5ef04c0 📝 docs: add test_ruler.py usage guide and rule
- Add comprehensive test_ruler.py usage guide with verified commands
- Add .claude/rules/test-ruler.md to enforce documentation-first approach
- Update CLAUDE.md documentation index

Tested commands on RTX 3090 (GPU 4):
- 32K/64K offload + XAttn BSA
- Multi-dataset, JSON output, quiet mode
- GLM-4 model support

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 02:46:44 +08:00
Zijie Tian
1c36d53570 🙈 chore: add ralph-tui session file to gitignore
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 02:00:44 +08:00
Zijie Tian
54fd302fa8 📝 docs: add XAttention density alignment verification results
- Add verification doc comparing GPU-only vs Offload mode density
- Test results: 32K (0.37% diff), 64K (0.09% diff) - alignment successful
- Both modes achieve 100% accuracy on RULER niah_single_1

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-02-05 01:59:11 +08:00
Zijie Tian
1eb7521994 📝 docs: add XAttention density types documentation
Document the difference between compute density (BSA block level)
and communication density (CPU block level).

Key finding: Even with 37% compute density, comm density can be 100%
due to any() aggregation across heads/Q-positions spreading sparse
blocks across all CPU blocks.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 01:44:11 +08:00
Zijie Tian
51bd678335 📊 feat: distinguish compute density and communication density in DensityObserver
- Add record_comm_density() call in select_blocks to track CPU block selection
- Add get_per_layer_comm_density() method for detailed analysis
- Update print_summary() to show both densities and H2D savings ratio
- Set DensityObserver mode (offload/gpu_only) in test_ruler.py
- Update get_summary() to return both density types

Key insight: Comm density can be 100% even when compute density is ~37%
because sparse BSA blocks are distributed across all CPU blocks.
Since CPU block granularity is 32x coarser (4096 vs 128 tokens),
any() aggregation across heads/Q-blocks results in all CPU blocks being needed.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 01:43:17 +08:00
Zijie Tian
1ea5afd886 📝 docs: add XAttention offload stream sync fix documentation
- Document the CUDA stream synchronization bug in XAttention BSA
- Include root cause analysis with stream timing diagrams
- Add test commands and verification results (100% accuracy)
- Update CLAUDE.md documentation index

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 01:32:50 +08:00
Zijie Tian
829b311c02 🐛 fix: stream synchronization for XAttention estimate kernels in offload mode
- Wrap all compute kernels in select_blocks with compute_stream context
  (Pass 1 historical blocks, Pass 1 current chunk, Step 2 merge,
   Pass 2 historical blocks, Pass 2 current chunk, Step 4 block selection)
- Fix K data mismatch between Pass 1 and Pass 2 by ensuring wait_slot_layer
  syncs with compute_stream where kernels actually run
- Remove STRONG SYNC code from offload_engine.py (now handled by events)
- Remove debug print statements and torch.save code
- Consolidate fallback conditions in compute_with_xattn
- Change default chunk_size from 16384 to 4096 for density alignment

The bug caused Pass 1 and Pass 2 to see different K data from the same
CPU block because compute kernels ran on default stream while
wait_slot_layer only synced compute_stream.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 01:30:23 +08:00
Zijie Tian
dd0472aea8 [plugin] Added ralph-tui setup. 2026-02-05 01:27:53 +08:00
Zijie Tian
a1c68a733e 📊 docs: add XAttention memory benchmark for 24GB GPUs
- Add memory analysis for Qwen3-0.6B @ 32K context
- Document 24GB VRAM feasibility (RTX 3090/4090)
- Recommend gpu-utilization=0.28 for 24GB GPUs
- Include KV cache breakdown and model estimations
- Update CLAUDE.md index

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-02-02 14:38:27 +08:00
Zijie Tian
dc51972777 📝 docs: update density alignment test with Offload mode results
- Rename doc to "Density Alignment Test Results" (covers both modes)
- Add Offload mode test results (3.7K-64.9K tokens, all passed)
- Add Layer 5 GPU-only test results (threshold=0.9, density=6.24%)
- Enhance test script to support both GPU-only and Offload data formats
- Add batch testing commands for all data files
- Update CLAUDE.md index

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-02-02 14:22:40 +08:00
Zijie Tian
232fcf043e 📝 docs: add GPU-only density alignment test results
Document test results verifying XAttention density calculation in
GPU-only mode matches independent xattn_estimate calls.

Test results (Llama-3.1-8B-Instruct, threshold=0.9):
- 4k:  Layer 0 density 63.8%, verified 
- 8k:  Layer 0 density 65.0%, verified 
- 16k: Layer 0 density 61.6%, verified 
- 32k: Layer 0 density 50.2%, verified 
- 64k: Layer 0 density 37.0%, verified 

All tests show exact match (attn_sums diff=0, mask exact match).

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-02-02 11:22:34 +08:00
Zijie Tian
aeed6ccdfb test: add GPU-only density alignment verification test
Add test to verify XAttention density calculation in GPU-only mode
matches independent xattn_estimate calls.

Changes:
- Add tests/test_gpuonly_density_alignment.py: loads saved Q/K from
  xattn_bsa.py, calls xattn_estimate independently, compares results
- Enhance debug save in xattn_bsa.py: now saves Q, K tensors and
  xattn_estimate parameters for external verification
- Set _DEBUG_SAVE_MASK = False by default

Usage:
1. Set _DEBUG_SAVE_MASK = True in xattn_bsa.py
2. Run GPU-only inference with XAttention (e.g., test_ruler.py)
3. Run tests/test_gpuonly_density_alignment.py to verify alignment

Verified on 4k/8k/16k/32k/64k contexts - all pass with exact match.

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-02-02 11:14:46 +08:00
Zijie Tian
6c55c4d2a3 ♻️ refactor: rewrite select_blocks with 3-stage KV chunking algorithm
Implement correct 3-stage KV chunking for XAttention offload mode:
- Stage 1: Compute partial softmax stats (m, l) for each KV chunk
- Stage 2: Merge all partial stats to get global normalization factors
- Stage 3: Normalize with global stats and compute block sums

Key fixes:
- Add wait_all_prefill_offloads() before loading CPU blocks to ensure
  async offload completion (fixes stale data bug)
- Pre-allocate m/l partial buffers and block_sums buffer

This produces identical density to GPU-only xattn_estimate while using
O(S×C) peak memory instead of O(S²).

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-02-02 10:10:10 +08:00
Zijie Tian
6e34efd58a 📝 docs: add storage overhead analysis and batch tests for KV chunking
- Update xattn_kv_chunking_kernels.md with:
  - Detailed storage overhead analysis (O(S) vs O(S²))
  - Peak memory optimization (8x reduction)
  - Support for independent Q/KV chunk sizes
  - Batch verification results (3K-64K seqlen)
  - ASCII pipeline diagram

- Add test_xattn_kv_chunking_batch.py for batch validation
- Fix causal mask post-processing in alignment test
- Update CLAUDE.md documentation index

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-02-01 19:22:36 +08:00
Zijie Tian
5acd5558d6 feat: add KV chunking support for XAttention softmax kernels
Implement three-phase KV chunking for sparse attention estimation:
1. softmax_compute_partial_stats: compute (m, l) per KV chunk
2. merge_softmax_stats: merge partial stats on host
3. softmax_normalize_and_block_sum: normalize with global stats

This allows computing sparse attention masks without storing full
raw attention scores in GPU memory, reducing peak memory usage
from O(q_len * k_full_len) to O(q_len * k_chunk_len).

Key changes:
- Add softmax_partial_stats_kernel with causal mask support
- Add softmax_normalize_block_sum_kernel with kv_offset parameter
- Add Python wrappers for new kernels
- Update test script to validate KV chunking alignment
- Add documentation for the new kernels

Test results show perfect alignment with xattn_estimate API:
- Density difference: 0.000000
- Mask difference: 0.0044%

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-02-01 18:53:26 +08:00
Zijie Tian
193ef55d18 ♻️ refactor: use Q-chunked processing in xattn alignment test
Match xattn_estimate internal logic by processing Q in chunks:
- Reduces peak memory for attn_scores tensor
- Enables testing 64K sequences without OOM
- All 5 test files pass (3.6K to 64K)

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-02-01 18:08:15 +08:00
Zijie Tian
f173a3f7f5 test: add xattn_estimate vs low-level kernels alignment test
Test that xattn_estimate produces the same results as manually calling:
- flat_group_gemm_fuse_reshape
- softmax_fuse_block_sum
- find_blocks_chunked

Uses real KV cache data from results/kvcache/ directory.
Verifies density calculation matches between high-level API and kernels.

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-02-01 17:49:37 +08:00
Zijie Tian
8035e4db3d 📝 docs: add XAttention KV chunking density test results
Document the verification test for XAttention Triton kernel KV chunking:
- 32K and 64K test results with threshold 0.9/0.95/1.0
- Key finding: threshold=1.0 achieves alignment (~0% diff)
- threshold<1.0 shows 10-13% difference due to per-chunk threshold application
- Conclusion: softmax normalization is correct, issue is threshold accumulation

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-02-01 17:36:19 +08:00
Zijie Tian
8ab53e7331 🚧 WIP: add DEBUG code for XAttention KV chunking density verification
Add instrumentation to compare GPU-only vs Offload mode density:
- Layer 0 DEBUG output for both modes
- Accumulate selected/total counts across chunks
- Proper causal mask with Q offset handling
- Skip normal offload logic for isolated testing

Test results (threshold=1.0 achieves alignment):
- 32K: GPU-only 0.9999, Offload 0.9999 (diff ~0%)
- 64K: GPU-only 0.9995, Offload 0.9995 (diff ~0%)

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-02-01 17:33:23 +08:00
Zijie Tian
2e96d1d97d WIP: Enhance sparse attention with density tracking and block selection improvements
- Added analysis documentation for xattn density alignment.
- Refactored ModelRunner to pre-allocate policy metadata buffers regardless of CPU offload configuration.
- Updated FullAttentionPolicy and SparsePolicy to accept query and key tensors for block selection.
- Enhanced QuestPolicy to utilize query tensor for block selection and improved handling of selected blocks.
- Expanded XAttentionBSAPolicy to support chunked prefill and improved attention score computation with historical and current chunk handling.
- Introduced DensityObserver to track compute and communication density for sparse attention layers.
- Updated attention layer to ensure block selection is always called, improving robustness in first chunk scenarios.
- Added tests for attention kernel behavior with enhanced input patterns.
2026-01-31 14:48:23 +08:00
Zijie Tian
f6ac4ccdde feat: add DensityObserver for XAttention sparse attention density tracking
- Add DensityObserver class to track per-layer density statistics
- Integrate DensityObserver into compute_prefill for GPU-only mode
- Fix stride parameter not being passed to xattn_estimate
- Add density statistics output to test_ruler.py for XATTN_BSA
- Add comprehensive density benchmark documentation

Key changes:
- nanovllm/utils/density_observer.py: New Observer for density tracking
- xattn_bsa.py: Add stride param to xattn_estimate, integrate DensityObserver
- test_ruler.py: Enable DensityObserver and print summary for XATTN_BSA
- docs/xattn_density_benchmark.md: Benchmark results for 4K-32K contexts

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-30 16:26:56 +08:00
Zijie Tian
4484a1482c [refactor] Refactor the profile_offload.sh 2026-01-29 08:39:34 +08:00
Zijie Tian
e436ec861f ⚙️ config: update test_ruler.py defaults
- max_new_tokens: 128 → 16 (sufficient for NIAH answers)
- block_size: 1024 → 4096 (better performance)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-28 14:21:23 +08:00
Zijie Tian
45efcf0db1 feat: add --dtype parameter to test_ruler.py
Support models with float32 default dtype (e.g., Nemotron).
FlashAttention requires fp16/bf16, so dtype must be specified.

Usage: --dtype bfloat16

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-28 13:56:15 +08:00
Zijie Tian
e09a2a5b10 feat: add Qwen2/2.5 model support
Separate Qwen2 from Qwen3 implementation:
- Qwen2: Uses QKV bias, no QK norm
- Qwen3: Has optional QK norm when no bias

Tested with Qwen2.5-7B-Instruct-1M, RULER niah_single_1 passed.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-28 13:44:32 +08:00
Zijie Tian
a239bfb40d 📚 docs: add new model integration guide
Summarizes lessons learned from GLM-4 integration:
- Config field mapping (multi_query_group_num, kv_channels, etc.)
- RoPE variants (interleaved vs half, partial vs full rotation)
- EOS token handling for multi-EOS models
- Weight name conversion patterns
- Verification checklist

Also updates CLAUDE.md to reflect GLM-4 support.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-28 13:36:24 +08:00
Zijie Tian
29e102720b 🐛 fix: support multiple EOS tokens for GLM-4
GLM-4 uses multiple EOS tokens [151329, 151336, 151338] where 151336
(<|user|>) should also stop generation. Previously only the first EOS
from tokenizer was used, causing generation to always hit max_tokens.

Changes:
- config.py: Change eos type to int | list[int]
- llm_engine.py: Read eos_token_id from hf_config (contains full list)
- scheduler.py: Use set for efficient multi-EOS lookup

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-28 13:23:53 +08:00
Zijie Tian
726e4b58cf feat: add GLM-4-9B-Chat-1M model support
Add support for GLM-4 model architecture with the following changes:

- Add glm4.py with ChatGLMForCausalLM, GLM4Model, GLM4Attention, GLM4MLP
- Add GLM4RotaryEmbedding with interleaved partial rotation (rotary_dim = head_dim // 2)
- Add apply_rotary_emb_interleaved function for GLM-4 style RoPE
- Add GLM-4 weight name conversion and loading in loader.py
- Add GLM-4 chat template conversion in test_ruler.py
- Add trust_remote_code=True for GLM-4 config loading

Key GLM-4 specific adaptations:
- QKV bias enabled (add_qkv_bias: true)
- RoPE with rope_ratio scaling (base = 10000 * rope_ratio)
- Interleaved RoPE (pairs adjacent elements, not first/second half)
- Partial rotation (only half of head_dim is rotated)
- Uses multi_query_group_num instead of num_key_value_heads
- Uses kv_channels instead of head_dim
- Uses ffn_hidden_size instead of intermediate_size

Tested with RULER niah_single_1 (5 samples): 100% accuracy
Both GPU-only and CPU offload modes verified

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-28 13:15:57 +08:00
Zijie Tian
8d19e61446 ️ perf: replace Triton merge with FlashInfer merge_state
Use FlashInfer's optimized merge_state kernel for attention output merging
in chunked prefill. End-to-end improvement: +0.8% (32K) to +2.4% (64K).

Key changes:
- Add merge_attention_outputs_flashinfer() with LSE format conversion
- FlashInfer uses log2, flash_attn uses ln: convert via LOG2_E/LN_2
- Keep original Triton kernel for fallback

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-28 10:04:38 +08:00
Zijie Tian
4484ebbb77 📚 docs: add 1M+ context length models reference list
- Add comprehensive list of 1M+ context models from Hugging Face
- Categorize by type: text-only LLM vs vision-language models
- Separate ≤10B (practical) from >10B (resource-intensive) models
- Include Qwen, GLM, InternLM, Llama, MiniMax, Gradient AI series
- Add VRAM requirements and technical comparison table
- Update CLAUDE.md documentation index

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-28 09:04:55 +08:00
Zijie Tian
2c2383c786 ️ perf: optimize XAttention estimate with hierarchical block sum
Replace slow softmax_fuse_block_sum (block_size=4096) with optimized
hierarchical approach (estimate_block_size=1024):

- Add estimate_block_size parameter to XAttentionBSAPolicy (default 1024)
- Rewrite select_blocks to use hierarchical aggregation:
  1. Fine-grained softmax with small block size (15x faster kernel)
  2. Aggregate to CPU block level via reshape + sum
  3. Score + threshold selection (replaces mask + voting)

Performance improvement (CPU Offload mode):
- softmax_fuse_block_sum: 48% → 1% of total time (44x faster)
- 128K: XAttention now +2.4% faster than Full (was -59%)
- 64K: -3.8% (was -21%)
- 32K: -6.0% (was -14%)

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-01-28 06:47:13 +08:00
Zijie Tian
f049971f84 test: add hierarchical block sum estimation validation
Validate the hierarchical estimation approach for XAttention:
- Test 1: Math equivalence (diff = 0.0) between hierarchical and direct
- Test 2: Score + threshold selection strategy (replaces mask + voting)
- Test 3: Performance benchmark (41x speedup)

Uses pure torch + xattn kernels, independent of nanovllm framework.

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-01-28 06:24:35 +08:00
Zijie Tian
c90dc196b2 📝 docs: add estimate block_size performance analysis
Document the performance impact of block_size on softmax_fuse_block_sum:
- Current 4096 (reshaped 512) is the WORST point: 95ms
- Optimal 1024 (reshaped 128): 6ms - 15x faster
- Performance follows U-shaped curve

Add tests/bench_estimate_block_size.py for benchmarking and propose
hierarchical block sum approach for optimization.

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-01-28 06:24:28 +08:00
Zijie Tian
3da9b8aef2 ️ perf: optimize XAttention estimate phase with K-only loading
Add load_k_only_to_slot_layer() to OffloadEngine for estimate phase:
- Only load K (not K+V) during block selection in select_blocks()
- Reduces H2D transfer by 50% in estimate phase
- 64K context: XAttn/Full ratio drops from 1.48x to 0.99x
- 32K context: XAttn/Full ratio drops from 1.67x to 1.20x

The estimate phase uses flat_group_gemm_fuse_reshape(Q, K) which
only requires K for attention score computation. V is unused.

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-01-28 06:24:20 +08:00
Zijie Tian
a832d127b6 feat: add nsys-profiler agent for kernel performance analysis
Add a specialized agent for NVIDIA Nsys profiling that handles:
- Profile data collection using framework scripts
- Statistical analysis of kernel timing and memory transfers
- Timeline analysis for GPU-CPU overlap efficiency
- Comparative analysis between different configurations

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-01-28 06:24:09 +08:00
Zijie Tian
39d12a0416 📈 feat: add MemoryObserver for GPU-CPU communication tracking
Implement MemoryObserver to track memory transfers between GPU and CPU:
- H2D (Host to Device): CPU → GPU transfers
- D2H (Device to Host): GPU → CPU transfers
- D2D (Device to Device): GPU buffer copies
- Supports prefill/decode phase separation

Integration points in offload_engine.py:
- load_to_slot_layer: H2D with is_prefill parameter
- offload_slot_layer_to_cpu, offload_prefill_buffer_async: D2H
- write_to_prefill_buffer, write_to_decode_buffer: D2D
- load_block_sample_from_cpu, load_block_full_from_cpu: H2D

Add bench_offload.py integration for memory stats printing.

Benchmark results (Llama-3.1-8B, 64K context):
- Full Policy: Prefill H2D 262.13 GB
- XAttention: Prefill H2D 386.62 GB (1.48x)

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-01-28 04:06:45 +08:00
Zijie Tian
c16bfcf40f ♻️ refactor: restructure Observer as base class with InferenceObserver
- Refactor Observer into base class with common enable/disable/reset interface
- Create InferenceObserver subclass for TTFT/TPOT metrics
- Fix TTFT calculation timing: compute after prefill completes instead of
  at decode start (fixes max_tokens=1 returning TTFT=0)
- Integrate InferenceObserver into bench.py and bench_offload.py for
  accurate internal timing metrics vs external wall-clock time
- Add get_summary() and print_summary() methods for structured output

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-01-28 03:15:33 +08:00
Zijie Tian
f3e4611e3b 📝 docs: add XAttention performance analysis documentation
Add comprehensive performance analysis for XAttention:
- NVTX marker locations and usage
- Block size impact on offload mode (4096 vs 1024)
- Detailed timing breakdown for estimate vs compute phases
- softmax_fuse_block_sum_kernel analysis
- Optimization recommendations

Key findings:
- block_size=4096 is 2x faster than 1024 for 64K context
- find_blocks_chunked is bottleneck (40%) at block_size=4096
- estimate_gemm becomes bottleneck (24%) at block_size=1024

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-01-28 00:57:20 +08:00
Zijie Tian
7b5d3b34eb 📈 feat: add NVTX markers to XAttention for profiling
Add NVTX range markers to track XAttention performance:
- GPU-only: xattn_estimate, xattn_bsa_compute
- Offload: xattn_estimate_gemm, xattn_estimate_softmax,
  xattn_estimate_find_blocks, xattn_compute_historical,
  xattn_compute_current, xattn_compute_merge

These markers enable detailed nsys profiling to identify
performance bottlenecks in estimate vs compute phases.

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-01-28 00:57:11 +08:00
Zijie Tian
b760de84c5 feat: add context length and error handling to profile_offload.sh
- Add --ctx-len parameter (32k/64k/128k) for context length selection
- Auto-configure max-model-len and data-dir based on context length
- Add error handling to delete .nsys-rep file on test failure
- Remove set -e to allow proper error handling
- Update output filename format to include context length

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-01-28 00:28:37 +08:00
Zijie Tian
f81b5ae8a9 feat: enhance profile_offload.sh with policy, block-size parameters
- Add --policy parameter for sparse attention policy selection (full/xattn)
- Add --block-size parameter (default 4096) for KV cache block size
- Add --gpu-util parameter for GPU memory utilization control
- Improve output filename format: <policy>_<gpuonly|offload>_blk<size>_<timestamp>
- Map user-friendly policy names to internal enum (xattn -> XATTN_BSA)

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-01-27 23:23:20 +08:00
Zijie Tian
e874229adc 📝 docs: add comprehensive GPU-only vs Offload benchmark results
- Add --block-size argument to bench.py for configurable KV cache block size
- Update bench_offload_results.md with complete benchmark analysis:
  - GPU-only: XAttention shows +15% to +41% speedup
  - CPU Offload: XAttention shows -14% to -59% slowdown
  - Block size 4096 recommended for best performance
  - Document why XAttention hurts Offload mode (transfer bottleneck)

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-01-27 22:32:07 +08:00
Zijie Tian
4fe7dfb239 🔀 merge: integrate tzj/minference-exp (GPU-only sparse attention)
Merge GPU-only sparse attention support from tzj/minference-exp branch:

**GPU-only mode additions:**
- Add compute_prefill/compute_decode methods to SparsePolicy base class
- Add GPU-only attention routing in attention.py
- Add alloc_policy_metadata() for pre-allocating GQA buffers
- Add XAttention + BSA sparse attention for GPU-only prefill
- Add kvcache_manager to set_context() for policy access

**bench.py enhancements:**
- Add --model argument for configurable model path
- Add --policy argument (full, xattn) for sparse policy selection
- Add --enable-policy flag for FullAttentionPolicy routing
- Add --enforce-eager option to disable CUDA graphs
- Add --gpu-util option for GPU memory utilization

**Documentation:**
- Add gpu_only_xattn_guide.md with performance analysis
- Add gpu_only_sparse_integration.md baseline document
- Add gpu-vram-requirement.md rule for GPU-only mode

Both CPU offload and GPU-only paths are preserved and functional.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-27 09:25:36 +08:00