Commit Graph

230 Commits

Author SHA1 Message Date
Zijie Tian
18bc433f09 perf: improve NVTX profiling with colored ranges and configurable slots
- Switch from torch.cuda.nvtx to nvtx package for colored range support
- Add color coding: blue for H2D, green for D2H decode, orange for D2H prefill
- Add --num-gpu-blocks parameter to profile_offload.sh
- Include slot count in output filename for easier comparison

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-01-27 03:42:05 +08:00
Zijie Tian
aea3812230 ♻️ refactor: unify KV cache operations through OffloadEngine
- Add write_to_prefill_buffer() and write_to_decode_buffer() methods
- Add chunk_idx parameter to load_to_slot_layer() for NVTX labeling
- Replace direct copy_() calls with OffloadEngine methods in attention.py
- Update all load_to_slot_layer() calls to pass chunk_idx
- NVTX markers now show chunk info: "H2D: L{layer} Chunk{chunk} CPU[{block}]->Slot[{slot}]"

All KV cache data transfers in chunked offload mode now go through
OffloadEngine, enabling better profiling and consistent management.

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-01-27 02:20:59 +08:00
Zijie Tian
3100724666 📝 docs: add nsys wrong event order bug investigation
- Document ring buffer pipeline triggering nsys timestamp bug
- Update profile_offload.sh to use test_ruler.py with options
- Add reference to new doc in CLAUDE.md

Root cause: 4-slot ring buffer pipeline (4 transfer streams +
1 compute stream) triggers event ordering bug in nsys < 2024.2

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 04:32:05 +08:00
Zijie Tian
78a44f3536 📝 docs: add GPU memory monitoring rule
- Add .claude/rules/gpu-monitor.md requiring gpu-monitor agent for all GPU memory monitoring tasks
- Update CLAUDE.md rules index with reference to new rule

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 01:41:25 +08:00
Zijie Tian
7c41032a2e feat: add configurable stride and chunk_size for XAttention BSA
- Add sparse_chunk_size config option (default: 16384)
- Pass stride, chunk_size, use_triton through factory function
- Add --sparse-stride CLI option to test_ruler.py

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 10:37:04 +08:00
Zijie Tian
f28b500120 🙈 chore: uncomment planning files in gitignore
These files are session-level temporary and should not be tracked.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 09:43:46 +08:00
Zijie Tian
be67fa8060 🗑️ chore: remove temporary planning files
These files are session-level temporary files and should not be tracked.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 09:43:22 +08:00
Zijie Tian
4f35526457 🔀 merge: integrate remote changes (exec-plan command, CUDA graph plan)
Resolve task_plan.md conflict by keeping remote version (CUDA Graph optimization plan).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 09:43:06 +08:00
Zijie Tian
da5e13e2bb 📝 docs: update XAttention BSA Policy with benchmarks and memory management
Add new sections to xattn_bsa_policy_design.md:
- Performance benchmarks: 128K context comparison (Full vs XAttn BSA)
- Density trend analysis across chunks
- Memory leak issue and fix (64GB -> 4GB reduction)
- Memory monitoring guide with gpu-monitor agent
- Density statistics API documentation
- Known issues and optimization directions

Update CLAUDE.md description to reflect new content.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 09:35:18 +08:00
Zijie Tian
dd31033732 🔧 chore: add gpu-monitor agent for memory leak debugging
Add a custom agent for continuous GPU monitoring during benchmarks:
- Track GPU utilization, memory usage, and temperature
- Support multi-GPU and configurable sampling intervals
- Generate summary statistics when stopped

Useful for debugging memory leaks and profiling long-running tasks.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 09:33:15 +08:00
Zijie Tian
ed3c8bb4b8 🐛 fix: memory leak in XAttentionBSAPolicy select_blocks
Fix severe memory leak (64GB -> 4GB growth) by:
- Remove unused sparse_metadata storage (was accumulating attn_scores)
- Delete intermediate tensor list (attn_scores_list) after use
- Explicitly delete intermediate tensors before return

Before: 16GB -> 80GB during 128K prefill
After:  16GB -> 19.8GB during 128K prefill

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 09:30:18 +08:00
Zijie Tian
5eb35982bf 🔧 feat: add density statistics tracking to sparse policies
Add statistics tracking to compare block selection between policies:
- XAttentionBSAPolicy: track available/selected blocks per chunk
- FullAttentionPolicy: track total blocks (always 100% density)
- Add reset_stats(), get_density_stats(), print_density_stats() methods
- Use logger.debug for per-chunk density logging

Results on 32K niah_single_1:
- Full: 100% density across all chunks
- XAttn BSA: 90% -> 73% density (saves ~25-30% blocks in later chunks)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 08:53:22 +08:00
Zijie Tian
ad361c2c3b 📝 docs: add XAttention BSA Policy design documentation
- Create docs/xattn_bsa_policy_design.md with:
  - Algorithm overview and data flow diagram
  - select_blocks implementation details
  - GQA-aware aggregation and majority voting
  - compute_chunked_prefill ring buffer pipeline
  - Parameter configuration and usage examples
  - Performance characteristics and limitations
- Update CLAUDE.md documentation index

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 08:36:56 +08:00
Zijie Tian
4d1e40152d feat(xattn): implement compute_chunked_prefill with ring buffer pipeline
- Copy compute_chunked_prefill implementation from FullAttentionPolicy
- Set default threshold to 0.95 for accuracy testing
- Remove debug code (sys.exit, verbose prints)
- Use ring buffer pipeline for historical block loading
- Merge with current chunk attention using flash_attn_with_lse

RULER NIAH test passed with 5/5 samples (100% accuracy).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 08:27:40 +08:00
Zijie Tian
832b352afa feat(xattn): implement select_blocks with majority voting aggregation
Implement XAttention-based block selection for sparse attention:
- Use flat_group_gemm_fuse_reshape to compute Q@K^T attention scores
- Apply softmax_fuse_block_sum to aggregate into block-level attention
- Use find_blocks_chunked for threshold-based block selection
- Handle GQA by aggregating within KV head groups first
- Use majority voting (>50%) across heads instead of any() for better sparsity
- Align block_size with CPU offload block size (1024 tokens / stride = 128)

Test results show ~45% density at chunk 40 (down from 100% with any() aggregation).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 08:19:05 +08:00
Zijie Tian
a50b4c2ac2 ♻️ refactor: move select_blocks from policy to attention layer
Move block selection logic from compute_chunked_prefill/decode methods
to attention.py caller. This improves separation of concerns:

- attention.py now calls select_blocks() before compute_chunked_*()
- Policy methods receive pre-selected blocks via selected_blocks parameter
- Enables sparse policies to implement custom block selection without
  modifying the compute path

Changes:
- policy.py: Add selected_blocks parameter to abstract methods
- full_policy.py: Remove internal select_blocks calls, use passed blocks
- xattn_bsa.py: Sync signatures for prefill/decode methods
- attention.py: Add select_blocks calls before policy delegation

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 05:21:28 +08:00
Zijie Tian
ca32ea6f93 [WIP] Before refactor the compute)_chunked_prefill. 2026-01-23 03:36:12 +08:00
Zijie Tian
edc006463b docs: add XAttention kernels guide
- Document flat_group_gemm_fuse_reshape and softmax_fuse_block_sum kernels
- Explain anti-diagonal sum principle and stride sampling
- Add GPU-specific BLOCK_M/N constraints (RTX 3090 vs A100)
- Show Q/K can have different lengths (chunked prefill support)
- Update CLAUDE.md with doc reference

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 03:22:25 +08:00
Zijie Tian
999858e82f feat: add xattn kernels test and update testing rules
- Add test_xattn_kernels.py demonstrating flat_group_gemm_fuse_reshape
  and softmax_fuse_block_sum Triton kernels with structured data
- Update testing.md with new test code style guidelines
- Update xattn.py and xattn_bsa.py with improvements

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 03:01:25 +08:00
Zijie Tian
47d237bb7e feat: add exec-plan command for automated task plan execution
Add a new Claude command that executes task_plan.md refactoring with:
- GPU isolation via --gpu <id> parameter (required)
- Optional --no-interrupt mode for autonomous execution
- Progress tracking via progress.md and findings.md
- Strict CUDA_VISIBLE_DEVICES enforcement

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-22 02:23:12 +08:00
Zijie Tian
a5307fb124 📝 docs: add CUDA Graph optimization plan for offload mode decode
- Update task_plan.md with 6-phase segmented graph implementation plan
- Add findings.md documenting 7 key discoveries about current implementation
- Add progress.md for tracking implementation progress
- Add test_chunk_attention_graph_reuse.py validating 2-graph reuse strategy

Key architecture decision: Split transformer layer into 3 segments:
- PRE-ATTENTION GRAPH: norm → qkv_proj → rotary (1 graph, reused)
- CHUNKED ATTENTION: H2D (eager) + flash_attn (2 graphs) + merge (eager)
- POST-ATTENTION GRAPH: o_proj → norm → FFN (1 graph, reused)

Total: 4 graphs serving all layers via copy_() tensor updates.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-22 02:12:24 +08:00
Zijie Tian
d808970f2f [WIP] Before implement the plan. 2026-01-22 01:35:13 +08:00
Zijie Tian
bc92c1fdb8 feat: add xattn_estimate_chunked for chunked prefill support
- Add xattn_estimate_chunked function ported from COMPASS
- Support chunked prefill with q_start_pos parameter
- Ensure 100% consistency with standard xattn_estimate when
  using matching chunk_size parameter
- Add test and documentation

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-22 01:13:17 +08:00
Zijie Tian
2866d4fd88 feat: add chunk attention CUDA graph test for block sparse attention
Validates that pre-allocated CUDA graphs work for chunk-wise attention:
- Each (Q_chunk, K_chunk) pair has its own captured graph
- Zero copy_() during replay - all data pre-filled
- Uses nanovllm's flash_attn_with_lse and merge_attention_outputs

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-22 00:57:05 +08:00
Zijie Tian
5d722968ff [docs] Added cuda_graph_guide.md 2026-01-21 21:56:24 +08:00
Zijie Tian
d21b40f48f [test] Added test_cudagraph_memory.py. 2026-01-21 03:30:36 +08:00
Zijie Tian
42cf124343 📝 docs: add CUDA Graph memory mechanism guide
Document CUDA Graph memory behavior based on actual testing:
- Memory overhead at each stage (model, cache, warmup, capture, replay)
- StaticCache is the main overhead (~144MB for 1K tokens)
- Graph capture adds minimal overhead (~8MB)
- Graph replay requires zero additional allocation
- Performance improvement: ~2.8x decode throughput

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-21 02:59:21 +08:00
Zijie Tian
78050aef9f 🐛 fix: resolve CPU KV cache state leakage between requests
Root Cause:
- OffloadEngine.reset() cleared GPU buffers but NOT CPU cache
- Previous request's KV cache data persisted in CPU memory, contaminating subsequent requests

Fixes:
- Add k_cache_cpu.zero_() and v_cache_cpu.zero_() to OffloadEngine.reset()
- Add clear_decode_tracking(seq) call in HybridKVCacheManager.deallocate()

Results:
- niah_single_1 accuracy improved from ~80% to 94% (+14%)
- Remaining ~6% errors are model limitations, not state leakage

Also:
- Update docs/ruler_32k_chunked_offload_issue.md with fix details
- Remove debug planning files (findings.md, progress.md, task_plan.md)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-21 01:12:21 +08:00
Zijie Tian
4d8ae951c3 [WIP] Before debug plan. 2026-01-21 00:01:10 +08:00
Zijie Tian
1ab4676396 ♻️ refactor: consolidate RULER test files and document root cause
- test_ruler.py: add --fresh-llm, --sample-indices, --json-output options
- test_ruler.py: consolidate test_ruler_single_sample.py, test_ruler_sequential.py, test_ruler_samples.py
- docs: update chunked offload issue with root cause (state leakage confirmed)
- docs: add single-sample test results showing 100% accuracy for niah_single_1

Deleted redundant test files:
- tests/test_ruler_single_sample.py
- tests/test_ruler_sequential.py
- tests/test_ruler_samples.py

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-20 23:41:17 +08:00
Zijie Tian
512e1e5401 🔧 chore: add Claude rules for agent result format and multi-GPU debugging
- Add agent-result-format.md: standardize output formats for background agents
- Add multi-gpu-debugging.md: guidelines for parallel GPU testing workflows
- Update CLAUDE.md: add documentation index entry for chunked offload issue

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-20 23:41:08 +08:00
Zijie Tian
6180055ed8 📝 docs: add chunked attention solutions guide and update doc index
Add comprehensive documentation analyzing the 32K chunked offload
accuracy issues with proposed solutions covering LSE precision,
ring buffer state management, and position encoding validation.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-20 04:48:20 +08:00
Zijie Tian
4cbd451af7 📝 docs: add BSA interface documentation and cleanup temp files
- Add docs/block_sparse_attn_interface.md with BSA function signatures
- Update CLAUDE.md documentation index
- Remove obsolete DEBUG_SUMMARY.md and test_report_sparse_policy_refactor.md
- Add notes.md to .gitignore

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-20 04:27:19 +08:00
Zijie Tian
3aef6fc3a2 feat: add XAttention Triton operators for sparse attention estimation
Port XAttention operators from COMPASS project:
- flat_group_gemm_fuse_reshape: stride reshape GEMM kernel
- softmax_fuse_block_sum: fused softmax with block-level summation
- xattn_estimate: main estimation function for block sparse attention
- find_blocks_chunked: cumulative threshold-based block selection

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-20 04:27:07 +08:00
Zijie Tian
690456dbf9 ♻️ refactor: create ops module and move chunked_attention
- Create nanovllm/ops/ module for low-level attention operators
- Move chunked_attention.py from kvcache/ to ops/
- Update imports in full_policy.py (3 locations)
- Fix: remove dead code in OffloadEngine.reset() referencing
  non-existent layer_k/v_buffer_a/b attributes

Verified with needle test (32K offload): PASSED

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-20 02:50:14 +08:00
Zijie Tian
e440c45e73 📝 docs: add XAttention algorithm guide based on COMPASS implementation
- Create docs/xattention_algorithm_guide.md with detailed algorithm explanation
  - Stride reshape (inverse mode) for Q/K interleaved sampling
  - Triton kernels: flat_group_gemm_fuse_reshape, softmax_fuse_block_sum
  - Block selection via find_blocks_chunked with cumulative threshold
  - BSA (block_sparse_attn) dependency for sparse computation
- Update docs/sparse_attention_guide.md XAttention section with accurate description
- Add documentation index entry in CLAUDE.md

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-20 02:50:03 +08:00
Zijie Tian
07f5220f40 Merge branch 'tzj/minference' of ssh://git.zijie-tian.site:2222/zijie-tian/nano-vllm into tzj/minference 2026-01-20 02:27:10 +08:00
Zijie Tian
37aecd4d52 📝 docs: add SparsePolicy implementation guide and update rules
- Create docs/sparse_policy_implementation_guide.md with comprehensive guide
- Rewrite .claude/rules/sparse-policy.md with mandatory base class requirements
- Add new doc reference to CLAUDE.md documentation index

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-20 02:25:46 +08:00
Zijie Tian
b1f292cf22 Merge branch 'tzj/minference' of ssh://git.zijie-tian.site:2222/zijie-tian/nano-vllm into tzj/minference 2026-01-20 02:16:39 +08:00
Zijie Tian
16fbcf9e4c docs: add RULER 32K chunked offload issue documentation
- Document accuracy degradation issue in 32K context with chunked offload
- Add detailed hypothesis analysis and debugging approach
- Include 4-slot ring buffer experiment results

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-20 02:16:21 +08:00
Zijie Tian
fa7601f4b8 ♻️ refactor: remove cross-layer pipeline and rename compute_chunked_prefill
- Remove cross-layer pipeline from OffloadEngine (saves ~1GB GPU memory for long sequences)
  - Delete layer_k/v_buffer_a/b double buffers
  - Remove start_decode_pipeline, get_decode_layer_kv, end_decode_pipeline methods
  - Remove pipeline state tracking variables
- Simplify decode to use ring buffer pipeline only (more efficient for long sequences)
- Rename compute_chunked_attention → compute_chunked_prefill for clarity
- Add mandatory needle test requirements: --enable-offload --input-len 32768

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-20 02:10:40 +08:00
Zijie Tian
6080bf7554 🙈 chore: exclude planning-with-files from git tracking
- Add planning files (task_plan.md, findings.md, progress.md) to .gitignore
- Remove existing planning files from git index (keep local)
- Update planning-with-files rule with git management policy

These temporary session files should not be version controlled.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-20 02:06:28 +08:00
Zijie Tian
e5a17c832c 📝 docs: add SparsePolicy architecture documentation
Add comprehensive documentation for the SparsePolicy abstraction:
- SparsePolicy base class and abstract methods
- FullAttentionPolicy prefill/decode flow
- Ring buffer and cross-layer pipeline modes
- Code conventions and testing guidelines

Update CLAUDE.md documentation index with reference.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-20 01:36:09 +08:00
Zijie Tian
4593f42ec3 ♻️ refactor: migrate chunked decode attention to SparsePolicy
Move decode attention computation from attention.py to SparsePolicy:
- Add compute_chunked_decode abstract method to SparsePolicy base class
- Implement compute_chunked_decode in FullAttentionPolicy with:
  - Ring buffer pipeline (_decode_ring_buffer_pipeline)
  - Cross-layer pipeline (_decode_with_layer_pipeline)
  - Decode buffer handling
- Simplify _chunked_decode_attention to only validate and delegate
- Remove _decode_ring_buffer_pipeline and _decode_with_layer_pipeline from attention.py
- Add supports_decode check for policy validation

This completes the SparsePolicy v5 refactoring where both prefill and
decode paths now delegate all computation to the sparse policy.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-20 01:32:17 +08:00
Zijie Tian
a36f8569fc [WIP] Before refactor. 2026-01-20 01:25:46 +08:00
Zijie Tian
d3b41b2f64 🔧 chore: clean up claude-flow configuration
Remove unused claude-flow hooks, permissions, and daemon settings.
Add disabled MCP servers list for claude-flow related servers.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-20 00:58:52 +08:00
Zijie Tian
baa4be7e2e ♻️ refactor: migrate chunked prefill attention to SparsePolicy
Move all chunked prefill attention computation from attention.py to
SparsePolicy.compute_chunked_attention(). This is the v4 architecture
refactoring for sparse attention policies.

Changes:
- Add compute_chunked_attention abstract method to SparsePolicy base
- Add offload_engine parameter to select_blocks for policies needing
  KV access during block selection
- Implement compute_chunked_attention in FullAttentionPolicy with
  complete ring buffer pipeline logic
- Simplify attention.py to delegate all chunked prefill to policy
- Remove redundant _sync_load_previous_chunks and
  _ring_buffer_pipeline_load methods from Attention class

Test: test_needle.py --enable-offload PASSED

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-20 00:58:46 +08:00
Zijie Tian
6783a45e6f 🚧 wip: update sparse policy refactoring plan to v4
Add clear acceptance criteria and verification methods:
- Define 3 acceptance criteria (needle test, zero calc in attention.py, KV via offload_engine)
- Document violations to fix (direct flash_attn/copy calls)
- Add offload_engine.write_prefill_buffer encapsulation plan
- Add LSP-based verification method using cclsp tools

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-19 23:23:16 +08:00
Zijie Tian
16b269d897 🚧 wip: update sparse policy refactoring plan to v4
Simplified scope to FullPolicy only. Added debug validation phase.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-19 23:10:49 +08:00
Zijie Tian
b97b0b96a0 [WIP] Before refactor the nanovllm sparse policy. 2026-01-19 22:34:44 +08:00