Commit Graph

248 Commits

Author SHA1 Message Date
Zijie Tian
4cbd451af7 📝 docs: add BSA interface documentation and cleanup temp files
- Add docs/block_sparse_attn_interface.md with BSA function signatures
- Update CLAUDE.md documentation index
- Remove obsolete DEBUG_SUMMARY.md and test_report_sparse_policy_refactor.md
- Add notes.md to .gitignore

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-20 04:27:19 +08:00
Zijie Tian
3aef6fc3a2 feat: add XAttention Triton operators for sparse attention estimation
Port XAttention operators from COMPASS project:
- flat_group_gemm_fuse_reshape: stride reshape GEMM kernel
- softmax_fuse_block_sum: fused softmax with block-level summation
- xattn_estimate: main estimation function for block sparse attention
- find_blocks_chunked: cumulative threshold-based block selection

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-20 04:27:07 +08:00
Zijie Tian
690456dbf9 ♻️ refactor: create ops module and move chunked_attention
- Create nanovllm/ops/ module for low-level attention operators
- Move chunked_attention.py from kvcache/ to ops/
- Update imports in full_policy.py (3 locations)
- Fix: remove dead code in OffloadEngine.reset() referencing
  non-existent layer_k/v_buffer_a/b attributes

Verified with needle test (32K offload): PASSED

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-20 02:50:14 +08:00
Zijie Tian
e440c45e73 📝 docs: add XAttention algorithm guide based on COMPASS implementation
- Create docs/xattention_algorithm_guide.md with detailed algorithm explanation
  - Stride reshape (inverse mode) for Q/K interleaved sampling
  - Triton kernels: flat_group_gemm_fuse_reshape, softmax_fuse_block_sum
  - Block selection via find_blocks_chunked with cumulative threshold
  - BSA (block_sparse_attn) dependency for sparse computation
- Update docs/sparse_attention_guide.md XAttention section with accurate description
- Add documentation index entry in CLAUDE.md

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-20 02:50:03 +08:00
Zijie Tian
07f5220f40 Merge branch 'tzj/minference' of ssh://git.zijie-tian.site:2222/zijie-tian/nano-vllm into tzj/minference 2026-01-20 02:27:10 +08:00
Zijie Tian
37aecd4d52 📝 docs: add SparsePolicy implementation guide and update rules
- Create docs/sparse_policy_implementation_guide.md with comprehensive guide
- Rewrite .claude/rules/sparse-policy.md with mandatory base class requirements
- Add new doc reference to CLAUDE.md documentation index

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-20 02:25:46 +08:00
Zijie Tian
b1f292cf22 Merge branch 'tzj/minference' of ssh://git.zijie-tian.site:2222/zijie-tian/nano-vllm into tzj/minference 2026-01-20 02:16:39 +08:00
Zijie Tian
16fbcf9e4c docs: add RULER 32K chunked offload issue documentation
- Document accuracy degradation issue in 32K context with chunked offload
- Add detailed hypothesis analysis and debugging approach
- Include 4-slot ring buffer experiment results

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-20 02:16:21 +08:00
Zijie Tian
fa7601f4b8 ♻️ refactor: remove cross-layer pipeline and rename compute_chunked_prefill
- Remove cross-layer pipeline from OffloadEngine (saves ~1GB GPU memory for long sequences)
  - Delete layer_k/v_buffer_a/b double buffers
  - Remove start_decode_pipeline, get_decode_layer_kv, end_decode_pipeline methods
  - Remove pipeline state tracking variables
- Simplify decode to use ring buffer pipeline only (more efficient for long sequences)
- Rename compute_chunked_attention → compute_chunked_prefill for clarity
- Add mandatory needle test requirements: --enable-offload --input-len 32768

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-20 02:10:40 +08:00
Zijie Tian
6080bf7554 🙈 chore: exclude planning-with-files from git tracking
- Add planning files (task_plan.md, findings.md, progress.md) to .gitignore
- Remove existing planning files from git index (keep local)
- Update planning-with-files rule with git management policy

These temporary session files should not be version controlled.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-20 02:06:28 +08:00
Zijie Tian
e5a17c832c 📝 docs: add SparsePolicy architecture documentation
Add comprehensive documentation for the SparsePolicy abstraction:
- SparsePolicy base class and abstract methods
- FullAttentionPolicy prefill/decode flow
- Ring buffer and cross-layer pipeline modes
- Code conventions and testing guidelines

Update CLAUDE.md documentation index with reference.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-20 01:36:09 +08:00
Zijie Tian
4593f42ec3 ♻️ refactor: migrate chunked decode attention to SparsePolicy
Move decode attention computation from attention.py to SparsePolicy:
- Add compute_chunked_decode abstract method to SparsePolicy base class
- Implement compute_chunked_decode in FullAttentionPolicy with:
  - Ring buffer pipeline (_decode_ring_buffer_pipeline)
  - Cross-layer pipeline (_decode_with_layer_pipeline)
  - Decode buffer handling
- Simplify _chunked_decode_attention to only validate and delegate
- Remove _decode_ring_buffer_pipeline and _decode_with_layer_pipeline from attention.py
- Add supports_decode check for policy validation

This completes the SparsePolicy v5 refactoring where both prefill and
decode paths now delegate all computation to the sparse policy.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-20 01:32:17 +08:00
Zijie Tian
a36f8569fc [WIP] Before refactor. 2026-01-20 01:25:46 +08:00
Zijie Tian
d3b41b2f64 🔧 chore: clean up claude-flow configuration
Remove unused claude-flow hooks, permissions, and daemon settings.
Add disabled MCP servers list for claude-flow related servers.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-20 00:58:52 +08:00
Zijie Tian
baa4be7e2e ♻️ refactor: migrate chunked prefill attention to SparsePolicy
Move all chunked prefill attention computation from attention.py to
SparsePolicy.compute_chunked_attention(). This is the v4 architecture
refactoring for sparse attention policies.

Changes:
- Add compute_chunked_attention abstract method to SparsePolicy base
- Add offload_engine parameter to select_blocks for policies needing
  KV access during block selection
- Implement compute_chunked_attention in FullAttentionPolicy with
  complete ring buffer pipeline logic
- Simplify attention.py to delegate all chunked prefill to policy
- Remove redundant _sync_load_previous_chunks and
  _ring_buffer_pipeline_load methods from Attention class

Test: test_needle.py --enable-offload PASSED

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-20 00:58:46 +08:00
Zijie Tian
6783a45e6f 🚧 wip: update sparse policy refactoring plan to v4
Add clear acceptance criteria and verification methods:
- Define 3 acceptance criteria (needle test, zero calc in attention.py, KV via offload_engine)
- Document violations to fix (direct flash_attn/copy calls)
- Add offload_engine.write_prefill_buffer encapsulation plan
- Add LSP-based verification method using cclsp tools

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-19 23:23:16 +08:00
Zijie Tian
16b269d897 🚧 wip: update sparse policy refactoring plan to v4
Simplified scope to FullPolicy only. Added debug validation phase.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-19 23:10:49 +08:00
Zijie Tian
b97b0b96a0 [WIP] Before refactor the nanovllm sparse policy. 2026-01-19 22:34:44 +08:00
Zijie Tian
b5da802dff [WIP] Before integrate the xattn operator. 2026-01-19 21:19:21 +08:00
Zijie Tian
9e6fdc0650 [WIP] Before plan execute. 2026-01-19 03:30:44 +08:00
Zijie Tian
50520a6c3c [fix] fixed request to request error. 2026-01-19 00:55:26 +08:00
Zijie Tian
e6e0dc5d7d feat: add comprehensive RULER benchmark testing
- Add test_ruler.py from tzj/vs_offload branch with 13 RULER tasks
- Add comprehensive documentation for RULER benchmark results
- Update CLAUDE.md with new documentation index entry
- Add architecture, debugging, optimization, and known issues guides
- Test 32K context with CPU offload: 92.3% accuracy across all tasks
- Parallel execution on 4 GPUs with detailed performance metrics

Benchmark results:
- 13 RULER tasks total (niah_single, multikey, multiquery, multivalue, qa, cwe, fwe, vt)
- 26 samples tested with 92.3% overall accuracy
- CPU offload stable at 32K context length
- Parallel GPU execution achieving 4x speedup

Key findings:
- Single needle tasks: 100% accuracy
- Multi-value and recall tasks: 100% accuracy
- Multi-query tasks: 50% accuracy (most challenging)
- QA tasks: 100% accuracy
- Total execution time: ~220 seconds (parallel)
2026-01-18 20:34:06 +08:00
Zijie Tian
0550a64339 feat: add dynamic port allocation from tzj/vs_offload
- Import os and socket modules
- Add _find_free_port() function for automatic port detection
- Use NANOVLLM_DIST_PORT env var if set, otherwise auto-assign
- Enables running multiple model instances without port conflicts

Co-Authored-By: Claude <noreply@anthropic.com>
2026-01-18 19:51:56 +08:00
Zijie Tian
d9890aa2cd chore: add Block-SparseAttention submodule from tzj/vs_offload 2026-01-18 19:22:40 +08:00
Zijie Tian
5a837c8c83 chore: update .gitignore with tzj/vs_offload configuration
- Add Claude Flow generated files ignore patterns
- Add test data directory ignore
- Add Serena MCP tool config ignore
- Add Windows wrapper files ignore

These configurations improve development workflow by excluding temporary
and generated files from version control.
2026-01-18 18:59:17 +08:00
Zijie Tian
d1bbb7efe2 chore: update claude configuration and rules from tzj/vs_offload
- Add /sc:git command with smart commit functionality
- Add /sc:ultra-think command for deep thinking
- Update .claude/rules/ with improved documentation:
  - commands.md: command usage guidelines
  - doc-management.md: documentation policy
  - no-extra-docs.md: documentation creation policy
  - gpu-testing.md: GPU type detection and testing rules
- Update .claude/settings.json with claude-flow MCP configuration

这些改进提供了更好的开发体验和工具支持。
2026-01-18 18:56:49 +08:00
Zijie Tian
1a78ae74d5 feat: add claude-flow MCP configuration
Add .claude/settings.json to enable claude-flow MCP in all worktrees.

This configuration includes:
- SessionStart hook to auto-start claude-flow daemon
- Auto-approval for claude-flow MCP tools and CLI commands
- Basic claude-flow settings

Co-Authored-By: Claude <noreply@anthropic.com>
2026-01-18 18:55:56 +08:00
Zijie Tian
c254c8c330 chore: add planning-with-files rule configuration 2026-01-18 18:55:55 +08:00
Zijie Tian
03a8c033cb [claudesquad] update from 'add-llama-1' on 10 Jan 26 21:03 CST 2026-01-10 21:03:45 +08:00
Zijie Tian
6575099a06 [refactor] Cleanup unused code after perf_opt merge
Removed ~460 lines of unused/redundant code from offload_engine.py:
- CUDA gather methods (gathered_h2d_*, update_gather_indices)
- Legacy async transfer methods (prefetch_block_async, offload_block_async)
- Legacy sync/wait methods (wait_for_block, wait_all_transfers, sync_indices)
- Legacy compatibility methods (load_to_compute_layer, wait_compute_layer)
- Unused gather_indices tensors and memory calculations

Updated class docstring to reflect current architecture.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-07 06:25:21 +08:00
Zijie Tian
8fd25d72d7 Merge perf_opt-1 and perf_opt-2 branches
Combines two performance optimization features:
- perf_opt-1: Cross-layer pipeline for decode (double-buffered layer cache)
- perf_opt-2: Per-layer prefill buffer for async offload

Both features are complementary and improve CPU offload performance.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-07 06:03:44 +08:00
Zijie Tian
ccf27d3a74 [claudesquad] update from 'perf_opt-1' on 07 Jan 26 05:58 CST 2026-01-07 05:58:23 +08:00
Zijie Tian
0ad86eb449 [claudesquad] update from 'perf_opt-2' on 07 Jan 26 05:58 CST 2026-01-07 05:58:10 +08:00
Zijie Tian
aa953ecb59 [refactor] Aligned the bench. 2026-01-07 04:25:06 +08:00
Zijie Tian
362f5e575f [fix] Fixed .gitignores . 2026-01-07 03:32:14 +08:00
Zijie Tian
58a06501c1 Merge branch 'zijie/debug_chunk-2' into tzj/minference 2026-01-07 03:30:38 +08:00
Zijie Tian
2a6e0a2c02 [feat] Added Quest Sparsity Policy. 2026-01-07 03:29:21 +08:00
Zijie Tian
2fe50bab50 [claudesquad] update from 'debug_chunk-2' on 07 Jan 26 03:27 CST 2026-01-07 03:27:27 +08:00
Zijie Tian
c99a6f3d3f [WIP] Before add Quest policy. 2026-01-07 02:32:30 +08:00
Zijie Tian
f240903013 [docs] Add GPU mutex instructions for multi-instance debugging
Add instructions for Claude instances to check GPU availability before
running CUDA operations, preventing conflicts when multiple instances
debug in parallel on a single GPU.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-07 01:42:59 +08:00
Zijie Tian
0e691f2d85 [WIP] move metadata to GPU. 2026-01-06 23:32:32 +08:00
Zijie Tian
edb5273e34 [WIP] Added basic test for quest. 2026-01-06 22:30:31 +08:00
Zijie Tian
690492e074 [WIP] Before refactor policies. 2026-01-06 20:47:55 +08:00
Zijie Tian
7cc8a394a5 [fix] Fixed bench_offload.py, BUT performance DEGRAD. 2026-01-06 18:46:48 +08:00
Zijie Tian
535f2037ab [WIP] Before fix bench_offload.py. 2026-01-06 18:41:08 +08:00
Zijie Tian
c7ac39dfbd [refactor] Before add sprae policy. 2026-01-05 21:19:24 +08:00
Zijie Tian
e554d5482b [refactor] Delete unnesscessory test, and refacrtor the offload prefix cache. 2026-01-05 20:31:42 +08:00
Zijie Tian
247c5312d9 [fix] Fixed decode misalign. 2026-01-05 19:00:44 +08:00
Zijie Tian
054aaff403 [fix] Fixed needle test bug. 2026-01-05 18:34:09 +08:00
Zijie Tian
d623043a3c [WIP] FIXED decode and prefill NEEDLE test. 2026-01-05 01:51:46 +08:00