Commit Graph

9 Commits

Author SHA1 Message Date
Zijie Tian
07f5220f40 Merge branch 'tzj/minference' of ssh://git.zijie-tian.site:2222/zijie-tian/nano-vllm into tzj/minference 2026-01-20 02:27:10 +08:00
Zijie Tian
37aecd4d52 📝 docs: add SparsePolicy implementation guide and update rules
- Create docs/sparse_policy_implementation_guide.md with comprehensive guide
- Rewrite .claude/rules/sparse-policy.md with mandatory base class requirements
- Add new doc reference to CLAUDE.md documentation index

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-20 02:25:46 +08:00
Zijie Tian
b1f292cf22 Merge branch 'tzj/minference' of ssh://git.zijie-tian.site:2222/zijie-tian/nano-vllm into tzj/minference 2026-01-20 02:16:39 +08:00
Zijie Tian
16fbcf9e4c docs: add RULER 32K chunked offload issue documentation
- Document accuracy degradation issue in 32K context with chunked offload
- Add detailed hypothesis analysis and debugging approach
- Include 4-slot ring buffer experiment results

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-20 02:16:21 +08:00
Zijie Tian
fa7601f4b8 ♻️ refactor: remove cross-layer pipeline and rename compute_chunked_prefill
- Remove cross-layer pipeline from OffloadEngine (saves ~1GB GPU memory for long sequences)
  - Delete layer_k/v_buffer_a/b double buffers
  - Remove start_decode_pipeline, get_decode_layer_kv, end_decode_pipeline methods
  - Remove pipeline state tracking variables
- Simplify decode to use ring buffer pipeline only (more efficient for long sequences)
- Rename compute_chunked_attention → compute_chunked_prefill for clarity
- Add mandatory needle test requirements: --enable-offload --input-len 32768

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-20 02:10:40 +08:00
Zijie Tian
e5a17c832c 📝 docs: add SparsePolicy architecture documentation
Add comprehensive documentation for the SparsePolicy abstraction:
- SparsePolicy base class and abstract methods
- FullAttentionPolicy prefill/decode flow
- Ring buffer and cross-layer pipeline modes
- Code conventions and testing guidelines

Update CLAUDE.md documentation index with reference.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-20 01:36:09 +08:00
Zijie Tian
b5da802dff [WIP] Before integrate the xattn operator. 2026-01-19 21:19:21 +08:00
Zijie Tian
e6e0dc5d7d feat: add comprehensive RULER benchmark testing
- Add test_ruler.py from tzj/vs_offload branch with 13 RULER tasks
- Add comprehensive documentation for RULER benchmark results
- Update CLAUDE.md with new documentation index entry
- Add architecture, debugging, optimization, and known issues guides
- Test 32K context with CPU offload: 92.3% accuracy across all tasks
- Parallel execution on 4 GPUs with detailed performance metrics

Benchmark results:
- 13 RULER tasks total (niah_single, multikey, multiquery, multivalue, qa, cwe, fwe, vt)
- 26 samples tested with 92.3% overall accuracy
- CPU offload stable at 32K context length
- Parallel GPU execution achieving 4x speedup

Key findings:
- Single needle tasks: 100% accuracy
- Multi-value and recall tasks: 100% accuracy
- Multi-query tasks: 50% accuracy (most challenging)
- QA tasks: 100% accuracy
- Total execution time: ~220 seconds (parallel)
2026-01-18 20:34:06 +08:00
Zijie Tian
bf4c63c7ec [docs] Added Sparse Attn. 2025-12-29 19:56:54 +08:00