Commit Graph

39 Commits

Author SHA1 Message Date
Zijie Tian
cfb188c34a docs: add chunked prefill analysis for ultra-long sequences
Add comprehensive analysis document covering:
- MLP activation memory bottlenecks with SwiGLU architecture
- Chunked MLP strategy (98% memory reduction)
- Chunked prefill for single layers (78% memory reduction)
- Streaming Chunked Prefill (最优方案): GPU memory becomes constant
- Memory formulas and implementation guidance
- Theoretical maximum: 4M tokens on 24GB GPU (128× improvement)

Co-Authored-By: Claude <noreply@anthropic.com>
2026-01-16 10:38:02 +08:00
Zijie Tian
2826a649de docs: add XAttention integration guide
Comprehensive documentation for XAttention sparse policy integration:
- Algorithm principles (chunked estimation + block sparse attention)
- COMPASS source code analysis
- Design decisions for CPU offload mode
- Implementation details (utils.py, kernels.py, xattn.py)
- Problem-solving (OOM, GQA, abstract method)
- Test validation results (RULER 32k benchmark)

Co-Authored-By: Claude <noreply@anthropic.com>
2026-01-14 10:16:21 +08:00
Zijie Tian
57f4e9c6e6 docs: reorganize documentation files
- Move notes.md to docs/development_notes.md
- Move Xattention_analysis.md to docs/xattention_analysis.md
- Delete DEBUG_SUMMARY.md (no longer needed)
- Update CLAUDE.md with documentation index entries

Co-Authored-By: Claude <noreply@anthropic.com>
2026-01-14 10:08:41 +08:00
Zijie Tian
8d6fde3b23 docs: add Block-Sparse-Attention library reference
Add comprehensive documentation for the MIT-Han-Lab Block-Sparse-Attention
library (3rdparty submodule, branch: tzj/minference).

The new document covers:
- Four sparse attention modes (dense, token/block streaming, block sparse)
- Hybrid mask support (different patterns per head)
- Complete API reference for all three functions
- Performance benchmarks (up to 3-4x speedup on A100)
- Integration considerations for nano-vllm

Co-Authored-By: Claude <noreply@anthropic.com>
2026-01-14 08:39:03 +08:00
Zijie Tian
86633004ca 📝 docs: add 64k memory analysis and test configuration updates
Add comprehensive memory analysis for 64k inference on Llama 3.1 8B:

New documentation:
- docs/64k_memory_analysis.md: GPU-only vs offload memory analysis,
  OOM root cause (memory fragmentation), RTX 3090 limitations,
  theoretical vs actual memory usage breakdown

Test configuration updates:
- tests/test_ruler.py: Add --num-kv-buffers parameter for ring buffer
  size tuning (default 4, can reduce to 1 for lower memory)
- Update default data_dir to ruler_64k
- Update default max_model_len to 65664 for 64k support

CLAUDE.md updates:
- Add 64k_memory_analysis.md to documentation index
- Document num_kv_buffers parameter in Configuration section
- Add 64k hardware requirements note to Model Limits

Key findings: 64k inference requires ~26GB (GPU-only) or ~23GB (offload)
due to memory fragmentation on 24GB GPUs, making A100 (40GB+) the
recommended hardware for 64k workloads.

Co-Authored-By: Claude <noreply@anthropic.com>
2026-01-14 07:02:09 +08:00
Zijie Tian
64971c8e8a Merge branch 'zijie/fix-dist-3': Fix distributed port conflict
- Auto port allocation with _find_free_port() in model_runner.py
- Resource management refactor with close() + context manager in llm_engine.py
- Add tests/test_port_conflict.py and tests/run_parallel_niah.sh
- Remove docs/torch_distributed_port_issue.md (issue fixed)
- Ignore tests/data/ directory

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-12 16:27:25 +08:00
Zijie Tian
de6f36bdb2 [docs] Added dist port issue. 2026-01-12 15:16:39 +08:00
Zijie Tian
8e0888c20c [docs] Added offload_acc issue. 2026-01-12 15:05:55 +08:00
Zijie Tian
2771312565 [docs] Add sparse prefill integration plan from int-minference analysis
Consolidated analysis from int-minference-1/2/3 branches into a unified
integration plan for MInference, XAttention, and FlexPrefill strategies.

Key design decisions:
- Backward compatible: Keep existing SparsePolicy interface
- Unified BlockMask intermediate representation for new strategies
- XAttention/FlexPrefill use block_sparse_attn_func kernel
- MInference can optionally use block_sparse_attn (Phase 4)

Five-phase implementation plan:
1. BlockMask + block_sparse_attn wrapper
2. XAttention implementation
3. FlexPrefill implementation
4. Optional MInference refactoring
5. Integration and testing

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-10 23:33:09 +08:00
Zijie Tian
de6eae472d [docs] Update CLAUDE.md with multi-model support documentation
- Update overview to reflect Qwen3/Qwen2/Llama support
- Add docs/multi_model_support.md to documentation index
- Add Llama-3.1-8B-Instruct to model limits

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-10 21:29:39 +08:00
Zijie Tian
067e36f4a2 [claudesquad] update from 'fix-bug-2' on 09 Jan 26 16:10 CST 2026-01-09 16:10:28 +08:00
Zijie Tian
79c4df4a27 [claudesquad] update from 'int-minference-1' on 08 Jan 26 23:42 CST 2026-01-08 23:42:30 +08:00
Zijie Tian
ea4e904de0 [claudesquad] update from 'int-minference-1' on 08 Jan 26 23:22 CST 2026-01-08 23:22:38 +08:00
Zijie Tian
0bfe1984ef [docs] Refine GPU mutex: exclusive for benchmarks, port check for tests
Benchmarks (bench*.py) still require exclusive GPU access for accurate
measurements. Other scripts (tests, examples) now only check for
distributed port 29500 conflicts, allowing parallel GPU sharing.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-08 21:35:08 +08:00
Zijie Tian
105201b902 [claudesquad] update from 'lw-offload-2' on 08 Jan 26 21:19 CST 2026-01-08 21:19:38 +08:00
Zijie Tian
b5c0ef3b7a [docs] Replace chunked prefill docs with layer-wise offload strategy
Remove all chunked prefill related documentation (ring buffer, sgDMA,
Triton merge kernels, known issues) and replace with layer-wise offload
system documentation including:
- Design philosophy and benefits
- Memory layout and per-layer KV size table
- Prefill and decode flow pseudocode
- Critical implementation details (sync offload, causal=False for decode)
- Helper methods in HybridKVCacheManager

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-08 05:39:26 +08:00
Zijie Tian
bbbfd1e7da [docs] Simplify multi-instance development with direct PYTHONPATH
Replace pip install -e . --prefix=./.local approach with simpler PYTHONPATH method:
- No pip install required
- Code changes take effect immediately
- Each worktree is completely isolated

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-08 04:51:55 +08:00
Zijie Tian
8fd25d72d7 Merge perf_opt-1 and perf_opt-2 branches
Combines two performance optimization features:
- perf_opt-1: Cross-layer pipeline for decode (double-buffered layer cache)
- perf_opt-2: Per-layer prefill buffer for async offload

Both features are complementary and improve CPU offload performance.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-07 06:03:44 +08:00
Zijie Tian
ccf27d3a74 [claudesquad] update from 'perf_opt-1' on 07 Jan 26 05:58 CST 2026-01-07 05:58:23 +08:00
Zijie Tian
0ad86eb449 [claudesquad] update from 'perf_opt-2' on 07 Jan 26 05:58 CST 2026-01-07 05:58:10 +08:00
Zijie Tian
2fe50bab50 [claudesquad] update from 'debug_chunk-2' on 07 Jan 26 03:27 CST 2026-01-07 03:27:27 +08:00
Zijie Tian
f240903013 [docs] Add GPU mutex instructions for multi-instance debugging
Add instructions for Claude instances to check GPU availability before
running CUDA operations, preventing conflicts when multiple instances
debug in parallel on a single GPU.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-07 01:42:59 +08:00
Zijie Tian
edb5273e34 [WIP] Added basic test for quest. 2026-01-06 22:30:31 +08:00
Zijie Tian
e554d5482b [refactor] Delete unnesscessory test, and refacrtor the offload prefix cache. 2026-01-05 20:31:42 +08:00
Zijie Tian
054aaff403 [fix] Fixed needle test bug. 2026-01-05 18:34:09 +08:00
Zijie Tian
9b52d25866 [docs] Update CLAUDE.md. 2026-01-03 20:46:00 +08:00
Zijie Tian
bf4c63c7ec [docs] Added Sparse Attn. 2025-12-29 19:56:54 +08:00
Zijie Tian
82ed34fc2d [opt] optimize nanovllm performance compareable with vllm. 2025-12-25 03:47:07 +08:00
Zijie Tian
16fcf8350b [WIP] replace merge attention with triton kernel. 2025-12-25 01:07:05 +08:00
Zijie Tian
6ec1b23982 [WIP] NEED to modify communication. 2025-12-24 21:57:51 +08:00
Zijie Tian
782437c486 [WIP] remove num_prefetch_blocks varible. 2025-12-24 18:22:26 +08:00
Zijie Tian
1907b625b6 [refactor] Remove legacy mode path. 2025-12-22 20:17:56 +08:00
Zijie Tian
08d83185ce [fix] fix bench*.py. 2025-12-22 19:53:50 +08:00
Zijie Tian
8df0c7517b [docs] refactor CLAUDE.md. 2025-12-15 21:43:33 +08:00
Zijie Tian
b8b6478506 [feat] Need to optimized with async prefetch. 2025-12-15 06:58:40 +08:00
Zijie Tian
1081ab51ea [refactor] Refactor offload code to multi-chunk. 2025-12-15 01:13:58 +08:00
Zijie Tian
5949537faf [docs] Start ues CLAUDE rules. 2025-12-15 00:20:54 +08:00
Zijie Tian
a37f07943c [docs] Update the CLAUDE.md. 2025-12-15 00:13:27 +08:00
Zijie Tian
761929390e [bench] Added vllm vs nano-vllm bench. 2025-12-10 00:44:57 +08:00