Commit Graph

168 Commits

Author SHA1 Message Date
Zijie Tian
b8c00399af chore: sync submodule URL with tzj/minference (use HTTPS) 2026-01-18 19:32:18 +08:00
Zijie Tian
13586e689b docs: add chunked prefill integration plan
分析两个分支的内存布局差异,明确 Block-Based 设计对支持
任意长度推理的重要性。

核心发现:
- tzj/vs_offload 的 max_seq_len 设计导致 GPU 内存随序列长度增长
- tzj/minference 的 block-based 设计使 GPU 内存固定(~1.6 GB)
- 在 24GB RTX 3090 上可支持 4M+ tokens 推理

规划将 tzj/minference 的 chunked prefill 机制移植到 tzj/vs_offload 分支:
- Block-based GPU cache (无 layer 维度)
- Per-layer prefill buffer (完全并行 offload)
- Cross-layer pipeline buffers (double-buffering)
- Chunked prefill 流程和 LSE 在线合并

Sparse Policy 策略:保留架构,现阶段仅实现 FULL 策略

相关文件:
- docs/chunked_prefill_integration_plan.md (新增)
2026-01-18 18:49:19 +08:00
Zijie Tian
e72725c12b test: add OffloadedTensor unified test suite
Add comprehensive test suite for OffloadedTensor implementation,
including basic functionality, chunked GEMM, and sync analysis.

Components:
- OffloadedTensor: Virtual GPU tensor with transparent CPU/GPU data movement
- OffloadManager: LRU cache management with performance stats
- ChunkedOffloadLinear: Chunked GEMM along seqlen dimension

Tests (10 total):
- Basic functionality, MLP integration, LRU eviction, correctness
- Memory analysis, 128K sequence, performance comparison, transformers layer
- Sync behavior analysis, profiler analysis

Key findings:
- 93.9% memory savings for 128K sequences (3156MB → 191MB)
- Constant memory footprint regardless of sequence length
- Only 8% performance overhead from chunked processing

Co-Authored-By: Claude <noreply@anthropic.com>
2026-01-18 10:41:40 +08:00
Zijie Tian
cfb188c34a docs: add chunked prefill analysis for ultra-long sequences
Add comprehensive analysis document covering:
- MLP activation memory bottlenecks with SwiGLU architecture
- Chunked MLP strategy (98% memory reduction)
- Chunked prefill for single layers (78% memory reduction)
- Streaming Chunked Prefill (最优方案): GPU memory becomes constant
- Memory formulas and implementation guidance
- Theoretical maximum: 4M tokens on 24GB GPU (128× improvement)

Co-Authored-By: Claude <noreply@anthropic.com>
2026-01-16 10:38:02 +08:00
Zijie Tian
2826a649de docs: add XAttention integration guide
Comprehensive documentation for XAttention sparse policy integration:
- Algorithm principles (chunked estimation + block sparse attention)
- COMPASS source code analysis
- Design decisions for CPU offload mode
- Implementation details (utils.py, kernels.py, xattn.py)
- Problem-solving (OOM, GQA, abstract method)
- Test validation results (RULER 32k benchmark)

Co-Authored-By: Claude <noreply@anthropic.com>
2026-01-14 10:16:21 +08:00
Zijie Tian
24baeb6d5a chore: add planning-with-files rule configuration 2026-01-14 10:09:52 +08:00
Zijie Tian
57f4e9c6e6 docs: reorganize documentation files
- Move notes.md to docs/development_notes.md
- Move Xattention_analysis.md to docs/xattention_analysis.md
- Delete DEBUG_SUMMARY.md (no longer needed)
- Update CLAUDE.md with documentation index entries

Co-Authored-By: Claude <noreply@anthropic.com>
2026-01-14 10:08:41 +08:00
Zijie Tian
ac1ccbceaa feat: add XAttention sparse policy integration
Integrate COMPASS XAttention algorithm into nano-vllm's CPU offload
execution path. Uses FlashAttention with native GQA support for
offload mode.

New files:
- nanovllm/kvcache/sparse/utils.py: find_blocks_chunked() utility
- nanovllm/kvcache/sparse/kernels.py: Triton kernels for XAttention
- nanovllm/kvcache/sparse/xattn.py: XAttentionPolicy implementation

Modified:
- nanovllm/config.py: Add XATTN configuration parameters
- nanovllm/engine/model_runner.py: Support XATTN policy
- nanovllm/kvcache/sparse/__init__.py: Register XAttentionPolicy
- tests/test_ruler.py: Add --sparse-policy parameter

Test results (32k ruler):
- NIAH tasks: 12/12 (100%)
- QA/Recall tasks: 11/15 (73%)
- Overall: 23/27 (85%)

Co-Authored-By: Claude <noreply@anthropic.com>
2026-01-14 10:04:46 +08:00
Zijie Tian
029894118d feat: add claude-flow MCP configuration
Add .claude/settings.json to enable claude-flow MCP in all worktrees.

This configuration includes:
- SessionStart hook to auto-start claude-flow daemon
- Auto-approval for claude-flow MCP tools and CLI commands
- Basic claude-flow settings

Co-Authored-By: Claude <noreply@anthropic.com>
2026-01-14 09:18:09 +08:00
Zijie Tian
8d6fde3b23 docs: add Block-Sparse-Attention library reference
Add comprehensive documentation for the MIT-Han-Lab Block-Sparse-Attention
library (3rdparty submodule, branch: tzj/minference).

The new document covers:
- Four sparse attention modes (dense, token/block streaming, block sparse)
- Hybrid mask support (different patterns per head)
- Complete API reference for all three functions
- Performance benchmarks (up to 3-4x speedup on A100)
- Integration considerations for nano-vllm

Co-Authored-By: Claude <noreply@anthropic.com>
2026-01-14 08:39:03 +08:00
Zijie Tian
6a6bd75685 feat: add Block-Sparse-Attention submodule (tzj/minference branch)
Add 3rdparty/Block-Sparse-Attention as a git submodule from the
tzj/minference branch of Zijie-Tian/Block-Sparse-Attention repository.

Co-Authored-By: Claude <noreply@anthropic.com>
2026-01-14 08:07:07 +08:00
Zijie Tian
86633004ca 📝 docs: add 64k memory analysis and test configuration updates
Add comprehensive memory analysis for 64k inference on Llama 3.1 8B:

New documentation:
- docs/64k_memory_analysis.md: GPU-only vs offload memory analysis,
  OOM root cause (memory fragmentation), RTX 3090 limitations,
  theoretical vs actual memory usage breakdown

Test configuration updates:
- tests/test_ruler.py: Add --num-kv-buffers parameter for ring buffer
  size tuning (default 4, can reduce to 1 for lower memory)
- Update default data_dir to ruler_64k
- Update default max_model_len to 65664 for 64k support

CLAUDE.md updates:
- Add 64k_memory_analysis.md to documentation index
- Document num_kv_buffers parameter in Configuration section
- Add 64k hardware requirements note to Model Limits

Key findings: 64k inference requires ~26GB (GPU-only) or ~23GB (offload)
due to memory fragmentation on 24GB GPUs, making A100 (40GB+) the
recommended hardware for 64k workloads.

Co-Authored-By: Claude <noreply@anthropic.com>
2026-01-14 07:02:09 +08:00
Zijie Tian
c51a640a29 🐛 fix: remove torch.compile from add_rms_forward to avoid recompilation
The add_rms_forward method processes two input tensors (x and residual),
which causes torch.compile recompilation issues. Keep @torch.compile only
on rms_forward which processes a single input.

This prevents unnecessary recompilation overhead during inference.

Co-Authored-By: Claude <noreply@anthropic.com>
2026-01-14 07:02:02 +08:00
Zijie Tian
dce6ad6b74 ♻️ refactor: chunked LayerNorm/QKV/MLP for 64k memory optimization
Implement chunked processing for LayerNorm, QKV projection, and MLP
layers to reduce peak activation memory for 64k sequence inference.

Changes:
- Chunked input_layernorm and post_attention_layernorm (chunk_size=128)
- Chunked QKV projection (chunk_size=128)
- Chunked MLP processing (chunk_size=128) with memory cleanup
- Added torch.cuda.empty_cache() calls after each chunk

This reduces peak activation from ~2 GB to ~50 MB per layer,
making 64k inference theoretically possible on 24GB GPUs
(though still limited by memory fragmentation).

Related: docs/64k_memory_analysis.md

Co-Authored-By: Claude <noreply@anthropic.com>
2026-01-14 07:01:57 +08:00
Zijie Tian
cf168fd9b9 test: add comprehensive RULER benchmark test suite
- Add test_ruler.py supporting all 13 RULER tasks (NIAH, QA, CWE, FWE, VT)
- Implement RULER official evaluation metrics (string_match_all/part)
- Fix max_model_len to 32896 to prevent decode OOM on long inputs
- Add ruler_benchmark_report.md with full test results (92.1% accuracy)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-14 00:51:30 +08:00
Zijie Tian
76af506956 [claudesquad] update from 'multi-request-2' on 13 Jan 26 02:01 CST 2026-01-13 02:01:07 +08:00
Zijie Tian
49519c7ce7 📝 docs: update offload accuracy issue with independent testing results
Document key finding: single request inference works correctly (100% accuracy).
The 66% accuracy issue in batch mode is due to state accumulation between
sequential requests in the same process.

- Add comparison table: independent (100%) vs batch (66%) testing modes
- Document root cause analysis: state cleanup issue between requests
- Add workaround using test_ruler_niah.sh for independent testing
- Update next steps to focus on OffloadEngine reset/cleanup logic

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-12 21:08:35 +08:00
Zijie Tian
1424e665e7 test: add parallel multi-GPU RULER NIAH test script
Add test_ruler_niah.sh for independent sample testing across multiple GPUs.
Each sample runs in a separate Python process to avoid state accumulation issues.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-12 21:08:27 +08:00
Zijie Tian
64971c8e8a Merge branch 'zijie/fix-dist-3': Fix distributed port conflict
- Auto port allocation with _find_free_port() in model_runner.py
- Resource management refactor with close() + context manager in llm_engine.py
- Add tests/test_port_conflict.py and tests/run_parallel_niah.sh
- Remove docs/torch_distributed_port_issue.md (issue fixed)
- Ignore tests/data/ directory

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-12 16:27:25 +08:00
Zijie Tian
de6f36bdb2 [docs] Added dist port issue. 2026-01-12 15:16:39 +08:00
Zijie Tian
8e0888c20c [docs] Added offload_acc issue. 2026-01-12 15:05:55 +08:00
Zijie Tian
a6cc703d73 [tests] Added test_niah_standalone.py. 2026-01-12 00:16:37 +08:00
Zijie Tian
5895de0c97 [docs] Added transformers error desp. 2026-01-11 18:48:50 +08:00
Zijie Tian
2771312565 [docs] Add sparse prefill integration plan from int-minference analysis
Consolidated analysis from int-minference-1/2/3 branches into a unified
integration plan for MInference, XAttention, and FlexPrefill strategies.

Key design decisions:
- Backward compatible: Keep existing SparsePolicy interface
- Unified BlockMask intermediate representation for new strategies
- XAttention/FlexPrefill use block_sparse_attn_func kernel
- MInference can optionally use block_sparse_attn (Phase 4)

Five-phase implementation plan:
1. BlockMask + block_sparse_attn wrapper
2. XAttention implementation
3. FlexPrefill implementation
4. Optional MInference refactoring
5. Integration and testing

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-10 23:33:09 +08:00
Zijie Tian
de6eae472d [docs] Update CLAUDE.md with multi-model support documentation
- Update overview to reflect Qwen3/Qwen2/Llama support
- Add docs/multi_model_support.md to documentation index
- Add Llama-3.1-8B-Instruct to model limits

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-10 21:29:39 +08:00
Zijie Tian
e23be2e844 Merge branch 'zijie/add-llama-1': Add multi-model support
- Add model registry system for dynamic model loading
- Implement LlamaForCausalLM with Llama3 RoPE scaling
- Register Qwen3ForCausalLM and Qwen2ForCausalLM
- Update ModelRunner to use get_model_class() for dynamic model selection

Tested: needle 32k test PASSED

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-10 21:20:53 +08:00
Zijie Tian
24f5ae5fc3 [claudesquad] update from 'add-llama-1' on 10 Jan 26 21:14 CST 2026-01-10 21:14:32 +08:00
Zijie Tian
03a8c033cb [claudesquad] update from 'add-llama-1' on 10 Jan 26 21:03 CST 2026-01-10 21:03:45 +08:00
Zijie Tian
9377ff63fe Merge remote-tracking branch 'origin/zijie/fix-bug-2' into tzj/vs_offload 2026-01-09 16:13:38 +08:00
Zijie Tian
067e36f4a2 [claudesquad] update from 'fix-bug-2' on 09 Jan 26 16:10 CST 2026-01-09 16:10:28 +08:00
Zijie Tian
1425510a2e [claudesquad] update from 'fix-bug-2' on 09 Jan 26 16:05 CST 2026-01-09 16:05:36 +08:00
Zijie Tian
335117bfca Merge remote-tracking branch 'origin/zijie/fix-bug-2' into tzj/vs_offload 2026-01-09 15:21:48 +08:00
Zijie Tian
5012b11291 [bench] Modify bench_vllm.py 2026-01-09 15:20:37 +08:00
Zijie Tian
ccf04d3917 [claudesquad] update from 'fix-bug-2' on 09 Jan 26 15:16 CST 2026-01-09 15:16:55 +08:00
Zijie Tian
59f8970ed3 [claudesquad] update from 'fix-bug-2' on 09 Jan 26 15:12 CST 2026-01-09 15:12:42 +08:00
Zijie Tian
6378cb4c17 Merge remote-tracking branch 'origin/zijie/fix-ga-perf-2' into tzj/vs_offload 2026-01-09 14:21:00 +08:00
Zijie Tian
47e3e465f0 [claudesquad] update from 'fix-ga-perf-2' on 09 Jan 26 14:08 CST 2026-01-09 14:08:12 +08:00
Zijie Tian
aac94c9481 [claude] Added some commands. 2026-01-09 13:16:23 +08:00
Zijie Tian
79c4df4a27 [claudesquad] update from 'int-minference-1' on 08 Jan 26 23:42 CST 2026-01-08 23:42:30 +08:00
Zijie Tian
ea4e904de0 [claudesquad] update from 'int-minference-1' on 08 Jan 26 23:22 CST 2026-01-08 23:22:38 +08:00
Zijie Tian
0bfe1984ef [docs] Refine GPU mutex: exclusive for benchmarks, port check for tests
Benchmarks (bench*.py) still require exclusive GPU access for accurate
measurements. Other scripts (tests, examples) now only check for
distributed port 29500 conflicts, allowing parallel GPU sharing.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-08 21:35:08 +08:00
Zijie Tian
105201b902 [claudesquad] update from 'lw-offload-2' on 08 Jan 26 21:19 CST 2026-01-08 21:19:38 +08:00
Zijie Tian
a8c9f0d837 [claudesquad] update from 'lw-offload-2' on 08 Jan 26 20:53 CST 2026-01-08 20:53:08 +08:00
Zijie Tian
85bcca3d17 [claudesquad] update from 'int-offload-1' on 08 Jan 26 19:44 CST 2026-01-08 19:44:29 +08:00
Zijie Tian
b5c0ef3b7a [docs] Replace chunked prefill docs with layer-wise offload strategy
Remove all chunked prefill related documentation (ring buffer, sgDMA,
Triton merge kernels, known issues) and replace with layer-wise offload
system documentation including:
- Design philosophy and benefits
- Memory layout and per-layer KV size table
- Prefill and decode flow pseudocode
- Critical implementation details (sync offload, causal=False for decode)
- Helper methods in HybridKVCacheManager

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-08 05:39:26 +08:00
Zijie Tian
bbbfd1e7da [docs] Simplify multi-instance development with direct PYTHONPATH
Replace pip install -e . --prefix=./.local approach with simpler PYTHONPATH method:
- No pip install required
- Code changes take effect immediately
- Each worktree is completely isolated

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-08 04:51:55 +08:00
Zijie Tian
c1ddb44e5d Merge branch 'zijie/layer-prefill-1' into tzj/vs_offload
Adds MInference sparse attention support:
- New MInference sparse policy implementation
- A-shape, vertical-slash, and block-sparse patterns
- Updated bench.py with sparse attention options
- test_minference_gpu.py validation test

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-08 03:40:53 +08:00
Zijie Tian
d8a87da1c3 [claudesquad] update from 'layer-prefill-1' on 08 Jan 26 03:36 CST 2026-01-08 03:36:39 +08:00
Zijie Tian
ecd9ae0271 [WIP] changed to layerwise offload. 2026-01-08 00:28:27 +08:00
Zijie Tian
6575099a06 [refactor] Cleanup unused code after perf_opt merge
Removed ~460 lines of unused/redundant code from offload_engine.py:
- CUDA gather methods (gathered_h2d_*, update_gather_indices)
- Legacy async transfer methods (prefetch_block_async, offload_block_async)
- Legacy sync/wait methods (wait_for_block, wait_all_transfers, sync_indices)
- Legacy compatibility methods (load_to_compute_layer, wait_compute_layer)
- Unused gather_indices tensors and memory calculations

Updated class docstring to reflect current architecture.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-07 06:25:21 +08:00