Add a new Claude command that executes task_plan.md refactoring with:
- GPU isolation via --gpu <id> parameter (required)
- Optional --no-interrupt mode for autonomous execution
- Progress tracking via progress.md and findings.md
- Strict CUDA_VISIBLE_DEVICES enforcement
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add ops module ported from tzj/minference branch containing:
- xattn.py: XAttention block importance estimation with Triton kernels
- xattn_estimate(): standard estimation for sparse attention mask
- xattn_estimate_chunked(): chunked prefill compatible version
- flat_group_gemm_fuse_reshape(): fused stride reshape + GEMM kernel
- softmax_fuse_block_sum(): online softmax + block-wise sum kernel
- chunked_attention.py: Flash attention with LSE output for chunk merging
- test_xattn_estimate_chunked.py: verification test (all seq_lens pass)
This prepares the foundation for AttentionPolicy refactoring where
XAttentionPolicy.estimate() will call these ops.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Move notes.md to docs/development_notes.md
- Move Xattention_analysis.md to docs/xattention_analysis.md
- Delete DEBUG_SUMMARY.md (no longer needed)
- Update CLAUDE.md with documentation index entries
Co-Authored-By: Claude <noreply@anthropic.com>
Add .claude/settings.json to enable claude-flow MCP in all worktrees.
This configuration includes:
- SessionStart hook to auto-start claude-flow daemon
- Auto-approval for claude-flow MCP tools and CLI commands
- Basic claude-flow settings
Co-Authored-By: Claude <noreply@anthropic.com>
Add comprehensive documentation for the MIT-Han-Lab Block-Sparse-Attention
library (3rdparty submodule, branch: tzj/minference).
The new document covers:
- Four sparse attention modes (dense, token/block streaming, block sparse)
- Hybrid mask support (different patterns per head)
- Complete API reference for all three functions
- Performance benchmarks (up to 3-4x speedup on A100)
- Integration considerations for nano-vllm
Co-Authored-By: Claude <noreply@anthropic.com>
Add 3rdparty/Block-Sparse-Attention as a git submodule from the
tzj/minference branch of Zijie-Tian/Block-Sparse-Attention repository.
Co-Authored-By: Claude <noreply@anthropic.com>
Add comprehensive memory analysis for 64k inference on Llama 3.1 8B:
New documentation:
- docs/64k_memory_analysis.md: GPU-only vs offload memory analysis,
OOM root cause (memory fragmentation), RTX 3090 limitations,
theoretical vs actual memory usage breakdown
Test configuration updates:
- tests/test_ruler.py: Add --num-kv-buffers parameter for ring buffer
size tuning (default 4, can reduce to 1 for lower memory)
- Update default data_dir to ruler_64k
- Update default max_model_len to 65664 for 64k support
CLAUDE.md updates:
- Add 64k_memory_analysis.md to documentation index
- Document num_kv_buffers parameter in Configuration section
- Add 64k hardware requirements note to Model Limits
Key findings: 64k inference requires ~26GB (GPU-only) or ~23GB (offload)
due to memory fragmentation on 24GB GPUs, making A100 (40GB+) the
recommended hardware for 64k workloads.
Co-Authored-By: Claude <noreply@anthropic.com>
The add_rms_forward method processes two input tensors (x and residual),
which causes torch.compile recompilation issues. Keep @torch.compile only
on rms_forward which processes a single input.
This prevents unnecessary recompilation overhead during inference.
Co-Authored-By: Claude <noreply@anthropic.com>
Implement chunked processing for LayerNorm, QKV projection, and MLP
layers to reduce peak activation memory for 64k sequence inference.
Changes:
- Chunked input_layernorm and post_attention_layernorm (chunk_size=128)
- Chunked QKV projection (chunk_size=128)
- Chunked MLP processing (chunk_size=128) with memory cleanup
- Added torch.cuda.empty_cache() calls after each chunk
This reduces peak activation from ~2 GB to ~50 MB per layer,
making 64k inference theoretically possible on 24GB GPUs
(though still limited by memory fragmentation).
Related: docs/64k_memory_analysis.md
Co-Authored-By: Claude <noreply@anthropic.com>
- Add test_ruler.py supporting all 13 RULER tasks (NIAH, QA, CWE, FWE, VT)
- Implement RULER official evaluation metrics (string_match_all/part)
- Fix max_model_len to 32896 to prevent decode OOM on long inputs
- Add ruler_benchmark_report.md with full test results (92.1% accuracy)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Document key finding: single request inference works correctly (100% accuracy).
The 66% accuracy issue in batch mode is due to state accumulation between
sequential requests in the same process.
- Add comparison table: independent (100%) vs batch (66%) testing modes
- Document root cause analysis: state cleanup issue between requests
- Add workaround using test_ruler_niah.sh for independent testing
- Update next steps to focus on OffloadEngine reset/cleanup logic
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add test_ruler_niah.sh for independent sample testing across multiple GPUs.
Each sample runs in a separate Python process to avoid state accumulation issues.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Auto port allocation with _find_free_port() in model_runner.py
- Resource management refactor with close() + context manager in llm_engine.py
- Add tests/test_port_conflict.py and tests/run_parallel_niah.sh
- Remove docs/torch_distributed_port_issue.md (issue fixed)
- Ignore tests/data/ directory
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Update overview to reflect Qwen3/Qwen2/Llama support
- Add docs/multi_model_support.md to documentation index
- Add Llama-3.1-8B-Instruct to model limits
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add model registry system for dynamic model loading
- Implement LlamaForCausalLM with Llama3 RoPE scaling
- Register Qwen3ForCausalLM and Qwen2ForCausalLM
- Update ModelRunner to use get_model_class() for dynamic model selection
Tested: needle 32k test PASSED
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Benchmarks (bench*.py) still require exclusive GPU access for accurate
measurements. Other scripts (tests, examples) now only check for
distributed port 29500 conflicts, allowing parallel GPU sharing.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Remove all chunked prefill related documentation (ring buffer, sgDMA,
Triton merge kernels, known issues) and replace with layer-wise offload
system documentation including:
- Design philosophy and benefits
- Memory layout and per-layer KV size table
- Prefill and decode flow pseudocode
- Critical implementation details (sync offload, causal=False for decode)
- Helper methods in HybridKVCacheManager
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace pip install -e . --prefix=./.local approach with simpler PYTHONPATH method:
- No pip install required
- Code changes take effect immediately
- Each worktree is completely isolated
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>