Benchmarks (bench*.py) still require exclusive GPU access for accurate
measurements. Other scripts (tests, examples) now only check for
distributed port 29500 conflicts, allowing parallel GPU sharing.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Remove all chunked prefill related documentation (ring buffer, sgDMA,
Triton merge kernels, known issues) and replace with layer-wise offload
system documentation including:
- Design philosophy and benefits
- Memory layout and per-layer KV size table
- Prefill and decode flow pseudocode
- Critical implementation details (sync offload, causal=False for decode)
- Helper methods in HybridKVCacheManager
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace pip install -e . --prefix=./.local approach with simpler PYTHONPATH method:
- No pip install required
- Code changes take effect immediately
- Each worktree is completely isolated
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Combines two performance optimization features:
- perf_opt-1: Cross-layer pipeline for decode (double-buffered layer cache)
- perf_opt-2: Per-layer prefill buffer for async offload
Both features are complementary and improve CPU offload performance.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add instructions for Claude instances to check GPU availability before
running CUDA operations, preventing conflicts when multiple instances
debug in parallel on a single GPU.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>