Add comprehensive memory analysis for 64k inference on Llama 3.1 8B:
New documentation:
- docs/64k_memory_analysis.md: GPU-only vs offload memory analysis,
OOM root cause (memory fragmentation), RTX 3090 limitations,
theoretical vs actual memory usage breakdown
Test configuration updates:
- tests/test_ruler.py: Add --num-kv-buffers parameter for ring buffer
size tuning (default 4, can reduce to 1 for lower memory)
- Update default data_dir to ruler_64k
- Update default max_model_len to 65664 for 64k support
CLAUDE.md updates:
- Add 64k_memory_analysis.md to documentation index
- Document num_kv_buffers parameter in Configuration section
- Add 64k hardware requirements note to Model Limits
Key findings: 64k inference requires ~26GB (GPU-only) or ~23GB (offload)
due to memory fragmentation on 24GB GPUs, making A100 (40GB+) the
recommended hardware for 64k workloads.
Co-Authored-By: Claude <noreply@anthropic.com>
- Auto port allocation with _find_free_port() in model_runner.py
- Resource management refactor with close() + context manager in llm_engine.py
- Add tests/test_port_conflict.py and tests/run_parallel_niah.sh
- Remove docs/torch_distributed_port_issue.md (issue fixed)
- Ignore tests/data/ directory
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Update overview to reflect Qwen3/Qwen2/Llama support
- Add docs/multi_model_support.md to documentation index
- Add Llama-3.1-8B-Instruct to model limits
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Benchmarks (bench*.py) still require exclusive GPU access for accurate
measurements. Other scripts (tests, examples) now only check for
distributed port 29500 conflicts, allowing parallel GPU sharing.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Remove all chunked prefill related documentation (ring buffer, sgDMA,
Triton merge kernels, known issues) and replace with layer-wise offload
system documentation including:
- Design philosophy and benefits
- Memory layout and per-layer KV size table
- Prefill and decode flow pseudocode
- Critical implementation details (sync offload, causal=False for decode)
- Helper methods in HybridKVCacheManager
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace pip install -e . --prefix=./.local approach with simpler PYTHONPATH method:
- No pip install required
- Code changes take effect immediately
- Each worktree is completely isolated
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Combines two performance optimization features:
- perf_opt-1: Cross-layer pipeline for decode (double-buffered layer cache)
- perf_opt-2: Per-layer prefill buffer for async offload
Both features are complementary and improve CPU offload performance.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add instructions for Claude instances to check GPU availability before
running CUDA operations, preventing conflicts when multiple instances
debug in parallel on a single GPU.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>