Merge perf_opt-1 and perf_opt-2 branches

Combines two performance optimization features: - perf_opt-1: Cross-layer pipeline for decode (double-buffered layer cache) - perf_opt-2: Per-layer prefill buffer for async offload Both features are complementary and improve CPU offload performance. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-07 06:03:44 +08:00
parent ccf27d3a74 0ad86eb449
commit 8fd25d72d7
4 changed files with 175 additions and 68 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -77,6 +77,12 @@ PYTHONPATH=./.local/lib/python3.10/site-packages:$PYTHONPATH python <script.py>

 **Note**: The Python version in the path (python3.10) should match your environment.

+**CRITICAL**: After making code changes to `nanovllm/` source files, you MUST reinstall the package for changes to take effect:
+```bash
+pip install -e . --prefix=./.local --no-deps
+```
+Without reinstallation, Python will use the old cached version and your changes will NOT be reflected!
+
 ## Sparse Attention

 For sparse attention related content (block sparse attention, MInference, FlexPrefill, XAttention, AvgPool, etc.), refer to [`docs/sparse_attention_guide.md`](docs/sparse_attention_guide.md).