📝 docs: add CPU offload optimization strategies guide
- Document chunk size optimization (simplest, most effective) - Analyze CUDA Graph limitations for offload scenarios - Cover CUDA Graph applicability for MLP/Proj layers - Survey frontier research: InfiniGen, ShadowKV, L2 Prefetch, KVPR - Add optimization priority recommendations Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>
This commit is contained in:
@@ -28,6 +28,7 @@ Nano-vLLM is a lightweight vLLM implementation (~1,200 lines) for fast offline L
|
||||
| [`docs/nsys_wrong_event_order_bug.md`](docs/nsys_wrong_event_order_bug.md) | 🐛 NSYS BUG: Ring buffer pipeline 触发 nsys 时间戳乱序问题的调试记录 |
|
||||
| [`docs/cpu_scheduling_latency_analysis.md`](docs/cpu_scheduling_latency_analysis.md) | ⚡ PERF: CPU 调度延迟分析,kernel 间隙来源,GPU 利用率优化方向 |
|
||||
| [`docs/bench_offload_results.md`](docs/bench_offload_results.md) | 📊 BENCH: CPU offload 性能测试结果,Full vs XAttention 对比 (32K/128K) |
|
||||
| [`docs/cpu_offload_optimization_strategies.md`](docs/cpu_offload_optimization_strategies.md) | 🚀 OPT: CPU offload 优化策略:chunk size、CUDA Graph、前沿研究(InfiniGen/ShadowKV) |
|
||||
|
||||
## Rules Index
|
||||
|
||||
|
||||
Reference in New Issue
Block a user