📝 docs: add CPU scheduling latency analysis for chunked attention

- Document kernel gap analysis showing 77-81% CPU scheduling overhead
- Identify GPU utilization at 12.8% with potential to reach 39.5%
- Outline optimization directions: CUDA Graph, Triton fusion, C++ extension
- Add documentation index entry in CLAUDE.md

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
This commit is contained in:
Zijie Tian
2026-01-27 03:42:12 +08:00
parent 18bc433f09
commit 0619accd1c
2 changed files with 178 additions and 0 deletions

View File

@@ -26,6 +26,7 @@ Nano-vLLM is a lightweight vLLM implementation (~1,200 lines) for fast offline L
| [`docs/ruler_32k_chunked_offload_issue.md`](docs/ruler_32k_chunked_offload_issue.md) | ⚠️ OPEN ISSUE: 32K chunked offload accuracy problem (20% error rate in RULER) |
| [`docs/chunked_attention_solutions.md`](docs/chunked_attention_solutions.md) | 🔧 SOLUTIONS: Chunked attention 准确性问题的代码分析和解决方案 |
| [`docs/nsys_wrong_event_order_bug.md`](docs/nsys_wrong_event_order_bug.md) | 🐛 NSYS BUG: Ring buffer pipeline 触发 nsys 时间戳乱序问题的调试记录 |
| [`docs/cpu_scheduling_latency_analysis.md`](docs/cpu_scheduling_latency_analysis.md) | ⚡ PERF: CPU 调度延迟分析kernel 间隙来源GPU 利用率优化方向 |
## Rules Index