📝 docs: add XAttention performance analysis documentation
Add comprehensive performance analysis for XAttention: - NVTX marker locations and usage - Block size impact on offload mode (4096 vs 1024) - Detailed timing breakdown for estimate vs compute phases - softmax_fuse_block_sum_kernel analysis - Optimization recommendations Key findings: - block_size=4096 is 2x faster than 1024 for 64K context - find_blocks_chunked is bottleneck (40%) at block_size=4096 - estimate_gemm becomes bottleneck (24%) at block_size=1024 Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>
This commit is contained in:
@@ -30,6 +30,7 @@ Nano-vLLM is a lightweight vLLM implementation (~1,200 lines) for fast offline L
|
||||
| [`docs/bench_offload_results.md`](docs/bench_offload_results.md) | 📊 BENCH: CPU offload 性能测试结果,Full vs XAttention 对比 (32K/128K) |
|
||||
| [`docs/cpu_offload_optimization_strategies.md`](docs/cpu_offload_optimization_strategies.md) | 🚀 OPT: CPU offload 优化策略:chunk size、CUDA Graph、前沿研究(InfiniGen/ShadowKV) |
|
||||
| [`docs/gpu_only_xattn_guide.md`](docs/gpu_only_xattn_guide.md) | 🚀 GPU-Only XAttention: 内存预分配、性能分析 (32K +15%, 64K +41%)、CUDA Graph 限制 |
|
||||
| [`docs/xattn_performance_analysis.md`](docs/xattn_performance_analysis.md) | 📊 XAttention 性能分析: NVTX 标记、block size 影响、estimate vs compute 耗时对比 |
|
||||
|
||||
## Rules Index
|
||||
|
||||
|
||||
Reference in New Issue
Block a user