diff --git a/CLAUDE.md b/CLAUDE.md index b7e3647..9ffc834 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -28,6 +28,7 @@ Nano-vLLM is a lightweight vLLM implementation (~1,200 lines) for fast offline L | [`docs/nsys_wrong_event_order_bug.md`](docs/nsys_wrong_event_order_bug.md) | ๐Ÿ› NSYS BUG: Ring buffer pipeline ่งฆๅ‘ nsys ๆ—ถ้—ดๆˆณไนฑๅบ้—ฎ้ข˜็š„่ฐƒ่ฏ•่ฎฐๅฝ• | | [`docs/cpu_scheduling_latency_analysis.md`](docs/cpu_scheduling_latency_analysis.md) | โšก PERF: CPU ่ฐƒๅบฆๅปถ่ฟŸๅˆ†ๆž๏ผŒkernel ้—ด้š™ๆฅๆบ๏ผŒGPU ๅˆฉ็”จ็އไผ˜ๅŒ–ๆ–นๅ‘ | | [`docs/bench_offload_results.md`](docs/bench_offload_results.md) | ๐Ÿ“Š BENCH: CPU offload ๆ€ง่ƒฝๆต‹่ฏ•็ป“ๆžœ๏ผŒFull vs XAttention ๅฏนๆฏ” (32K/128K) | +| [`docs/cpu_offload_optimization_strategies.md`](docs/cpu_offload_optimization_strategies.md) | ๐Ÿš€ OPT: CPU offload ไผ˜ๅŒ–็ญ–็•ฅ๏ผšchunk sizeใ€CUDA Graphใ€ๅ‰ๆฒฟ็ ”็ฉถ(InfiniGen/ShadowKV) | ## Rules Index diff --git a/docs/cpu_offload_optimization_strategies.md b/docs/cpu_offload_optimization_strategies.md new file mode 100644 index 0000000..686d905 --- /dev/null +++ b/docs/cpu_offload_optimization_strategies.md @@ -0,0 +1,300 @@ +# CPU Offload ไผ˜ๅŒ–็ญ–็•ฅ + +ๆœฌๆ–‡ๆกฃ่ฎฐๅฝ• CPU Offload ๅœบๆ™ฏไธ‹็š„ๆ€ง่ƒฝไผ˜ๅŒ–็ญ–็•ฅๅˆ†ๆž๏ผŒๅŒ…ๆ‹ฌๅฎž้™…ๅฏ่กŒ็š„ๆ–นๆกˆๅ’Œๅ‰ๆฒฟ็ ”็ฉถๆ–นๅ‘ใ€‚ + +## ้—ฎ้ข˜ๅ›ž้กพ + +ๆ นๆฎ [CPU ่ฐƒๅบฆๅปถ่ฟŸๅˆ†ๆž](cpu_scheduling_latency_analysis.md)๏ผŒๅฝ“ๅ‰ chunked attention pipeline ็š„ไธป่ฆ้—ฎ้ข˜๏ผš + +| ๆŒ‡ๆ ‡ | ๅฝ“ๅ‰ๅ€ผ | ็†่ฎบๅ€ผ | +|------|--------|--------| +| Flash kernel ๆ‰ง่กŒๆ—ถ้—ด | ~138 ฮผs | - | +| Flash kernel ้—ด้š” | ~942 ฮผs | ~211 ฮผs (ไป… H2D + merge) | +| GPU ๅˆฉ็”จ็އ | **12.8%** | **39.5%** (็†่ฎบไธŠ้™) | +| CPU ่ฐƒๅบฆ็ฉบ้—ฒๅ ๆฏ” | **77-81%** | 0% | + +**็“ถ้ขˆๆ นๆบ**๏ผšๆฏไธช block ้ƒฝ็ป่ฟ‡ๅฎŒๆ•ด็š„ Python ๅพช็Žฏ๏ผŒๅฏผ่‡ดๅคง้‡ CPU ่ฐƒๅบฆๅปถ่ฟŸใ€‚ + +--- + +## ไผ˜ๅŒ–ๆ–นๆกˆไธ€๏ผš่ฐƒๅคง Chunk Size๏ผˆๆŽจ่๏ผ‰ + +### ๆ ธๅฟƒๆดžๅฏŸ + +**Merge ๅคšไธชๅฐ chunk ๅ’Œ็›ดๆŽฅไฝฟ็”จๅคง chunk ๆ˜ฏ็ญ‰ๆ•ˆ็š„**๏ผš + +``` +ๆ–นๆกˆ A: Merge 4 ไธชๅฐ chunks +[H2D 2K][H2D 2K][H2D 2K][H2D 2K] โ†’ concat โ†’ [Flash 8K] โ†’ merge + +ๆ–นๆกˆ B: ็›ดๆŽฅ็”จๅคง chunk +[H2D 8K] โ†’ [Flash 8K] โ†’ merge + +่ฎก็ฎ—็ป“ๆžœๅฎŒๅ…จ็ญ‰ๆ•ˆ๏ผ +``` + +### ๆ”ถ็›Šๅˆ†ๆž + +| ๆŒ‡ๆ ‡ | ๅฐ chunk (2K) ร— 4 | ๅคง chunk (8K) ร— 1 | +|------|-------------------|-------------------| +| H2D ๆฌกๆ•ฐ | 4 | 1 | +| Flash kernel ่ฐƒ็”จ | 4 | 1 | +| Merge ่ฐƒ็”จ | 4 | 1 | +| Python ๅพช็Žฏๆฌกๆ•ฐ | 4 | 1 | +| CPU ่ฐƒๅบฆๅผ€้”€ | 4 ร— ~300ฮผs = 1200ฮผs | 1 ร— ~300ฮผs = 300ฮผs | + +**ๆœฌ่ดจ**๏ผšCPU ่ฐƒๅบฆๅปถ่ฟŸ้—ฎ้ข˜็š„ๆ นๆบๆ˜ฏๅพช็Žฏๆฌกๆ•ฐๅคชๅคš๏ผŒ่ฐƒๅคง chunk size ็›ดๆŽฅๅ‡ๅฐ‘ๅพช็Žฏๆฌกๆ•ฐใ€‚ + +### Trade-off + +1. **GPU ๅ†…ๅญ˜ๅขžๅŠ ** + - 2K chunk: ๆฏ slot ~4MB (K+V) + - 8K chunk: ๆฏ slot ~16MB (K+V) + - 4 slots = 64MB๏ผŒๅฏน 80GB A100 ๅฝฑๅ“ๅพˆๅฐ + +2. **ๅ•ๆฌก H2D ๆ—ถ้—ดๅ˜้•ฟ** + - H2D 8K โ‰ˆ 350ฮผs + - Flash 8K โ‰ˆ 550ฮผs + - ๅ› ไธบ Flash > H2D๏ผŒpipeline ไป็„ถๆœ‰ๆ•ˆ + +### ้…็ฝฎๆ–นๆณ• + +```bash +# ๆต‹่ฏ•ไธๅŒ block size +python bench_offload.py --kvcache-block-size 2048 # ๅŸบๅ‡† +python bench_offload.py --kvcache-block-size 4096 # 2x +python bench_offload.py --kvcache-block-size 8192 # 4x +``` + +--- + +## ไผ˜ๅŒ–ๆ–นๆกˆไบŒ๏ผšCUDA Graph๏ผˆ้€‚็”จไบŽ้ž Attention ้ƒจๅˆ†๏ผ‰ + +### CUDA Graph ๅœจ Offload ๅœบๆ™ฏ็š„ๅฑ€้™ๆ€ง + +CUDA Graph ็š„ๅ‰ๆ๏ผšๆ‰€ๆœ‰ๆ“ไฝœๅœจ capture ๆ—ถ็กฎๅฎš๏ผŒๆ•ฐๆฎๅœฐๅ€ๅ›บๅฎšใ€‚ + +**Offload ๅœบๆ™ฏ็š„็Žฐๅฎž**๏ผš +1. **H2D ๆบๅœฐๅ€ๅŠจๆ€** - ๆฏๆฌกไปŽไธๅŒ็š„ CPU block ๅŠ ่ฝฝ +2. **ๅŠ ่ฝฝๅ†ณ็ญ–ๅœจ่ฟ่กŒๆ—ถ** - ๅ“ชไบ› block ้œ€่ฆๅŠ ่ฝฝๆ˜ฏๅŠจๆ€็š„ +3. **CPU ๅฟ…้กปๅ่ฐƒ** - H2D ๅ’Œ Compute ็š„ๅŒๆญฅ้œ€่ฆ CPU ๅ‚ไธŽ + +``` +Offload ๅœบๆ™ฏ๏ผš +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ ๆ•ฐๆฎๅœจ CPU๏ผŒ้œ€่ฆๅŠจๆ€ๅŠ ่ฝฝ โ”‚ +โ”‚ [H2D_i] โ†’ [Compute] โ†’ [H2D_{i+n}] โ†’ ...โ”‚ +โ”‚ โ†‘ ๅŠจๆ€ใ€CPU ๅฟ…้กปๅ‚ไธŽ่ฐƒๅบฆ โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + +ๅณไฝฟ็”จ Graph๏ผš +Python: [wait_h2d] [replay] [launch_h2d] [wait_h2d] [replay] ... + โ†‘ CPU ๅ‚ไธŽ โ†‘ CPU ๅ‚ไธŽ โ†‘ CPU ๅ‚ไธŽ + +CPU ่ฐƒๅบฆๅผ€้”€ไป็„ถๅญ˜ๅœจ๏ผŒGraph ๅชไผ˜ๅŒ–ไบ†ไธญ้—ด็š„ compute ้ƒจๅˆ†ใ€‚ +``` + +**็ป“่ฎบ**๏ผšCUDA Graph ไธๆ˜ฏ Offload ๅœบๆ™ฏ็š„้“ถๅผนใ€‚ + +### ้€‚็”จๅœบๆ™ฏ๏ผšMLP ๅ’Œ Projection ๅฑ‚ + +LLM ๆฏๅฑ‚็š„่ฎก็ฎ—ๆต็จ‹๏ผš + +``` +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ [LayerNorm] โ†’ [QKV Proj] โ†’ [Attention] โ†’ [O Proj] โ†’ [Add] โ”‚ +โ”‚ โ†‘ โ”‚ +โ”‚ KV Offload โ”‚ +โ”‚ [LayerNorm] โ†’ [MLP: gate + up + down] โ†’ [Add] โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ +``` + +| ็ป„ไปถ | ๆถ‰ๅŠ Offload | ่ƒฝ็”จ CUDA Graph | +|------|-------------|-----------------| +| LayerNorm | โŒ | โœ… | +| QKV Projection | โŒ | โœ… | +| **Attention** | โœ… | โŒ | +| Output Projection | โŒ | โœ… | +| MLP (FFN) | โŒ | โœ… | + +**ๅชๆœ‰ Attention ๆถ‰ๅŠๅŠจๆ€ KV Cache ๅŠ ่ฝฝ๏ผŒๅ…ถไฝ™้ƒฝๆ˜ฏ"็บฏ่ฎก็ฎ—"๏ผŒๅฏไปฅ็”จ CUDA Graphใ€‚** + +### ๅฎž็Žฐๆ–นๆกˆ + +```python +class OptimizedLayer: + def __init__(self, layer): + # Graph 1: Attention ไน‹ๅ‰ + self.graph_pre_attn = capture([ + layer.input_layernorm, + layer.self_attn.q_proj, + layer.self_attn.k_proj, + layer.self_attn.v_proj, + ]) + + # Graph 2: Attention ไน‹ๅŽ + MLP + self.graph_post_attn = capture([ + layer.self_attn.o_proj, + # residual add + layer.post_attention_layernorm, + layer.mlp.gate_proj, + layer.mlp.up_proj, + layer.mlp.down_proj, + # residual add + ]) + + def forward(self, hidden_states, kv_cache): + # Pre-attention (CUDA Graph) + self.graph_pre_attn.replay() + + # Attention with offload (ๅŠจๆ€๏ผŒไธ่ƒฝ็”จ graph) + attn_output = chunked_attention_with_offload(q, kv_cache) + + # Post-attention + MLP (CUDA Graph) + self.graph_post_attn.replay() +``` + +### ๆ”ถ็›Šไผฐ็ฎ— + +MLP ๆฏๅฑ‚ๅ…ธๅž‹ๆ“ไฝœ launch ๅผ€้”€๏ผš +- `gate_proj`, `up_proj`, `act_fn`, `gate * up`, `down_proj`, `residual add` +- ๆฏไธชๆ“ไฝœ ~30-50ฮผs launch ๅผ€้”€๏ผŒๆ€ป่ฎก ~200ฮผs/ๅฑ‚ +- ็”จ CUDA Graph๏ผš~30ฮผs/ๅฑ‚ + +**32 ๅฑ‚ ร— 170ฮผs ่Š‚็œ โ‰ˆ 5.4ms** + +--- + +## ไผ˜ๅŒ–ๆ–นๆกˆไธ‰๏ผšๅ‰ๆฒฟ็ ”็ฉถๆ–นๅ‘ + +### 1. InfiniGen - ๆŠ•ๆœบ้ข„ๅ– (OSDI'24) + +**ๆ ธๅฟƒๆ€ๆƒณ**๏ผšไธ้œ€่ฆๅŠ ่ฝฝๆ‰€ๆœ‰ KV๏ผŒๅช้ข„ๅ–"้‡่ฆ"็š„ tokenใ€‚ + +``` +ๅ…ณ้”ฎๆดžๅฏŸ๏ผš็›ธ้‚ปๅฑ‚็š„ attention pattern ้ซ˜ๅบฆ็›ธไผผ + โ†“ +็”จ็ฌฌ L ๅฑ‚็š„ attention score ้ข„ๆต‹็ฌฌ L+1 ๅฑ‚้œ€่ฆๅ“ชไบ› token + โ†“ +ๅช้ข„ๅ– top-k ้‡่ฆ็š„ KV entries๏ผˆ่€Œไธๆ˜ฏๅ…จ้ƒจ๏ผ‰ +``` + +**ๆŠ€ๆœฏๅฎž็Žฐ**๏ผš +- ็”จๅฝ“ๅ‰ๅฑ‚็š„ Q ๅ’Œไธ‹ไธ€ๅฑ‚็š„้ƒจๅˆ† K ๅš"้ข„ๆผ”" +- ้ข„ๆต‹ไธ‹ไธ€ๅฑ‚็š„ attention ๅˆ†ๅธƒ +- ๅผ‚ๆญฅ้ข„ๅ–้ข„ๆต‹็š„้‡่ฆ token +- **ๅ‡ๅฐ‘ PCIe ๅธฆๅฎฝๆตช่ดน๏ผŒ่€Œไธๆ˜ฏๅŠ ้€Ÿไผ ่พ“** + +**ๆ•ˆๆžœ**๏ผšๆœ€้ซ˜ **3x ๅŠ ้€Ÿ** + +**ๅ‚่€ƒ**๏ผš[InfiniGen (OSDI'24)](https://www.usenix.org/conference/osdi24/presentation/lee) + +### 2. ShadowKV - ไฝŽ็งฉๅŽ‹็ผฉ + Sparse Offload (ICML'25 Spotlight) + +**ๆ ธๅฟƒๆ€ๆƒณ**๏ผšKey ๅŽ‹็ผฉๅญ˜ GPU๏ผŒValue offload ๅˆฐ CPU๏ผŒๅชๅŠ ่ฝฝ 1.56% ็š„ KVใ€‚ + +``` +Pre-filling: +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ Key Cache โ†’ SVD ไฝŽ็งฉๅŽ‹็ผฉ โ†’ ไฟ็•™ๅœจ GPU โ”‚ +โ”‚ Value Cache โ†’ Offload ๅˆฐ CPU โ”‚ +โ”‚ ่ฎก็ฎ—ๆฏไธช chunk ็š„ landmark (ๅ‡ๅ€ผ) โ”‚ +โ”‚ ่ฏ†ๅˆซ outlier tokens โ†’ ไฟ็•™ๅœจ GPU โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + +Decoding: +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ ็”จ landmarks ๅฟซ้€Ÿไผฐ่ฎก attention score โ”‚ +โ”‚ ๅชๅŠ ่ฝฝ top-k ้‡่ฆ็š„ Value (1.56% sparse) โ”‚ +โ”‚ ็ป“ๅˆ GPU ไธŠ็š„ outliers ่ฎก็ฎ—ๆœ€็ปˆ็ป“ๆžœ โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ +``` + +**ๆ•ˆๆžœ**๏ผš6x ๆ›ดๅคง batch size๏ผŒ**3.04x ๅžๅๆๅ‡** + +**ๅ‚่€ƒ**๏ผš[ShadowKV (ByteDance)](https://github.com/ByteDance-Seed/ShadowKV) + +### 3. L2 Cache ๅผ‚ๆญฅ้ข„ๅ– (2025) + +**ๆ ธๅฟƒๆ€ๆƒณ**๏ผšๅˆฉ็”จ GPU L2 Cache ๅš้ข„ๅ–๏ผŒๅœจ่ฎก็ฎ—ๆ—ถ้ข„ๅ–ไธ‹ไธ€ๆ‰น KVใ€‚ + +``` +ไผ ็ปŸ๏ผš +Compute: [Flash_i] [Flash_{i+1}] +H2D: [H2D_{i+1}] + โ†‘ ็ญ‰ๅพ… + +L2 Prefetch๏ผš +Compute: [Flash_i + Prefetch_{i+1} to L2] [Flash_{i+1} L2 hit] + โ†‘ ่ฎก็ฎ—ๆ—ถๅˆฉ็”จ็ฉบ้—ฒ memory bandwidth ้ข„ๅ– +``` + +**ๆŠ€ๆœฏ**๏ผš +- ๅœจ Flash Attention kernel ๅ†…้ƒจๅ‘่ตท้ข„ๅ–ๆŒ‡ไปค +- ๅˆฉ็”จ่ฎก็ฎ—ๆ—ถ็š„็ฉบ้—ฒ memory bandwidth +- ไธ‹ไธ€ๆฌก่ฎฟ้—ฎ็›ดๆŽฅ L2 hit + +**ๆ•ˆๆžœ**๏ผš**2.15x attention kernel ๆ•ˆ็އ**๏ผŒ1.97x ็ซฏๅˆฐ็ซฏๅžๅ + +**ๅ‚่€ƒ**๏ผš[Asynchronous KV Cache Prefetching (2025)](https://arxiv.org/abs/2504.06319) + +### 4. KVPR - I/O-Aware ่ฐƒๅบฆ (ACL'25) + +**ๆ ธๅฟƒๆ€ๆƒณ**๏ผš่ฎก็ฎ—ๆœ€ไผ˜็š„ recompute vs offload ๆฏ”ไพ‹ใ€‚ + +``` +ๆƒ่กก๏ผš +- Recompute: ้‡ๆ–ฐ่ฎก็ฎ— KV๏ผˆ็”จ GPU ็ฎ—ๅŠ›ๆขๅ†…ๅญ˜๏ผ‰ +- Offload: ไปŽ CPU ๅŠ ่ฝฝ๏ผˆ็”จ PCIe ๅธฆๅฎฝๆข็ฎ—ๅŠ›๏ผ‰ + +KVPR: ๆ นๆฎๅฝ“ๅ‰่ดŸ่ฝฝๅŠจๆ€ๅ†ณๅฎšๆœ€ไผ˜ๆฏ”ไพ‹ + + ้ข„ๅ–ๆŠ€ๆœฏ้‡ๅ ๆ•ฐๆฎไผ ่พ“ๅ’Œ่ฎก็ฎ— +``` + +**ๅ‚่€ƒ**๏ผš[KVPR (ACL'25)](https://aclanthology.org/2025.findings-acl.997.pdf) + +--- + +## ไผ˜ๅŒ–็ญ–็•ฅๆ€ป็ป“ + +### ๆŽจ่ไผ˜ๅ…ˆ็บง + +| ไผ˜ๅ…ˆ็บง | ๆ–นๆกˆ | ๆ ธๅฟƒไผ˜ๅŒ– | ๅฎž็Žฐๅคๆ‚ๅบฆ | ้ข„ๆœŸๆ”ถ็›Š | +|--------|------|---------|-----------|---------| +| **P0** | ่ฐƒๅคง chunk size | ๅ‡ๅฐ‘ๅพช็Žฏๆฌกๆ•ฐ | ๆžไฝŽ๏ผˆๆ”น้…็ฝฎ๏ผ‰ | 2-4x | +| **P1** | MLP CUDA Graph | ๅ‡ๅฐ‘ launch ๅผ€้”€ | ไธญ | ~5ms/request | +| **P2** | InfiniGen ๅผ้ข„ๅ– | ๅชๅŠ ่ฝฝ้‡่ฆ token | ไธญ้ซ˜ | 2-3x | +| **P3** | ShadowKV ๅผๅŽ‹็ผฉ | Key ๅŽ‹็ผฉ + Sparse | ้ซ˜ | 3x | +| **P3** | C++ Extension | ๆถˆ้™ค Python ๅผ€้”€ | ้ซ˜ | 2-3x | + +### ็ญ–็•ฅๅˆ†็ฆปๅŽŸๅˆ™ + +``` +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ Attention + Offload ้ƒจๅˆ†๏ผš โ”‚ +โ”‚ - ็“ถ้ขˆ๏ผšH2D ไผ ่พ“ + CPU ่ฐƒๅบฆ โ”‚ +โ”‚ - ไผ˜ๅŒ–๏ผš่ฐƒๅคง chunk size / ๆŠ•ๆœบ้ข„ๅ– / Sparse โ”‚ +โ”‚ โ”‚ +โ”‚ MLP + Proj + Norm ้ƒจๅˆ†๏ผš โ”‚ +โ”‚ - ็“ถ้ขˆ๏ผšKernel launch ๅผ€้”€ โ”‚ +โ”‚ - ไผ˜ๅŒ–๏ผšCUDA Graph โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + +ไธค้ƒจๅˆ†ไผ˜ๅŒ–ๅฎŒๅ…จๆญฃไบค๏ผŒๅฏไปฅ็ป„ๅˆไฝฟ็”จใ€‚ +``` + +--- + +## ็›ธๅ…ณๆ–‡ไปถ + +- `nanovllm/kvcache/sparse/full_policy.py`: Chunked attention pipeline +- `nanovllm/kvcache/offload_engine.py`: H2D/D2H ไผ ่พ“็ฎก็† +- `docs/cpu_scheduling_latency_analysis.md`: ้—ฎ้ข˜ๅˆ†ๆž + +## ๅ‚่€ƒๆ–‡็Œฎ + +1. [InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management](https://www.usenix.org/conference/osdi24/presentation/lee) - OSDI'24 +2. [ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference](https://github.com/ByteDance-Seed/ShadowKV) - ICML'25 Spotlight +3. [Accelerating LLM Inference Throughput via Asynchronous KV Cache Prefetching](https://arxiv.org/abs/2504.06319) - 2025 +4. [KVPR: Efficient LLM Inference with I/O-Aware KV Cache](https://aclanthology.org/2025.findings-acl.997.pdf) - ACL'25 +5. [LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference](https://lmcache.ai/tech_report.pdf) - 2025