From 6da116de984d6bf600a396aa2fc40eaaf32fbbcd Mon Sep 17 00:00:00 2001 From: Zijie Tian Date: Tue, 27 Jan 2026 07:21:46 +0800 Subject: [PATCH] =?UTF-8?q?=F0=9F=93=9D=20docs:=20add=20GPU-Only=20XAttent?= =?UTF-8?q?ion=20guide=20with=20performance=20analysis?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add comprehensive documentation for GPU-only XAttention BSA mode: - Architecture design and SparsePolicy interface - Memory pre-allocation mechanism (alloc_policy_metadata) - Performance analysis: 32K +15%, 64K +41% vs baseline - CUDA Graph limitations explanation (variable seq_len in prefill) - nsys profiling tools usage guide Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude Co-Authored-By: Happy --- CLAUDE.md | 1 + docs/gpu_only_xattn_guide.md | 296 +++++++++++++++++++++++++++++++++++ 2 files changed, 297 insertions(+) create mode 100644 docs/gpu_only_xattn_guide.md diff --git a/CLAUDE.md b/CLAUDE.md index b7e3647..0328beb 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -28,6 +28,7 @@ Nano-vLLM is a lightweight vLLM implementation (~1,200 lines) for fast offline L | [`docs/nsys_wrong_event_order_bug.md`](docs/nsys_wrong_event_order_bug.md) | ๐Ÿ› NSYS BUG: Ring buffer pipeline ่งฆๅ‘ nsys ๆ—ถ้—ดๆˆณไนฑๅบ้—ฎ้ข˜็š„่ฐƒ่ฏ•่ฎฐๅฝ• | | [`docs/cpu_scheduling_latency_analysis.md`](docs/cpu_scheduling_latency_analysis.md) | โšก PERF: CPU ่ฐƒๅบฆๅปถ่ฟŸๅˆ†ๆž๏ผŒkernel ้—ด้š™ๆฅๆบ๏ผŒGPU ๅˆฉ็”จ็އไผ˜ๅŒ–ๆ–นๅ‘ | | [`docs/bench_offload_results.md`](docs/bench_offload_results.md) | ๐Ÿ“Š BENCH: CPU offload ๆ€ง่ƒฝๆต‹่ฏ•็ป“ๆžœ๏ผŒFull vs XAttention ๅฏนๆฏ” (32K/128K) | +| [`docs/gpu_only_xattn_guide.md`](docs/gpu_only_xattn_guide.md) | ๐Ÿš€ GPU-Only XAttention: ๅ†…ๅญ˜้ข„ๅˆ†้…ใ€ๆ€ง่ƒฝๅˆ†ๆž (32K +15%, 64K +41%)ใ€CUDA Graph ้™ๅˆถ | ## Rules Index diff --git a/docs/gpu_only_xattn_guide.md b/docs/gpu_only_xattn_guide.md new file mode 100644 index 0000000..18311e9 --- /dev/null +++ b/docs/gpu_only_xattn_guide.md @@ -0,0 +1,296 @@ +# GPU-Only XAttention ๆŒ‡ๅ— + +ๆœฌๆ–‡ๆกฃไป‹็ป GPU-only ๆจกๅผไธ‹ XAttention BSA ็š„ๅฎž็Žฐใ€ๅ†…ๅญ˜ไผ˜ๅŒ–ๅ’Œๆ€ง่ƒฝๅˆ†ๆžใ€‚ + +## ๆฆ‚่ฟฐ + +GPU-only ๆจกๅผไธ‹๏ผŒๆ‰€ๆœ‰ KV cache ๅญ˜ๅ‚จๅœจ GPU ไธŠ๏ผŒๆ— ้œ€ CPU offloadใ€‚XAttention ้€š่ฟ‡็จ€็–ๆณจๆ„ๅŠ›ๅŠ ้€Ÿ prefill ้˜ถๆฎตใ€‚ + +### ๆ‰ง่กŒ่ทฏๅพ„ๅฏนๆฏ” + +| ๆจกๅผ | Prefill ๆ–นๆณ• | Decode ๆ–นๆณ• | KV ๅญ˜ๅ‚จ | +|------|-------------|-------------|---------| +| GPU-only Full | `compute_prefill()` | `compute_decode()` | GPU | +| GPU-only XAttn | `compute_prefill()` | `compute_decode()` | GPU | +| CPU Offload | `compute_chunked_prefill()` | `compute_chunked_decode()` | CPU + GPU | + +## ๆžถๆž„่ฎพ่ฎก + +### SparsePolicy ๆŽฅๅฃ + +```python +class SparsePolicy: + # GPU-only ๆ–นๆณ• + def compute_prefill(self, q, k, v, ...) -> Tensor + def compute_decode(self, q, k_cache, v_cache, ...) -> Tensor + + # CPU Offload ๆ–นๆณ• + def compute_chunked_prefill(self, q, k, v, ...) -> Tensor + def compute_chunked_decode(self, q, ...) -> Tensor + + # ๅˆๅง‹ๅŒ–ๆ–นๆณ• + def initialize(self, num_layers, ...) -> None # CPU offload metadata + def alloc_policy_metadata(self, num_heads, ...) -> None # GPU-only buffers +``` + +### XAttentionBSAPolicy ๅฎž็Žฐ + +``` +GPU-only Prefill ๆต็จ‹: +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ 1. GQA ๆ‰ฉๅฑ• (ไฝฟ็”จ้ข„ๅˆ†้… buffer) โ”‚ +โ”‚ K: [seq, kv_heads, dim] โ†’ K_exp: [1, heads, seq, dim] โ”‚ +โ”‚ โ”‚ +โ”‚ 2. XAttention ไผฐ่ฎก โ”‚ +โ”‚ flat_group_gemm_fuse_reshape_kernel (Q@K^T) โ”‚ +โ”‚ softmax_fuse_block_sum_kernel (block ้‡่ฆๆ€ง) โ”‚ +โ”‚ โ†’ sparse mask โ”‚ +โ”‚ โ”‚ +โ”‚ 3. BSA ็จ€็–ๆณจๆ„ๅŠ› โ”‚ +โ”‚ flash_fwd_block_kernel (ๅช่ฎก็ฎ—้€‰ไธญ็š„ blocks) โ”‚ +โ”‚ โ†’ output โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ +``` + +## ๅ†…ๅญ˜้ข„ๅˆ†้… + +### ้—ฎ้ข˜่ƒŒๆ™ฏ + +XAttention ็š„ `compute_prefill()` ้œ€่ฆ GQA ๆ‰ฉๅฑ•๏ผš + +```python +# ไน‹ๅ‰: ๅŠจๆ€ๅˆ†้… (~2GB for 64K) +K_exp = K.repeat_interleave(num_groups, dim=1) # ๅˆ†้… 1 +k_bsa = k.repeat_interleave(num_groups, dim=1) # ๅˆ†้… 2 (้‡ๅค!) +``` + +ๆฏๆฌก prefill ้ƒฝๅŠจๆ€ๅˆ†้…๏ผŒๅฏผ่‡ด๏ผš +- ๅ†…ๅญ˜็ขŽ็‰‡ +- ๅˆ†้…ๅปถ่ฟŸ +- ๅฏ่ƒฝ OOM + +### ่งฃๅ†ณๆ–นๆกˆ: alloc_policy_metadata() + +ๅœจๆก†ๆžถๅˆๅง‹ๅŒ–ๆ—ถ้ข„ๅˆ†้… buffer๏ผš + +```python +class XAttentionBSAPolicy(SparsePolicy): + def alloc_policy_metadata(self, num_heads, num_kv_heads, head_dim, + max_seq_len, dtype, device): + # ้ข„ๅˆ†้… GQA ๆ‰ฉๅฑ• buffer + shape = (1, num_heads, max_seq_len, head_dim) + self._k_expanded = torch.empty(shape, dtype=dtype, device=device) + self._v_expanded = torch.empty(shape, dtype=dtype, device=device) + + def compute_prefill(self, q, k, v, ...): + seq_len = k.shape[0] + # ไฝฟ็”จ้ข„ๅˆ†้… buffer ็š„ slice + K_exp = self._k_expanded[:, :, :seq_len, :] + # ๅŽŸๅœฐ GQA ๆ‰ฉๅฑ• + K_exp.view(...).copy_(K.unsqueeze(2).expand(...)) + # ๅค็”จๅŒไธ€ buffer ็ป™ BSA + k_bsa = K_exp.squeeze(0).transpose(0, 1) +``` + +### ๅ†…ๅญ˜ไฝฟ็”จ + +| ๅบๅˆ—้•ฟๅบฆ | ้ข„ๅˆ†้…ๅคงๅฐ | ่ฏดๆ˜Ž | +|---------|-----------|------| +| 32K | 512 MB | `2 * 32 * 32768 * 128 * 2 bytes` | +| 64K | 1024 MB | `2 * 32 * 65536 * 128 * 2 bytes` | + +ไผ˜ๅŒ–ๆ•ˆๆžœ๏ผš +- ไน‹ๅ‰: ~2GB ๅŠจๆ€ๅˆ†้… (xattn_estimate + BSA ๅ„ไธ€ๆฌก) +- ไน‹ๅŽ: ~1GB ้ข„ๅˆ†้… (ๅค็”จๅŒไธ€ buffer) + +### ๆก†ๆžถ้›†ๆˆ + +```python +# model_runner.py - allocate_kv_cache() +def allocate_kv_cache(self): + # ... KV cache ๅˆ†้… ... + + # GPU-only ๆจกๅผ: ้ข„ๅˆ†้… policy buffers + if not config.enable_cpu_offload: + self.kvcache_manager.sparse_policy.alloc_policy_metadata( + num_heads=num_heads, + num_kv_heads=num_kv_heads, + head_dim=head_dim, + max_seq_len=config.max_model_len, + dtype=dtype, + device=torch.device("cuda"), + ) +``` + +## ๆ€ง่ƒฝๅˆ†ๆž + +### 32K Prefill ๆ€ง่ƒฝ + +| Policy | Throughput | ็›ธๅฏนๆๅ‡ | +|--------|------------|----------| +| Baseline | 4880 tok/s | - | +| Full | 4892 tok/s | +0.2% | +| **XAttention** | **5602 tok/s** | **+15%** | + +### 64K Prefill ๆ€ง่ƒฝ + +| Policy | Throughput | ็›ธๅฏนๆๅ‡ | +|--------|------------|----------| +| Baseline | 3386 tok/s | - | +| Full | 3355 tok/s | -0.9% | +| **XAttention** | **4775 tok/s** | **+41%** | + +### Kernel ๆ—ถ้—ดๅˆ†่งฃ (32K) + +**XAttention:** +``` +FFN GEMM: 3219 ms (54%) +BSA Attention: 1231 ms (21%) +XAttn Estimation: 415 ms (7%) +Other: 1020 ms (18%) +โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ +Total: 5885 ms +``` + +**Full:** +``` +FFN GEMM: 3244 ms (48%) +Dense Attention: 2861 ms (43%) +Other: 595 ms (9%) +โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ +Total: 6700 ms +``` + +### ๅŠ ้€Ÿๆฅๆบ + +``` +Dense Attention: 2861 ms +BSA Attention: 1231 ms (่Š‚็œ 1630 ms, -57%) +XAttn Estimation: 415 ms (้ขๅค–ๅผ€้”€) +โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ +ๅ‡€่Š‚็œ: 1215 ms (42% attention ๆ—ถ้—ด) +``` + +## CUDA Graph ้™ๅˆถ + +### ไธบไป€ไนˆ Prefill ไธ่ƒฝ็”จ CUDA Graph + +CUDA Graph ่ฆๆฑ‚ๆ‰€ๆœ‰ๆ“ไฝœๅœจ capture ๆ—ถ็กฎๅฎš๏ผš + +| ๅฟ…้กปๅ›บๅฎš | Prefill ็š„ๆƒ…ๅ†ต | +|---------|---------------| +| Tensor ๅฝข็Šถ | seq_len ๅฏๅ˜ (1 ~ max_model_len) | +| Kernel grid | ไพ่ต– seq_len | +| ๅ†…ๅญ˜ๅœฐๅ€ | ไธญ้—ด tensor ๅคงๅฐๅ˜ๅŒ– | + +```python +# ไธๅŒ่ฏทๆฑ‚็š„ seq_len ไธๅŒ +request_1: prefill(seq_len=1024) # grid=(8, 32, 1) +request_2: prefill(seq_len=32768) # grid=(256, 32, 1) +``` + +### Decode ๅฏไปฅ็”จ CUDA Graph + +```python +# Decode ๆฏๆฌกๅชๅค„็† 1 token +q: [batch_size, 1, heads, dim] # ๅฝข็Šถๅ›บๅฎš +``` + +nanovllm ไธบๆฏไธช batch_size ้ข„ๅ…ˆ capture ไธ€ไธช graph๏ผš + +```python +def capture_cudagraph(self): + for batch_size in [1, 2, 4, 8, ...]: + with torch.cuda.graph(g): + self.run_model(dummy_input, is_prefill=False) + self.graphs[batch_size] = g +``` + +### Nsys Profile ็ป“ๆžœ + +``` +XAttention 32K Prefill: + Total kernels: 41,904 + Non-graph: 41,904 (100%) + Graph: 0 + +Full 32K Prefill: + Total kernels: 35,308 + Non-graph: 35,308 (100%) + Graph: 0 +``` + +**ไธค่€…้ƒฝๆ˜ฏ 100% NON-GRAPH**๏ผŒ่ฟ™ๆ˜ฏ prefill ็š„ๆœฌ่ดจ็‰นๆ€งใ€‚ + +## Profiling ๅทฅๅ…ท + +### ไฝฟ็”จ profile.sh + +```bash +# XAttention 32K +bash scripts/profile.sh --max-len 32768 --policy xattn + +# Full 32K +bash scripts/profile.sh --max-len 32768 --policy full + +# 64K (้œ€่ฆ้™ไฝŽ gpu-util) +bash scripts/profile.sh --max-len 65536 --policy xattn --gpu-util 0.7 +``` + +### ๅˆ†ๆž nsys ็ป“ๆžœ + +```bash +# ๆŸฅ็œ‹ kernel ็ปŸ่ฎก +nsys stats --report cuda_gpu_kern_sum results/nsys/.nsys-rep + +# ็”จ sqlite ๆŸฅ่ฏข่ฏฆ็ป†ๆ•ฐๆฎ +sqlite3 results/nsys/.sqlite " +SELECT + (SELECT value FROM StringIds WHERE id = shortName) as kernel, + COUNT(*) as count, + SUM(end-start)/1e6 as total_ms +FROM CUPTI_ACTIVITY_KIND_KERNEL +GROUP BY shortName +ORDER BY total_ms DESC +LIMIT 10 +" +``` + +## ไฝฟ็”จๆŒ‡ๅ— + +### ๅฏ็”จ XAttention GPU-only + +```python +from nanovllm import LLM +from nanovllm.config import SparsePolicyType + +llm = LLM( + model_path, + max_model_len=32768, + sparse_policy=SparsePolicyType.XATTN_BSA, + gpu_memory_utilization=0.9, # 64K ๆ—ถๅฏ่ƒฝ้œ€่ฆ้™ไฝŽ +) +``` + +### ๅ‘ฝไปค่กŒๆต‹่ฏ• + +```bash +# bench.py +python bench.py --max-len 32768 --policy xattn + +# 64K ้œ€่ฆ้™ไฝŽ gpu-util +python bench.py --max-len 65536 --policy xattn --gpu-util 0.7 +``` + +### ๆœ€ไฝณๅฎž่ทต + +1. **32K ๅŠไปฅไธ‹**: ไฝฟ็”จ้ป˜่ฎค `gpu_memory_utilization=0.9` +2. **64K**: ้™ไฝŽๅˆฐ `gpu_memory_utilization=0.7` +3. **Decode**: XAttention ่‡ชๅŠจ fallback ๅˆฐ FullAttentionPolicy +4. **Paged KV Cache**: ๅฝ“ `block_tables` ๅญ˜ๅœจๆ—ถ่‡ชๅŠจ fallback ๅˆฐ flash_attn + +## ็›ธๅ…ณๆ–‡ๆกฃ + +- [Sparse Policy ๆžถๆž„](sparse_policy_architecture.md) +- [XAttention ็ฎ—ๆณ•่ฏฆ่งฃ](xattention_algorithm_guide.md) +- [BSA ๆŽฅๅฃๆ–‡ๆกฃ](block_sparse_attn_interface.md)