Files

Zijie Tian 39d12a0416 📈 feat: add MemoryObserver for GPU-CPU communication tracking

Implement MemoryObserver to track memory transfers between GPU and CPU:
- H2D (Host to Device): CPU → GPU transfers
- D2H (Device to Host): GPU → CPU transfers
- D2D (Device to Device): GPU buffer copies
- Supports prefill/decode phase separation

Integration points in offload_engine.py:
- load_to_slot_layer: H2D with is_prefill parameter
- offload_slot_layer_to_cpu, offload_prefill_buffer_async: D2H
- write_to_prefill_buffer, write_to_decode_buffer: D2D
- load_block_sample_from_cpu, load_block_full_from_cpu: H2D

Add bench_offload.py integration for memory stats printing.

Benchmark results (Llama-3.1-8B, 64K context):
- Full Policy: Prefill H2D 262.13 GB
- XAttention: Prefill H2D 386.62 GB (1.48x)

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>

2026-01-28 04:06:45 +08:00

2.4 KiB

Raw Blame History

Memory Communication Benchmark

GPU-CPU 通信量测试结果，对比 Full Policy 和 XAttention BSA Policy。

测试环境

模型: Llama-3.1-8B-Instruct
GPU: RTX 3090 (24GB)
配置: num_gpu_blocks=4, block_size=1024, enable_cpu_offload=True
XAttention 参数: threshold=0.95, stride=8

32K 上下文测试结果

指标	Full Policy	XAttention	比率
Prefill H2D	66.57 GB	111.12 GB	1.67x
Prefill D2H	4.29 GB	4.29 GB	1.00x
TTFT	8473 ms	10367 ms	1.22x

XAttention Block Selection (32K)

指标	数值
可用 blocks	465
选中 blocks	374
选择密度	80.4%

64K 上下文测试结果

指标	Full Policy	XAttention	比率
Prefill H2D	262.13 GB	386.62 GB	1.48x
Prefill D2H	8.46 GB	8.46 GB	1.00x
Decode H2D (32 tokens)	262.13 GB	262.13 GB	1.00x
TTFT	27081 ms	33634 ms	1.24x

通信量比率对比

上下文长度	XAttn/Full Prefill H2D 比率
32K	1.67x
64K	1.48x

分析

XAttention 通信量增加原因：
- Estimate 阶段：加载 100% 历史 blocks（用于 attention score 估计）
- Compute 阶段：加载 选中的 blocks（约 70-80%）
- 理论比率：1 + selection_density
64K 比率更低的原因：
- 更长上下文时，attention 分布更稀疏
- XAttention 的 block 选择更有效（选中比例更低）
- First/last block 强制包含的影响相对减小
Decode 阶段通信量相同：
- XAttention 仅支持 prefill 阶段
- Decode 阶段 fallback 到 Full Policy

测试命令

# 32K Full Policy
python bench_offload.py --max-len 32768 --input-len 32000

# 32K XAttention
python bench_offload.py --max-len 32768 --input-len 32000 --enable-xattn

# 64K Full Policy
python bench_offload.py --max-len 65536 --input-len 64000

# 64K XAttention
python bench_offload.py --max-len 65536 --input-len 64000 --enable-xattn

# 包含 decode 测试
python bench_offload.py --max-len 65536 --input-len 64000 --bench-decode --output-len 32

2.4 KiB Raw Blame History Unescape Escape

Memory Communication Benchmark

测试环境

32K 上下文测试结果

XAttention Block Selection (32K)

64K 上下文测试结果

通信量比率对比

分析

测试命令

相关文档

2.4 KiB

Raw Blame History