Zijie Tian
39d12a0416
📈 feat: add MemoryObserver for GPU-CPU communication tracking
Implement MemoryObserver to track memory transfers between GPU and CPU:
- H2D (Host to Device): CPU → GPU transfers
- D2H (Device to Host): GPU → CPU transfers
- D2D (Device to Device): GPU buffer copies
- Supports prefill/decode phase separation
Integration points in offload_engine.py:
- load_to_slot_layer: H2D with is_prefill parameter
- offload_slot_layer_to_cpu, offload_prefill_buffer_async: D2H
- write_to_prefill_buffer, write_to_decode_buffer: D2D
- load_block_sample_from_cpu, load_block_full_from_cpu: H2D
Add bench_offload.py integration for memory stats printing.
Benchmark results (Llama-3.1-8B, 64K context):
- Full Policy: Prefill H2D 262.13 GB
- XAttention: Prefill H2D 386.62 GB (1.48x)
Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)
Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-01-28 04:06:45 +08:00
..
2026-01-18 20:34:06 +08:00
2026-01-27 22:32:07 +08:00
2026-01-20 04:27:19 +08:00
2026-01-20 04:48:20 +08:00
2026-01-27 04:44:36 +08:00
2026-01-27 03:42:12 +08:00
2026-01-21 02:59:21 +08:00
2026-01-21 21:56:24 +08:00
2026-01-18 20:34:06 +08:00
2026-01-27 04:36:31 +08:00
2026-01-27 07:21:46 +08:00
2026-01-18 20:34:06 +08:00
2026-01-28 04:06:45 +08:00
2026-01-24 04:32:05 +08:00
2026-01-28 04:06:45 +08:00
2026-01-18 20:34:06 +08:00
2026-01-21 01:12:21 +08:00
2026-01-18 20:34:06 +08:00
2026-01-20 02:50:03 +08:00
2026-01-20 02:10:40 +08:00
2026-01-20 02:25:46 +08:00
2026-01-20 02:50:03 +08:00
2026-01-19 21:19:21 +08:00
2026-01-23 09:35:18 +08:00
2026-01-22 01:13:17 +08:00
2026-01-23 03:22:25 +08:00
2026-01-28 00:57:20 +08:00