Zijie Tian
39d12a0416
📈 feat: add MemoryObserver for GPU-CPU communication tracking
...
Implement MemoryObserver to track memory transfers between GPU and CPU:
- H2D (Host to Device): CPU → GPU transfers
- D2H (Device to Host): GPU → CPU transfers
- D2D (Device to Device): GPU buffer copies
- Supports prefill/decode phase separation
Integration points in offload_engine.py:
- load_to_slot_layer: H2D with is_prefill parameter
- offload_slot_layer_to_cpu, offload_prefill_buffer_async: D2H
- write_to_prefill_buffer, write_to_decode_buffer: D2D
- load_block_sample_from_cpu, load_block_full_from_cpu: H2D
Add bench_offload.py integration for memory stats printing.
Benchmark results (Llama-3.1-8B, 64K context):
- Full Policy: Prefill H2D 262.13 GB
- XAttention: Prefill H2D 386.62 GB (1.48x)
Generated with [Claude Code](https://claude.ai/code )
via [Happy](https://happy.engineering )
Co-Authored-By: Claude <noreply@anthropic.com >
Co-Authored-By: Happy <yesreply@happy.engineering >
2026-01-28 04:06:45 +08:00
Zijie Tian
c16bfcf40f
♻️ refactor: restructure Observer as base class with InferenceObserver
...
- Refactor Observer into base class with common enable/disable/reset interface
- Create InferenceObserver subclass for TTFT/TPOT metrics
- Fix TTFT calculation timing: compute after prefill completes instead of
at decode start (fixes max_tokens=1 returning TTFT=0)
- Integrate InferenceObserver into bench.py and bench_offload.py for
accurate internal timing metrics vs external wall-clock time
- Add get_summary() and print_summary() methods for structured output
Generated with [Claude Code](https://claude.ai/code )
via [Happy](https://happy.engineering )
Co-Authored-By: Claude <noreply@anthropic.com >
Co-Authored-By: Happy <yesreply@happy.engineering >
2026-01-28 03:15:33 +08:00
Zijie Tian
b8b6478506
[feat] Need to optimized with async prefetch.
2025-12-15 06:58:40 +08:00
Zijie Tian
1081ab51ea
[refactor] Refactor offload code to multi-chunk.
2025-12-15 01:13:58 +08:00
Zijie Tian
9b8165af5a
[fix] Fixed kvcache offload problem.
2025-12-12 01:35:30 +08:00
Zijie Tian
babfa17354
[refactor] Translate into english, void Chinese due to claude.
2025-12-11 00:30:24 +08:00
Zijie Tian
e85c2b4776
[fix] Fixed kvcache offload bugs.
2025-12-10 22:34:00 +08:00
Zijie Tian
0a247ccb1b
[feat] Added num_gpu_blocks limit gpu blocks.
2025-12-10 20:17:42 +08:00
Zijie Tian
01f19ee4a6
[feat] Added logger into nanovllm.
2025-12-10 19:53:38 +08:00
Zijie Tian
0b6f19242d
[feat] Added chunked prefill and kvcache offload mechenism.
2025-12-10 03:47:37 +08:00
Zijie Tian
204fe2b38f
[feat] Added metric into tqdm bar.
2025-12-10 00:52:13 +08:00
GeeeekExplorer
cde3fc22c2
simplify
2025-06-21 17:19:15 +08:00
GeeeekExplorer
bc0ad5a116
better
2025-06-17 23:33:38 +08:00
GeeeekExplorer
fc778a4da9
better
2025-06-15 10:36:45 +08:00
GeeeekExplorer
98a1551a7d
support CUDA_VISIBLE_DEVICES
2025-06-12 23:14:01 +08:00
GeeeekExplorer
fee58d44e4
fix
2025-06-12 01:00:31 +08:00
GeeeekExplorer
08c84ec08d
multi file loader
2025-06-12 01:00:09 +08:00
GeeeekExplorer
a5a4909e6a
init commit
2025-06-10 00:27:01 +08:00