nano-vllm

Files

Zijie Tian aea3812230 ♻️ refactor: unify KV cache operations through OffloadEngine

- Add write_to_prefill_buffer() and write_to_decode_buffer() methods
- Add chunk_idx parameter to load_to_slot_layer() for NVTX labeling
- Replace direct copy_() calls with OffloadEngine methods in attention.py
- Update all load_to_slot_layer() calls to pass chunk_idx
- NVTX markers now show chunk info: "H2D: L{layer} Chunk{chunk} CPU[{block}]->Slot[{slot}]"

All KV cache data transfers in chunked offload mode now go through
OffloadEngine, enabling better profiling and consistent management.

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>

2026-01-27 02:20:59 +08:00

activation.py

fix

2025-06-15 13:28:29 +08:00

attention.py

♻️ refactor: unify KV cache operations through OffloadEngine

2026-01-27 02:20:59 +08:00

embed_head.py

simplify

2025-08-31 20:02:51 +08:00

layernorm.py

[refactor] Translate into english, void Chinese due to claude.