nano-vllm/nanovllm/kvcache/offload_engine.py at fa7601f4b879bb1cd140d3e716b81073d0fe34d3

Files

Zijie Tian fa7601f4b8 ♻️ refactor: remove cross-layer pipeline and rename compute_chunked_prefill

- Remove cross-layer pipeline from OffloadEngine (saves ~1GB GPU memory for long sequences)
  - Delete layer_k/v_buffer_a/b double buffers
  - Remove start_decode_pipeline, get_decode_layer_kv, end_decode_pipeline methods
  - Remove pipeline state tracking variables
- Simplify decode to use ring buffer pipeline only (more efficient for long sequences)
- Rename compute_chunked_attention → compute_chunked_prefill for clarity
- Add mandatory needle test requirements: --enable-offload --input-len 32768

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-20 02:10:40 +08:00

30 KiB

Raw Blame History

View Raw

30 KiB Raw Blame History

30 KiB

Raw Blame History