nano-vllm/nanovllm/kvcache/sparse/full_policy.py at e5a17c832c80cfc5051e2b477e9f60017e918eb9

Files

Zijie Tian 4593f42ec3 ♻️ refactor: migrate chunked decode attention to SparsePolicy

Move decode attention computation from attention.py to SparsePolicy:
- Add compute_chunked_decode abstract method to SparsePolicy base class
- Implement compute_chunked_decode in FullAttentionPolicy with:
  - Ring buffer pipeline (_decode_ring_buffer_pipeline)
  - Cross-layer pipeline (_decode_with_layer_pipeline)
  - Decode buffer handling
- Simplify _chunked_decode_attention to only validate and delegate
- Remove _decode_ring_buffer_pipeline and _decode_with_layer_pipeline from attention.py
- Add supports_decode check for policy validation

This completes the SparsePolicy v5 refactoring where both prefill and
decode paths now delegate all computation to the sparse policy.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-20 01:32:17 +08:00

18 KiB

Raw Blame History

View Raw

18 KiB Raw Blame History

18 KiB

Raw Blame History