nano-vllm/nanovllm/layers/attention.py at 4593f42ec39cded5d86fc5b09ce19dc003bb4554

Files

Zijie Tian 4593f42ec3 ♻️ refactor: migrate chunked decode attention to SparsePolicy

Move decode attention computation from attention.py to SparsePolicy:
- Add compute_chunked_decode abstract method to SparsePolicy base class
- Implement compute_chunked_decode in FullAttentionPolicy with:
  - Ring buffer pipeline (_decode_ring_buffer_pipeline)
  - Cross-layer pipeline (_decode_with_layer_pipeline)
  - Decode buffer handling
- Simplify _chunked_decode_attention to only validate and delegate
- Remove _decode_ring_buffer_pipeline and _decode_with_layer_pipeline from attention.py
- Add supports_decode check for policy validation

This completes the SparsePolicy v5 refactoring where both prefill and
decode paths now delegate all computation to the sparse policy.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-20 01:32:17 +08:00

12 KiB

Raw Blame History

View Raw

12 KiB Raw Blame History

12 KiB

Raw Blame History