nano-vllm

Files

Zijie Tian 4593f42ec3 ♻️ refactor: migrate chunked decode attention to SparsePolicy

Move decode attention computation from attention.py to SparsePolicy:
- Add compute_chunked_decode abstract method to SparsePolicy base class
- Implement compute_chunked_decode in FullAttentionPolicy with:
  - Ring buffer pipeline (_decode_ring_buffer_pipeline)
  - Cross-layer pipeline (_decode_with_layer_pipeline)
  - Decode buffer handling
- Simplify _chunked_decode_attention to only validate and delegate
- Remove _decode_ring_buffer_pipeline and _decode_with_layer_pipeline from attention.py
- Add supports_decode check for policy validation

This completes the SparsePolicy v5 refactoring where both prefill and
decode paths now delegate all computation to the sparse policy.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-20 01:32:17 +08:00

policies

[feat] Added chunked prefill and kvcache offload mechenism.

2025-12-10 03:47:37 +08:00

sparse

♻️ refactor: migrate chunked decode attention to SparsePolicy

2026-01-20 01:32:17 +08:00

__init__.py

[WIP] Before integrate the xattn operator.

2026-01-19 21:19:21 +08:00

base_manager.py

[feat] Added chunked prefill and kvcache offload mechenism.