nano-vllm/nanovllm/kvcache/sparse/full_policy.py at a36f8569fc91c6f90d645186819c8d678f927ee4

Files

Zijie Tian baa4be7e2e ♻️ refactor: migrate chunked prefill attention to SparsePolicy

Move all chunked prefill attention computation from attention.py to
SparsePolicy.compute_chunked_attention(). This is the v4 architecture
refactoring for sparse attention policies.

Changes:
- Add compute_chunked_attention abstract method to SparsePolicy base
- Add offload_engine parameter to select_blocks for policies needing
  KV access during block selection
- Implement compute_chunked_attention in FullAttentionPolicy with
  complete ring buffer pipeline logic
- Simplify attention.py to delegate all chunked prefill to policy
- Remove redundant _sync_load_previous_chunks and
  _ring_buffer_pipeline_load methods from Attention class

Test: test_needle.py --enable-offload PASSED

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-20 00:58:46 +08:00

7.7 KiB

Raw Blame History

View Raw

7.7 KiB Raw Blame History

7.7 KiB

Raw Blame History