Add statistics tracking to compare block selection between policies:
- XAttentionBSAPolicy: track available/selected blocks per chunk
- FullAttentionPolicy: track total blocks (always 100% density)
- Add reset_stats(), get_density_stats(), print_density_stats() methods
- Use logger.debug for per-chunk density logging
Results on 32K niah_single_1:
- Full: 100% density across all chunks
- XAttn BSA: 90% -> 73% density (saves ~25-30% blocks in later chunks)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Create nanovllm/ops/ module for low-level attention operators
- Move chunked_attention.py from kvcache/ to ops/
- Update imports in full_policy.py (3 locations)
- Fix: remove dead code in OffloadEngine.reset() referencing
non-existent layer_k/v_buffer_a/b attributes
Verified with needle test (32K offload): PASSED
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove cross-layer pipeline from OffloadEngine (saves ~1GB GPU memory for long sequences)
- Delete layer_k/v_buffer_a/b double buffers
- Remove start_decode_pipeline, get_decode_layer_kv, end_decode_pipeline methods
- Remove pipeline state tracking variables
- Simplify decode to use ring buffer pipeline only (more efficient for long sequences)
- Rename compute_chunked_attention → compute_chunked_prefill for clarity
- Add mandatory needle test requirements: --enable-offload --input-len 32768
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Move decode attention computation from attention.py to SparsePolicy:
- Add compute_chunked_decode abstract method to SparsePolicy base class
- Implement compute_chunked_decode in FullAttentionPolicy with:
- Ring buffer pipeline (_decode_ring_buffer_pipeline)
- Cross-layer pipeline (_decode_with_layer_pipeline)
- Decode buffer handling
- Simplify _chunked_decode_attention to only validate and delegate
- Remove _decode_ring_buffer_pipeline and _decode_with_layer_pipeline from attention.py
- Add supports_decode check for policy validation
This completes the SparsePolicy v5 refactoring where both prefill and
decode paths now delegate all computation to the sparse policy.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Move all chunked prefill attention computation from attention.py to
SparsePolicy.compute_chunked_attention(). This is the v4 architecture
refactoring for sparse attention policies.
Changes:
- Add compute_chunked_attention abstract method to SparsePolicy base
- Add offload_engine parameter to select_blocks for policies needing
KV access during block selection
- Implement compute_chunked_attention in FullAttentionPolicy with
complete ring buffer pipeline logic
- Simplify attention.py to delegate all chunked prefill to policy
- Remove redundant _sync_load_previous_chunks and
_ring_buffer_pipeline_load methods from Attention class
Test: test_needle.py --enable-offload PASSED
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>