Commit Graph

8 Commits

Author SHA1 Message Date
Zijie Tian
a50b4c2ac2 ♻️ refactor: move select_blocks from policy to attention layer
Move block selection logic from compute_chunked_prefill/decode methods
to attention.py caller. This improves separation of concerns:

- attention.py now calls select_blocks() before compute_chunked_*()
- Policy methods receive pre-selected blocks via selected_blocks parameter
- Enables sparse policies to implement custom block selection without
  modifying the compute path

Changes:
- policy.py: Add selected_blocks parameter to abstract methods
- full_policy.py: Remove internal select_blocks calls, use passed blocks
- xattn_bsa.py: Sync signatures for prefill/decode methods
- attention.py: Add select_blocks calls before policy delegation

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 05:21:28 +08:00
Zijie Tian
690456dbf9 ♻️ refactor: create ops module and move chunked_attention
- Create nanovllm/ops/ module for low-level attention operators
- Move chunked_attention.py from kvcache/ to ops/
- Update imports in full_policy.py (3 locations)
- Fix: remove dead code in OffloadEngine.reset() referencing
  non-existent layer_k/v_buffer_a/b attributes

Verified with needle test (32K offload): PASSED

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-20 02:50:14 +08:00
Zijie Tian
fa7601f4b8 ♻️ refactor: remove cross-layer pipeline and rename compute_chunked_prefill
- Remove cross-layer pipeline from OffloadEngine (saves ~1GB GPU memory for long sequences)
  - Delete layer_k/v_buffer_a/b double buffers
  - Remove start_decode_pipeline, get_decode_layer_kv, end_decode_pipeline methods
  - Remove pipeline state tracking variables
- Simplify decode to use ring buffer pipeline only (more efficient for long sequences)
- Rename compute_chunked_attention → compute_chunked_prefill for clarity
- Add mandatory needle test requirements: --enable-offload --input-len 32768

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-20 02:10:40 +08:00
Zijie Tian
4593f42ec3 ♻️ refactor: migrate chunked decode attention to SparsePolicy
Move decode attention computation from attention.py to SparsePolicy:
- Add compute_chunked_decode abstract method to SparsePolicy base class
- Implement compute_chunked_decode in FullAttentionPolicy with:
  - Ring buffer pipeline (_decode_ring_buffer_pipeline)
  - Cross-layer pipeline (_decode_with_layer_pipeline)
  - Decode buffer handling
- Simplify _chunked_decode_attention to only validate and delegate
- Remove _decode_ring_buffer_pipeline and _decode_with_layer_pipeline from attention.py
- Add supports_decode check for policy validation

This completes the SparsePolicy v5 refactoring where both prefill and
decode paths now delegate all computation to the sparse policy.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-20 01:32:17 +08:00
Zijie Tian
baa4be7e2e ♻️ refactor: migrate chunked prefill attention to SparsePolicy
Move all chunked prefill attention computation from attention.py to
SparsePolicy.compute_chunked_attention(). This is the v4 architecture
refactoring for sparse attention policies.

Changes:
- Add compute_chunked_attention abstract method to SparsePolicy base
- Add offload_engine parameter to select_blocks for policies needing
  KV access during block selection
- Implement compute_chunked_attention in FullAttentionPolicy with
  complete ring buffer pipeline logic
- Simplify attention.py to delegate all chunked prefill to policy
- Remove redundant _sync_load_previous_chunks and
  _ring_buffer_pipeline_load methods from Attention class

Test: test_needle.py --enable-offload PASSED

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-20 00:58:46 +08:00
Zijie Tian
b97b0b96a0 [WIP] Before refactor the nanovllm sparse policy. 2026-01-19 22:34:44 +08:00
Zijie Tian
690492e074 [WIP] Before refactor policies. 2026-01-06 20:47:55 +08:00
Zijie Tian
051f2295c9 [feat] Added sparse KVcache feature, NEED VERIFY. 2025-12-22 08:51:02 +08:00