♻️ refactor: remove cross-layer pipeline and rename compute_chunked_prefill

- Remove cross-layer pipeline from OffloadEngine (saves ~1GB GPU memory for long sequences) - Delete layer_k/v_buffer_a/b double buffers - Remove start_decode_pipeline, get_decode_layer_kv, end_decode_pipeline methods - Remove pipeline state tracking variables - Simplify decode to use ring buffer pipeline only (more efficient for long sequences) - Rename compute_chunked_attention → compute_chunked_prefill for clarity - Add mandatory needle test requirements: --enable-offload --input-len 32768 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-20 02:10:40 +08:00
parent 6080bf7554
commit fa7601f4b8
9 changed files with 67 additions and 299 deletions
--- a/nanovllm/layers/attention.py
+++ b/nanovllm/layers/attention.py
@@ -174,7 +174,7 @@ class Attention(nn.Module):
        Compute attention with per-layer prefill buffer for async offload.

        Simplified design:
-        - All computation logic is delegated to sparse_policy.compute_chunked_attention()
+        - All computation logic is delegated to sparse_policy.compute_chunked_prefill()
        - This method only handles async offload after computation

        The policy handles:
@@ -198,11 +198,11 @@ class Attention(nn.Module):
            raise RuntimeError("sparse_policy is required for chunked prefill")

        # [DEBUG] Verify execution path
-        logger.debug(f"[DEBUG] Calling sparse_policy.compute_chunked_attention, "
+        logger.debug(f"[DEBUG] Calling sparse_policy.compute_chunked_prefill, "
                     f"policy={sparse_policy}, layer={self.layer_id}, chunk={current_chunk_idx}")

        # Delegate all computation to policy (no flash_attn or merge calls here!)
-        final_o = sparse_policy.compute_chunked_attention(
+        final_o = sparse_policy.compute_chunked_prefill(
            q, k, v,
            self.layer_id,
            self.scale,