[WIP] need refactor.

📝 docs: add layer offload planning notes and task plan
Add planning documents for layer-wise offload implementation: - notes.md: Implementation notes and findings - task_plan.md: Detailed task breakdown and progress tracking Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-22 22:20:34 +08:00 · 2026-01-22 06:04:36 +08:00 · 2026-01-22 06:03:42 +08:00 · 2026-01-22 06:00:42 +08:00
18 changed files with 3424 additions and 548 deletions
--- a/.claude/commands/exec-plan.md
+++ b/.claude/commands/exec-plan.md
@@ -0,0 +1,158 @@
 ---
 allowed-tools: Bash(CUDA_VISIBLE_DEVICES=*), Bash(PYTHONPATH=*), Bash(python*), Bash(git*), Bash(rm*), Bash(ls*), Bash(cat*), Bash(nvidia-smi*), Read, Edit, Write, Glob, Grep, TodoWrite, Task
 argument-hint: --gpu <id> [--no-interrupt]
 description: Execute task_plan.md refactoring with specified GPU, optionally without user interruption
 ---
 # Execute Task Plan (exec-plan)
 按照 `task_plan.md` 的要求执行代码重构，确保计划中的最终目标圆满实现。
 ## 参数说明
 命令格式: `/exec-plan --gpu <id> [--no-interrupt]`
 | 参数 | 说明 | 示例 |
 |------|------|------|
 | `--gpu <id>` | **必需**。指定可用的 GPU ID，只能使用此 GPU 进行调试 | `--gpu 0`, `--gpu 2` |
 | `--no-interrupt` | 可选。禁止中断执行，遇到问题不与用户交互，自动解决或跳过 | `--no-interrupt` |
 ## 当前参数
 ```
 $ARGUMENTS
 ```
 ## 执行前准备
 ### 1. 解析参数
 从 `$ARGUMENTS` 中解析：
 - `GPU_ID`: 从 `--gpu <id>` 或 `-g <id>` 提取
 - `NO_INTERRUPT`: 是否存在 `--no-interrupt` 或 `-n` 标志
 ### 2. 参数验证
 **必须验证**:
 - GPU_ID 必须是有效的数字
 - 运行 `nvidia-smi -i <GPU_ID>` 验证 GPU 存在
 ### 3. 读取 task_plan.md
 读取项目根目录下的 `task_plan.md` 文件，理解：
 - 总体目标
 - 分阶段计划 (Phase 1, 2, 3...)
 - 文件修改清单
 - 风险和注意事项
 - 测试计划
 ## 执行流程
 ### Step 1: 创建执行计划
 使用 TodoWrite 工具创建详细的执行计划，包括：
 - 从 task_plan.md 提取的所有 Phase
 - 每个 Phase 的子任务
 - 测试验证步骤
 ### Step 2: 按 Phase 执行重构
 对于 task_plan.md 中的每个 Phase：
 1. **读取当前代码**: 使用 Read/Grep 理解现有实现
 2. **实施修改**: 使用 Edit/Write 进行代码修改
 3. **验证修改**: 运行相关测试
 ### Step 3: 运行测试验证
 执行 task_plan.md 中定义的测试计划，验证重构成功。
 ## GPU 限制规则
 **严格限制**: 只能使用指定的 GPU，所有涉及 GPU 的命令必须加 `CUDA_VISIBLE_DEVICES` 前缀：
 ```bash
 # 正确
 CUDA_VISIBLE_DEVICES=$GPU_ID PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH python test.py
 # 错误 - 禁止使用其他 GPU
 python test.py  # 可能使用默认 GPU 0
 CUDA_VISIBLE_DEVICES=0,1 python test.py  # 使用多个 GPU
 ```
 ## 中断模式规则
 ### 当 `--no-interrupt` 生效时
 遇到以下情况**不停下来询问用户**，而是：
 | 情况 | 处理方式 |
 |------|----------|
 | 测试失败 | 记录失败原因，尝试自动修复，继续下一步 |
 | 代码冲突 | 尝试合理解决，记录解决方案 |
 | 不确定的实现细节 | 选择最合理的方案继续 |
 | 执行错误 | 分析错误，尝试修复，记录问题 |
 **自动决策原则**:
 1. 优先保证功能正确性
 2. 遵循现有代码风格
 3. 选择简单直接的实现
 4. 记录所有自动决策到 `progress.md`
 ### 当未指定 `--no-interrupt` 时
 遇到以下情况**可以询问用户**：
 - 多个实现方案需要选择
 - 测试持续失败无法自动修复
 - 发现 task_plan.md 中的问题或矛盾
 ## 执行记录
 ### 进度文件: progress.md
 实时更新 `progress.md` 记录：
 ```markdown
 ## 执行进度
 ### Phase X: [名称]
 - 状态: [进行中/完成/失败]
 - 开始时间: [时间]
 - 完成时间: [时间]
 - 修改文件: [文件列表]
 - 自动决策: [如果有]
 - 问题记录: [如果有]
 ```
 ### 发现记录: findings.md
 记录执行过程中的重要发现到 `findings.md`。
 ## 示例用法
 ```bash
 # 使用 GPU 2，允许中断
 /exec-plan --gpu 2
 # 使用 GPU 0，不中断执行
 /exec-plan --gpu 0 --no-interrupt
 # 简短形式
 /exec-plan -g 1 -n
 ```
 ## 完成标准
 执行完成后，确保：
 1. **所有 Phase 完成**: task_plan.md 中的所有 Phase 都已实施
 2. **测试通过**: task_plan.md 中的测试计划全部通过
 3. **代码质量**: 修改符合项目代码规范
 4. **文档更新**: progress.md 包含完整执行记录
 ## 重要约束
 1. **GPU 隔离**: 绝对不能使用指定 GPU 以外的设备
 2. **遵循计划**: 严格按照 task_plan.md 执行，不做计划外的修改
 3. **渐进式修改**: 每个 Phase 完成后验证，而不是最后一起验证
 4. **回滚准备**: 重大修改前考虑是否需要 git commit 保存点
--- a/nanovllm/config.py
+++ b/nanovllm/config.py
@@ -62,6 +62,7 @@ class Config:
    xattn_keep_sink: bool = False  # Always keep first block (sink tokens)
    xattn_keep_recent: bool = False  # Always keep recent diagonal blocks
    xattn_norm: float = 1.0  # Normalization factor for attention scores
    xattn_use_bsa: bool = True  # Use Block Sparse Attention library (requires installation)
    def __post_init__(self):
        assert os.path.isdir(self.model)
--- a/nanovllm/engine/model_runner.py
+++ b/nanovllm/engine/model_runner.py
@@ -57,8 +57,8 @@ class ModelRunner:
        load_model(self.model, config.model)
        self.sampler = GreedySampler()
-        # Initialize sparse_prefill_policy before warmup (will be configured in allocate_kv_cache)
+        # Initialize attention_policy before warmup (will be configured in allocate_kv_cache)
-        self.sparse_prefill_policy = None
+        self.attention_policy = None
        #> Disable warmup for debugging
        self.warmup_model()
@@ -178,11 +178,9 @@ class ModelRunner:
        # Create KV cache manager using factory
        self.kvcache_manager: KVCacheManager = create_kvcache_manager(config)
-        # Create sparse prefill policy
+        # Create attention policy (always, including FULL)
-        # This is used for both GPU-only and CPU offload modes when policy supports prefill
+        # In layerwise offload mode, all attention goes through the policy
-        self.sparse_prefill_policy = None
+        from nanovllm.kvcache.sparse import create_attention_policy
        if config.sparse_policy != SparsePolicyType.FULL:
            from nanovllm.kvcache.sparse import create_sparse_policy
        # Get policy-specific parameters based on type
        if config.sparse_policy == SparsePolicyType.XATTN:
@@ -194,8 +192,9 @@ class ModelRunner:
                "keep_sink": config.xattn_keep_sink,
                "keep_recent": config.xattn_keep_recent,
                "norm": config.xattn_norm,
                "use_bsa": config.xattn_use_bsa,
            }
-            else:  # MINFERENCE or others
+        elif config.sparse_policy == SparsePolicyType.MINFERENCE:
            policy_kwargs = {
                "vertical_size": config.minference_vertical_size,
                "slash_size": config.minference_slash_size,
@@ -203,13 +202,11 @@ class ModelRunner:
                "num_sink_tokens": config.minference_num_sink_tokens,
                "num_recent_diags": config.minference_num_recent_diags,
            }
        else:  # FULL or QUEST
            policy_kwargs = {}
-            policy = create_sparse_policy(config.sparse_policy, **policy_kwargs)
+        self.attention_policy = create_attention_policy(config.sparse_policy, **policy_kwargs)
-
+        logger.info(f"Attention policy: {self.attention_policy}")
            # Only use if policy supports sparse prefill
            if policy.supports_prefill:
                self.sparse_prefill_policy = policy
                logger.info(f"Sparse prefill policy enabled: {self.sparse_prefill_policy}")
        # Allocate cache through manager
        self.kvcache_manager.allocate_cache(
@@ -395,7 +392,7 @@ class ModelRunner:
        set_context(True, cu_seqlens_q, cu_seqlens_k, max_seqlen_q, max_seqlen_k,
                    slot_mapping, None, block_tables,
-                    sparse_prefill_policy=self.sparse_prefill_policy)
+                    attention_policy=self.attention_policy)
        return input_ids, positions
    def prepare_decode(self, seqs: list[Sequence]):
@@ -592,20 +589,10 @@ class ModelRunner:
            # RoPE
            q, k = layer.self_attn.rotary_emb(positions, q, k)
-            # Sparse or Full attention (uses k, v directly - before store!)
+            # Compute attention using policy (uses k, v directly - before store!)
-            if self.sparse_prefill_policy is not None:
+            attn_output = self.attention_policy.compute_prefill(
-                attn_output = self.sparse_prefill_policy.sparse_prefill_attention(
+                q, k, v, layer_id,
                    q, k, v, layer_id
                )
            else:
                attn_output = flash_attn_varlen_func(
                    q, k, v,
                    cu_seqlens_q=cu_seqlens,
                    cu_seqlens_k=cu_seqlens,
                    max_seqlen_q=total_tokens,
                    max_seqlen_k=total_tokens,
                softmax_scale=layer.self_attn.attn.scale,
                    causal=True,
            )
            # O projection
@@ -872,22 +859,10 @@ class ModelRunner:
                # RoPE
                q, k = layer.self_attn.rotary_emb(positions, q, k)
-                # Sparse or Full attention
+                # Compute attention using policy
-                if self.sparse_prefill_policy is not None:
+                attn_output = self.attention_policy.compute_prefill(
-                    # MInference or other sparse prefill policy
+                    q, k, v, layer_id,
                    attn_output = self.sparse_prefill_policy.sparse_prefill_attention(
                        q, k, v, layer_id
                    )
                else:
                    # Full attention using FlashAttention
                    attn_output = flash_attn_varlen_func(
                        q, k, v,
                        cu_seqlens_q=cu_seqlens,
                        cu_seqlens_k=cu_seqlens,
                        max_seqlen_q=total_tokens,
                        max_seqlen_k=total_tokens,
                    softmax_scale=layer.self_attn.attn.scale,
                        causal=True,
                )
                # O projection
--- a/nanovllm/kvcache/sparse/init.py
+++ b/nanovllm/kvcache/sparse/init.py
@@ -1,49 +1,56 @@
 """
-Sparse Attention Policy module.
+Attention Policy module for layerwise offload mode.
-Provides pluggable policies for selecting which KV blocks to load
+Provides pluggable policies for attention computation:
-during chunked attention with CPU offload.
+- FullAttentionPolicy: Standard FlashAttention (no sparsity)
 - XAttentionPolicy: Sparse prefill using XAttention algorithm
 - MInferencePolicy: MInference sparse attention
 - QuestPolicy: Quest block selection (for chunked offload)
 Usage:
-    from nanovllm.kvcache.sparse import create_sparse_policy, SparsePolicyType
+    from nanovllm.kvcache.sparse import create_attention_policy, SparsePolicyType
    # Create policy using factory function
-    policy = create_sparse_policy(SparsePolicyType.QUEST, topk_blocks=8)
+    policy = create_attention_policy(SparsePolicyType.XATTN, threshold=0.9)
    # Use policy for attention
    attn_output = policy.compute_prefill(q, k, v, layer_id, softmax_scale)
    # Or create custom policy
-    class MyPolicy(SparsePolicy):
+    class MyPolicy(AttentionPolicy):
        supports_prefill = True
        supports_decode = True
-        def select_blocks(self, available_blocks, ctx):
+        def compute_prefill(self, q, k, v, layer_id, softmax_scale):
-            return available_blocks[:5]  # Just first 5 blocks
+            # Custom attention computation
            ...
 """
 from nanovllm.config import SparsePolicyType
-from nanovllm.kvcache.sparse.policy import SparsePolicy, PolicyContext
+from nanovllm.kvcache.sparse.policy import AttentionPolicy, SparsePolicy, PolicyContext
 from nanovllm.kvcache.sparse.full_policy import FullAttentionPolicy
 from nanovllm.kvcache.sparse.quest import QuestPolicy, QuestConfig, BlockMetadataManager
 from nanovllm.kvcache.sparse.minference import MInferencePolicy
 from nanovllm.kvcache.sparse.xattn import XAttentionPolicy
-def create_sparse_policy(policy_type: SparsePolicyType, **kwargs) -> SparsePolicy:
+def create_attention_policy(policy_type: SparsePolicyType, **kwargs) -> AttentionPolicy:
    """
-    Create a sparse policy instance from an enum type.
+    Create an attention policy instance from an enum type.
-    The returned policy is not yet initialized. Call policy.initialize()
+    All attention (including full attention) goes through a policy in layerwise
-    or let the framework call it during KV cache allocation.
+    offload mode. The policy is responsible for computing prefill/decode attention.
    Args:
-        policy_type: SparsePolicyType enum value
+        policy_type: SparsePolicyType enum value (FULL, XATTN, MINFERENCE, QUEST)
        **kwargs: Policy-specific configuration options
    Returns:
-        SparsePolicy instance (not initialized)
+        AttentionPolicy instance
    Example:
-        policy = create_sparse_policy(SparsePolicyType.QUEST, topk_blocks=4)
+        policy = create_attention_policy(SparsePolicyType.XATTN, threshold=0.9)
-        policy.initialize(num_layers=28, num_kv_heads=8, ...)
+        attn_out = policy.compute_prefill(q, k, v, layer_id, softmax_scale)
    """
    if policy_type == SparsePolicyType.FULL:
        return FullAttentionPolicy()
@@ -75,21 +82,32 @@ def create_sparse_policy(policy_type: SparsePolicyType, **kwargs) -> SparsePolic
            keep_sink=kwargs.get("keep_sink", False),
            keep_recent=kwargs.get("keep_recent", False),
            norm=kwargs.get("norm", 1.0),
            use_bsa=kwargs.get("use_bsa", True),
        )
    else:
        raise ValueError(f"Unknown policy type: {policy_type}")
 # Backward compatibility alias
 create_sparse_policy = create_attention_policy
 __all__ = [
    # New interface
    "AttentionPolicy",
    "create_attention_policy",
    # Backward compatibility
    "SparsePolicy",
    "create_sparse_policy",
    # Common types
    "PolicyContext",
    "SparsePolicyType",
    # Policy implementations
    "FullAttentionPolicy",
    "QuestPolicy",
    "QuestConfig",
    "BlockMetadataManager",
    "MInferencePolicy",
    "XAttentionPolicy",
    "create_sparse_policy",
 ]
--- a/nanovllm/kvcache/sparse/full_policy.py
+++ b/nanovllm/kvcache/sparse/full_policy.py
@@ -1,20 +1,21 @@
 """
-Full attention policy - loads all blocks (no sparsity).
+Full attention policy - standard FlashAttention without sparsity.
 This serves as a baseline and default policy when sparse
 attention is not needed.
 """
-from typing import List
+from typing import Optional
-from .policy import SparsePolicy, PolicyContext
+import torch
 from .policy import AttentionPolicy
-class FullAttentionPolicy(SparsePolicy):
+class FullAttentionPolicy(AttentionPolicy):
    """
-    Full attention policy that loads all available blocks.
+    Full attention policy using FlashAttention (no sparsity).
-    This is the default behavior with no sparsity - all previous
+    This is the default behavior with standard causal attention.
-    KV cache blocks are loaded for each query chunk.
+    All tokens attend to all previous tokens.
    Use this as:
    - A baseline for comparing sparse policies
@@ -25,15 +26,55 @@ class FullAttentionPolicy(SparsePolicy):
    # Full attention supports both prefill and decode
    supports_prefill = True
    supports_decode = True
    requires_block_selection = False  # Load all blocks, no selective loading
-    def select_blocks(
+    def estimate(
        self,
-        available_blocks: List[int],
+        q: torch.Tensor,
-        ctx: PolicyContext,
+        k: torch.Tensor,
-    ) -> List[int]:
+        layer_id: int,
-        """Return all blocks - no sparsity."""
+    ) -> Optional[torch.Tensor]:
-        return available_blocks
+        """
        Full attention - no sparse mask needed.
        Returns None to indicate full attention should be used.
        """
        return None
    def compute_prefill(
        self,
        q: torch.Tensor,
        k: torch.Tensor,
        v: torch.Tensor,
        layer_id: int,
        softmax_scale: float,
    ) -> torch.Tensor:
        """
        Compute full causal attention using FlashAttention.
        Args:
            q: Query tensor [seq_len, num_heads, head_dim]
            k: Key tensor [seq_len, num_kv_heads, head_dim]
            v: Value tensor [seq_len, num_kv_heads, head_dim]
            layer_id: Transformer layer index
            softmax_scale: Softmax scaling factor (1/sqrt(head_dim))
        Returns:
            Attention output [seq_len, num_heads, head_dim]
        """
        from flash_attn.flash_attn_interface import flash_attn_varlen_func
        seq_len = q.shape[0]
        cu_seqlens = torch.tensor([0, seq_len], dtype=torch.int32, device=q.device)
        return flash_attn_varlen_func(
            q, k, v,
            cu_seqlens_q=cu_seqlens,
            cu_seqlens_k=cu_seqlens,
            max_seqlen_q=seq_len,
            max_seqlen_k=seq_len,
            softmax_scale=softmax_scale,
            causal=True,
        )
    def __repr__(self) -> str:
        return "FullAttentionPolicy()"
--- a/nanovllm/kvcache/sparse/minference.py
+++ b/nanovllm/kvcache/sparse/minference.py
@@ -10,10 +10,10 @@ from typing import List, Tuple, Optional
 import torch
 import torch.nn.functional as F
-from nanovllm.kvcache.sparse.policy import SparsePolicy, PolicyContext
+from nanovllm.kvcache.sparse.policy import AttentionPolicy, PolicyContext
-class MInferencePolicy(SparsePolicy):
+class MInferencePolicy(AttentionPolicy):
    """
    MInference sparse prefill policy using vertical + slash pattern.
@@ -347,6 +347,33 @@ class MInferencePolicy(SparsePolicy):
        return o
    def compute_prefill(
        self,
        q: torch.Tensor,
        k: torch.Tensor,
        v: torch.Tensor,
        layer_id: int,
        softmax_scale: float,
    ) -> torch.Tensor:
        """
        Compute MInference sparse prefill attention.
        This is the new unified interface for attention policies.
        Delegates to sparse_prefill_attention (ignores softmax_scale as MInference
        computes it internally from head_dim).
        Args:
            q: Query tensor [seq_len, num_heads, head_dim]
            k: Key tensor [seq_len, num_kv_heads, head_dim]
            v: Value tensor [seq_len, num_kv_heads, head_dim]
            layer_id: Transformer layer index
            softmax_scale: Softmax scaling factor (unused, computed internally)
        Returns:
            Attention output [seq_len, num_heads, head_dim]
        """
        return self.sparse_prefill_attention(q, k, v, layer_id)
    def __repr__(self) -> str:
        return (f"MInferencePolicy("
                f"adaptive_budget={self.adaptive_budget}, "
--- a/nanovllm/kvcache/sparse/policy.py
+++ b/nanovllm/kvcache/sparse/policy.py
@@ -1,13 +1,18 @@
 """
-Base class for sparse attention policies.
+Base class for attention policies in layerwise offload mode.
-Sparse attention policies determine which KV cache blocks to load
+AttentionPolicy defines the interface for all attention computation,
-from CPU for each query chunk during chunked attention computation.
+including full attention and sparse attention methods like XAttention.
 Key methods:
 - estimate(): Compute sparse attention mask (optional, returns None for full attention)
 - compute_prefill(): Compute prefill attention
 - compute_decode(): Compute decode attention (default implementation provided)
 """
 from abc import ABC, abstractmethod
 from dataclasses import dataclass
-from typing import List, Optional, Any
+from typing import List, Optional, Tuple
 import torch
 # Import SparsePolicyType from config to avoid circular imports
@@ -17,10 +22,10 @@ from nanovllm.config import SparsePolicyType
@dataclass
 class PolicyContext:
    """
-    Context passed to sparse policy for block selection.
+    Context passed to attention policy for block selection.
-    This dataclass contains all information needed by a sparse policy
+    This dataclass contains all information needed by an attention policy
-    to decide which blocks to load for the current query chunk.
+    for sparse estimation and attention computation.
    """
    query_chunk_idx: int
@@ -49,40 +54,41 @@ class PolicyContext:
    """Total KV sequence length so far (for reference)."""
-class SparsePolicy(ABC):
+class AttentionPolicy(ABC):
    """
-    Abstract base class for sparse attention policies.
+    Base class for attention policies in layerwise offload mode.
-    Subclass this and implement select_blocks() to create custom
+    All attention computation goes through a policy, including both
-    sparse attention patterns. The policy receives context about
+    full attention and sparse attention methods.
-    the current query chunk and returns which KV blocks to load.
+
    The policy interface is designed for layerwise offload where:
    - The entire KV cache for a layer is on GPU during computation
    - No need for block loading from CPU during attention
    - estimate() returns a sparse mask (or None for full attention)
    - compute_prefill()/compute_decode() perform the actual attention
    Attributes:
        supports_prefill: Whether this policy can be used for prefill phase.
        supports_decode: Whether this policy can be used for decode phase.
    Example:
-        class MySparsePolicy(SparsePolicy):
+        class MyPolicy(AttentionPolicy):
-            supports_prefill = False  # decode-only policy
+            supports_prefill = True
            supports_decode = True
-            def select_blocks(self, available_blocks, ctx):
+            def estimate(self, q, k, layer_id):
-                # Load first block and last 2 blocks
+                # Return sparse mask or None
-                if len(available_blocks) <= 3:
+                return None
-                    return available_blocks
+
-                return [available_blocks[0]] + available_blocks[-2:]
+            def compute_prefill(self, q, k, v, layer_id, softmax_scale):
                # Compute attention
                return flash_attn_varlen_func(q, k, v, ...)
    """
    # Compatibility flags - override in subclasses
    supports_prefill: bool = True
    supports_decode: bool = True
    # Whether this policy requires selective block loading during decode
    # If True: OffloadEngine will call select_blocks() before loading KV from CPU
    # If False: OffloadEngine will load all blocks (select_blocks ignored for load)
    # Example: MInference=False (only affects attention), Quest=True (affects load)
    requires_block_selection: bool = False
    def initialize(
        self,
        num_layers: int,
@@ -96,7 +102,7 @@ class SparsePolicy(ABC):
        Initialize policy resources.
        Called by the framework after KV cache is allocated. Override this
-        to create metadata structures (e.g., BlockMetadataManager for Quest).
+        to create metadata structures or pre-allocate buffers.
        Default implementation does nothing.
        Args:
@@ -109,76 +115,98 @@ class SparsePolicy(ABC):
        """
        pass
-    @abstractmethod
+    def estimate(
    def select_blocks(
        self,
-        available_blocks: List[int],
+        q: torch.Tensor,
-        ctx: PolicyContext,
+        k: torch.Tensor,
-    ) -> List[int]:
+        layer_id: int,
    ) -> Optional[torch.Tensor]:
        """
-        Select which KV blocks to load for the current query chunk.
+        Estimate sparse attention mask.
-        This is the core method that defines the sparse attention pattern.
+        For sparse policies (e.g., XAttention), computes block-level importance
-        The returned blocks will be loaded from CPU to GPU for attention
+        and returns a boolean mask indicating which blocks to attend.
-        computation against the current query chunk.
+        For full attention policy, returns None.
        This corresponds to xattn_estimate() in COMPASS.
        Args:
-            available_blocks: List of CPU block IDs that contain KV cache
+            q: Query tensor [seq_len, num_heads, head_dim]
-                             from previous chunks. These are ordered by
+            k: Key tensor [seq_len, num_kv_heads, head_dim]
-                             their position in the sequence.
+            layer_id: Transformer layer index
            ctx: PolicyContext with information about the current query
                 chunk, layer, phase (prefill/decode), etc.
        Returns:
-            List of block IDs to load (must be a subset of available_blocks).
+            sparse_mask: [batch, num_heads, q_blocks, k_blocks] boolean mask,
-            The order may affect performance (sequential access is faster).
+                        or None for full attention
            Returning [] means no previous blocks will be loaded.
        """
-        pass
+        return None
-    def on_prefill_offload(
+    @abstractmethod
    def compute_prefill(
        self,
-        cpu_block_id: int,
+        q: torch.Tensor,
        k: torch.Tensor,
        v: torch.Tensor,
        layer_id: int,
-        k_cache: torch.Tensor,
+        softmax_scale: float,
-        num_valid_tokens: int,
+    ) -> torch.Tensor:
    ) -> None:
        """
-        Hook called when a block is offloaded during prefill phase.
+        Compute prefill attention.
-        Called BEFORE GPU→CPU copy, while k_cache is still on GPU.
+        The entire KV cache for this layer is on GPU. Compute attention
-        Override this to collect metadata about blocks (e.g., min/max keys
+        between Q and K/V, optionally using sparse mask from estimate().
        for Quest-style selection). Default implementation does nothing.
        Args:
-            cpu_block_id: The CPU block ID that will be written
+            q: Query tensor [seq_len, num_heads, head_dim]
            k: Key tensor [seq_len, num_kv_heads, head_dim]
            v: Value tensor [seq_len, num_kv_heads, head_dim]
            layer_id: Transformer layer index
-            k_cache: Key cache tensor [block_size, num_kv_heads, head_dim] (on GPU)
+            softmax_scale: Softmax scaling factor (1/sqrt(head_dim))
-            num_valid_tokens: Number of valid tokens in this block
+
        Returns:
            Attention output [seq_len, num_heads, head_dim]
        """
        pass
-    def on_decode_offload(
+    def compute_decode(
        self,
-        cpu_block_id: int,
+        q: torch.Tensor,
        k: torch.Tensor,
        v: torch.Tensor,
        layer_id: int,
-        k_cache: torch.Tensor,
+        softmax_scale: float,
-        num_valid_tokens: int,
+    ) -> torch.Tensor:
    ) -> None:
        """
-        Hook called when a block is offloaded during decode phase.
+        Compute decode attention.
-        Called BEFORE GPU→CPU copy, while k_cache is still on GPU.
+        KV is provided from ring buffer, containing prefill tokens + decoded tokens.
-        Override this to update metadata about blocks. Default implementation
+        Default implementation uses FlashAttention.
        does nothing.
        Args:
-            cpu_block_id: The CPU block ID that will be written
+            q: Query tensor [1, num_heads, head_dim]
            k: Key tensor [context_len+1, num_kv_heads, head_dim]
            v: Value tensor [context_len+1, num_kv_heads, head_dim]
            layer_id: Transformer layer index
-            k_cache: Key cache tensor [block_size, num_kv_heads, head_dim] (on GPU)
+            softmax_scale: Softmax scaling factor
-            num_valid_tokens: Number of valid tokens in this block
+
        Returns:
            Attention output [1, num_heads, head_dim]
        """
-        pass
+        from flash_attn.flash_attn_interface import flash_attn_varlen_func
        context_len = k.shape[0]
        cu_seqlens_q = torch.tensor([0, 1], dtype=torch.int32, device=q.device)
        cu_seqlens_k = torch.tensor([0, context_len], dtype=torch.int32, device=q.device)
        return flash_attn_varlen_func(
            q, k, v,
            cu_seqlens_q=cu_seqlens_q,
            cu_seqlens_k=cu_seqlens_k,
            max_seqlen_q=1,
            max_seqlen_k=context_len,
            softmax_scale=softmax_scale,
            causal=False,
        )
    def reset(self) -> None:
        """
@@ -189,32 +217,9 @@ class SparsePolicy(ABC):
        """
        pass
    def sparse_prefill_attention(
        self,
        q: torch.Tensor,
        k: torch.Tensor,
        v: torch.Tensor,
        layer_id: int,
    ) -> torch.Tensor:
        """
        Compute sparse attention for prefill phase.
        This method is called when supports_prefill=True and the policy
        is used for GPU-only sparse prefill (no CPU offload).
        Args:
            q: Query tensor [seq_len, num_heads, head_dim]
            k: Key tensor [seq_len, num_kv_heads, head_dim]
            v: Value tensor [seq_len, num_kv_heads, head_dim]
            layer_id: Current transformer layer index
        Returns:
            Attention output [seq_len, num_heads, head_dim]
        """
        raise NotImplementedError(
            f"{self.__class__.__name__} does not implement sparse_prefill_attention. "
            "Set supports_prefill=False or implement this method."
        )
    def __repr__(self) -> str:
        return f"{self.__class__.__name__}()"
 # Backward compatibility alias
 SparsePolicy = AttentionPolicy
--- a/nanovllm/kvcache/sparse/quest.py
+++ b/nanovllm/kvcache/sparse/quest.py
@@ -11,7 +11,7 @@ import logging
 import torch
 from dataclasses import dataclass
 from typing import List, Tuple, Optional
-from .policy import SparsePolicy, PolicyContext
+from .policy import AttentionPolicy, PolicyContext
 logger = logging.getLogger(__name__)
@@ -137,7 +137,7 @@ class QuestConfig:
    """Always include this many recent blocks (last N blocks), in addition to Top-K."""
-class QuestPolicy(SparsePolicy):
+class QuestPolicy(AttentionPolicy):
    """
    Quest-style Top-K block selection using min/max key bounds.
@@ -317,6 +317,25 @@ class QuestPolicy(SparsePolicy):
        if self.metadata is not None:
            self.metadata.reset()
    def compute_prefill(
        self,
        q: torch.Tensor,
        k: torch.Tensor,
        v: torch.Tensor,
        layer_id: int,
        softmax_scale: float,
    ) -> torch.Tensor:
        """
        Quest does not support prefill - raises error.
        Quest is a decode-only policy for selective block loading.
        For prefill, use FullAttentionPolicy or XAttentionPolicy.
        """
        raise NotImplementedError(
            "QuestPolicy does not support prefill. "
            "Use FullAttentionPolicy or XAttentionPolicy for prefill."
        )
    def __repr__(self) -> str:
        return (
            f"QuestPolicy(topk={self.config.topk_blocks}, "
--- a/nanovllm/kvcache/sparse/xattn.py
+++ b/nanovllm/kvcache/sparse/xattn.py
@@ -4,48 +4,56 @@ XAttention sparse attention policy for nano-vllm.
 Implements the XAttention algorithm from COMPASS, using chunked estimation
 and block sparse attention for efficient long-context inference.
 Architecture:
    XAttention = Estimate (Triton) + Compute (BSA)
    - Estimate: xattn_estimate() computes block-level importance scores
    - Compute: block_sparse_attn_func() executes sparse attention
 Reference: COMPASS/compass/src/Xattention.py
 """
 import math
-from typing import List, Optional
+from typing import Optional
 import torch
 import torch.nn.functional as F
-from nanovllm.kvcache.sparse.policy import SparsePolicy, PolicyContext
+from nanovllm.kvcache.sparse.policy import AttentionPolicy
-from nanovllm.kvcache.sparse.kernels import (
+
-    flat_group_gemm_fuse_reshape,
+# BSA block size is fixed at 128 (hardcoded in block_sparse_attn)
-    softmax_fuse_block_sum,
+BSA_BLOCK_SIZE = 128
 )
 from nanovllm.kvcache.sparse.utils import find_blocks_chunked
-class XAttentionPolicy(SparsePolicy):
+class XAttentionPolicy(AttentionPolicy):
    """
    XAttention sparse prefill policy using chunked estimation + block sparse attention.
    This policy estimates sparse attention patterns by:
-    1. Chunked QK computation using Triton kernels
+    1. Chunked QK computation using Triton kernels (via nanovllm.ops.xattn)
    2. Block-wise softmax with importance scores
    3. Block selection based on threshold
-    4. Block sparse attention computation
+    4. Block sparse attention computation using MIT-HAN-LAB BSA library
    The key method is estimate() which calls xattn_estimate() from nanovllm.ops
    to compute the sparse attention mask.
    Note: Requires Triton >= 2.1.0 and CUDA SM 80+ (RTX 3090, A100, H100, etc.)
    BSA library: https://github.com/mit-han-lab/Block-Sparse-Attention
    """
    supports_prefill = True
-    supports_decode = False  # XAttention is prefill-only
+    supports_decode = True  # Uses default FlashAttention for decode
    requires_block_selection = False  # Only affects attention computation
    def __init__(
        self,
        stride: int = 8,
        threshold: float = 0.9,
-        chunk_size: Optional[int] = None,
+        block_size: int = 128,
        chunk_size: int = 16384,
        use_triton: bool = True,
        keep_sink: bool = False,
        keep_recent: bool = False,
        norm: float = 1.0,
        use_bsa: bool = True,
    ):
        """
        Initialize XAttention policy.
@@ -53,19 +61,28 @@ class XAttentionPolicy(SparsePolicy):
        Args:
            stride: Stride for reorganizing Q/K (default: 8)
            threshold: Block selection threshold, 0-1 (default: 0.9)
-            chunk_size: Chunk size for estimation (auto if None)
+            block_size: Block size for sparse attention (default: 128, must match BSA)
            chunk_size: Chunk size for estimation (default: 16384)
            use_triton: Use Triton kernels (requires SM 80+)
            keep_sink: Always keep first block (sink tokens)
            keep_recent: Always keep recent diagonal blocks
            norm: Normalization factor for attention scores
            use_bsa: Use Block Sparse Attention library (default: True)
        """
        self.stride = stride
        self.threshold = threshold
        self.block_size = block_size
        self.chunk_size = chunk_size
        self.use_triton = use_triton
        self.keep_sink = keep_sink
        self.keep_recent = keep_recent
        self.norm = norm
        self.use_bsa = use_bsa
        # BSA requires block_size = 128
        if self.use_bsa and self.block_size != BSA_BLOCK_SIZE:
            print(f"XAttention: BSA requires block_size=128, adjusting from {self.block_size}")
            self.block_size = BSA_BLOCK_SIZE
        # Check Triton availability
        if self.use_triton:
@@ -79,380 +96,207 @@ class XAttentionPolicy(SparsePolicy):
                self.use_triton = False
                print("XAttention: Triton not available. Falling back to PyTorch.")
-    def select_blocks(
+        # Check BSA availability
        if self.use_bsa:
            try:
                from block_sparse_attn import block_sparse_attn_func
            except ImportError:
                self.use_bsa = False
                print("XAttention: block_sparse_attn not available. Falling back to FlashAttention.")
    def estimate(
        self,
-        available_blocks: List[int],
+        q: torch.Tensor,
-        ctx: PolicyContext,
+        k: torch.Tensor,
-    ) -> List[int]:
+        layer_id: int,
    ) -> Optional[torch.Tensor]:
        """
-        Select blocks for decode phase.
+        Estimate sparse attention mask using XAttention algorithm.
-        XAttention is prefill-only, so this method is only used as a fallback.
+        Calls xattn_estimate() from nanovllm.ops.xattn to compute block-level
-        Returns all available blocks by default.
+        importance scores and generate a sparse boolean mask.
        Args:
            q: Query tensor [seq_len, num_heads, head_dim]
            k: Key tensor [seq_len, num_kv_heads, head_dim]
            layer_id: Transformer layer index
        Returns:
            sparse_mask: [batch, num_heads, q_blocks, k_blocks] boolean mask,
                        or None if estimation fails (fallback to full attention)
        """
-        # XAttention is prefill-only, but we need to implement this abstract method
+        try:
-        # Since requires_block_selection=False, this won't be called for loading
+            from nanovllm.ops.xattn import xattn_estimate
        return available_blocks
-    def sparse_prefill_attention(
+            seq_len, num_heads, head_dim = q.shape
            num_kv_heads = k.shape[1]
            # Convert to [batch, heads, seq, dim] format expected by xattn_estimate
            q_bhsd = q.unsqueeze(0).transpose(1, 2)  # [1, num_heads, seq_len, head_dim]
            k_bhsd = k.unsqueeze(0).transpose(1, 2)  # [1, num_kv_heads, seq_len, head_dim]
            # Handle GQA: expand k to match q heads for estimation
            if num_kv_heads != num_heads:
                # GQA: expand k by repeating
                repeat_factor = num_heads // num_kv_heads
                k_bhsd = k_bhsd.repeat(1, repeat_factor, 1, 1)
            # Call xattn_estimate
            attn_sums, sparse_mask = xattn_estimate(
                q_bhsd, k_bhsd,
                block_size=self.block_size,
                stride=self.stride,
                norm=self.norm,
                threshold=self.threshold,
                chunk_size=self.chunk_size,
                use_triton=self.use_triton,
                causal=True,
                keep_sink=self.keep_sink,
                keep_recent=self.keep_recent,
            )
            return sparse_mask
        except Exception as e:
            # If estimation fails, return None to use full attention
            print(f"XAttention estimate failed: {e}, falling back to full attention")
            return None
    def compute_prefill(
        self,
        q: torch.Tensor,
        k: torch.Tensor,
        v: torch.Tensor,
        layer_id: int,
        softmax_scale: float,
    ) -> torch.Tensor:
        """
-        Compute XAttention sparse attention for prefill.
+        Compute XAttention sparse prefill attention.
        Flow:
        1. Call estimate() to get sparse mask
        2. If mask is None or BSA unavailable, use full FlashAttention
        3. Otherwise, use block_sparse_attn_func with mask
        Args:
            q: Query tensor [seq_len, num_heads, head_dim]
            k: Key tensor [seq_len, num_kv_heads, head_dim]
            v: Value tensor [seq_len, num_kv_heads, head_dim]
-            layer_id: Current transformer layer index
+            layer_id: Transformer layer index
            softmax_scale: Softmax scaling factor
        Returns:
            Attention output [seq_len, num_heads, head_dim]
        """
-        seq_len = q.shape[0]
+        # If BSA is disabled, use full attention directly (skip estimation)
-        num_heads = q.shape[1]
+        if not self.use_bsa:
-        head_dim = q.shape[2]
+            return self._full_attention(q, k, v, softmax_scale)
        # Step 1: Estimate sparse mask
        sparse_mask = self.estimate(q, k, layer_id)
        # Step 2: Compute attention
        if sparse_mask is None:
            # Estimation failed, fallback to full FlashAttention
            return self._full_attention(q, k, v, softmax_scale)
        # Use block sparse attention with mask
        return self._block_sparse_attention(q, k, v, sparse_mask, softmax_scale)
    def _block_sparse_attention(
        self,
        q: torch.Tensor,
        k: torch.Tensor,
        v: torch.Tensor,
        sparse_mask: torch.Tensor,
        softmax_scale: float,
    ) -> torch.Tensor:
        """
        Compute block sparse attention using MIT-HAN-LAB BSA library.
        Args:
            q: Query tensor [seq_len, num_heads, head_dim]
            k: Key tensor [seq_len, num_kv_heads, head_dim]
            v: Value tensor [seq_len, num_kv_heads, head_dim]
            sparse_mask: Block mask [batch, num_heads, q_blocks, k_blocks]
            softmax_scale: Softmax scaling factor
        Returns:
            Attention output [seq_len, num_heads, head_dim]
        """
        from block_sparse_attn import block_sparse_attn_func
        seq_len, num_heads, head_dim = q.shape
        num_kv_heads = k.shape[1]
-        # Use FlashAttention directly for CPU offload mode
+        # Handle GQA: expand K/V to match Q heads
-        # FlashAttention supports GQA natively
+        if num_kv_heads != num_heads:
-        try:
+            repeat_factor = num_heads // num_kv_heads
            k = k.repeat_interleave(repeat_factor, dim=1)
            v = v.repeat_interleave(repeat_factor, dim=1)
        # Cumulative sequence lengths (batch=1)
        cu_seqlens_q = torch.tensor([0, seq_len], dtype=torch.int32, device=q.device)
        cu_seqlens_k = torch.tensor([0, seq_len], dtype=torch.int32, device=q.device)
        # Head mask type: 1 for all heads using block sparse
        head_mask_type = torch.ones(num_heads, dtype=torch.int32, device=q.device)
        # Trim sparse_mask to actual block counts
        q_blocks = (seq_len + BSA_BLOCK_SIZE - 1) // BSA_BLOCK_SIZE
        k_blocks = (seq_len + BSA_BLOCK_SIZE - 1) // BSA_BLOCK_SIZE
        block_mask = sparse_mask[:, :, :q_blocks, :k_blocks].contiguous()
        # Call BSA
        attn_output = block_sparse_attn_func(
            q, k, v,
            cu_seqlens_q, cu_seqlens_k,
            head_mask_type,
            None,  # streaming_info (left_mask)
            block_mask,
            seq_len, seq_len,
            p_dropout=0.0,
            deterministic=True,
            softmax_scale=softmax_scale,
            is_causal=True,
        )
        return attn_output
    def _full_attention(
        self,
        q: torch.Tensor,
        k: torch.Tensor,
        v: torch.Tensor,
        softmax_scale: float,
    ) -> torch.Tensor:
        """
        Compute full causal attention using FlashAttention.
        Args:
            q: Query tensor [seq_len, num_heads, head_dim]
            k: Key tensor [seq_len, num_kv_heads, head_dim]
            v: Value tensor [seq_len, num_kv_heads, head_dim]
            softmax_scale: Softmax scaling factor
        Returns:
            Attention output [seq_len, num_heads, head_dim]
        """
        from flash_attn.flash_attn_interface import flash_attn_varlen_func
        seq_len = q.shape[0]
        cu_seqlens = torch.tensor([0, seq_len], dtype=torch.int32, device=q.device)
-            attn_output = flash_attn_varlen_func(
+        return flash_attn_varlen_func(
            q, k, v,
            cu_seqlens_q=cu_seqlens,
            cu_seqlens_k=cu_seqlens,
            max_seqlen_q=seq_len,
            max_seqlen_k=seq_len,
-                softmax_scale=1.0 / math.sqrt(head_dim),
+            softmax_scale=softmax_scale,
            causal=True,
        )
            return attn_output
        except Exception as e:
            # Fallback: PyTorch SDPA (supports GQA natively)
            print(f"XAttention: FlashAttention fallback failed ({e}), using PyTorch SDPA")
            attn_output = F.scaled_dot_product_attention(
                q, k, v,
                attn_mask=None,
                is_causal=True,
                scale=1.0 / math.sqrt(head_dim)
            )
            return attn_output
    def _xattn_offload_prefill(
        self,
        query_states: torch.Tensor,
        key_states: torch.Tensor,
        value_states: torch.Tensor,
        causal: bool = True,
    ) -> torch.Tensor:
        """
        Simplified XAttention prefill for CPU offload mode.
        Uses FlashAttention with full context since chunked estimation
        with full key_states requires special handling.
        """
        batch_size, num_heads, q_len, head_dim = query_states.shape
        _, _, k_len, _ = key_states.shape
        # Use FlashAttention with full context
        # In offload mode, keys are already on CPU and loaded as needed
        try:
            from flash_attn.flash_attn_interface import flash_attn_varlen_func
            # Convert to [seq, heads, dim] format
            q = query_states.squeeze(0).transpose(0, 1)  # [q_len, num_heads, head_dim]
            k = key_states.squeeze(0).transpose(0, 1)  # [k_len, num_heads, head_dim]
            v = value_states.squeeze(0).transpose(0, 1)  # [k_len, num_heads, head_dim]
            cu_seqlens_q = torch.tensor([0, q_len], dtype=torch.int32, device=q.device)
            cu_seqlens_k = torch.tensor([0, k_len], dtype=torch.int32, device=q.device)
            attn_output = flash_attn_varlen_func(
                q, k, v,
                cu_seqlens_q=cu_seqlens_q,
                cu_seqlens_k=cu_seqlens_k,
                max_seqlen_q=q_len,
                max_seqlen_k=k_len,
                softmax_scale=1.0 / math.sqrt(head_dim),
                causal=causal,
            )
            # Convert back to [batch, seq, heads, dim]
            attn_output = attn_output.unsqueeze(0).transpose(1, 2)  # [1, q_len, num_heads, head_dim]
            return attn_output
        except Exception as e:
            # Final fallback: PyTorch SDPA
            print(f"XAttention: FlashAttention fallback failed ({e}), using PyTorch SDPA")
            with torch.backends.cuda.sdp_kernel(enable_flash=False, enable_math=True, enable_mem_efficient=False):
                attn_output = F.scaled_dot_product_attention(
                    query_states, key_states, value_states,
                    attn_mask=None,
                    is_causal=causal,
                    scale=1.0 / math.sqrt(head_dim)
                )
            return attn_output
    def _xattn_prefill(
        self,
        query_states: torch.Tensor,
        key_states: torch.Tensor,
        value_states: torch.Tensor,
        stride: int,
        norm: float,
        threshold: float,
        block_size: int = 128,
        use_triton: bool = True,
        causal: bool = True,
        chunk_size: Optional[int] = None,
        keep_sink: bool = False,
        keep_recent: bool = False,
    ) -> torch.Tensor:
        """
        XAttention prefill implementation.
        Args:
            query_states: [batch, num_heads, q_len, head_dim]
            key_states: [batch, num_heads, k_len, head_dim]
            value_states: [batch, num_heads, k_len, head_dim]
            ... other params
        Returns:
            Attention output [batch, q_len, num_heads, head_dim]
        """
        batch_size, num_heads, k_len, head_dim = key_states.shape
        _, _, q_len, _ = query_states.shape
        # Auto-compute chunk_size if not specified
        if chunk_size is None:
            chunk_size = int(
                max(
                    min(
                        max(2048, 1 << (k_len - 1).bit_length()),
                        128 * 1024 * 2048 // (1 << (k_len - 1).bit_length()),
                    ),
                    2048,
                )
            )
        # Phase 1: Estimate sparse pattern
        attn_sums, approx_simple_mask = self._xattn_estimate(
            query_states,
            key_states,
            block_size=block_size,
            stride=stride,
            norm=norm,
            threshold=threshold,
            chunk_size=chunk_size,
            use_triton=use_triton,
            causal=causal,
            keep_sink=keep_sink,
            keep_recent=keep_recent,
        )
        # Phase 2: Block sparse attention
        # For now, use FlashAttention as fallback since block_sparse_attn_func may not be available
        attn_output = self._block_sparse_attention_fallback(
            query_states, key_states, value_states,
            approx_simple_mask, block_size, q_len, k_len
        )
        return attn_output
    def _xattn_estimate(
        self,
        query_states: torch.Tensor,
        key_states: torch.Tensor,
        block_size: int,
        stride: int,
        norm: float = 1,
        softmax: bool = True,
        threshold: float = 0.9,
        chunk_size: int = 16384,
        use_triton: bool = True,
        causal: bool = True,
        keep_sink: bool = False,
        keep_recent: bool = False,
    ) -> torch.Tensor:
        """
        Estimate sparse attention pattern using chunked computation.
        Returns:
            attn_sums: [batch, heads, q_blocks, k_blocks] - importance scores
            simple_masks: [batch, heads, q_blocks, k_blocks] - boolean masks
        """
        batch_size, num_kv_head, k_len, head_dim = key_states.shape
        batch_size, num_q_head, q_len, head_dim = query_states.shape
        k_num_to_pad = ((k_len + chunk_size - 1) // chunk_size) * chunk_size - k_len
        q_num_to_pad = ((q_len + chunk_size - 1) // chunk_size) * chunk_size - q_len
        k_chunk_num = (k_len + k_num_to_pad) // chunk_size
        k_block_num = (k_len + k_num_to_pad) // block_size
        q_chunk_num = (q_len + q_num_to_pad) // chunk_size
        q_block_num = (q_len + q_num_to_pad) // block_size
        # Pad inputs
        if k_num_to_pad > 0:
            pad_key_states = F.pad(key_states, (0, 0, 0, k_num_to_pad), value=0)
        else:
            pad_key_states = key_states
        if q_num_to_pad > 0:
            pad_query_states = F.pad(query_states, (0, 0, 0, q_num_to_pad), value=0)
        else:
            pad_query_states = query_states
        reshaped_chunk_size = chunk_size // stride
        reshaped_block_size = block_size // stride
        k_reshaped_seq_len = (k_len + k_num_to_pad) // stride
        attn_sum_list = []
        simple_mask_list = []
        for chunk_idx in range(q_chunk_num):
            if use_triton:
                # Triton GEMM + Softmax
                attn_weights_slice = flat_group_gemm_fuse_reshape(
                    pad_query_states[:, :, (chunk_idx * reshaped_chunk_size) * stride : (chunk_idx * reshaped_chunk_size + reshaped_chunk_size) * stride, :],
                    pad_key_states,
                    stride,
                    (k_block_num - q_block_num) * reshaped_block_size + chunk_idx * reshaped_chunk_size,
                    (k_block_num - q_block_num) * reshaped_block_size + chunk_idx * reshaped_chunk_size + reshaped_chunk_size,
                    is_causal=causal,
                )
                attn_sum = softmax_fuse_block_sum(
                    attn_weights_slice,
                    reshaped_block_size,
                    min(4096, reshaped_block_size),
                    (k_block_num - q_block_num) * reshaped_block_size + chunk_idx * reshaped_chunk_size,
                    (k_block_num - q_block_num) * reshaped_block_size + chunk_idx * reshaped_chunk_size + reshaped_chunk_size,
                    k_reshaped_seq_len - (k_num_to_pad // stride),
                    1.4426950408889634 / math.sqrt(head_dim) / stride / norm,
                    is_causal=causal,
                )
            else:
                # PyTorch fallback
                chunk_size_actual = reshaped_chunk_size
                chunk_start = chunk_idx * chunk_size_actual
                chunk_end = chunk_start + chunk_size_actual
                chunked_query = pad_query_states[:, :, chunk_start * stride:chunk_end * stride:stride, :]
                attn_weights_slice = torch.matmul(chunked_query, pad_key_states.transpose(2, 3))
                attn_weights_slice = attn_weights_slice / math.sqrt(head_dim) / stride / norm
                if causal:
                    causal_mask = torch.zeros((batch_size, num_q_head, chunk_size_actual, chunk_size_actual * k_chunk_num), device=key_states.device)
                    causal_mask[:, :, :, -(k_num_to_pad // stride):] = float("-inf")
                    # ... more causal mask logic ...
                    attn_weights_slice = attn_weights_slice + causal_mask
                attn_weights_slice = F.softmax(attn_weights_slice, dim=-1, dtype=torch.float32)
                attn_sum = attn_weights_slice.view(batch_size, num_q_head, chunk_size_actual // reshaped_block_size, reshaped_block_size, -1).sum(dim=-1).sum(dim=-2)
            # Find blocks based on threshold
            simple_mask = find_blocks_chunked(
                attn_sum,
                k_block_num - q_block_num + chunk_idx * (reshaped_chunk_size // reshaped_block_size),
                threshold,
                None,
                decoding=False,
                mode="prefill",
                causal=causal,
            )
            attn_sum_list.append(attn_sum)
            simple_mask_list.append(simple_mask)
        attn_sums = torch.cat(attn_sum_list, dim=-2)
        simple_masks = torch.cat(simple_mask_list, dim=-2)
        # Apply causal mask to block masks
        if causal:
            simple_masks[:, :, -q_block_num:, -q_block_num:] = torch.where(
                torch.tril(torch.ones(q_block_num, q_block_num, dtype=bool, device=key_states.device), diagonal=0),
                simple_masks[:, :, -q_block_num:, -q_block_num:],
                False,
            )
        if keep_sink:
            simple_masks[:, :, 0, :] = True
        if keep_recent:
            eye_matrix = torch.eye(q_block_num, device=simple_masks.device, dtype=bool)
            eye_matrix_expanded = eye_matrix.unsqueeze(0).unsqueeze(0).expand(1, num_q_head, q_block_num, q_block_num)
            simple_masks[:, :, -q_block_num:, -q_block_num:] = torch.where(
                eye_matrix_expanded, True, simple_masks[:, :, -q_block_num:, -q_block_num:]
            )
        return attn_sums, simple_masks
    def _block_sparse_attention_fallback(
        self,
        query_states: torch.Tensor,
        key_states: torch.Tensor,
        value_states: torch.Tensor,
        mask: torch.Tensor,
        block_size: int,
        q_len: int,
        k_len: int,
    ) -> torch.Tensor:
        """
        Fallback implementation using FlashAttention.
        Since block_sparse_attn_func may not be available in all environments,
        this uses standard FlashAttention with full attention.
        """
        try:
            from flash_attn.flash_attn_interface import flash_attn_varlen_func
            batch_size, num_heads, _, head_dim = query_states.shape
            # Convert to [seq, heads, dim] format
            q = query_states.squeeze(0).transpose(0, 1)  # [q_len, num_heads, head_dim]
            k = key_states.squeeze(0).transpose(0, 1)  # [k_len, num_heads, head_dim]
            v = value_states.squeeze(0).transpose(0, 1)  # [k_len, num_heads, head_dim]
            cu_seqlens_q = torch.tensor([0, q_len], dtype=torch.int32, device=q.device)
            cu_seqlens_k = torch.tensor([0, k_len], dtype=torch.int32, device=q.device)
            attn_output = flash_attn_varlen_func(
                q, k, v,
                cu_seqlens_q=cu_seqlens_q,
                cu_seqlens_k=cu_seqlens_k,
                max_seqlen_q=q_len,
                max_seqlen_k=k_len,
                softmax_scale=1.0 / math.sqrt(head_dim),
                causal=True,
            )
            # Convert back to [batch, seq, heads, dim]
            attn_output = attn_output.unsqueeze(0).transpose(1, 2)
            return attn_output
        except Exception as e:
            # Final fallback: PyTorch SDPA
            print(f"XAttention: FlashAttention fallback failed ({e}), using PyTorch SDPA")
            with torch.backends.cuda.sdp_kernel(enable_flash=False, enable_math=True, enable_mem_efficient=False):
                attn_output = F.scaled_dot_product_attention(
                    query_states, key_states, value_states,
                    attn_mask=None,
                    is_causal=True,
                    scale=1.0 / math.sqrt(query_states.shape[-1])
                )
            return attn_output
    def reset(self) -> None:
        """Reset policy state (no state to reset for XAttention)."""
        pass
@@ -461,4 +305,6 @@ class XAttentionPolicy(SparsePolicy):
        return (f"XAttentionPolicy("
                f"stride={self.stride}, "
                f"threshold={self.threshold}, "
-                f"use_triton={self.use_triton})")
+                f"block_size={self.block_size}, "
                f"use_triton={self.use_triton}, "
                f"use_bsa={self.use_bsa})")
--- a/nanovllm/layers/attention.py
+++ b/nanovllm/layers/attention.py
@@ -98,10 +98,10 @@ class Attention(nn.Module):
                                           max_seqlen_q=context.max_seqlen_q, cu_seqlens_q=context.cu_seqlens_q,
                                           max_seqlen_k=context.max_seqlen_k, cu_seqlens_k=context.cu_seqlens_k,
                                           softmax_scale=self.scale, causal=True, block_table=context.block_tables)
-            elif context.sparse_prefill_policy is not None:
+            elif context.attention_policy is not None:
-                # Sparse prefill (GPU-only) - delegate to policy
+                # Attention via policy (GPU-only) - delegate to policy
-                o = context.sparse_prefill_policy.sparse_prefill_attention(
+                o = context.attention_policy.compute_prefill(
-                    q, k, v, self.layer_id
+                    q, k, v, self.layer_id, softmax_scale=self.scale
                )
            else:
                o = flash_attn_varlen_func(q, k, v,
--- a/nanovllm/ops/init.py
+++ b/nanovllm/ops/init.py
@@ -0,0 +1,38 @@
 """
 Operators module for nano-vLLM.
 This module contains low-level attention operators and kernels.
 """
 from nanovllm.ops.chunked_attention import (
    flash_attn_with_lse,
    merge_attention_outputs,
    chunked_attention_varlen,
    ChunkedPrefillState,
 )
 from nanovllm.ops.xattn import (
    xattn_estimate,
    xattn_estimate_chunked,
    flat_group_gemm_fuse_reshape,
    softmax_fuse_block_sum,
    find_blocks_chunked,
    create_causal_mask,
    compute_sparsity,
 )
 __all__ = [
    # chunked_attention
    "flash_attn_with_lse",
    "merge_attention_outputs",
    "chunked_attention_varlen",
    "ChunkedPrefillState",
    # xattn
    "xattn_estimate",
    "xattn_estimate_chunked",
    "flat_group_gemm_fuse_reshape",
    "softmax_fuse_block_sum",
    "find_blocks_chunked",
    "create_causal_mask",
    "compute_sparsity",
 ]
--- a/nanovllm/ops/chunked_attention.py
+++ b/nanovllm/ops/chunked_attention.py
@@ -0,0 +1,624 @@
 """
 Chunked attention implementation for CPU KV cache offloading.
 This module implements flash attention with LSE (log-sum-exp) output,
 enabling proper online softmax merging for chunked prefill.
 Key functions:
 - flash_attn_with_lse: Flash attention that returns output and LSE
 - merge_attention_outputs: Merge outputs from multiple KV chunks
 - chunked_prefill_attention: High-level interface for chunked attention
 """
 import math
 import torch
 import triton
 import triton.language as tl
 from typing import Tuple, List, Optional
@triton.heuristics(
    {
        "EVEN_M": lambda args: args["seqlen_q"] % args["BLOCK_M"] == 0,
        "EVEN_N": lambda args: args["seqlen_k"] % args["BLOCK_N"] == 0,
        "EVEN_HEADDIM": lambda args: args["headdim"] == args["BLOCK_HEADDIM"],
    }
 )
@triton.jit
 def _fwd_kernel_with_lse(
    Q,
    K,
    V,
    Out,
    Lse,
    softmax_scale,
    stride_qb,
    stride_qh,
    stride_qm,
    stride_kb,
    stride_kh,
    stride_kn,
    stride_vb,
    stride_vh,
    stride_vn,
    stride_ob,
    stride_oh,
    stride_om,
    nheads,
    seqlen_q,
    seqlen_k,
    seqlen_q_rounded,
    headdim,
    CACHE_KEY_SEQLEN_Q,
    CACHE_KEY_SEQLEN_K,
    IS_CAUSAL: tl.constexpr,
    BLOCK_HEADDIM: tl.constexpr,
    EVEN_M: tl.constexpr,
    EVEN_N: tl.constexpr,
    EVEN_HEADDIM: tl.constexpr,
    BLOCK_M: tl.constexpr,
    BLOCK_N: tl.constexpr,
 ):
    """
    Flash attention forward kernel with LSE output.
    Implements standard Flash Attention online softmax algorithm:
    - m_i: running max of attention scores
    - l_i: running sum of exp(scores - m_i)
    - acc_o: running sum of softmax(scores) @ V (unnormalized)
    Final output: acc_o / l_i
    Final LSE: m_i + log(l_i)
    """
    start_m = tl.program_id(0)
    off_hb = tl.program_id(1)
    off_b = off_hb // nheads
    off_h = off_hb % nheads
    offs_m = start_m * BLOCK_M + tl.arange(0, BLOCK_M)
    offs_n = tl.arange(0, BLOCK_N)
    offs_d = tl.arange(0, BLOCK_HEADDIM)
    # Pointers
    q_ptrs = (
        Q + off_b * stride_qb + off_h * stride_qh + (offs_m[:, None] * stride_qm + offs_d[None, :])
    )
    k_ptrs = (
        K + off_b * stride_kb + off_h * stride_kh + (offs_n[:, None] * stride_kn + offs_d[None, :])
    )
    v_ptrs = (
        V + off_b * stride_vb + off_h * stride_vh + (offs_n[:, None] * stride_vn + offs_d[None, :])
    )
    # Initialize running statistics
    m_i = tl.zeros([BLOCK_M], dtype=tl.float32) - float("inf")  # running max
    l_i = tl.zeros([BLOCK_M], dtype=tl.float32)  # running sum of exp
    acc_o = tl.zeros([BLOCK_M, BLOCK_HEADDIM], dtype=tl.float32)  # running output (unnormalized)
    # Load Q (once per block)
    if EVEN_M & EVEN_N:
        if EVEN_HEADDIM:
            q = tl.load(q_ptrs)
        else:
            q = tl.load(q_ptrs, mask=offs_d[None, :] < headdim, other=0.0)
    else:
        if EVEN_HEADDIM:
            q = tl.load(q_ptrs, mask=offs_m[:, None] < seqlen_q, other=0.0)
        else:
            q = tl.load(
                q_ptrs, mask=(offs_m[:, None] < seqlen_q) & (offs_d[None, :] < headdim), other=0.0
            )
    # Loop over K, V blocks
    end_n = seqlen_k if not IS_CAUSAL else tl.minimum((start_m + 1) * BLOCK_M, seqlen_k)
    for start_n in range(0, end_n, BLOCK_N):
        start_n = tl.multiple_of(start_n, BLOCK_N)
        # Load K
        if EVEN_N & EVEN_M:
            if EVEN_HEADDIM:
                k = tl.load(k_ptrs + start_n * stride_kn)
            else:
                k = tl.load(k_ptrs + start_n * stride_kn, mask=offs_d[None, :] < headdim, other=0.0)
        else:
            if EVEN_HEADDIM:
                k = tl.load(
                    k_ptrs + start_n * stride_kn,
                    mask=(start_n + offs_n)[:, None] < seqlen_k,
                    other=0.0,
                )
            else:
                k = tl.load(
                    k_ptrs + start_n * stride_kn,
                    mask=((start_n + offs_n)[:, None] < seqlen_k) & (offs_d[None, :] < headdim),
                    other=0.0,
                )
        # Compute QK^T * scale
        qk = tl.zeros([BLOCK_M, BLOCK_N], dtype=tl.float32)
        qk += tl.dot(q, tl.trans(k))
        qk *= softmax_scale
        # Apply masks
        if not EVEN_N:
            qk += tl.where((start_n + offs_n)[None, :] < seqlen_k, 0, float("-inf"))
        if IS_CAUSAL:
            qk += tl.where(offs_m[:, None] >= (start_n + offs_n)[None, :], 0, float("-inf"))
        # Online softmax: compute block max
        m_ij = tl.max(qk, 1)  # [BLOCK_M]
        # New running max
        m_new = tl.maximum(m_i, m_ij)  # [BLOCK_M]
        # Rescale factor for previous accumulator
        alpha = tl.exp(m_i - m_new)  # [BLOCK_M]
        # Compute P = exp(qk - m_new)
        p = tl.exp(qk - m_new[:, None])  # [BLOCK_M, BLOCK_N]
        # Sum of current block
        l_ij = tl.sum(p, 1)  # [BLOCK_M]
        # Update running sum: l_new = l_i * alpha + l_ij
        l_new = l_i * alpha + l_ij
        # Rescale previous output and add new contribution
        acc_o = acc_o * alpha[:, None]
        # Load V
        if EVEN_N & EVEN_M:
            if EVEN_HEADDIM:
                v = tl.load(v_ptrs + start_n * stride_vn)
            else:
                v = tl.load(v_ptrs + start_n * stride_vn, mask=offs_d[None, :] < headdim, other=0.0)
        else:
            if EVEN_HEADDIM:
                v = tl.load(
                    v_ptrs + start_n * stride_vn,
                    mask=(start_n + offs_n)[:, None] < seqlen_k,
                    other=0.0,
                )
            else:
                v = tl.load(
                    v_ptrs + start_n * stride_vn,
                    mask=((start_n + offs_n)[:, None] < seqlen_k) & (offs_d[None, :] < headdim),
                    other=0.0,
                )
        # acc_o += P @ V
        p = p.to(v.dtype)
        acc_o += tl.dot(p, v)
        # Update running statistics
        m_i = m_new
        l_i = l_new
    # Final normalization: output = acc_o / l_i
    acc_o = acc_o / l_i[:, None]
    # Compute LSE = m_i + log(l_i)
    lse_i = m_i + tl.log(l_i)
    # Store LSE
    lse_ptrs = Lse + off_hb * seqlen_q_rounded + offs_m
    if EVEN_M:
        tl.store(lse_ptrs, lse_i)
    else:
        tl.store(lse_ptrs, lse_i, mask=offs_m < seqlen_q)
    # Store output
    out_ptrs = (
        Out
        + off_b * stride_ob
        + off_h * stride_oh
        + (offs_m[:, None] * stride_om + offs_d[None, :])
    )
    if EVEN_M:
        if EVEN_HEADDIM:
            tl.store(out_ptrs, acc_o)
        else:
            tl.store(out_ptrs, acc_o, mask=offs_d[None, :] < headdim)
    else:
        if EVEN_HEADDIM:
            tl.store(out_ptrs, acc_o, mask=offs_m[:, None] < seqlen_q)
        else:
            tl.store(
                out_ptrs, acc_o, mask=(offs_m[:, None] < seqlen_q) & (offs_d[None, :] < headdim)
            )
 def flash_attn_with_lse(
    q: torch.Tensor,
    k: torch.Tensor,
    v: torch.Tensor,
    softmax_scale: Optional[float] = None,
    causal: bool = False,
 ) -> Tuple[torch.Tensor, torch.Tensor]:
    """
    Flash attention forward pass that returns both output and LSE.
    Uses flash_attn library which natively supports GQA without memory overhead.
    Args:
        q: Query tensor [batch, seqlen_q, nheads_q, headdim]
        k: Key tensor [batch, seqlen_k, nheads_kv, headdim]
        v: Value tensor [batch, seqlen_k, nheads_kv, headdim]
        softmax_scale: Scaling factor (default: 1/sqrt(headdim))
        causal: Whether to apply causal masking
    Returns:
        out: Output tensor [batch, seqlen_q, nheads_q, headdim]
        lse: Log-sum-exp tensor [batch, nheads_q, seqlen_q]
    """
    from flash_attn.flash_attn_interface import flash_attn_func
    batch, seqlen_q, nheads_q, headdim = q.shape
    _, seqlen_k, nheads_kv, _ = k.shape
    if softmax_scale is None:
        softmax_scale = 1.0 / math.sqrt(headdim)
    # Use flash_attn_func which natively supports GQA (no memory overhead)
    # It returns (output, softmax_lse) when return_attn_probs=True is not set
    # We need to use the internal function to get LSE
    out, lse, _ = flash_attn_func(
        q, k, v,
        softmax_scale=softmax_scale,
        causal=causal,
        return_attn_probs=True,  # This makes it return (out, softmax_lse, S_dmask)
    )
    # lse shape from flash_attn: [batch, nheads_q, seqlen_q_rounded]
    # Trim to actual seqlen_q
    lse = lse[:, :, :seqlen_q]
    return out, lse
@triton.jit
 def _merge_lse_kernel(
    lse1_ptr, lse2_ptr, lse_out_ptr,
    num_elements: tl.constexpr,
    BLOCK_SIZE: tl.constexpr,
 ):
    """Fused kernel for merging LSE values.
    IMPORTANT: Uses fp32 for exp/log operations to avoid precision loss.
    bf16 has only 7 bits of mantissa, causing significant errors in exp/log.
    """
    # Each program handles BLOCK_SIZE elements
    pid = tl.program_id(0)
    block_start = pid * BLOCK_SIZE
    offsets = block_start + tl.arange(0, BLOCK_SIZE)
    mask = offsets < num_elements
    # Load lse values and convert to fp32 for precision
    lse1 = tl.load(lse1_ptr + offsets, mask=mask).to(tl.float32)
    lse2 = tl.load(lse2_ptr + offsets, mask=mask).to(tl.float32)
    # Compute max for numerical stability (in fp32)
    max_lse = tl.maximum(lse1, lse2)
    # Compute exp(lse - max_lse) in fp32
    exp1 = tl.exp(lse1 - max_lse)
    exp2 = tl.exp(lse2 - max_lse)
    # Compute merged LSE: max_lse + log(exp1 + exp2) in fp32
    lse_merged = max_lse + tl.log(exp1 + exp2)
    # Store result (convert back to original dtype)
    tl.store(lse_out_ptr + offsets, lse_merged, mask=mask)
@triton.jit
 def _merge_output_kernel(
    o1_ptr, o2_ptr, lse1_ptr, lse2_ptr, o_out_ptr,
    batch: tl.constexpr, seqlen_q: tl.constexpr, nheads: tl.constexpr, headdim: tl.constexpr,
    BLOCK_SIZE: tl.constexpr,
 ):
    """Fused kernel for merging attention outputs.
    IMPORTANT: Uses fp32 for exp operations and weighted sum to avoid precision loss.
    This is critical for numerical accuracy in chunked attention.
    """
    # Each program handles BLOCK_SIZE elements along headdim for one (batch, seqlen_q, nheads) position
    pid_batch = tl.program_id(0)
    pid_seq = tl.program_id(1)
    pid_head = tl.program_id(2)
    # Compute LSE index: [batch, nheads, seqlen_q]
    lse_idx = pid_batch * nheads * seqlen_q + pid_head * seqlen_q + pid_seq
    # Load LSE values and convert to fp32 for precision
    lse1 = tl.load(lse1_ptr + lse_idx).to(tl.float32)
    lse2 = tl.load(lse2_ptr + lse_idx).to(tl.float32)
    # Compute max and scaling factors in fp32
    max_lse = tl.maximum(lse1, lse2)
    exp1 = tl.exp(lse1 - max_lse)
    exp2 = tl.exp(lse2 - max_lse)
    sum_exp = exp1 + exp2
    # Process headdim in chunks
    for d_offset in range(0, headdim, BLOCK_SIZE):
        d_idx = d_offset + tl.arange(0, BLOCK_SIZE)
        mask = d_idx < headdim
        # Compute output index: [batch, seqlen_q, nheads, headdim]
        base_idx = (pid_batch * seqlen_q * nheads * headdim +
                    pid_seq * nheads * headdim +
                    pid_head * headdim)
        o_idx = base_idx + d_idx
        # Load o1, o2 and convert to fp32 for weighted sum
        o1_val = tl.load(o1_ptr + o_idx, mask=mask, other=0.0).to(tl.float32)
        o2_val = tl.load(o2_ptr + o_idx, mask=mask, other=0.0).to(tl.float32)
        # Compute merged output in fp32: (o1 * exp1 + o2 * exp2) / sum_exp
        o_merged = (o1_val * exp1 + o2_val * exp2) / sum_exp
        # Store result (Triton will convert back to original dtype)
        tl.store(o_out_ptr + o_idx, o_merged, mask=mask)
 def merge_attention_outputs(
    o1: torch.Tensor,
    lse1: torch.Tensor,
    o2: torch.Tensor,
    lse2: torch.Tensor,
 ) -> Tuple[torch.Tensor, torch.Tensor]:
    """
    Merge two attention outputs using online softmax (Triton fused kernel).
    This implements the online softmax merging formula:
    - m_new = max(lse1, lse2)
    - o_new = (exp(lse1 - m_new) * o1 + exp(lse2 - m_new) * o2) / (exp(lse1 - m_new) + exp(lse2 - m_new))
    - lse_new = m_new + log(exp(lse1 - m_new) + exp(lse2 - m_new))
    Args:
        o1: First output [batch, seqlen_q, nheads, headdim]
        lse1: First LSE [batch, nheads, seqlen_q]
        o2: Second output [batch, seqlen_q, nheads, headdim]
        lse2: Second LSE [batch, nheads, seqlen_q]
    Returns:
        o_merged: Merged output [batch, seqlen_q, nheads, headdim]
        lse_merged: Merged LSE [batch, nheads, seqlen_q]
    """
    batch, seqlen_q, nheads, headdim = o1.shape
    # Allocate output tensors
    o_merged = torch.empty_like(o1)
    lse_merged = torch.empty_like(lse1)
    # Launch LSE merge kernel
    num_lse_elements = batch * nheads * seqlen_q
    BLOCK_SIZE_LSE = 256
    grid_lse = (triton.cdiv(num_lse_elements, BLOCK_SIZE_LSE),)
    _merge_lse_kernel[grid_lse](
        lse1, lse2, lse_merged,
        num_lse_elements,
        BLOCK_SIZE=BLOCK_SIZE_LSE,
    )
    # Launch output merge kernel
    BLOCK_SIZE = 128
    grid_output = (batch, seqlen_q, nheads)
    _merge_output_kernel[grid_output](
        o1, o2, lse1, lse2, o_merged,
        batch, seqlen_q, nheads, headdim,
        BLOCK_SIZE=BLOCK_SIZE,
    )
    return o_merged, lse_merged
 def chunked_attention_varlen(
    q: torch.Tensor,
    kv_chunks: List[Tuple[torch.Tensor, torch.Tensor]],
    cu_seqlens_q: torch.Tensor,
    cu_seqlens_k_list: List[torch.Tensor],
    max_seqlen_q: int,
    max_seqlen_k_list: List[int],
    softmax_scale: Optional[float] = None,
    causal_mask_per_chunk: Optional[List[bool]] = None,
 ) -> torch.Tensor:
    """
    Compute attention with KV split across multiple chunks.
    This is the core function for chunked prefill. It computes attention
    against each KV chunk and merges results using online softmax.
    For causal attention with chunked KV:
    - First chunk (current tokens): Apply causal mask
    - Previous chunks: No causal mask (all previous tokens are valid context)
    Args:
        q: Query tensor [total_q_tokens, nheads, headdim]
        kv_chunks: List of (K, V) tuples, each [batch, seqlen_k_i, nheads, headdim]
        cu_seqlens_q: Cumulative sequence lengths for Q [batch+1]
        cu_seqlens_k_list: List of cumulative sequence lengths for each KV chunk
        max_seqlen_q: Maximum query sequence length
        max_seqlen_k_list: List of maximum key sequence lengths for each chunk
        softmax_scale: Scaling factor
        causal_mask_per_chunk: Whether to apply causal mask for each chunk
    Returns:
        out: Output tensor [total_q_tokens, nheads, headdim]
    """
    if len(kv_chunks) == 0:
        raise ValueError("Need at least one KV chunk")
    nheads = q.shape[1]
    headdim = q.shape[2]
    batch = cu_seqlens_q.shape[0] - 1
    if softmax_scale is None:
        softmax_scale = 1.0 / math.sqrt(headdim)
    if causal_mask_per_chunk is None:
        # Default: causal for last chunk only
        causal_mask_per_chunk = [False] * (len(kv_chunks) - 1) + [True]
    # Initialize accumulated output and LSE
    accumulated_o = None
    accumulated_lse = None
    for chunk_idx, (k_chunk, v_chunk) in enumerate(kv_chunks):
        is_causal = causal_mask_per_chunk[chunk_idx]
        # Reshape Q for batch processing
        # For varlen, we need to handle each sequence separately
        # For simplicity, assume single sequence (batch=1) for now
        q_batched = q.unsqueeze(0)  # [1, total_q, nheads, headdim]
        # Compute attention for this chunk
        chunk_o, chunk_lse = flash_attn_with_lse(
            q_batched,
            k_chunk,
            v_chunk,
            softmax_scale=softmax_scale,
            causal=is_causal,
        )
        # Merge with accumulated
        if accumulated_o is None:
            accumulated_o = chunk_o
            accumulated_lse = chunk_lse
        else:
            accumulated_o, accumulated_lse = merge_attention_outputs(
                accumulated_o, accumulated_lse,
                chunk_o, chunk_lse,
            )
    # Remove batch dimension
    return accumulated_o.squeeze(0)
 class ChunkedPrefillState:
    """
    State for tracking chunked prefill progress.
    This class maintains the accumulated attention output and LSE
    across multiple prefill chunks.
    """
    def __init__(self, num_layers: int, dtype: torch.dtype, device: torch.device):
        self.num_layers = num_layers
        self.dtype = dtype
        self.device = device
        # Per-layer accumulated outputs
        # Each entry: (accumulated_output, accumulated_lse) or None
        self.layer_states: List[Optional[Tuple[torch.Tensor, torch.Tensor]]] = [
            None for _ in range(num_layers)
        ]
        # Track which chunks have been processed
        self.processed_chunks: int = 0
    def update_layer(
        self,
        layer_id: int,
        chunk_output: torch.Tensor,
        chunk_lse: torch.Tensor,
    ):
        """Update accumulated state for a layer with a new chunk's output."""
        if self.layer_states[layer_id] is None:
            self.layer_states[layer_id] = (chunk_output, chunk_lse)
        else:
            acc_o, acc_lse = self.layer_states[layer_id]
            merged_o, merged_lse = merge_attention_outputs(
                acc_o, acc_lse,
                chunk_output, chunk_lse,
            )
            self.layer_states[layer_id] = (merged_o, merged_lse)
    def get_layer_output(self, layer_id: int) -> Optional[torch.Tensor]:
        """Get the final accumulated output for a layer."""
        if self.layer_states[layer_id] is None:
            return None
        return self.layer_states[layer_id][0]
    def clear(self):
        """Clear all accumulated state."""
        self.layer_states = [None for _ in range(self.num_layers)]
        self.processed_chunks = 0
 # Test function
 def _test_chunked_attention():
    """Test chunked attention using flash_attn_with_lse and merge_attention_outputs."""
    from flash_attn.flash_attn_interface import flash_attn_func
    torch.manual_seed(42)
    print("=" * 70)
    print("Test: Chunked attention vs flash_attn_func (non-causal)")
    print("=" * 70)
    print("Splitting K,V into chunks, computing attention per chunk, then merging")
    print()
    for dtype in [torch.float16, torch.bfloat16]:
        for num_chunks in [64, 128, 256]:
            for batch, seqlen, nheads, headdim in [
                (1, 1024, 32, 128),
                (1, 2048, 32, 128),
                (1, 4096, 32, 128),
                (1, 8192, 32, 128),
            ]:
                # Generate random Q, K, V
                q = torch.randn(batch, seqlen, nheads, headdim, device="cuda", dtype=dtype)
                k = torch.randn(batch, seqlen, nheads, headdim, device="cuda", dtype=dtype)
                v = torch.randn(batch, seqlen, nheads, headdim, device="cuda", dtype=dtype)
                # Reference: full attention (non-causal)
                out_ref = flash_attn_func(q, k, v, causal=False)
                # Chunked attention: split K, V into chunks
                chunk_size = seqlen // num_chunks
                accumulated_o = None
                accumulated_lse = None
                for i in range(num_chunks):
                    start = i * chunk_size
                    end = (i + 1) * chunk_size
                    k_chunk = k[:, start:end, :, :]
                    v_chunk = v[:, start:end, :, :]
                    # Q attends to this K,V chunk (non-causal)
                    chunk_o, chunk_lse = flash_attn_with_lse(
                        q, k_chunk, v_chunk, causal=False
                    )
                    if accumulated_o is None:
                        accumulated_o = chunk_o
                        accumulated_lse = chunk_lse
                    else:
                        # Merge with previous chunks
                        accumulated_o, accumulated_lse = merge_attention_outputs(
                            accumulated_o, accumulated_lse,
                            chunk_o, chunk_lse
                        )
                # Compare
                out_diff = (out_ref - accumulated_o).abs()
                out_max_diff = out_diff.max().item()
                out_mean_diff = out_diff.mean().item()
                status = "PASS" if out_max_diff < 1e-2 else "FAIL"
                print(
                    f"[{status}] dtype={str(dtype):14s} chunks={num_chunks} "
                    f"shape=({batch}, {seqlen:4d}, {nheads:2d}, {headdim:3d}) "
                    f"max_diff={out_max_diff:.6f} mean_diff={out_mean_diff:.6f}"
                )
    print()
    print("=" * 70)
    print("Test completed!")
 if __name__ == "__main__":
    _test_chunked_attention()
--- a/nanovllm/ops/xattn.py
+++ b/nanovllm/ops/xattn.py
--- a/nanovllm/utils/context.py
+++ b/nanovllm/utils/context.py
@@ -14,9 +14,9 @@ class Context:
    context_lens: torch.Tensor | None = None
    block_tables: torch.Tensor | None = None
-    # Sparse prefill attention support (GPU-only path)
+    # Attention policy support (GPU-only path)
-    # When set, uses policy.sparse_prefill_attention() instead of FlashAttention
+    # When set, uses policy.compute_prefill() instead of FlashAttention
-    sparse_prefill_policy: Any = None  # SparsePolicy instance with supports_prefill=True
+    attention_policy: Any = None  # AttentionPolicy instance
 _CONTEXT = Context()
@@ -35,7 +35,7 @@ def set_context(
    slot_mapping=None,
    context_lens=None,
    block_tables=None,
-    sparse_prefill_policy=None,
+    attention_policy=None,
 ):
    global _CONTEXT
    _CONTEXT = Context(
@@ -47,7 +47,7 @@ def set_context(
        slot_mapping=slot_mapping,
        context_lens=context_lens,
        block_tables=block_tables,
-        sparse_prefill_policy=sparse_prefill_policy,
+        attention_policy=attention_policy,
    )
--- a/notes.md
+++ b/notes.md
@@ -0,0 +1,130 @@
 # Notes: SparsePolicy Refactoring Research
 ## Sources
 ### Source 1: tzj/minference branch - policy.py
 - 路径: `nanovllm/kvcache/sparse/policy.py`
 - 关键设计:
  - `PolicyContext` 数据类包含 query_chunk_idx, num_query_chunks, layer_id, query, is_prefill 等
  - `select_blocks()` 需要 offload_engine 参数
  - `compute_chunked_prefill()` 和 `compute_chunked_decode()` 是完整的 attention 流程
  - `on_prefill_offload()` / `on_decode_offload()` hooks 用于收集元数据
 ### Source 2: tzj/minference branch - full_policy.py
 - 路径: `nanovllm/kvcache/sparse/full_policy.py`
 - 关键实现:
  - `compute_chunked_prefill()` 内部使用 ring buffer pipeline 加载 blocks
  - 使用 `flash_attn_with_lse` 和 `merge_attention_outputs` 合并多个 chunk 的 attention
  - `compute_chunked_decode()` 处理 prefilled blocks + decode buffer
 ### Source 3: tzj/layer-offload branch - model_runner.py
 - 路径: `nanovllm/engine/model_runner.py`
 - 关键设计:
  - `run_layerwise_offload_prefill()` 逐层处理，每层计算完整 attention
  - `sparse_prefill_policy.sparse_prefill_attention(q, k, v, layer_id)` 简单接口
  - FULL policy 通过 `if sparse_prefill_policy is None` 走 else 分支
 ### Source 4: tzj/layer-offload branch - xattn.py
 - 路径: `nanovllm/kvcache/sparse/xattn.py`
 - 关键实现:
  - `sparse_prefill_attention()` 直接使用 FlashAttention（因为 chunked prefill 架构限制）
  - 保留 Triton kernels 供未来 GPU-only 模式
 ## Synthesized Findings
 ### 架构差异总结
 | 方面 | Chunked Offload | Layerwise Offload |
 |------|-----------------|-------------------|
 | **Prefill 流程** | chunk-by-chunk，跨层 | layer-by-layer，完整序列 |
 | **KV 存储** | 每 chunk 立即 offload | 每层计算后 offload |
 | **Attention 计算** | 分多次计算+合并 | 一次完整计算 |
 | **Block 加载** | 需要从 CPU 加载历史 | 不需要，已在 GPU |
 | **Policy 责任** | 完整 attention 流程 | 仅 attention kernel 选择 |
 ### Layerwise Offload 的简化点
 1. **不需要 block selection**: 整层 KV 都在 GPU，无需选择
 2. **不需要 offload_engine 参数**: Policy 不负责加载 KV
 3. **不需要 merge_attention_outputs**: 一次计算完整 attention
 4. **不需要 offload hooks**: offload 在 model_runner 统一处理
 ### 设计建议
 1. **保持接口简单**: 只需要 `compute_prefill_attention()` 和 `compute_decode_attention()`
 2. **FULL 也实现方法**: 不再通过 `is None` 判断，所有 policy 统一调用
 3. **移除不必要的参数**: 不需要 offload_engine, kvcache_manager, seq 等
 4. **统一命名**: 使用 `compute_*_attention` 而不是 `sparse_prefill_attention`
 ## Code Examples
 ### 当前调用方式 (model_runner.py:876-891)
 ```python
 # Sparse or Full attention
 if self.sparse_prefill_policy is not None:
    # MInference or other sparse prefill policy
    attn_output = self.sparse_prefill_policy.sparse_prefill_attention(
        q, k, v, layer_id
    )
 else:
    # Full attention using FlashAttention
    attn_output = flash_attn_varlen_func(
        q, k, v, ...
    )
 ```
 ### 建议的新调用方式
 ```python
 # 所有 policy 统一调用
 attn_output = self.attention_policy.compute_prefill_attention(
    q, k, v, layer_id, softmax_scale=layer.self_attn.attn.scale
 )
 ```
 ## Questions Resolved
 - Q: 是否需要 PolicyContext?
 - A: 可以简化，因为 layerwise 模式下不需要 chunk 信息
 - Q: decode 阶段如何处理?
 - A: **Decode 不需要 policy**！当前 `run_layerwise_offload_decode()` 使用标准 `layer(positions, hidden_states, residual)` 调用，走 Attention.forward() 路径
 - Q: 为什么 decode 不需要 sparse?
 - A: 因为 decode 每次只有 1 个 token，没有稀疏化的意义。KV 从 ring buffer 加载后直接用 flash_attn_with_kvcache
 ## Key Insight
 **Layerwise Offload 的 Policy 设计应该只关注 Prefill**：
 ```
 Prefill: 需要 Policy
 - 整个序列一次计算 attention
 - 可以使用 sparse attention 方法（如 MInference 的 vertical+slash pattern）
 - Policy 接收 q, k, v, layer_id, softmax_scale
 Decode: 不需要 Policy
 - 每次只有 1 个 token query
 - KV 从 ring buffer 加载
 - 使用标准 flash_attn_with_kvcache
 ```
 ## Interface Comparison Summary
 | 方面 | tzj/minference | tzj/layer-offload (新设计) |
 |------|----------------|---------------------------|
 | 类名 | SparsePolicy | AttentionPolicy |
 | Prefill 方法 | compute_chunked_prefill() | compute_attention() |
 | Decode 方法 | compute_chunked_decode() | 不需要（用标准路径） |
 | 需要 offload_engine | 是 | 否 |
 | 需要 kvcache_manager | 是 | 否 |
 | 需要 seq | 是 | 否 |
 | 支持 FULL | 是 | 是 |
 ## Migration Path
 1. 保留 `SparsePolicy` 作为 `AttentionPolicy` 的别名
 2. 保留 `PolicyContext` 供未来扩展
 3. 保留 `select_blocks()` 方法签名（虽然不使用）
 4. 移除 `requires_block_selection` 属性（不需要）
--- a/task_plan.md
+++ b/task_plan.md
@@ -0,0 +1,549 @@
 # Task Plan: Refactor SparsePolicy for Layerwise Offload
 ## Goal
 重构 SparsePolicy 接口，参考 tzj/minference 分支的设计模式，使所有 attention 都可以抽象成 policy，并按统一规范编写。适配当前 layerwise offload 架构特点（整层 KV 在 GPU 上）。
 ## Background
 ### 两种 Offload 架构对比
 | 特性 | tzj/minference (Chunked Offload) | tzj/layer-offload (Layerwise Offload) |
 |------|----------------------------------|---------------------------------------|
 | 处理粒度 | 每次一个 chunk (block_size tokens) | 每次一整层 (所有 tokens) |
 | KV 位置 | 历史 chunks 在 CPU，需要加载 | 整层 KV 都在 GPU |
 | Policy 入口 | `compute_chunked_prefill()/decode()` | `compute_prefill()/decode()` |
 | 需要 offload_engine | 是（加载 blocks） | 否（KV 已在 GPU） |
 | Mask 计算 | `select_blocks()` 返回 block IDs | `estimate()` 返回 sparse mask |
 ### tzj/minference 的 Policy 接口
 ```python
 class SparsePolicy(ABC):
    supports_prefill: bool
    supports_decode: bool
    @abstractmethod
    def select_blocks(self, available_blocks, offload_engine, ctx) -> List[int]
    @abstractmethod
    def compute_chunked_prefill(self, q, k, v, layer_id, ..., offload_engine, ...) -> Tensor
    @abstractmethod
    def compute_chunked_decode(self, q, layer_id, ..., offload_engine, ...) -> Tensor
 ```
 ### 当前 branch 的 Policy 接口（重构前）
 ```python
 class SparsePolicy(ABC):
    supports_prefill: bool
    supports_decode: bool
    @abstractmethod
    def select_blocks(self, available_blocks, ctx) -> List[int]
    def sparse_prefill_attention(self, q, k, v, layer_id) -> Tensor
 ```
 ## Phases
 - [x] Phase 1: 分析差异并设计新接口
 - [x] **Phase 0: 创建 nanovllm.ops 模块** ✅ 测试通过
 - [ ] Phase 2: 重构 AttentionPolicy 基类
 - [ ] Phase 3: 重构 FullAttentionPolicy
 - [ ] Phase 4: 重构 XAttentionPolicy (含 estimate 方法)
 - [ ] Phase 5: 更新 model_runner 调用方式
 - [ ] Phase 6: 测试验证
 ---
 ## Phase 0: 创建 nanovllm.ops 模块
 ### 目标
 从 tzj/minference 分支提取 ops 模块，为 XAttention estimate 提供底层算子支持。
 ### 步骤
 1. **创建目录结构**
   ```
   nanovllm/ops/
   ├── __init__.py
   ├── xattn.py           # xattn_estimate, xattn_estimate_chunked, Triton kernels
   └── chunked_attention.py  # flash_attn_with_lse, merge_attention_outputs (备用)
   ```
 2. **从 tzj/minference 提取文件**
   ```bash
   git show tzj/minference:nanovllm/ops/__init__.py > nanovllm/ops/__init__.py
   git show tzj/minference:nanovllm/ops/xattn.py > nanovllm/ops/xattn.py
   git show tzj/minference:nanovllm/ops/chunked_attention.py > nanovllm/ops/chunked_attention.py
   ```
 3. **Cherry-pick 测试文件**
   ```bash
   git show tzj/minference:tests/test_xattn_estimate_chunked.py > tests/test_xattn_estimate_chunked.py
   ```
 4. **运行测试验证**
   ```bash
   CUDA_VISIBLE_DEVICES=0 PYTHONPATH=/home/zijie/Code/Worktree/nano-vllm:$PYTHONPATH \
       python tests/test_xattn_estimate_chunked.py
   ```
 ### nanovllm/ops 模块内容
 | 文件 | 核心函数 | 用途 |
 |------|----------|------|
 | `xattn.py` | `xattn_estimate()` | 标准 XAttention estimation |
 | `xattn.py` | `xattn_estimate_chunked()` | Chunked prefill 版本 |
 | `xattn.py` | `flat_group_gemm_fuse_reshape()` | Triton kernel: fused reshape + GEMM |
 | `xattn.py` | `softmax_fuse_block_sum()` | Triton kernel: softmax + block sum |
 | `xattn.py` | `find_blocks_chunked()` | Block selection based on threshold |
 | `chunked_attention.py` | `flash_attn_with_lse()` | Flash attention with LSE output |
 | `chunked_attention.py` | `merge_attention_outputs()` | Merge multiple attention chunks |
 ### 与 Policy 的关系
 ```
 XAttentionPolicy.estimate()
    └── 调用 nanovllm.ops.xattn.xattn_estimate()
            ├── flat_group_gemm_fuse_reshape() (Triton)
            ├── softmax_fuse_block_sum() (Triton)
            └── find_blocks_chunked()
 ```
 ---
 ## Key Questions
 1. **`select_blocks` 改为什么?**
   - 改名为 `estimate()`：用于计算 sparse mask
   - 对于 XAttention，对应 COMPASS 的 `xattn_estimate()` 函数
   - FullAttentionPolicy 的 `estimate()` 返回 None（表示 full attention）
 2. **Policy 接口应该如何设计?**
   - Prefill: `compute_prefill(q, k, v, layer_id, softmax_scale)`
   - Decode: `compute_decode(q, k, v, layer_id, softmax_scale)`
   - Estimate: `estimate(q, k, layer_id)` - 计算 sparse mask
 3. **FULL policy 如何处理?**
   - FULL 也实现 `compute_prefill/decode`，使用 FlashAttention
   - `estimate()` 返回 None（表示不进行稀疏化）
 ## Proposed New Interface
 ```python
 from abc import ABC, abstractmethod
 from typing import Optional
 import torch
 class AttentionPolicy(ABC):
    """Layerwise Offload 模式下的 Attention Policy
    所有 attention 计算都通过 policy 进行，包括 Full 和 Sparse。
    支持 prefill 和 decode 两个阶段。
    """
    supports_prefill: bool = True
    supports_decode: bool = True
    def estimate(
        self,
        q: torch.Tensor,      # [seq_len, num_heads, head_dim]
        k: torch.Tensor,      # [seq_len, num_kv_heads, head_dim]
        layer_id: int,
    ) -> Optional[torch.Tensor]:
        """
        估算 sparse attention mask。
        对于 sparse policy（如 XAttention），计算哪些 blocks 需要 attend。
        对于 full policy，返回 None 表示使用完整 attention。
        对应 COMPASS 的 xattn_estimate() 函数。
        Args:
            q: Query tensor [seq_len, num_heads, head_dim]
            k: Key tensor [seq_len, num_kv_heads, head_dim]
            layer_id: Transformer layer index
        Returns:
            sparse_mask: [num_heads, q_blocks, k_blocks] boolean mask, 或 None
        """
        return None  # 默认为 full attention
    @abstractmethod
    def compute_prefill(
        self,
        q: torch.Tensor,      # [seq_len, num_heads, head_dim]
        k: torch.Tensor,      # [seq_len, num_kv_heads, head_dim]
        v: torch.Tensor,      # [seq_len, num_kv_heads, head_dim]
        layer_id: int,
        softmax_scale: float,
    ) -> torch.Tensor:
        """
        计算 prefill attention。
        整层 KV 都在 GPU 上，一次计算完整 attention。
        可以先调用 estimate() 获取 sparse mask，然后应用 block sparse attention。
        Args:
            q: Query tensor [seq_len, num_heads, head_dim]
            k: Key tensor [seq_len, num_kv_heads, head_dim]
            v: Value tensor [seq_len, num_kv_heads, head_dim]
            layer_id: Transformer layer index
            softmax_scale: Softmax scaling factor (1/sqrt(head_dim))
        Returns:
            Attention output [seq_len, num_heads, head_dim]
        """
        pass
    def compute_decode(
        self,
        q: torch.Tensor,      # [1, num_heads, head_dim]
        k: torch.Tensor,      # [context_len+1, num_kv_heads, head_dim]
        v: torch.Tensor,      # [context_len+1, num_kv_heads, head_dim]
        layer_id: int,
        softmax_scale: float,
    ) -> torch.Tensor:
        """
        计算 decode attention。
        KV 从 ring buffer 提供，包含 prefill tokens + 已 decode 的 tokens。
        Args:
            q: Query tensor [1, num_heads, head_dim]
            k: Key tensor [context_len+1, num_kv_heads, head_dim]
            v: Value tensor [context_len+1, num_kv_heads, head_dim]
            layer_id: Transformer layer index
            softmax_scale: Softmax scaling factor
        Returns:
            Attention output [1, num_heads, head_dim]
        """
        # 默认实现：使用 FlashAttention
        from flash_attn.flash_attn_interface import flash_attn_varlen_func
        context_len = k.shape[0]
        cu_seqlens_q = torch.tensor([0, 1], dtype=torch.int32, device=q.device)
        cu_seqlens_k = torch.tensor([0, context_len], dtype=torch.int32, device=q.device)
        return flash_attn_varlen_func(
            q, k, v,
            cu_seqlens_q=cu_seqlens_q,
            cu_seqlens_k=cu_seqlens_k,
            max_seqlen_q=1,
            max_seqlen_k=context_len,
            softmax_scale=softmax_scale,
            causal=False,
        )
    def reset(self) -> None:
        """Reset policy state between sequences."""
        pass
    def __repr__(self) -> str:
        return f"{self.__class__.__name__}()"
 # 保留旧名称作为别名
 SparsePolicy = AttentionPolicy
 ```
 ## Implementation Plan
 ### Phase 2: 重构 policy.py
 ```python
 # nanovllm/kvcache/sparse/policy.py
 from abc import ABC, abstractmethod
 from typing import Optional
 import torch
 class AttentionPolicy(ABC):
    """Base class for attention policies in layerwise offload mode."""
    supports_prefill: bool = True
    supports_decode: bool = True
    def estimate(
        self,
        q: torch.Tensor,
        k: torch.Tensor,
        layer_id: int,
    ) -> Optional[torch.Tensor]:
        """
        Estimate sparse attention mask.
        For sparse policies (e.g., XAttention), computes block-level importance.
        For full policy, returns None.
        Corresponds to xattn_estimate() in COMPASS.
        Returns:
            sparse_mask: [num_heads, q_blocks, k_blocks] or None
        """
        return None
    @abstractmethod
    def compute_prefill(
        self,
        q: torch.Tensor,
        k: torch.Tensor,
        v: torch.Tensor,
        layer_id: int,
        softmax_scale: float,
    ) -> torch.Tensor:
        """Compute prefill attention."""
        pass
    def compute_decode(
        self,
        q: torch.Tensor,
        k: torch.Tensor,
        v: torch.Tensor,
        layer_id: int,
        softmax_scale: float,
    ) -> torch.Tensor:
        """Compute decode attention (default: FlashAttention)."""
        from flash_attn.flash_attn_interface import flash_attn_varlen_func
        context_len = k.shape[0]
        cu_seqlens_q = torch.tensor([0, 1], dtype=torch.int32, device=q.device)
        cu_seqlens_k = torch.tensor([0, context_len], dtype=torch.int32, device=q.device)
        return flash_attn_varlen_func(
            q, k, v,
            cu_seqlens_q=cu_seqlens_q,
            cu_seqlens_k=cu_seqlens_k,
            max_seqlen_q=1,
            max_seqlen_k=context_len,
            softmax_scale=softmax_scale,
            causal=False,
        )
    def reset(self) -> None:
        pass
    def __repr__(self) -> str:
        return f"{self.__class__.__name__}()"
 # Backward compatibility alias
 SparsePolicy = AttentionPolicy
 ```
 ### Phase 3: 重构 FullAttentionPolicy
 ```python
 # nanovllm/kvcache/sparse/full_policy.py
 import torch
 from .policy import AttentionPolicy
 class FullAttentionPolicy(AttentionPolicy):
    """Full attention using FlashAttention (no sparsity)."""
    supports_prefill = True
    supports_decode = True
    def estimate(self, q, k, layer_id):
        """Full attention - no sparse mask needed."""
        return None
    def compute_prefill(self, q, k, v, layer_id, softmax_scale):
        from flash_attn.flash_attn_interface import flash_attn_varlen_func
        seq_len = q.shape[0]
        cu_seqlens = torch.tensor([0, seq_len], dtype=torch.int32, device=q.device)
        return flash_attn_varlen_func(
            q, k, v,
            cu_seqlens_q=cu_seqlens,
            cu_seqlens_k=cu_seqlens,
            max_seqlen_q=seq_len,
            max_seqlen_k=seq_len,
            softmax_scale=softmax_scale,
            causal=True,
        )
    def __repr__(self):
        return "FullAttentionPolicy()"
 ```
 ### Phase 4: 重构 XAttentionPolicy
 ```python
 # nanovllm/kvcache/sparse/xattn.py
 import torch
 from typing import Optional
 from .policy import AttentionPolicy
 class XAttentionPolicy(AttentionPolicy):
    """
    XAttention sparse prefill policy.
    Uses chunked estimation to compute sparse attention mask,
    then applies block sparse attention.
    """
    supports_prefill = True
    supports_decode = True
    def __init__(
        self,
        stride: int = 8,
        threshold: float = 0.9,
        block_size: int = 128,
        chunk_size: int = 16384,
        use_triton: bool = True,
    ):
        self.stride = stride
        self.threshold = threshold
        self.block_size = block_size
        self.chunk_size = chunk_size
        self.use_triton = use_triton
    def estimate(
        self,
        q: torch.Tensor,
        k: torch.Tensor,
        layer_id: int,
    ) -> Optional[torch.Tensor]:
        """
        XAttention estimation (xattn_estimate).
        Uses chunked GEMM + softmax to estimate block-level importance,
        then selects important blocks based on threshold.
        对应 COMPASS 的 xattn_estimate() 函数:
        1. Pad inputs to chunk_size multiples
        2. Reshape with stride
        3. Compute QK^T in chunks (Triton)
        4. Block-wise softmax + aggregation
        5. Threshold-based selection
        Args:
            q: [seq_len, num_heads, head_dim]
            k: [seq_len, num_kv_heads, head_dim]
            layer_id: transformer layer index
        Returns:
            sparse_mask: [num_heads, q_blocks, k_blocks] boolean mask
                        or None (fallback to full attention)
        """
        # TODO: 实现真正的 xattn_estimate
        # 当前返回 None 使用 full attention
        return None
    def compute_prefill(
        self,
        q: torch.Tensor,
        k: torch.Tensor,
        v: torch.Tensor,
        layer_id: int,
        softmax_scale: float,
    ) -> torch.Tensor:
        """
        Compute XAttention sparse prefill.
        Flow:
        1. Call estimate() to get sparse mask
        2. If mask is None, use full attention
        3. Otherwise, apply block sparse attention with mask
        """
        # Step 1: Estimate sparse mask
        sparse_mask = self.estimate(q, k, layer_id)
        # Step 2: Compute attention
        if sparse_mask is None:
            # Fallback to full attention
            from flash_attn.flash_attn_interface import flash_attn_varlen_func
            seq_len = q.shape[0]
            cu_seqlens = torch.tensor([0, seq_len], dtype=torch.int32, device=q.device)
            return flash_attn_varlen_func(
                q, k, v,
                cu_seqlens_q=cu_seqlens,
                cu_seqlens_k=cu_seqlens,
                max_seqlen_q=seq_len,
                max_seqlen_k=seq_len,
                softmax_scale=softmax_scale,
                causal=True,
            )
        else:
            # Apply block sparse attention with mask
            # 使用 block_sparse_attn_func(q, k, v, sparse_mask, block_size)
            raise NotImplementedError("Block sparse attention not yet implemented")
    def __repr__(self):
        return (f"XAttentionPolicy("
                f"stride={self.stride}, "
                f"threshold={self.threshold}, "
                f"block_size={self.block_size})")
 ```
 ### Phase 5: 更新 model_runner.py
 ```python
 # model_runner.py - allocate_kv_cache()
 # 改为总是创建 policy（包括 FULL）
 from nanovllm.kvcache.sparse import create_attention_policy
 self.attention_policy = create_attention_policy(config.attention_policy, **policy_kwargs)
 logger.info(f"Attention policy: {self.attention_policy}")
 # run_layerwise_offload_prefill() 和 run_gpu_only_prefill()
 # 旧代码:
 if self.sparse_prefill_policy is not None:
    attn_output = self.sparse_prefill_policy.sparse_prefill_attention(q, k, v, layer_id)
 else:
    attn_output = flash_attn_varlen_func(...)
 # 新代码:
 attn_output = self.attention_policy.compute_prefill(
    q, k, v, layer_id, softmax_scale=layer.self_attn.attn.scale
 )
 ```
 ## Method Mapping
 | 旧方法 | 新方法 | 说明 |
 |--------|--------|------|
 | `select_blocks()` | `estimate()` | 计算 sparse mask（对应 xattn_estimate） |
 | `sparse_prefill_attention()` | `compute_prefill()` | Prefill attention |
 | (无) | `compute_decode()` | Decode attention（默认实现） |
 | `on_prefill_offload()` | (移除) | Offload 在 model_runner 处理 |
 ## Files to Modify
 | File | Changes |
 |------|---------|
 | `nanovllm/kvcache/sparse/policy.py` | 新接口：estimate, compute_prefill, compute_decode |
 | `nanovllm/kvcache/sparse/full_policy.py` | 实现 compute_prefill(), estimate() 返回 None |
 | `nanovllm/kvcache/sparse/xattn.py` | estimate() 对应 xattn_estimate, compute_prefill() |
 | `nanovllm/kvcache/sparse/__init__.py` | 更新工厂函数 |
 | `nanovllm/engine/model_runner.py` | 统一调用 attention_policy.compute_prefill() |
 | `nanovllm/config.py` | 可选：重命名配置项 |
 ## Decisions Made
 1. **方法命名**: `compute_prefill` / `compute_decode` 对应 chunked 版本的命名风格
 2. **estimate 方法**: 替代 `select_blocks`，返回 sparse mask 而不是 block IDs
 3. **XAttention**: `estimate()` 对应 COMPASS 的 `xattn_estimate()`
 4. **Full Policy**: `estimate()` 返回 None 表示使用完整 attention
 5. **Decode 默认实现**: 基类提供默认的 FlashAttention 实现
 ## Errors Encountered
 - (无)
 ## Status
 **Currently in Phase 1** - 完成分析和接口设计，等待用户确认后进入 Phase 2
--- a/tests/test_needle.py
+++ b/tests/test_needle.py
@@ -32,11 +32,14 @@ def run_needle_test(
    enable_cpu_offload: bool = False,
    enable_quest: bool = False,
    enable_minference: bool = False,
    enable_xattn: bool = False,
    sparse_topk: int = 8,
    sparse_threshold: int = 4,
    minference_budget: float = 0.3,
    minference_vertical: int = 1000,
    minference_slash: int = 6096,
    xattn_threshold: float = 0.9,
    xattn_use_bsa: bool = True,
    gpu_utilization: float = 0.9,
    enforce_eager: bool = True,
    verbose: bool = True,
@@ -56,11 +59,14 @@ def run_needle_test(
        enable_cpu_offload: Enable CPU offload mode
        enable_quest: Enable Quest sparse attention (decode-only Top-K)
        enable_minference: Enable MInference sparse prefill (GPU-only)
        enable_xattn: Enable XAttention sparse prefill with BSA
        sparse_topk: Top-K blocks for Quest
        sparse_threshold: Apply sparse only when blocks > threshold
        minference_budget: MInference adaptive budget (fraction of seq_len, None=fixed mode)
        minference_vertical: Fixed vertical_size (only used when budget=None)
        minference_slash: Fixed slash_size (only used when budget=None)
        xattn_threshold: XAttention block selection threshold (0-1)
        xattn_use_bsa: Use Block Sparse Attention library
        gpu_utilization: GPU memory utilization fraction
        verbose: Print detailed output
@@ -68,7 +74,9 @@ def run_needle_test(
        True if test passed, False otherwise
    """
    # Determine sparse policy
-    if enable_minference:
+    if enable_xattn:
        sparse_policy = SparsePolicyType.XATTN
    elif enable_minference:
        sparse_policy = SparsePolicyType.MINFERENCE
    elif enable_quest:
        sparse_policy = SparsePolicyType.QUEST
@@ -94,6 +102,8 @@ def run_needle_test(
                print(f"  MInference: adaptive (budget={minference_budget})")
            else:
                print(f"  MInference: fixed (vertical={minference_vertical}, slash={minference_slash})")
        if enable_xattn:
            print(f"  XAttention: threshold={xattn_threshold}, use_bsa={xattn_use_bsa}")
        print(f"{'='*60}\n")
    # 1. Initialize LLM
@@ -111,7 +121,7 @@ def run_needle_test(
        llm_kwargs["sparse_threshold_blocks"] = sparse_threshold
    # Set sparse policy (can be used with or without offload)
-    if enable_minference or enable_quest:
+    if enable_minference or enable_quest or enable_xattn:
        llm_kwargs["sparse_policy"] = sparse_policy
    # MInference params (works with both GPU-only and offload mode)
@@ -120,6 +130,11 @@ def run_needle_test(
        llm_kwargs["minference_vertical_size"] = minference_vertical
        llm_kwargs["minference_slash_size"] = minference_slash
    # XAttention params
    if enable_xattn:
        llm_kwargs["xattn_threshold"] = xattn_threshold
        llm_kwargs["xattn_use_bsa"] = xattn_use_bsa
    llm = LLM(model_path, **llm_kwargs)
    # 2. Generate needle prompt
@@ -224,6 +239,11 @@ if __name__ == "__main__":
        action="store_true",
        help="Enable MInference sparse prefill (GPU-only, vertical+slash pattern)"
    )
    parser.add_argument(
        "--enable-xattn",
        action="store_true",
        help="Enable XAttention sparse prefill with Block Sparse Attention"
    )
    parser.add_argument(
        "--sparse-topk",
        type=int,
@@ -254,6 +274,17 @@ if __name__ == "__main__":
        default=6096,
        help="Fixed slash_size (only used when budget=0)"
    )
    parser.add_argument(
        "--xattn-threshold",
        type=float,
        default=0.9,
        help="XAttention block selection threshold (0-1, higher=more blocks)"
    )
    parser.add_argument(
        "--xattn-no-bsa",
        action="store_true",
        help="Disable Block Sparse Attention (use FlashAttention fallback)"
    )
    parser.add_argument(
        "--gpu-utilization",
        type=float,
@@ -291,11 +322,14 @@ if __name__ == "__main__":
        enable_cpu_offload=args.enable_offload,
        enable_quest=args.enable_quest,
        enable_minference=args.enable_minference,
        enable_xattn=args.enable_xattn,
        sparse_topk=args.sparse_topk,
        sparse_threshold=args.sparse_threshold,
        minference_budget=minference_budget,
        minference_vertical=args.minference_vertical,
        minference_slash=args.minference_slash,
        xattn_threshold=args.xattn_threshold,
        xattn_use_bsa=not args.xattn_no_bsa,
        gpu_utilization=args.gpu_utilization,
        enforce_eager=enforce_eager,
        verbose=True,
--- a/tests/test_xattn_estimate_chunked.py
+++ b/tests/test_xattn_estimate_chunked.py
@@ -0,0 +1,244 @@
 """
 Test: Compare xattn_estimate vs xattn_estimate_chunked
 Verify that chunked estimation with EXTERNAL chunking produces the same mask
 as standard estimation. This ensures the chunked version can be used in
 chunked prefill scenarios without accuracy loss.
 Usage:
    CUDA_VISIBLE_DEVICES=0 PYTHONPATH=/home/zijie/Code/nano-vllm:$PYTHONPATH \
        python tests/test_xattn_estimate_chunked.py
 """
 import sys
 import traceback
 import torch
 from nanovllm.ops.xattn import xattn_estimate, xattn_estimate_chunked
 # ============================================================
 # Configuration
 # ============================================================
 # Configuration for xattn_estimate_chunked consistency test.
 # Key requirements for 100% match:
 # 1. Use matching chunk_size for both standard and chunked versions
 # 2. Use same random seed for reproducibility
 # Note: Tiny differences (~0.000001) may occur at boundary cases due to
 # floating point precision in cumulative sum calculations.
 BLOCK_SIZE = 64
 STRIDE = 4
 THRESHOLD = 0.9
 CHUNK_SIZE = 4096  # External chunking size
 # Test sequence lengths
 TEST_SEQ_LENS = [4096, 8192, 16384, 32768]
 # ============================================================
 # Utility Functions
 # ============================================================
 def compare_masks(mask1, mask2, name1="standard", name2="chunked"):
    """Compare two masks and report differences."""
    if mask1.shape != mask2.shape:
        print(f"  Shape mismatch: {name1}={mask1.shape}, {name2}={mask2.shape}")
        return False
    diff = (mask1 != mask2).sum().item()
    total = mask1.numel()
    match_rate = (total - diff) / total * 100
    print(f"  Match rate: {match_rate:.4f}% ({total - diff}/{total})")
    if diff > 0:
        diff_indices = torch.where(mask1 != mask2)
        print(f"  First 5 diff positions: {list(zip(*[idx[:5].tolist() for idx in diff_indices]))}")
    return diff == 0
 def run_chunked_externally(query, key, block_size, stride, threshold, chunk_size):
    """
    Run xattn_estimate_chunked with EXTERNAL chunking.
    This simulates how chunked prefill should be used in practice.
    """
    batch_size, num_heads, q_len, head_dim = query.shape
    _, _, k_len, _ = key.shape
    q_block_num = (q_len + block_size - 1) // block_size
    k_block_num = (k_len + block_size - 1) // block_size
    # If Q fits in one chunk, call directly
    if q_len <= chunk_size:
        return xattn_estimate_chunked(
            query, key,
            q_start_pos=0,
            block_size=block_size,
            stride=stride,
            threshold=threshold,
            use_triton=True,
            chunk_size=chunk_size,
        )
    # External chunking: split Q and call for each chunk
    num_q_chunks = (q_len + chunk_size - 1) // chunk_size
    print(f"    External chunking: {num_q_chunks} chunks")
    combined_attn_sum = torch.zeros(
        batch_size, num_heads, q_block_num, k_block_num,
        dtype=query.dtype, device=query.device
    )
    combined_mask = torch.zeros(
        batch_size, num_heads, q_block_num, k_block_num,
        dtype=torch.bool, device=query.device
    )
    q_block_offset = 0
    for q_chunk_idx in range(num_q_chunks):
        q_chunk_start = q_chunk_idx * chunk_size
        q_chunk_end = min((q_chunk_idx + 1) * chunk_size, q_len)
        q_chunk = query[:, :, q_chunk_start:q_chunk_end, :]
        # For causal attention, K accumulates up to current Q position
        # q_start_pos=0 means Q starts at position 0 in the full sequence
        # K is [0, q_chunk_end) for causal attention
        k_end = q_chunk_end
        k_chunk = key[:, :, :k_end, :]
        attn_sum_chunk, mask_chunk = xattn_estimate_chunked(
            q_chunk, k_chunk,
            q_start_pos=q_chunk_start,
            block_size=block_size,
            stride=stride,
            threshold=threshold,
            use_triton=True,
            chunk_size=chunk_size,
        )
        # Place chunk results into combined output
        chunk_q_blocks = mask_chunk.shape[2]
        chunk_k_blocks = mask_chunk.shape[3]
        combined_attn_sum[:, :, q_block_offset:q_block_offset+chunk_q_blocks, :chunk_k_blocks] = attn_sum_chunk
        combined_mask[:, :, q_block_offset:q_block_offset+chunk_q_blocks, :chunk_k_blocks] = mask_chunk
        q_block_offset += chunk_q_blocks
    return combined_attn_sum, combined_mask
 def test_single_seq_len(seq_len, num_heads=32, head_dim=128):
    """Test a single sequence length."""
    print(f"\nTesting seq_len={seq_len}")
    print("=" * 60)
    # Generate random Q/K
    query = torch.randn(1, num_heads, seq_len, head_dim, device="cuda", dtype=torch.bfloat16)
    key = torch.randn(1, num_heads, seq_len, head_dim, device="cuda", dtype=torch.bfloat16)
    # Run standard xattn_estimate
    print("[1] Running standard xattn_estimate...")
    try:
        attn_sum_std, mask_std = xattn_estimate(
            query, key,
            block_size=BLOCK_SIZE,
            stride=STRIDE,
            threshold=THRESHOLD,
            chunk_size=CHUNK_SIZE,
            use_triton=True,
            causal=True,
        )
        density_std = mask_std.float().mean().item()
        print(f"  mask shape: {mask_std.shape}, density: {density_std:.4f}")
    except Exception as e:
        print(f"  ERROR: {e}")
        traceback.print_exc()
        return False
    # Run chunked xattn_estimate with EXTERNAL chunking
    print("[2] Running chunked xattn_estimate (external chunking)...")
    try:
        attn_sum_chunked, mask_chunked = run_chunked_externally(
            query, key,
            block_size=BLOCK_SIZE,
            stride=STRIDE,
            threshold=THRESHOLD,
            chunk_size=CHUNK_SIZE,
        )
        density_chunked = mask_chunked.float().mean().item()
        print(f"  mask shape: {mask_chunked.shape}, density: {density_chunked:.4f}")
    except Exception as e:
        print(f"  ERROR: {e}")
        traceback.print_exc()
        return False
    # Compare results
    print("[3] Comparing results...")
    chunked_q_blocks = mask_chunked.shape[2]
    chunked_k_blocks = mask_chunked.shape[3]
    # Extract comparable region from standard mask
    mask_std_comparable = mask_std[:, :, :chunked_q_blocks, :chunked_k_blocks]
    # Compare masks
    masks_match = compare_masks(mask_std_comparable, mask_chunked, "standard", "chunked")
    # Compare attn_sums
    attn_sum_std_comparable = attn_sum_std[:, :, :chunked_q_blocks, :chunked_k_blocks]
    if attn_sum_std_comparable.shape == attn_sum_chunked.shape:
        attn_diff = (attn_sum_std_comparable - attn_sum_chunked).abs().max().item()
        print(f"  Attn sum max diff: {attn_diff:.6f}")
    else:
        print(f"  Attn sum shape mismatch: std={attn_sum_std_comparable.shape}, chunked={attn_sum_chunked.shape}")
    # Clean up GPU memory
    del query, key, attn_sum_std, mask_std, attn_sum_chunked, mask_chunked
    torch.cuda.empty_cache()
    return masks_match
 # ============================================================
 # Main Test
 # ============================================================
 if __name__ == "__main__":
    print("XAttention Chunked vs Standard Test")
    print("=" * 60)
    print(f"Config: block_size={BLOCK_SIZE}, stride={STRIDE}, threshold={THRESHOLD}")
    print(f"External chunk_size={CHUNK_SIZE}")
    print()
    # Check CUDA availability
    if not torch.cuda.is_available():
        print("CUDA not available!")
        sys.exit(1)
    print(f"Using GPU: {torch.cuda.get_device_name(0)}")
    print("✓ xattn_estimate imported")
    print("✓ xattn_estimate_chunked imported")
    # Run tests
    all_passed = True
    results = []
    for seq_len in TEST_SEQ_LENS:
        passed = test_single_seq_len(seq_len)
        chunks = (seq_len + CHUNK_SIZE - 1) // CHUNK_SIZE
        results.append((seq_len, chunks, passed))
        if not passed:
            all_passed = False
    # Summary
    print("\n" + "=" * 60)
    print("SUMMARY")
    print("=" * 60)
    for seq_len, chunks, passed in results:
        status = "PASSED" if passed else "FAILED"
        print(f"  seq_len={seq_len:5d} ({chunks} chunk{'s' if chunks > 1 else ' '}): {status}")
    print("=" * 60)
    if all_passed:
        print("ALL TESTS PASSED!")
        sys.exit(0)
    else:
        print("SOME TESTS FAILED!")
        sys.exit(1)
Author	SHA1	Message	Date
Zijie Tian	5fb0f67295	[WIP] need refactor.	2026-01-22 22:20:34 +08:00
Zijie Tian	69b779e252	📝 docs: add layer offload planning notes and task plan Add planning documents for layer-wise offload implementation: - notes.md: Implementation notes and findings - task_plan.md: Detailed task breakdown and progress tracking Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-22 06:04:36 +08:00
Zijie Tian	e313dd795a	✨ feat: add exec-plan command for automated task plan execution Add a new Claude command that executes task_plan.md refactoring with: - GPU isolation via --gpu <id> parameter (required) - Optional --no-interrupt mode for autonomous execution - Progress tracking via progress.md and findings.md - Strict CUDA_VISIBLE_DEVICES enforcement Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-22 06:03:42 +08:00
Zijie Tian	9f3ee9279e	✨ feat: add nanovllm.ops module with XAttention estimation kernels Add ops module ported from tzj/minference branch containing: - xattn.py: XAttention block importance estimation with Triton kernels - xattn_estimate(): standard estimation for sparse attention mask - xattn_estimate_chunked(): chunked prefill compatible version - flat_group_gemm_fuse_reshape(): fused stride reshape + GEMM kernel - softmax_fuse_block_sum(): online softmax + block-wise sum kernel - chunked_attention.py: Flash attention with LSE output for chunk merging - test_xattn_estimate_chunked.py: verification test (all seq_lens pass) This prepares the foundation for AttentionPolicy refactoring where XAttentionPolicy.estimate() will call these ops. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-22 06:00:42 +08:00