📊 feat: distinguish compute density and communication density in DensityObserver

- Add record_comm_density() call in select_blocks to track CPU block selection - Add get_per_layer_comm_density() method for detailed analysis - Update print_summary() to show both densities and H2D savings ratio - Set DensityObserver mode (offload/gpu_only) in test_ruler.py - Update get_summary() to return both density types Key insight: Comm density can be 100% even when compute density is ~37% because sparse BSA blocks are distributed across all CPU blocks. Since CPU block granularity is 32x coarser (4096 vs 128 tokens), any() aggregation across heads/Q-blocks results in all CPU blocks being needed. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 01:43:17 +08:00
parent 1ea5afd886
commit 51bd678335
3 changed files with 35 additions and 4 deletions
--- a/nanovllm/kvcache/sparse/xattn_bsa.py
+++ b/nanovllm/kvcache/sparse/xattn_bsa.py
@@ -905,6 +905,15 @@ class XAttentionBSAPolicy(SparsePolicy):
            self._stats_total_selected_blocks += len(selected_block_ids)
            self._stats_num_chunks += 1

+            # Record communication density to DensityObserver
+            # Comm density = selected_cpu_blocks / available_cpu_blocks
+            # This is different from compute density (BSA block granularity)
+            DensityObserver.record_comm_density(
+                layer_id=layer_id,
+                selected_cpu_blocks=len(selected_block_ids),
+                total_cpu_blocks=len(available_blocks),
+            )
+
            # Log per-chunk density
            chunk_density = len(selected_block_ids) / len(available_blocks)
            logger.debug(f"[XAttn] chunk={ctx.query_chunk_idx}, available={len(available_blocks)}, "