📊 feat: distinguish compute density and communication density in DensityObserver

- Add record_comm_density() call in select_blocks to track CPU block selection - Add get_per_layer_comm_density() method for detailed analysis - Update print_summary() to show both densities and H2D savings ratio - Set DensityObserver mode (offload/gpu_only) in test_ruler.py - Update get_summary() to return both density types Key insight: Comm density can be 100% even when compute density is ~37% because sparse BSA blocks are distributed across all CPU blocks. Since CPU block granularity is 32x coarser (4096 vs 128 tokens), any() aggregation across heads/Q-blocks results in all CPU blocks being needed. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 01:43:17 +08:00
parent 1ea5afd886
commit 51bd678335
3 changed files with 35 additions and 4 deletions
--- a/tests/test_ruler.py
+++ b/tests/test_ruler.py
@@ -386,8 +386,11 @@ def run_ruler_benchmark(
    if sparse_policy and sparse_policy.upper() == "XATTN_BSA":
        DensityObserver.enable()
        DensityObserver.complete_reset()
+        # Set mode for correct density interpretation
+        DensityObserver.set_mode("offload" if enable_cpu_offload else "gpu_only")
        if not json_output:
-            print("[DensityObserver] Enabled for XAttention BSA")
+            mode_str = "offload" if enable_cpu_offload else "gpu_only"
+            print(f"[DensityObserver] Enabled for XAttention BSA (mode: {mode_str})")

    # LLM initialization kwargs
    llm_kwargs = {