nano-vllm

Files

Zijie Tian 51bd678335 📊 feat: distinguish compute density and communication density in DensityObserver

- Add record_comm_density() call in select_blocks to track CPU block selection
- Add get_per_layer_comm_density() method for detailed analysis
- Update print_summary() to show both densities and H2D savings ratio
- Set DensityObserver mode (offload/gpu_only) in test_ruler.py
- Update get_summary() to return both density types

Key insight: Comm density can be 100% even when compute density is ~37%
because sparse BSA blocks are distributed across all CPU blocks.
Since CPU block granularity is 32x coarser (4096 vs 128 tokens),
any() aggregation across heads/Q-blocks results in all CPU blocks being needed.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-02-05 01:43:17 +08:00

comm

[WIP] Added sgDMA operator for scatter kvcache communication.

2025-12-24 23:48:52 +08:00

debug

[refactor] Refactor the kvcache offload.

2026-01-04 19:37:03 +08:00

engine

WIP: Enhance sparse attention with density tracking and block selection improvements

2026-01-31 14:48:23 +08:00

kvcache

📊 feat: distinguish compute density and communication density in DensityObserver