nano-vllm/nanovllm/engine/model_runner.py at f049971f84064fa2da0429f3210d7e1ca9ed194c

Files

Zijie Tian 4fe7dfb239 🔀 merge: integrate tzj/minference-exp (GPU-only sparse attention)

Merge GPU-only sparse attention support from tzj/minference-exp branch:

**GPU-only mode additions:**
- Add compute_prefill/compute_decode methods to SparsePolicy base class
- Add GPU-only attention routing in attention.py
- Add alloc_policy_metadata() for pre-allocating GQA buffers
- Add XAttention + BSA sparse attention for GPU-only prefill
- Add kvcache_manager to set_context() for policy access

**bench.py enhancements:**
- Add --model argument for configurable model path
- Add --policy argument (full, xattn) for sparse policy selection
- Add --enable-policy flag for FullAttentionPolicy routing
- Add --enforce-eager option to disable CUDA graphs
- Add --gpu-util option for GPU memory utilization

**Documentation:**
- Add gpu_only_xattn_guide.md with performance analysis
- Add gpu_only_sparse_integration.md baseline document
- Add gpu-vram-requirement.md rule for GPU-only mode

Both CPU offload and GPU-only paths are preserved and functional.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-27 09:25:36 +08:00

40 KiB

Raw Blame History

View Raw

40 KiB Raw Blame History

40 KiB

Raw Blame History