nano-vllm

Files

Zijie Tian 2866d4fd88 ✨ feat: add chunk attention CUDA graph test for block sparse attention

Validates that pre-allocated CUDA graphs work for chunk-wise attention:
- Each (Q_chunk, K_chunk) pair has its own captured graph
- Zero copy_() during replay - all data pre-filled
- Uses nanovllm's flash_attn_with_lse and merge_attention_outputs

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-22 00:57:05 +08:00

__init__.py

[WIP] NEED refactor nanovllm mechenism.

2025-12-22 23:52:56 +08:00

modeling_qwen3.py

[refactor] Refactor needle test.

2026-01-03 19:19:37 +08:00

test_chunk_attention_graph.py

✨ feat: add chunk attention CUDA graph test for block sparse attention

2026-01-22 00:57:05 +08:00

test_cudagraph_memory.py

[test] Added test_cudagraph_memory.py.