Zijie Tian
a50b4c2ac2
♻️ refactor: move select_blocks from policy to attention layer
...
Move block selection logic from compute_chunked_prefill/decode methods
to attention.py caller. This improves separation of concerns:
- attention.py now calls select_blocks() before compute_chunked_*()
- Policy methods receive pre-selected blocks via selected_blocks parameter
- Enables sparse policies to implement custom block selection without
modifying the compute path
Changes:
- policy.py: Add selected_blocks parameter to abstract methods
- full_policy.py: Remove internal select_blocks calls, use passed blocks
- xattn_bsa.py: Sync signatures for prefill/decode methods
- attention.py: Add select_blocks calls before policy delegation
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-01-23 05:21:28 +08:00
Zijie Tian
ca32ea6f93
[WIP] Before refactor the compute)_chunked_prefill.
2026-01-23 03:36:12 +08:00
Zijie Tian
999858e82f
feat: add xattn kernels test and update testing rules
...
- Add test_xattn_kernels.py demonstrating flat_group_gemm_fuse_reshape
and softmax_fuse_block_sum Triton kernels with structured data
- Update testing.md with new test code style guidelines
- Update xattn.py and xattn_bsa.py with improvements
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-01-23 03:01:25 +08:00
Zijie Tian
b97b0b96a0
[WIP] Before refactor the nanovllm sparse policy.
2026-01-19 22:34:44 +08:00
Zijie Tian
b5da802dff
[WIP] Before integrate the xattn operator.
2026-01-19 21:19:21 +08:00