📝 docs: add SparsePolicy architecture documentation

Add comprehensive documentation for the SparsePolicy abstraction: - SparsePolicy base class and abstract methods - FullAttentionPolicy prefill/decode flow - Ring buffer and cross-layer pipeline modes - Code conventions and testing guidelines Update CLAUDE.md documentation index with reference. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-20 01:36:09 +08:00
parent 4593f42ec3
commit e5a17c832c
2 changed files with 325 additions and 0 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -11,6 +11,7 @@ Nano-vLLM is a lightweight vLLM implementation (~1,200 lines) for fast offline L
 | Document | Purpose |
 |----------|---------|
 | [`docs/architecture_guide.md`](docs/architecture_guide.md) | Core components, CPU offload system design, ring buffer architecture, stream configuration |
+| [`docs/sparse_policy_architecture.md`](docs/sparse_policy_architecture.md) | SparsePolicy abstraction: prefill/decode delegation, pipeline modes, policy implementations |
 | [`docs/sparse_attention_guide.md`](docs/sparse_attention_guide.md) | Block sparse attention methods (XAttention, FlexPrefill, MInference, AvgPool, Quest), computation flow, algorithms |
 | [`docs/debugging_guide.md`](docs/debugging_guide.md) | PyTorch hooks for debugging, hook positions, tensor comparison, memory profiling |
 | [`docs/optimization_guide.md`](docs/optimization_guide.md) | Performance optimizations: sgDMA (15x), Triton merge (4.3x), N-way pipeline (2x) |