- Add compute_prefill() and compute_decode() GPU-only methods to SparsePolicy base class
- Implement GPU-only methods in FullAttentionPolicy using flash_attn
- Add sparse_policy parameter to GPUOnlyManager
- Update create_kvcache_manager() to create FullAttentionPolicy for GPU-only mode
- Route GPU-only attention through sparse_policy in attention.py
- Pass kvcache_manager to context for policy access
- Add --enable-policy flag to bench.py for testing
- Handle warmup phase when kvcache_manager is not yet allocated
This allows GPU-only mode to use the same policy architecture as CPU offload mode,
enabling future sparse attention implementations (Quest, XAttention) in GPU-only mode.
Performance verified: ~4890 tok/s (unchanged from baseline)
Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)
Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
- Remove cross-layer pipeline from OffloadEngine (saves ~1GB GPU memory for long sequences)
- Delete layer_k/v_buffer_a/b double buffers
- Remove start_decode_pipeline, get_decode_layer_kv, end_decode_pipeline methods
- Remove pipeline state tracking variables
- Simplify decode to use ring buffer pipeline only (more efficient for long sequences)
- Rename compute_chunked_attention → compute_chunked_prefill for clarity
- Add mandatory needle test requirements: --enable-offload --input-len 32768
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Import os and socket modules
- Add _find_free_port() function for automatic port detection
- Use NANOVLLM_DIST_PORT env var if set, otherwise auto-assign
- Enables running multiple model instances without port conflicts
Co-Authored-By: Claude <noreply@anthropic.com>
Combines two performance optimization features:
- perf_opt-1: Cross-layer pipeline for decode (double-buffered layer cache)
- perf_opt-2: Per-layer prefill buffer for async offload
Both features are complementary and improve CPU offload performance.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>