nano-vllm

Author	SHA1	Message	Date
Zijie Tian	4cbd451af7	📝 docs: add BSA interface documentation and cleanup temp files - Add docs/block_sparse_attn_interface.md with BSA function signatures - Update CLAUDE.md documentation index - Remove obsolete DEBUG_SUMMARY.md and test_report_sparse_policy_refactor.md - Add notes.md to .gitignore Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 04:27:19 +08:00
Zijie Tian	3aef6fc3a2	✨ feat: add XAttention Triton operators for sparse attention estimation Port XAttention operators from COMPASS project: - flat_group_gemm_fuse_reshape: stride reshape GEMM kernel - softmax_fuse_block_sum: fused softmax with block-level summation - xattn_estimate: main estimation function for block sparse attention - find_blocks_chunked: cumulative threshold-based block selection Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 04:27:07 +08:00
Zijie Tian	690456dbf9	♻️ refactor: create ops module and move chunked_attention - Create nanovllm/ops/ module for low-level attention operators - Move chunked_attention.py from kvcache/ to ops/ - Update imports in full_policy.py (3 locations) - Fix: remove dead code in OffloadEngine.reset() referencing non-existent layer_k/v_buffer_a/b attributes Verified with needle test (32K offload): PASSED Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 02:50:14 +08:00
Zijie Tian	e440c45e73	📝 docs: add XAttention algorithm guide based on COMPASS implementation - Create docs/xattention_algorithm_guide.md with detailed algorithm explanation - Stride reshape (inverse mode) for Q/K interleaved sampling - Triton kernels: flat_group_gemm_fuse_reshape, softmax_fuse_block_sum - Block selection via find_blocks_chunked with cumulative threshold - BSA (block_sparse_attn) dependency for sparse computation - Update docs/sparse_attention_guide.md XAttention section with accurate description - Add documentation index entry in CLAUDE.md Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 02:50:03 +08:00
Zijie Tian	07f5220f40	Merge branch 'tzj/minference' of ssh://git.zijie-tian.site:2222/zijie-tian/nano-vllm into tzj/minference	2026-01-20 02:27:10 +08:00
Zijie Tian	37aecd4d52	📝 docs: add SparsePolicy implementation guide and update rules - Create docs/sparse_policy_implementation_guide.md with comprehensive guide - Rewrite .claude/rules/sparse-policy.md with mandatory base class requirements - Add new doc reference to CLAUDE.md documentation index Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 02:25:46 +08:00
Zijie Tian	b1f292cf22	Merge branch 'tzj/minference' of ssh://git.zijie-tian.site:2222/zijie-tian/nano-vllm into tzj/minference	2026-01-20 02:16:39 +08:00
Zijie Tian	16fbcf9e4c	docs: add RULER 32K chunked offload issue documentation - Document accuracy degradation issue in 32K context with chunked offload - Add detailed hypothesis analysis and debugging approach - Include 4-slot ring buffer experiment results Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 02:16:21 +08:00
Zijie Tian	fa7601f4b8	♻️ refactor: remove cross-layer pipeline and rename compute_chunked_prefill - Remove cross-layer pipeline from OffloadEngine (saves ~1GB GPU memory for long sequences) - Delete layer_k/v_buffer_a/b double buffers - Remove start_decode_pipeline, get_decode_layer_kv, end_decode_pipeline methods - Remove pipeline state tracking variables - Simplify decode to use ring buffer pipeline only (more efficient for long sequences) - Rename compute_chunked_attention → compute_chunked_prefill for clarity - Add mandatory needle test requirements: --enable-offload --input-len 32768 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 02:10:40 +08:00
Zijie Tian	6080bf7554	🙈 chore: exclude planning-with-files from git tracking - Add planning files (task_plan.md, findings.md, progress.md) to .gitignore - Remove existing planning files from git index (keep local) - Update planning-with-files rule with git management policy These temporary session files should not be version controlled. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 02:06:28 +08:00
Zijie Tian	e5a17c832c	📝 docs: add SparsePolicy architecture documentation Add comprehensive documentation for the SparsePolicy abstraction: - SparsePolicy base class and abstract methods - FullAttentionPolicy prefill/decode flow - Ring buffer and cross-layer pipeline modes - Code conventions and testing guidelines Update CLAUDE.md documentation index with reference. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 01:36:09 +08:00
Zijie Tian	4593f42ec3	♻️ refactor: migrate chunked decode attention to SparsePolicy Move decode attention computation from attention.py to SparsePolicy: - Add compute_chunked_decode abstract method to SparsePolicy base class - Implement compute_chunked_decode in FullAttentionPolicy with: - Ring buffer pipeline (_decode_ring_buffer_pipeline) - Cross-layer pipeline (_decode_with_layer_pipeline) - Decode buffer handling - Simplify _chunked_decode_attention to only validate and delegate - Remove _decode_ring_buffer_pipeline and _decode_with_layer_pipeline from attention.py - Add supports_decode check for policy validation This completes the SparsePolicy v5 refactoring where both prefill and decode paths now delegate all computation to the sparse policy. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 01:32:17 +08:00
Zijie Tian	a36f8569fc	[WIP] Before refactor.	2026-01-20 01:25:46 +08:00
Zijie Tian	d3b41b2f64	🔧 chore: clean up claude-flow configuration Remove unused claude-flow hooks, permissions, and daemon settings. Add disabled MCP servers list for claude-flow related servers. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 00:58:52 +08:00
Zijie Tian	baa4be7e2e	♻️ refactor: migrate chunked prefill attention to SparsePolicy Move all chunked prefill attention computation from attention.py to SparsePolicy.compute_chunked_attention(). This is the v4 architecture refactoring for sparse attention policies. Changes: - Add compute_chunked_attention abstract method to SparsePolicy base - Add offload_engine parameter to select_blocks for policies needing KV access during block selection - Implement compute_chunked_attention in FullAttentionPolicy with complete ring buffer pipeline logic - Simplify attention.py to delegate all chunked prefill to policy - Remove redundant _sync_load_previous_chunks and _ring_buffer_pipeline_load methods from Attention class Test: test_needle.py --enable-offload PASSED Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 00:58:46 +08:00
Zijie Tian	6783a45e6f	🚧 wip: update sparse policy refactoring plan to v4 Add clear acceptance criteria and verification methods: - Define 3 acceptance criteria (needle test, zero calc in attention.py, KV via offload_engine) - Document violations to fix (direct flash_attn/copy calls) - Add offload_engine.write_prefill_buffer encapsulation plan - Add LSP-based verification method using cclsp tools Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-19 23:23:16 +08:00
Zijie Tian	16b269d897	🚧 wip: update sparse policy refactoring plan to v4 Simplified scope to FullPolicy only. Added debug validation phase. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-19 23:10:49 +08:00
Zijie Tian	b97b0b96a0	[WIP] Before refactor the nanovllm sparse policy.	2026-01-19 22:34:44 +08:00
Zijie Tian	b5da802dff	[WIP] Before integrate the xattn operator.	2026-01-19 21:19:21 +08:00
Zijie Tian	9e6fdc0650	[WIP] Before plan execute.	2026-01-19 03:30:44 +08:00
Zijie Tian	50520a6c3c	[fix] fixed request to request error.	2026-01-19 00:55:26 +08:00
Zijie Tian	e6e0dc5d7d	✨ feat: add comprehensive RULER benchmark testing - Add test_ruler.py from tzj/vs_offload branch with 13 RULER tasks - Add comprehensive documentation for RULER benchmark results - Update CLAUDE.md with new documentation index entry - Add architecture, debugging, optimization, and known issues guides - Test 32K context with CPU offload: 92.3% accuracy across all tasks - Parallel execution on 4 GPUs with detailed performance metrics Benchmark results: - 13 RULER tasks total (niah_single, multikey, multiquery, multivalue, qa, cwe, fwe, vt) - 26 samples tested with 92.3% overall accuracy - CPU offload stable at 32K context length - Parallel GPU execution achieving 4x speedup Key findings: - Single needle tasks: 100% accuracy - Multi-value and recall tasks: 100% accuracy - Multi-query tasks: 50% accuracy (most challenging) - QA tasks: 100% accuracy - Total execution time: ~220 seconds (parallel)	2026-01-18 20:34:06 +08:00
Zijie Tian	0550a64339	feat: add dynamic port allocation from tzj/vs_offload - Import os and socket modules - Add _find_free_port() function for automatic port detection - Use NANOVLLM_DIST_PORT env var if set, otherwise auto-assign - Enables running multiple model instances without port conflicts Co-Authored-By: Claude <noreply@anthropic.com>	2026-01-18 19:51:56 +08:00
Zijie Tian	d9890aa2cd	chore: add Block-SparseAttention submodule from tzj/vs_offload	2026-01-18 19:22:40 +08:00
Zijie Tian	5a837c8c83	chore: update .gitignore with tzj/vs_offload configuration - Add Claude Flow generated files ignore patterns - Add test data directory ignore - Add Serena MCP tool config ignore - Add Windows wrapper files ignore These configurations improve development workflow by excluding temporary and generated files from version control.	2026-01-18 18:59:17 +08:00
Zijie Tian	d1bbb7efe2	chore: update claude configuration and rules from tzj/vs_offload - Add /sc:git command with smart commit functionality - Add /sc:ultra-think command for deep thinking - Update .claude/rules/ with improved documentation: - commands.md: command usage guidelines - doc-management.md: documentation policy - no-extra-docs.md: documentation creation policy - gpu-testing.md: GPU type detection and testing rules - Update .claude/settings.json with claude-flow MCP configuration 这些改进提供了更好的开发体验和工具支持。	2026-01-18 18:56:49 +08:00
Zijie Tian	1a78ae74d5	feat: add claude-flow MCP configuration Add .claude/settings.json to enable claude-flow MCP in all worktrees. This configuration includes: - SessionStart hook to auto-start claude-flow daemon - Auto-approval for claude-flow MCP tools and CLI commands - Basic claude-flow settings Co-Authored-By: Claude <noreply@anthropic.com>	2026-01-18 18:55:56 +08:00
Zijie Tian	c254c8c330	chore: add planning-with-files rule configuration	2026-01-18 18:55:55 +08:00
Zijie Tian	03a8c033cb	[claudesquad] update from 'add-llama-1' on 10 Jan 26 21:03 CST	2026-01-10 21:03:45 +08:00
Zijie Tian	6575099a06	[refactor] Cleanup unused code after perf_opt merge Removed ~460 lines of unused/redundant code from offload_engine.py: - CUDA gather methods (gathered_h2d_*, update_gather_indices) - Legacy async transfer methods (prefetch_block_async, offload_block_async) - Legacy sync/wait methods (wait_for_block, wait_all_transfers, sync_indices) - Legacy compatibility methods (load_to_compute_layer, wait_compute_layer) - Unused gather_indices tensors and memory calculations Updated class docstring to reflect current architecture. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-07 06:25:21 +08:00
Zijie Tian	8fd25d72d7	Merge perf_opt-1 and perf_opt-2 branches Combines two performance optimization features: - perf_opt-1: Cross-layer pipeline for decode (double-buffered layer cache) - perf_opt-2: Per-layer prefill buffer for async offload Both features are complementary and improve CPU offload performance. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-07 06:03:44 +08:00
Zijie Tian	ccf27d3a74	[claudesquad] update from 'perf_opt-1' on 07 Jan 26 05:58 CST	2026-01-07 05:58:23 +08:00
Zijie Tian	0ad86eb449	[claudesquad] update from 'perf_opt-2' on 07 Jan 26 05:58 CST	2026-01-07 05:58:10 +08:00
Zijie Tian	aa953ecb59	[refactor] Aligned the bench.	2026-01-07 04:25:06 +08:00
Zijie Tian	362f5e575f	[fix] Fixed .gitignores .	2026-01-07 03:32:14 +08:00
Zijie Tian	58a06501c1	Merge branch 'zijie/debug_chunk-2' into tzj/minference	2026-01-07 03:30:38 +08:00
Zijie Tian	2a6e0a2c02	[feat] Added Quest Sparsity Policy.	2026-01-07 03:29:21 +08:00
Zijie Tian	2fe50bab50	[claudesquad] update from 'debug_chunk-2' on 07 Jan 26 03:27 CST	2026-01-07 03:27:27 +08:00
Zijie Tian	c99a6f3d3f	[WIP] Before add Quest policy.	2026-01-07 02:32:30 +08:00
Zijie Tian	f240903013	[docs] Add GPU mutex instructions for multi-instance debugging Add instructions for Claude instances to check GPU availability before running CUDA operations, preventing conflicts when multiple instances debug in parallel on a single GPU. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-07 01:42:59 +08:00
Zijie Tian	0e691f2d85	[WIP] move metadata to GPU.	2026-01-06 23:32:32 +08:00
Zijie Tian	edb5273e34	[WIP] Added basic test for quest.	2026-01-06 22:30:31 +08:00
Zijie Tian	690492e074	[WIP] Before refactor policies.	2026-01-06 20:47:55 +08:00
Zijie Tian	7cc8a394a5	[fix] Fixed bench_offload.py, BUT performance DEGRAD.	2026-01-06 18:46:48 +08:00
Zijie Tian	535f2037ab	[WIP] Before fix bench_offload.py.	2026-01-06 18:41:08 +08:00
Zijie Tian	c7ac39dfbd	[refactor] Before add sprae policy.	2026-01-05 21:19:24 +08:00
Zijie Tian	e554d5482b	[refactor] Delete unnesscessory test, and refacrtor the offload prefix cache.	2026-01-05 20:31:42 +08:00
Zijie Tian	247c5312d9	[fix] Fixed decode misalign.	2026-01-05 19:00:44 +08:00
Zijie Tian	054aaff403	[fix] Fixed needle test bug.	2026-01-05 18:34:09 +08:00
Zijie Tian	d623043a3c	[WIP] FIXED decode and prefill NEEDLE test.	2026-01-05 01:51:46 +08:00

1 2 3 4 5

248 Commits