nano-vllm

zijie-tian/nano-vllm

Fork 0

52b12a89e3 📋 docs: add changelog for 2026-02-05 tzj/minference Zijie Tian 2026-02-05 03:16:39 +08:00
d35dd76e09 🗑️ chore: clean up tests directory to essential files only Zijie Tian 2026-02-05 03:13:50 +08:00
2b61c5ab57 🗑️ chore: remove test_needle* files Zijie Tian 2026-02-05 03:11:28 +08:00
a709551072 🗑️ chore: remove redundant XAttention test files Zijie Tian 2026-02-05 03:11:21 +08:00
11a867f6fb 🐛 fix: skip GQA buffer allocation in XAttention offload mode Zijie Tian 2026-02-05 02:57:18 +08:00
af4da454ba 📊 docs: add XAttention offload profiling analysis for 32K context Zijie Tian 2026-02-05 02:37:00 +08:00
ef37d4f1a8 🐛 docs: document XAttention offload GQA buffer OOM issue Zijie Tian 2026-02-05 02:46:50 +08:00
c8a5ef04c0 📝 docs: add test_ruler.py usage guide and rule Zijie Tian 2026-02-05 02:46:44 +08:00
1c36d53570 🙈 chore: add ralph-tui session file to gitignore Zijie Tian 2026-02-05 02:00:44 +08:00
54fd302fa8 📝 docs: add XAttention density alignment verification results Zijie Tian 2026-02-05 01:59:11 +08:00
1eb7521994 📝 docs: add XAttention density types documentation Zijie Tian 2026-02-05 01:44:11 +08:00
51bd678335 📊 feat: distinguish compute density and communication density in DensityObserver Zijie Tian 2026-02-05 01:43:17 +08:00
1ea5afd886 📝 docs: add XAttention offload stream sync fix documentation Zijie Tian 2026-02-05 01:32:50 +08:00
829b311c02 🐛 fix: stream synchronization for XAttention estimate kernels in offload mode Zijie Tian 2026-02-05 01:30:23 +08:00
dd0472aea8 [plugin] Added ralph-tui setup. Zijie Tian 2026-02-05 01:27:53 +08:00
a1c68a733e 📊 docs: add XAttention memory benchmark for 24GB GPUs Zijie Tian 2026-02-02 14:38:27 +08:00
dc51972777 📝 docs: update density alignment test with Offload mode results Zijie Tian 2026-02-02 14:22:40 +08:00
232fcf043e 📝 docs: add GPU-only density alignment test results Zijie Tian 2026-02-02 11:22:34 +08:00
aeed6ccdfb ✅ test: add GPU-only density alignment verification test Zijie Tian 2026-02-02 11:14:46 +08:00
6c55c4d2a3 ♻️ refactor: rewrite select_blocks with 3-stage KV chunking algorithm Zijie Tian 2026-02-02 10:10:10 +08:00
6e34efd58a 📝 docs: add storage overhead analysis and batch tests for KV chunking Zijie Tian 2026-02-01 19:22:36 +08:00
5acd5558d6 feat: add KV chunking support for XAttention softmax kernels Zijie Tian 2026-02-01 18:53:26 +08:00
193ef55d18 ♻️ refactor: use Q-chunked processing in xattn alignment test Zijie Tian 2026-02-01 18:08:15 +08:00
f173a3f7f5 ✅ test: add xattn_estimate vs low-level kernels alignment test Zijie Tian 2026-02-01 17:49:37 +08:00
8035e4db3d 📝 docs: add XAttention KV chunking density test results Zijie Tian 2026-02-01 17:36:19 +08:00
8ab53e7331 🚧 WIP: add DEBUG code for XAttention KV chunking density verification Zijie Tian 2026-02-01 17:33:23 +08:00
2e96d1d97d WIP: Enhance sparse attention with density tracking and block selection improvements Zijie Tian 2026-01-31 14:48:23 +08:00
f6ac4ccdde ✨ feat: add DensityObserver for XAttention sparse attention density tracking Zijie Tian 2026-01-30 16:26:56 +08:00
4484a1482c [refactor] Refactor the profile_offload.sh Zijie Tian 2026-01-29 08:39:34 +08:00
e436ec861f ⚙️ config: update test_ruler.py defaults Zijie Tian 2026-01-28 14:21:23 +08:00
45efcf0db1 ✨ feat: add --dtype parameter to test_ruler.py Zijie Tian 2026-01-28 13:56:15 +08:00
e09a2a5b10 ✨ feat: add Qwen2/2.5 model support Zijie Tian 2026-01-28 13:44:32 +08:00
a239bfb40d 📚 docs: add new model integration guide Zijie Tian 2026-01-28 13:36:24 +08:00
29e102720b 🐛 fix: support multiple EOS tokens for GLM-4 Zijie Tian 2026-01-28 13:23:53 +08:00
726e4b58cf ✨ feat: add GLM-4-9B-Chat-1M model support Zijie Tian 2026-01-28 13:15:57 +08:00
8d19e61446 ⚡️ perf: replace Triton merge with FlashInfer merge_state Zijie Tian 2026-01-28 10:04:38 +08:00
4484ebbb77 📚 docs: add 1M+ context length models reference list Zijie Tian 2026-01-28 09:04:55 +08:00
2c2383c786 ⚡️ perf: optimize XAttention estimate with hierarchical block sum Zijie Tian 2026-01-28 06:47:13 +08:00
f049971f84 ✅ test: add hierarchical block sum estimation validation Zijie Tian 2026-01-28 06:24:35 +08:00
c90dc196b2 📝 docs: add estimate block_size performance analysis Zijie Tian 2026-01-28 06:24:28 +08:00
3da9b8aef2 ⚡️ perf: optimize XAttention estimate phase with K-only loading Zijie Tian 2026-01-28 06:24:20 +08:00
a832d127b6 ✨ feat: add nsys-profiler agent for kernel performance analysis Zijie Tian 2026-01-28 06:24:09 +08:00
39d12a0416 📈 feat: add MemoryObserver for GPU-CPU communication tracking Zijie Tian 2026-01-28 04:06:45 +08:00
c16bfcf40f ♻️ refactor: restructure Observer as base class with InferenceObserver Zijie Tian 2026-01-28 03:15:33 +08:00
f3e4611e3b 📝 docs: add XAttention performance analysis documentation Zijie Tian 2026-01-28 00:57:20 +08:00
7b5d3b34eb 📈 feat: add NVTX markers to XAttention for profiling Zijie Tian 2026-01-28 00:57:11 +08:00
b760de84c5 ✨ feat: add context length and error handling to profile_offload.sh Zijie Tian 2026-01-28 00:28:37 +08:00
f81b5ae8a9 ✨ feat: enhance profile_offload.sh with policy, block-size parameters Zijie Tian 2026-01-27 23:23:20 +08:00
e874229adc 📝 docs: add comprehensive GPU-only vs Offload benchmark results Zijie Tian 2026-01-27 22:32:07 +08:00
4fe7dfb239 🔀 merge: integrate tzj/minference-exp (GPU-only sparse attention) Zijie Tian 2026-01-27 09:25:36 +08:00
9177b62d7f ✨ feat: add --enforce-eager option to bench.py Zijie Tian 2026-01-27 09:19:53 +08:00
3956a30b14 🔧 chore: add --use-v1 flag to bench_vllm.py Zijie Tian 2026-01-27 09:14:55 +08:00
59473fa432 🔧 chore: add configurable arguments to bench_vllm.py Zijie Tian 2026-01-27 09:07:49 +08:00
4467e1f654 🔧 chore: add --block-size argument to bench_offload.py Zijie Tian 2026-01-27 09:07:44 +08:00
0437311068 ⚡ feat: add Phase 5 CUDA Graph optimization for chunked prefill Zijie Tian 2026-01-27 07:38:40 +08:00
6da116de98 📝 docs: add GPU-Only XAttention guide with performance analysis Zijie Tian 2026-01-27 07:21:46 +08:00
f5682ca4a7 🔧 chore: add GPU-only profiling script Zijie Tian 2026-01-27 05:55:31 +08:00
a504bd873d ⚡ perf: pre-allocate GQA buffers in XAttention policy Zijie Tian 2026-01-27 05:49:23 +08:00
076656c9c2 ✨ feat: add GPU-only XAttention BSA sparse attention support Zijie Tian 2026-01-27 05:19:24 +08:00
b6b59b50ed 📝 docs: add sparse policy None constraint rule Zijie Tian 2026-01-27 05:08:08 +08:00
09b2136e9f ✨ feat: integrate sparse policy architecture into GPU-only mode Zijie Tian 2026-01-27 05:08:02 +08:00
0d31b3f71f 📝 docs: add CPU offload optimization strategies guide Zijie Tian 2026-01-27 04:44:36 +08:00
05ce57ee8e 📝 docs: add GPU-only sparse policy integration baseline Zijie Tian 2026-01-27 04:36:31 +08:00
94a6e06d79 📝 docs: add GPU VRAM requirement rule for GPU-only mode Zijie Tian 2026-01-27 04:36:24 +08:00
c717072f31 ✨ feat: add --model argument to bench.py for configurable model path Zijie Tian 2026-01-27 04:36:17 +08:00
73c9dc46ff ✨ feat: add XAttention BSA support to bench_offload.py Zijie Tian 2026-01-27 04:20:16 +08:00
924a0d2bfa 🔧 chore: add nsys profiling rule and update gitignore Zijie Tian 2026-01-27 03:42:17 +08:00
0619accd1c 📝 docs: add CPU scheduling latency analysis for chunked attention Zijie Tian 2026-01-27 03:42:12 +08:00
18bc433f09 ⚡ perf: improve NVTX profiling with colored ranges and configurable slots Zijie Tian 2026-01-27 03:42:05 +08:00
aea3812230 ♻️ refactor: unify KV cache operations through OffloadEngine Zijie Tian 2026-01-27 02:20:59 +08:00
3100724666 📝 docs: add nsys wrong event order bug investigation Zijie Tian 2026-01-24 04:32:05 +08:00
78a44f3536 📝 docs: add GPU memory monitoring rule Zijie Tian 2026-01-24 01:41:25 +08:00
7c41032a2e ✨ feat: add configurable stride and chunk_size for XAttention BSA Zijie Tian 2026-01-23 10:37:04 +08:00
f28b500120 🙈 chore: uncomment planning files in gitignore Zijie Tian 2026-01-23 09:43:46 +08:00
be67fa8060 🗑️ chore: remove temporary planning files Zijie Tian 2026-01-23 09:43:22 +08:00
4f35526457 🔀 merge: integrate remote changes (exec-plan command, CUDA graph plan) Zijie Tian 2026-01-23 09:43:06 +08:00
da5e13e2bb 📝 docs: update XAttention BSA Policy with benchmarks and memory management Zijie Tian 2026-01-23 09:35:18 +08:00
dd31033732 🔧 chore: add gpu-monitor agent for memory leak debugging Zijie Tian 2026-01-23 09:33:15 +08:00
ed3c8bb4b8 🐛 fix: memory leak in XAttentionBSAPolicy select_blocks Zijie Tian 2026-01-23 09:30:18 +08:00
5eb35982bf 🔧 feat: add density statistics tracking to sparse policies Zijie Tian 2026-01-23 08:53:22 +08:00
ad361c2c3b 📝 docs: add XAttention BSA Policy design documentation Zijie Tian 2026-01-23 08:36:56 +08:00
4d1e40152d ✨ feat(xattn): implement compute_chunked_prefill with ring buffer pipeline Zijie Tian 2026-01-23 08:27:40 +08:00
832b352afa ✨ feat(xattn): implement select_blocks with majority voting aggregation Zijie Tian 2026-01-23 08:19:05 +08:00
a50b4c2ac2 ♻️ refactor: move select_blocks from policy to attention layer Zijie Tian 2026-01-23 05:21:28 +08:00
ca32ea6f93 [WIP] Before refactor the compute)_chunked_prefill. Zijie Tian 2026-01-23 03:36:12 +08:00
edc006463b docs: add XAttention kernels guide Zijie Tian 2026-01-23 03:22:25 +08:00
999858e82f feat: add xattn kernels test and update testing rules Zijie Tian 2026-01-23 03:01:25 +08:00
5fb0f67295 [WIP] need refactor. tzj/layer-offload Zijie Tian 2026-01-22 22:20:34 +08:00
69b779e252 📝 docs: add layer offload planning notes and task plan Zijie Tian 2026-01-22 06:04:36 +08:00
e313dd795a ✨ feat: add exec-plan command for automated task plan execution Zijie Tian 2026-01-22 02:23:12 +08:00
9f3ee9279e ✨ feat: add nanovllm.ops module with XAttention estimation kernels Zijie Tian 2026-01-22 06:00:42 +08:00
47d237bb7e ✨ feat: add exec-plan command for automated task plan execution Zijie Tian 2026-01-22 02:23:12 +08:00
a5307fb124 📝 docs: add CUDA Graph optimization plan for offload mode decode Zijie Tian 2026-01-22 02:12:24 +08:00
d808970f2f [WIP] Before implement the plan. Zijie Tian 2026-01-22 01:35:13 +08:00
bc92c1fdb8 feat: add xattn_estimate_chunked for chunked prefill support Zijie Tian 2026-01-22 01:13:17 +08:00
2866d4fd88 ✨ feat: add chunk attention CUDA graph test for block sparse attention Zijie Tian 2026-01-22 00:57:05 +08:00
5d722968ff [docs] Added cuda_graph_guide.md Zijie Tian 2026-01-21 21:56:24 +08:00
d21b40f48f [test] Added test_cudagraph_memory.py. Zijie Tian 2026-01-21 03:30:36 +08:00
42cf124343 📝 docs: add CUDA Graph memory mechanism guide Zijie Tian 2026-01-21 02:59:21 +08:00
78050aef9f 🐛 fix: resolve CPU KV cache state leakage between requests Zijie Tian 2026-01-21 01:12:21 +08:00

1 2 3

Commit Graph Select branches Hide Pull Requests tzj/layer-offload tzj/minference tzj/vs_offload Mono Color

Commit Graph

Select branches

Hide Pull Requests

tzj/layer-offload

tzj/minference

tzj/vs_offload