Commit Graph

  • 52b12a89e3 📋 docs: add changelog for 2026-02-05 tzj/minference Zijie Tian 2026-02-05 03:16:39 +08:00
  • d35dd76e09 🗑️ chore: clean up tests directory to essential files only Zijie Tian 2026-02-05 03:13:50 +08:00
  • 2b61c5ab57 🗑️ chore: remove test_needle* files Zijie Tian 2026-02-05 03:11:28 +08:00
  • a709551072 🗑️ chore: remove redundant XAttention test files Zijie Tian 2026-02-05 03:11:21 +08:00
  • 11a867f6fb 🐛 fix: skip GQA buffer allocation in XAttention offload mode Zijie Tian 2026-02-05 02:57:18 +08:00
  • af4da454ba 📊 docs: add XAttention offload profiling analysis for 32K context Zijie Tian 2026-02-05 02:37:00 +08:00
  • ef37d4f1a8 🐛 docs: document XAttention offload GQA buffer OOM issue Zijie Tian 2026-02-05 02:46:50 +08:00
  • c8a5ef04c0 📝 docs: add test_ruler.py usage guide and rule Zijie Tian 2026-02-05 02:46:44 +08:00
  • 1c36d53570 🙈 chore: add ralph-tui session file to gitignore Zijie Tian 2026-02-05 02:00:44 +08:00
  • 54fd302fa8 📝 docs: add XAttention density alignment verification results Zijie Tian 2026-02-05 01:59:11 +08:00
  • 1eb7521994 📝 docs: add XAttention density types documentation Zijie Tian 2026-02-05 01:44:11 +08:00
  • 51bd678335 📊 feat: distinguish compute density and communication density in DensityObserver Zijie Tian 2026-02-05 01:43:17 +08:00
  • 1ea5afd886 📝 docs: add XAttention offload stream sync fix documentation Zijie Tian 2026-02-05 01:32:50 +08:00
  • 829b311c02 🐛 fix: stream synchronization for XAttention estimate kernels in offload mode Zijie Tian 2026-02-05 01:30:23 +08:00
  • dd0472aea8 [plugin] Added ralph-tui setup. Zijie Tian 2026-02-05 01:27:53 +08:00
  • a1c68a733e 📊 docs: add XAttention memory benchmark for 24GB GPUs Zijie Tian 2026-02-02 14:38:27 +08:00
  • dc51972777 📝 docs: update density alignment test with Offload mode results Zijie Tian 2026-02-02 14:22:40 +08:00
  • 232fcf043e 📝 docs: add GPU-only density alignment test results Zijie Tian 2026-02-02 11:22:34 +08:00
  • aeed6ccdfb test: add GPU-only density alignment verification test Zijie Tian 2026-02-02 11:14:46 +08:00
  • 6c55c4d2a3 ♻️ refactor: rewrite select_blocks with 3-stage KV chunking algorithm Zijie Tian 2026-02-02 10:10:10 +08:00
  • 6e34efd58a 📝 docs: add storage overhead analysis and batch tests for KV chunking Zijie Tian 2026-02-01 19:22:36 +08:00
  • 5acd5558d6 feat: add KV chunking support for XAttention softmax kernels Zijie Tian 2026-02-01 18:53:26 +08:00
  • 193ef55d18 ♻️ refactor: use Q-chunked processing in xattn alignment test Zijie Tian 2026-02-01 18:08:15 +08:00
  • f173a3f7f5 test: add xattn_estimate vs low-level kernels alignment test Zijie Tian 2026-02-01 17:49:37 +08:00
  • 8035e4db3d 📝 docs: add XAttention KV chunking density test results Zijie Tian 2026-02-01 17:36:19 +08:00
  • 8ab53e7331 🚧 WIP: add DEBUG code for XAttention KV chunking density verification Zijie Tian 2026-02-01 17:33:23 +08:00
  • 2e96d1d97d WIP: Enhance sparse attention with density tracking and block selection improvements Zijie Tian 2026-01-31 14:48:23 +08:00
  • f6ac4ccdde feat: add DensityObserver for XAttention sparse attention density tracking Zijie Tian 2026-01-30 16:26:56 +08:00
  • 4484a1482c [refactor] Refactor the profile_offload.sh Zijie Tian 2026-01-29 08:39:34 +08:00
  • e436ec861f ⚙️ config: update test_ruler.py defaults Zijie Tian 2026-01-28 14:21:23 +08:00
  • 45efcf0db1 feat: add --dtype parameter to test_ruler.py Zijie Tian 2026-01-28 13:56:15 +08:00
  • e09a2a5b10 feat: add Qwen2/2.5 model support Zijie Tian 2026-01-28 13:44:32 +08:00
  • a239bfb40d 📚 docs: add new model integration guide Zijie Tian 2026-01-28 13:36:24 +08:00
  • 29e102720b 🐛 fix: support multiple EOS tokens for GLM-4 Zijie Tian 2026-01-28 13:23:53 +08:00
  • 726e4b58cf feat: add GLM-4-9B-Chat-1M model support Zijie Tian 2026-01-28 13:15:57 +08:00
  • 8d19e61446 ️ perf: replace Triton merge with FlashInfer merge_state Zijie Tian 2026-01-28 10:04:38 +08:00
  • 4484ebbb77 📚 docs: add 1M+ context length models reference list Zijie Tian 2026-01-28 09:04:55 +08:00
  • 2c2383c786 ️ perf: optimize XAttention estimate with hierarchical block sum Zijie Tian 2026-01-28 06:47:13 +08:00
  • f049971f84 test: add hierarchical block sum estimation validation Zijie Tian 2026-01-28 06:24:35 +08:00
  • c90dc196b2 📝 docs: add estimate block_size performance analysis Zijie Tian 2026-01-28 06:24:28 +08:00
  • 3da9b8aef2 ️ perf: optimize XAttention estimate phase with K-only loading Zijie Tian 2026-01-28 06:24:20 +08:00
  • a832d127b6 feat: add nsys-profiler agent for kernel performance analysis Zijie Tian 2026-01-28 06:24:09 +08:00
  • 39d12a0416 📈 feat: add MemoryObserver for GPU-CPU communication tracking Zijie Tian 2026-01-28 04:06:45 +08:00
  • c16bfcf40f ♻️ refactor: restructure Observer as base class with InferenceObserver Zijie Tian 2026-01-28 03:15:33 +08:00
  • f3e4611e3b 📝 docs: add XAttention performance analysis documentation Zijie Tian 2026-01-28 00:57:20 +08:00
  • 7b5d3b34eb 📈 feat: add NVTX markers to XAttention for profiling Zijie Tian 2026-01-28 00:57:11 +08:00
  • b760de84c5 feat: add context length and error handling to profile_offload.sh Zijie Tian 2026-01-28 00:28:37 +08:00
  • f81b5ae8a9 feat: enhance profile_offload.sh with policy, block-size parameters Zijie Tian 2026-01-27 23:23:20 +08:00
  • e874229adc 📝 docs: add comprehensive GPU-only vs Offload benchmark results Zijie Tian 2026-01-27 22:32:07 +08:00
  • 4fe7dfb239 🔀 merge: integrate tzj/minference-exp (GPU-only sparse attention) Zijie Tian 2026-01-27 09:25:36 +08:00
  • 9177b62d7f feat: add --enforce-eager option to bench.py Zijie Tian 2026-01-27 09:19:53 +08:00
  • 3956a30b14 🔧 chore: add --use-v1 flag to bench_vllm.py Zijie Tian 2026-01-27 09:14:55 +08:00
  • 59473fa432 🔧 chore: add configurable arguments to bench_vllm.py Zijie Tian 2026-01-27 09:07:49 +08:00
  • 4467e1f654 🔧 chore: add --block-size argument to bench_offload.py Zijie Tian 2026-01-27 09:07:44 +08:00
  • 0437311068 feat: add Phase 5 CUDA Graph optimization for chunked prefill Zijie Tian 2026-01-27 07:38:40 +08:00
  • 6da116de98 📝 docs: add GPU-Only XAttention guide with performance analysis Zijie Tian 2026-01-27 07:21:46 +08:00
  • f5682ca4a7 🔧 chore: add GPU-only profiling script Zijie Tian 2026-01-27 05:55:31 +08:00
  • a504bd873d perf: pre-allocate GQA buffers in XAttention policy Zijie Tian 2026-01-27 05:49:23 +08:00
  • 076656c9c2 feat: add GPU-only XAttention BSA sparse attention support Zijie Tian 2026-01-27 05:19:24 +08:00
  • b6b59b50ed 📝 docs: add sparse policy None constraint rule Zijie Tian 2026-01-27 05:08:08 +08:00
  • 09b2136e9f feat: integrate sparse policy architecture into GPU-only mode Zijie Tian 2026-01-27 05:08:02 +08:00
  • 0d31b3f71f 📝 docs: add CPU offload optimization strategies guide Zijie Tian 2026-01-27 04:44:36 +08:00
  • 05ce57ee8e 📝 docs: add GPU-only sparse policy integration baseline Zijie Tian 2026-01-27 04:36:31 +08:00
  • 94a6e06d79 📝 docs: add GPU VRAM requirement rule for GPU-only mode Zijie Tian 2026-01-27 04:36:24 +08:00
  • c717072f31 feat: add --model argument to bench.py for configurable model path Zijie Tian 2026-01-27 04:36:17 +08:00
  • 73c9dc46ff feat: add XAttention BSA support to bench_offload.py Zijie Tian 2026-01-27 04:20:16 +08:00
  • 924a0d2bfa 🔧 chore: add nsys profiling rule and update gitignore Zijie Tian 2026-01-27 03:42:17 +08:00
  • 0619accd1c 📝 docs: add CPU scheduling latency analysis for chunked attention Zijie Tian 2026-01-27 03:42:12 +08:00
  • 18bc433f09 perf: improve NVTX profiling with colored ranges and configurable slots Zijie Tian 2026-01-27 03:42:05 +08:00
  • aea3812230 ♻️ refactor: unify KV cache operations through OffloadEngine Zijie Tian 2026-01-27 02:20:59 +08:00
  • 3100724666 📝 docs: add nsys wrong event order bug investigation Zijie Tian 2026-01-24 04:32:05 +08:00
  • 78a44f3536 📝 docs: add GPU memory monitoring rule Zijie Tian 2026-01-24 01:41:25 +08:00
  • 7c41032a2e feat: add configurable stride and chunk_size for XAttention BSA Zijie Tian 2026-01-23 10:37:04 +08:00
  • f28b500120 🙈 chore: uncomment planning files in gitignore Zijie Tian 2026-01-23 09:43:46 +08:00
  • be67fa8060 🗑️ chore: remove temporary planning files Zijie Tian 2026-01-23 09:43:22 +08:00
  • 4f35526457 🔀 merge: integrate remote changes (exec-plan command, CUDA graph plan) Zijie Tian 2026-01-23 09:43:06 +08:00
  • da5e13e2bb 📝 docs: update XAttention BSA Policy with benchmarks and memory management Zijie Tian 2026-01-23 09:35:18 +08:00
  • dd31033732 🔧 chore: add gpu-monitor agent for memory leak debugging Zijie Tian 2026-01-23 09:33:15 +08:00
  • ed3c8bb4b8 🐛 fix: memory leak in XAttentionBSAPolicy select_blocks Zijie Tian 2026-01-23 09:30:18 +08:00
  • 5eb35982bf 🔧 feat: add density statistics tracking to sparse policies Zijie Tian 2026-01-23 08:53:22 +08:00
  • ad361c2c3b 📝 docs: add XAttention BSA Policy design documentation Zijie Tian 2026-01-23 08:36:56 +08:00
  • 4d1e40152d feat(xattn): implement compute_chunked_prefill with ring buffer pipeline Zijie Tian 2026-01-23 08:27:40 +08:00
  • 832b352afa feat(xattn): implement select_blocks with majority voting aggregation Zijie Tian 2026-01-23 08:19:05 +08:00
  • a50b4c2ac2 ♻️ refactor: move select_blocks from policy to attention layer Zijie Tian 2026-01-23 05:21:28 +08:00
  • ca32ea6f93 [WIP] Before refactor the compute)_chunked_prefill. Zijie Tian 2026-01-23 03:36:12 +08:00
  • edc006463b docs: add XAttention kernels guide Zijie Tian 2026-01-23 03:22:25 +08:00
  • 999858e82f feat: add xattn kernels test and update testing rules Zijie Tian 2026-01-23 03:01:25 +08:00
  • 5fb0f67295 [WIP] need refactor. tzj/layer-offload Zijie Tian 2026-01-22 22:20:34 +08:00
  • 69b779e252 📝 docs: add layer offload planning notes and task plan Zijie Tian 2026-01-22 06:04:36 +08:00
  • e313dd795a feat: add exec-plan command for automated task plan execution Zijie Tian 2026-01-22 02:23:12 +08:00
  • 9f3ee9279e feat: add nanovllm.ops module with XAttention estimation kernels Zijie Tian 2026-01-22 06:00:42 +08:00
  • 47d237bb7e feat: add exec-plan command for automated task plan execution Zijie Tian 2026-01-22 02:23:12 +08:00
  • a5307fb124 📝 docs: add CUDA Graph optimization plan for offload mode decode Zijie Tian 2026-01-22 02:12:24 +08:00
  • d808970f2f [WIP] Before implement the plan. Zijie Tian 2026-01-22 01:35:13 +08:00
  • bc92c1fdb8 feat: add xattn_estimate_chunked for chunked prefill support Zijie Tian 2026-01-22 01:13:17 +08:00
  • 2866d4fd88 feat: add chunk attention CUDA graph test for block sparse attention Zijie Tian 2026-01-22 00:57:05 +08:00
  • 5d722968ff [docs] Added cuda_graph_guide.md Zijie Tian 2026-01-21 21:56:24 +08:00
  • d21b40f48f [test] Added test_cudagraph_memory.py. Zijie Tian 2026-01-21 03:30:36 +08:00
  • 42cf124343 📝 docs: add CUDA Graph memory mechanism guide Zijie Tian 2026-01-21 02:59:21 +08:00
  • 78050aef9f 🐛 fix: resolve CPU KV cache state leakage between requests Zijie Tian 2026-01-21 01:12:21 +08:00