zijie-tian

zijie-tian pushed to tzj/minference at zijie-tian/nano-vllm

2026-01-28 04:05:09 +08:00

39d12a0416 📈 feat: add MemoryObserver for GPU-CPU communication tracking

c16bfcf40f ♻️ refactor: restructure Observer as base class with InferenceObserver

Compare 2 commits »

zijie-tian pushed to tzj/minference at zijie-tian/nano-vllm

2026-01-28 00:57:04 +08:00

f3e4611e3b 📝 docs: add XAttention performance analysis documentation

7b5d3b34eb 📈 feat: add NVTX markers to XAttention for profiling

Compare 2 commits »

zijie-tian pushed to tzj/minference at zijie-tian/nano-vllm

2026-01-28 00:32:50 +08:00

b760de84c5 ✨ feat: add context length and error handling to profile_offload.sh

f81b5ae8a9 ✨ feat: enhance profile_offload.sh with policy, block-size parameters

Compare 2 commits »

zijie-tian pushed to tzj/minference at zijie-tian/nano-vllm

2026-01-27 22:38:34 +08:00

e874229adc 📝 docs: add comprehensive GPU-only vs Offload benchmark results

zijie-tian pushed to tzj/minference at zijie-tian/nano-vllm

2026-01-27 09:24:17 +08:00

4fe7dfb239 🔀 merge: integrate tzj/minference-exp (GPU-only sparse attention)

9177b62d7f ✨ feat: add --enforce-eager option to bench.py

3956a30b14 🔧 chore: add --use-v1 flag to bench_vllm.py

59473fa432 🔧 chore: add configurable arguments to bench_vllm.py

4467e1f654 🔧 chore: add --block-size argument to bench_offload.py

Compare 14 commits »

zijie-tian pushed to tzj/minference at zijie-tian/nano-vllm

2026-01-27 07:53:04 +08:00

0437311068 ⚡ feat: add Phase 5 CUDA Graph optimization for chunked prefill

zijie-tian pushed to tzj/minference at zijie-tian/nano-vllm

2026-01-27 07:36:36 +08:00

0d31b3f71f 📝 docs: add CPU offload optimization strategies guide

73c9dc46ff ✨ feat: add XAttention BSA support to bench_offload.py

924a0d2bfa 🔧 chore: add nsys profiling rule and update gitignore

0619accd1c 📝 docs: add CPU scheduling latency analysis for chunked attention

18bc433f09 ⚡ perf: improve NVTX profiling with colored ranges and configurable slots

Compare 6 commits »

zijie-tian pushed to tzj/minference at zijie-tian/nano-vllm

2026-01-24 04:32:52 +08:00

3100724666 📝 docs: add nsys wrong event order bug investigation

zijie-tian pushed to tzj/minference at zijie-tian/nano-vllm

2026-01-24 01:44:49 +08:00

78a44f3536 📝 docs: add GPU memory monitoring rule

zijie-tian pushed to tzj/minference at zijie-tian/nano-vllm

2026-01-23 10:35:57 +08:00

7c41032a2e ✨ feat: add configurable stride and chunk_size for XAttention BSA

zijie-tian pushed to tzj/minference at zijie-tian/nano-vllm

2026-01-23 09:45:36 +08:00

f28b500120 🙈 chore: uncomment planning files in gitignore

be67fa8060 🗑️ chore: remove temporary planning files

4f35526457 🔀 merge: integrate remote changes (exec-plan command, CUDA graph plan)

da5e13e2bb 📝 docs: update XAttention BSA Policy with benchmarks and memory management

dd31033732 🔧 chore: add gpu-monitor agent for memory leak debugging

Compare 14 commits »

zijie-tian pushed to tzj/layer-offload at zijie-tian/nano-vllm

2026-01-22 22:19:56 +08:00

5fb0f67295 [WIP] need refactor.

69b779e252 📝 docs: add layer offload planning notes and task plan

e313dd795a ✨ feat: add exec-plan command for automated task plan execution

9f3ee9279e ✨ feat: add nanovllm.ops module with XAttention estimation kernels

Compare 4 commits »

zijie-tian created branch tzj/layer-offload in zijie-tian/nano-vllm

2026-01-22 22:19:56 +08:00

zijie-tian pushed to tzj/minference at zijie-tian/nano-vllm

2026-01-22 03:15:15 +08:00

47d237bb7e ✨ feat: add exec-plan command for automated task plan execution

a5307fb124 📝 docs: add CUDA Graph optimization plan for offload mode decode

Compare 2 commits »

zijie-tian pushed to tzj/minference at zijie-tian/nano-vllm

2026-01-22 01:34:12 +08:00

d808970f2f [WIP] Before implement the plan.

zijie-tian pushed to tzj/minference at zijie-tian/nano-vllm

2026-01-22 01:33:25 +08:00

bc92c1fdb8 feat: add xattn_estimate_chunked for chunked prefill support

zijie-tian pushed to tzj/minference at zijie-tian/nano-vllm

2026-01-22 01:00:07 +08:00

2866d4fd88 ✨ feat: add chunk attention CUDA graph test for block sparse attention

zijie-tian pushed to tzj/minference at zijie-tian/nano-vllm

2026-01-21 21:56:27 +08:00

5d722968ff [docs] Added cuda_graph_guide.md

zijie-tian pushed to tzj/minference at zijie-tian/nano-vllm

2026-01-21 03:30:40 +08:00

d21b40f48f [test] Added test_cudagraph_memory.py.

zijie-tian pushed to tzj/minference at zijie-tian/nano-vllm

2026-01-21 03:27:41 +08:00

42cf124343 📝 docs: add CUDA Graph memory mechanism guide