Commit Graph

7 Commits

Author SHA1 Message Date
Zijie Tian
4484a1482c [refactor] Refactor the profile_offload.sh 2026-01-29 08:39:34 +08:00
Zijie Tian
b760de84c5 feat: add context length and error handling to profile_offload.sh
- Add --ctx-len parameter (32k/64k/128k) for context length selection
- Auto-configure max-model-len and data-dir based on context length
- Add error handling to delete .nsys-rep file on test failure
- Remove set -e to allow proper error handling
- Update output filename format to include context length

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-01-28 00:28:37 +08:00
Zijie Tian
f81b5ae8a9 feat: enhance profile_offload.sh with policy, block-size parameters
- Add --policy parameter for sparse attention policy selection (full/xattn)
- Add --block-size parameter (default 4096) for KV cache block size
- Add --gpu-util parameter for GPU memory utilization control
- Improve output filename format: <policy>_<gpuonly|offload>_blk<size>_<timestamp>
- Map user-friendly policy names to internal enum (xattn -> XATTN_BSA)

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-01-27 23:23:20 +08:00
Zijie Tian
f5682ca4a7 🔧 chore: add GPU-only profiling script
Add scripts/profile.sh for nsys profiling of GPU-only mode benchmarks.

Usage:
  bash scripts/profile.sh                    # Default: 32K xattn prefill
  bash scripts/profile.sh --max-len 65536 --gpu-util 0.7
  bash scripts/profile.sh --policy full
  bash scripts/profile.sh --bench-decode

Output: results/nsys/bench_<policy>_<len>_<mode>_<timestamp>.nsys-rep

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-01-27 05:55:31 +08:00
Zijie Tian
18bc433f09 perf: improve NVTX profiling with colored ranges and configurable slots
- Switch from torch.cuda.nvtx to nvtx package for colored range support
- Add color coding: blue for H2D, green for D2H decode, orange for D2H prefill
- Add --num-gpu-blocks parameter to profile_offload.sh
- Include slot count in output filename for easier comparison

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
2026-01-27 03:42:05 +08:00
Zijie Tian
3100724666 📝 docs: add nsys wrong event order bug investigation
- Document ring buffer pipeline triggering nsys timestamp bug
- Update profile_offload.sh to use test_ruler.py with options
- Add reference to new doc in CLAUDE.md

Root cause: 4-slot ring buffer pipeline (4 transfer streams +
1 compute stream) triggers event ordering bug in nsys < 2024.2

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 04:32:05 +08:00
Zijie Tian
6ec1b23982 [WIP] NEED to modify communication. 2025-12-24 21:57:51 +08:00