📝 docs: add nsys wrong event order bug investigation

- Document ring buffer pipeline triggering nsys timestamp bug - Update profile_offload.sh to use test_ruler.py with options - Add reference to new doc in CLAUDE.md Root cause: 4-slot ring buffer pipeline (4 transfer streams + 1 compute stream) triggers event ordering bug in nsys < 2024.2 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 04:32:05 +08:00
parent 78a44f3536
commit 3100724666
3 changed files with 284 additions and 12 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -25,6 +25,7 @@ Nano-vLLM is a lightweight vLLM implementation (~1,200 lines) for fast offline L
 | [`docs/ruler_benchmark_results_32k.md`](docs/ruler_benchmark_results_32k.md) | RULER benchmark results (32K context): 13 tasks, 92.3% accuracy, CPU offload performance |
 | [`docs/ruler_32k_chunked_offload_issue.md`](docs/ruler_32k_chunked_offload_issue.md) | ⚠️ OPEN ISSUE: 32K chunked offload accuracy problem (20% error rate in RULER) |
 | [`docs/chunked_attention_solutions.md`](docs/chunked_attention_solutions.md) | 🔧 SOLUTIONS: Chunked attention 准确性问题的代码分析和解决方案 |
+| [`docs/nsys_wrong_event_order_bug.md`](docs/nsys_wrong_event_order_bug.md) | 🐛 NSYS BUG: Ring buffer pipeline 触发 nsys 时间戳乱序问题的调试记录 |

 ## Rules Index

--- a/docs/nsys_wrong_event_order_bug.md
+++ b/docs/nsys_wrong_event_order_bug.md
@@ -0,0 +1,210 @@
+# Nsys "Wrong Event Order" Bug 调试记录
+
+## 问题描述
+
+使用 `nsys profile` 对 nanovllm 的 CPU offload 模式进行性能分析时，无法生成 `.nsys-rep` 文件，报错：
+
+```
+Importer error status: Importation failed.
+Wrong event order has been detected when adding events to the collection:
+new event ={ StartNs=21569539222 StopNs=21569672388 ... Type=48 }
+last event ={ StartNs=22046804077 StopNs=22046805343 ... Type=48 }
+```
+
+## 环境信息
+
+- **nsys 版本**: 2023.4.4.54-234433681190v0
+- **CUDA**: 12.4
+- **问题状态**: nsys 已知 bug，2024.2+ 版本已修复
+
+## 调试过程
+
+### 阶段 1：确定触发条件
+
+使用 bisect 脚本 (`tests/test_nsys_bisect.py`) 逐步测试：
+
+| Stage | 描述 | 结果 |
+|-------|------|------|
+| 1 | CUDA init | ✅ |
+| 2 | Import nanovllm | ✅ |
+| 3 | Create LLM (offload) | ✅ |
+| 4 | 短 prompt 生成 | ✅ |
+| **5** | **长 prompt (~64K) prefill** | ❌ |
+
+**结论**：问题出在长 prompt 的 chunked prefill 流程。
+
+### 阶段 2：定位具体组件
+
+在 `_chunked_prefill_attention` 方法中逐步注释代码：
+
+| 组件 | 文件位置 | 结果 |
+|------|----------|------|
+| 整个方法 (return zeros) | `attention.py:167` | ✅ |
+| `select_blocks()` | `attention.py:217` | ✅ |
+| `offload_prefill_buffer_async()` | `attention.py:241-248` | ✅ |
+| `compute_chunked_prefill()` | `attention.py:225-235` | ❌ |
+
+**结论**：问题出在 `compute_chunked_prefill` 内部。
+
+### 阶段 3：定位 Ring Buffer Pipeline
+
+在 `full_policy.py` 中进一步定位：
+
+| 组件 | 代码行 | 结果 |
+|------|--------|------|
+| Current chunk attention | 191-198 | ✅ |
+| **Historical block loading (ring buffer)** | 133-189 | ❌ |
+
+**根因确认**：Ring buffer pipeline 的多 stream 操作触发了 nsys bug。
+
+## 根本原因
+
+### 触发 Bug 的代码
+
+```python
+# nanovllm/kvcache/sparse/full_policy.py:133-189
+
+# 多 slot pipeline 模式
+for block_idx in range(num_blocks):
+    current_slot = load_slots[block_idx % num_slots]
+
+    # 等待 slot 的 transfer stream 完成
+    offload_engine.wait_slot_layer(current_slot)
+
+    # 在 compute_stream 上执行 attention
+    with torch.cuda.stream(compute_stream):
+        prev_k, prev_v = offload_engine.get_kv_for_slot(current_slot)
+        prev_o, prev_lse = flash_attn_with_lse(...)
+        offload_engine.record_slot_compute_done(current_slot)
+
+    # 异步发起下一个 block 的加载
+    if next_block_idx < num_blocks:
+        offload_engine.load_to_slot_layer(next_slot, layer_id, next_cpu_block_id)
+```
+
+### Stream 结构
+
+```
+slot_transfer_streams[0] ─┐
+slot_transfer_streams[1] ─┼─ 4 个 transfer streams
+slot_transfer_streams[2] ─┤
+slot_transfer_streams[3] ─┘
+                          │
+                          ▼ wait/record 同步
+                          │
+compute_stream ───────────┘
+```
+
+这种 4+1 stream 的复杂同步模式导致 nsys 2023.4.4 版本的事件时间戳排序算法出错。
+
+### 为什么简单多 stream 测试无法复现
+
+我们尝试用简单的测试代码 (`tests/test_multistream_nsys.py`) 复现问题：
+
+- 4-8 streams, 2000+ iterations: ✅ 成功
+- 32 threads + multi-stream: ✅ 成功
+- >64k CUDA operations: ✅ 成功
+
+但都无法触发 bug。原因是实际代码中的 stream 同步模式更复杂：
+1. 跨 stream 的 event wait/record
+2. 与 FlashAttention kernel 的交互
+3. 长时间运行（~50 秒）累积大量事件
+
+## 解决方案
+
+### 方案 1：升级 nsys（推荐）
+
+```bash
+# 下载 nsys 2024.2+ 版本
+# https://developer.nvidia.com/nsight-systems
+```
+
+根据 [NVIDIA 论坛](https://forums.developer.nvidia.com/t/nsys-profiler-wrong-event-order/264881)，此 bug 在 2024.2 版本已修复。
+
+### 方案 2：使用 .qdstrm 文件
+
+即使导入失败，`.qdstrm` 文件仍然生成：
+
+```bash
+# 生成的文件
+results/nsys/ruler_niah_single_1_sample0_offload_*.qdstrm
+
+# 尝试用 GUI 直接打开
+nsight-sys <file>.qdstrm
+```
+
+GUI 可能有更好的容错能力。
+
+### 方案 3：使用 PyTorch Profiler
+
+```python
+from torch.profiler import profile, ProfilerActivity
+
+with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA]) as prof:
+    # your code
+
+prof.export_chrome_trace("trace.json")  # chrome://tracing 查看
+```
+
+### 方案 4：临时禁用 ring buffer pipeline
+
+在 `full_policy.py` 中临时使用单 slot 同步模式（仅用于调试）：
+
+```python
+# 强制使用单 slot 模式
+if len(load_slots) == 1 or True:  # 添加 "or True"
+    # 同步模式，不会触发 nsys bug
+    ...
+```
+
+## 复现步骤
+
+### 环境准备
+
+```bash
+cd /home/zijie/Code/nano-vllm
+```
+
+### 运行 Bisect 脚本
+
+```bash
+# Stage 5 会触发 bug
+CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$PWD:$PYTHONPATH \
+    nsys profile --trace=cuda,nvtx,osrt --force-overwrite=true \
+    -o /tmp/bisect python tests/test_nsys_bisect.py --stage 5
+```
+
+### 验证修复
+
+```bash
+# 临时在 full_policy.py 中跳过 historical block loading
+# 将第 133 行改为: if False and cpu_block_table:
+
+# 重新运行，应该成功
+CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$PWD:$PYTHONPATH \
+    nsys profile --trace=cuda,nvtx,osrt --force-overwrite=true \
+    -o /tmp/bisect_fixed python tests/test_nsys_bisect.py --stage 5
+
+# 检查是否生成 .nsys-rep
+ls -la /tmp/bisect_fixed.nsys-rep
+```
+
+## 相关文件
+
+| 文件 | 用途 |
+|------|------|
+| `tests/test_nsys_bisect.py` | Bisect 调试脚本 |
+| `tests/test_multistream_nsys.py` | 简单多 stream 测试 |
+| `scripts/profile_offload.sh` | nsys profile 脚本 |
+| `nanovllm/layers/attention.py` | Attention 层 |
+| `nanovllm/kvcache/sparse/full_policy.py` | Ring buffer pipeline |
+
+## 参考资料
+
+- [Nsys Profiler- Wrong event order - NVIDIA Forums](https://forums.developer.nvidia.com/t/nsys-profiler-wrong-event-order/264881)
+- [Nsight Systems 2025.3 Release Notes](https://docs.nvidia.com/nsight-systems/2025.3/ReleaseNotes/index.html)
+- [Nsight Systems User Guide](https://docs.nvidia.com/nsight-systems/UserGuide/index.html)
+
+## 调试日期
+
+2026-01-24
--- a/scripts/profile_offload.sh
+++ b/scripts/profile_offload.sh
@@ -1,35 +1,93 @@
 #!/bin/bash

-# Profile test_attention_offload.py using NVIDIA Nsight Systems
+# Profile test_ruler.py using NVIDIA Nsight Systems
 #
 # Usage:
-#   bash scripts/profile_offload.sh
+#   bash scripts/profile_offload.sh [options]
+#
+# Options:
+#   --dataset DATASET    Task name (default: niah_single_1)
+#   --sample INDEX       Sample index (default: 0)
+#   --gpu GPU_ID         GPU to use (default: 0)
+#   --no-offload         Disable CPU offload
 #
 # Output:
-#   results/nsys/attention_offload_<timestamp>.nsys-rep
+#   results/nsys/ruler_<dataset>_sample<index>_<timestamp>.nsys-rep
 #
-# View results:
-#   nsight-sys results/nsys/attention_offload_<timestamp>.nsys-rep
+# Examples:
+#   bash scripts/profile_offload.sh
+#   bash scripts/profile_offload.sh --dataset niah_single_1 --sample 5
+#   bash scripts/profile_offload.sh --gpu 1 --no-offload

 set -e

-# Configuration
+# Default configuration
+DATASET="niah_single_1"
+SAMPLE_INDEX="0"
+GPU_ID="0"
+ENABLE_OFFLOAD="--enable-offload"
+
+# Parse arguments
+while [[ $# -gt 0 ]]; do
+    case $1 in
+        --dataset)
+            DATASET="$2"
+            shift 2
+            ;;
+        --sample)
+            SAMPLE_INDEX="$2"
+            shift 2
+            ;;
+        --gpu)
+            GPU_ID="$2"
+            shift 2
+            ;;
+        --no-offload)
+            ENABLE_OFFLOAD=""
+            shift
+            ;;
+        -h|--help)
+            echo "Usage: $0 [options]"
+            echo ""
+            echo "Options:"
+            echo "  --dataset DATASET    Task name (default: niah_single_1)"
+            echo "  --sample INDEX       Sample index (default: 0)"
+            echo "  --gpu GPU_ID         GPU to use (default: 0)"
+            echo "  --no-offload         Disable CPU offload"
+            exit 0
+            ;;
+        *)
+            echo "Unknown option: $1"
+            exit 1
+            ;;
+    esac
+done
+
+# Path configuration
 SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
 PROJECT_ROOT="$(dirname "$SCRIPT_DIR")"
 OUTPUT_DIR="$PROJECT_ROOT/results/nsys"
-TEST_SCRIPT="$PROJECT_ROOT/tests/test_attention_offload.py"
+TEST_SCRIPT="$PROJECT_ROOT/tests/test_ruler.py"

 # Create output directory if needed
 mkdir -p "$OUTPUT_DIR"

 # Generate timestamp for unique filename
 TIMESTAMP=$(date +%Y%m%d_%H%M%S)
-OUTPUT_FILE="$OUTPUT_DIR/attention_offload_$TIMESTAMP"
+OFFLOAD_SUFFIX=""
+if [ -n "$ENABLE_OFFLOAD" ]; then
+    OFFLOAD_SUFFIX="_offload"
+fi
+OUTPUT_FILE="$OUTPUT_DIR/ruler_${DATASET}_sample${SAMPLE_INDEX}${OFFLOAD_SUFFIX}_${TIMESTAMP}"

 echo "============================================================"
 echo "NVIDIA Nsight Systems Profiling"
 echo "============================================================"
 echo "Test script: $TEST_SCRIPT"
+echo "Dataset:     $DATASET"
+echo "Sample:      $SAMPLE_INDEX"
+echo "GPU:         $GPU_ID"
+echo "Offload:     ${ENABLE_OFFLOAD:-disabled}"
 echo "Output file: $OUTPUT_FILE.nsys-rep"
 echo ""

@@ -43,13 +101,16 @@ echo ""
 echo "Running nsys profile..."
 echo ""

+CUDA_VISIBLE_DEVICES=$GPU_ID PYTHONPATH="$PROJECT_ROOT:$PYTHONPATH" \
 nsys profile \
-    --trace=cuda,nvtx,osrt,cudnn,cublas \
-    --cuda-memory-usage=true \
-    --stats=true \
+    --trace=cuda,nvtx \
    --force-overwrite=true \
    --output="$OUTPUT_FILE" \
-    python "$TEST_SCRIPT"
+    python "$TEST_SCRIPT" \
+        --datasets "$DATASET" \
+        --sample-indices "$SAMPLE_INDEX" \
+        $ENABLE_OFFLOAD \
+        --quiet

 echo ""
 echo "============================================================"