From f240903013c7611af26bfbedd9255b13f5bf6ac1 Mon Sep 17 00:00:00 2001
From: Zijie Tian <zijietian@mail.xmu.edu.cn>
Date: Wed, 7 Jan 2026 01:42:59 +0800
Subject: [PATCH] [docs] Add GPU mutex instructions for multi-instance
 debugging
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Add instructions for Claude instances to check GPU availability before
running CUDA operations, preventing conflicts when multiple instances
debug in parallel on a single GPU.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
---
 CLAUDE.md | 38 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 38 insertions(+)

diff --git a/CLAUDE.md b/CLAUDE.md
index 16c2b37..2ed1058 100644
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -6,6 +6,44 @@ This file provides guidance to Claude Code when working with this repository.
 
 Nano-vLLM is a lightweight vLLM implementation (~1,200 lines) for fast offline LLM inference. Supports Qwen3 models with CPU offload for long-context inference.
 
+## GPU Mutex for Multi-Instance Debugging
+
+**IMPORTANT**: When running multiple Claude instances for parallel debugging, only one GPU (cuda:0) is available. Before executing ANY command that uses the GPU (python scripts, benchmarks, tests), Claude MUST:
+
+1. **Check GPU availability** by running:
+   ```bash
+   nvidia-smi --query-compute-apps=pid,name,used_memory --format=csv,noheader
+   ```
+
+2. **If processes are running on GPU**:
+   - Wait and retry every 10 seconds until GPU is free
+   - Use this polling loop:
+     ```bash
+     while [ -n "$(nvidia-smi --query-compute-apps=pid --format=csv,noheader)" ]; do
+       echo "GPU busy, waiting 10s..."
+       sleep 10
+     done
+     ```
+
+3. **Only proceed** when `nvidia-smi --query-compute-apps=pid --format=csv,noheader` returns empty output
+
+**Example workflow**:
+```bash
+# First check if GPU is in use
+nvidia-smi --query-compute-apps=pid,name,used_memory --format=csv,noheader
+
+# If output is empty, proceed with your command
+python bench_offload.py
+
+# If output shows processes, wait until they finish
+```
+
+**Note**: This applies to ALL GPU operations including:
+- Running tests (`python tests/test_*.py`)
+- Running benchmarks (`python bench*.py`)
+- Running examples (`python example.py`)
+- Any script that imports torch/cuda
+
 ## Sparse Attention
 
 For sparse attention related content (block sparse attention, MInference, FlexPrefill, XAttention, AvgPool, etc.), refer to [`docs/sparse_attention_guide.md`](docs/sparse_attention_guide.md).