[docs] Start ues CLAUDE rules.

2025-12-15 00:20:54 +08:00
parent a37f07943c
commit 5949537faf
3 changed files with 46 additions and 38 deletions
--- a/.claude/rules/commands.md
+++ b/.claude/rules/commands.md
@@ -0,0 +1,26 @@
 # Commands
 ## Installation
 ```bash
 pip install -e .
 ```
 ## Running
 ```bash
 # Run example
 python example.py
 # Run benchmarks
 python bench.py                    # Standard benchmark
 python bench_offload.py            # CPU offload benchmark
 ```
 ## Config Defaults
 - `max_num_batched_tokens`: 16384
 - `max_num_seqs`: 512
 - `kvcache_block_size`: 256
 - `gpu_memory_utilization`: 0.9
 - `enforce_eager`: False (enables CUDA graphs)
--- a/.claude/rules/testing.md
+++ b/.claude/rules/testing.md
@@ -0,0 +1,20 @@
 # Testing
 ## Chunked Attention Test
 ```bash
 CUDA_VISIBLE_DEVICES=4,5 python tests/test_chunked_attention.py 6 2048 64 2
 # Args: num_gpu_blocks input_len output_len num_prefetch_blocks
 ```
 ## CPU Offload Testing
 ```bash
 # Basic test with limited GPU blocks to trigger offload
 CUDA_VISIBLE_DEVICES=4,5 python tests/test_chunked_attention.py 6 2048 64 2
 # Verify consistency (run multiple times, output should be identical)
 for i in 1 2 3; do
  CUDA_VISIBLE_DEVICES=4,5 python tests/test_chunked_attention.py 6 2048 32 2 2>&1 | tail -3
 done
 ```
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -6,24 +6,6 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
 Nano-vLLM is a lightweight vLLM implementation (~1,200 lines) for fast offline LLM inference. Currently supports Qwen3 models.
 ## Commands
 ```bash
 # Install
 pip install -e .
 # Run example
 python example.py
 # Run benchmarks
 python bench.py                    # Standard benchmark
 python bench_offload.py            # CPU offload benchmark
 # Test chunked attention
 CUDA_VISIBLE_DEVICES=4,5 python tests/test_chunked_attention.py 6 2048 64 2
 # Args: num_gpu_blocks input_len output_len num_prefetch_blocks
 ```
 ## Architecture
 ### Core Components
@@ -157,23 +139,3 @@ Current double-buffering limits pipeline depth. Planned improvement:
 - Unified ring buffer using all GPU slots (except decode)
 - Per-slot per-layer CUDA events for fine-grained sync
 - Deeper pipeline: prefetch N-1 blocks ahead (vs 1 chunk)
 ## Config Defaults
 - `max_num_batched_tokens`: 16384
 - `max_num_seqs`: 512
 - `kvcache_block_size`: 256
 - `gpu_memory_utilization`: 0.9
 - `enforce_eager`: False (enables CUDA graphs)
 ## Testing CPU Offload
 ```bash
 # Basic test with limited GPU blocks to trigger offload
 CUDA_VISIBLE_DEVICES=4,5 python tests/test_chunked_attention.py 6 2048 64 2
 # Verify consistency (run multiple times, output should be identical)
 for i in 1 2 3; do
  CUDA_VISIBLE_DEVICES=4,5 python tests/test_chunked_attention.py 6 2048 32 2 2>&1 | tail -3
 done
 ```