3.9 KiB
3.9 KiB
Debugging Guide
This document provides debugging techniques for nano-vLLM, including PyTorch hooks for capturing intermediate tensors.
PyTorch Hooks for Debugging
Hook Positions in Qwen3
decoder_layer
├── input_layernorm (RMSNorm)
├── self_attn (Qwen3Attention) ← Hook here for attention I/O after o_proj
│ ├── q_proj → q_norm → RoPE
│ ├── k_proj → k_norm → RoPE
│ ├── v_proj
│ ├── attn (Attention) ← Hook here for Q/K/V tensors
│ │ └── FlashAttention / SDPA
│ └── o_proj
├── post_attention_layernorm (RMSNorm)
└── mlp (Qwen3MLP)
Hook Types & Data Shapes
| Hook Position | Type | Captured Data |
|---|---|---|
self_attn |
post | [batch, seq_len, hidden_size] - after o_proj |
self_attn.attn |
pre | Q,K,V: [seq_len, num_heads, head_dim] - after RoPE |
self_attn.attn |
post | [seq_len, num_heads, head_dim] - before o_proj |
Example: Capture Attention Outputs
storage = {}
def make_hook(layer_id: int, storage: dict):
def hook(module, inputs, output):
if isinstance(output, tuple):
attn_output = output[0]
else:
attn_output = output
# nanovllm shape: [num_tokens, hidden_size] -> add batch dim
if attn_output.dim() == 2:
attn_output = attn_output.unsqueeze(0)
storage[layer_id] = attn_output.detach().clone()
return hook
# Register hooks
hooks = []
for layer_idx, layer in enumerate(model.model.layers):
hooks.append(layer.self_attn.register_forward_hook(make_hook(layer_idx, storage)))
# Run inference...
# Cleanup
for hook in hooks:
hook.remove()
Reference Implementation
Key files for comparison testing:
| File | Purpose |
|---|---|
tests/modeling_qwen3.py |
Reference Qwen3 implementation (torch + transformers only) |
tests/test_needle_ref.py |
Reference needle test using custom Qwen3 |
tests/test_needle.py |
Needle-in-haystack test for nanovllm |
Common Pitfalls
- Shape mismatch: nanovllm uses
[num_tokens, ...]while torch uses[batch, seq_len, ...] - Hook position:
self_attncaptures after o_proj,self_attn.attncaptures before o_proj - Output format: nanovllm returns tuple
(attn_output, None), handle withoutput[0]
Memory Debugging
Track Peak GPU Memory
import torch
# Reset stats before operation
torch.cuda.reset_peak_memory_stats()
torch.cuda.empty_cache()
# Run operation
outputs = llm.generate([prompt], sampling_params)
# Check peak
peak_gb = torch.cuda.max_memory_allocated() / 1024**3
print(f"Peak GPU memory: {peak_gb:.2f} GB")
Monitor Memory During Execution
import torch
def memory_snapshot():
allocated = torch.cuda.memory_allocated() / 1024**3
reserved = torch.cuda.memory_reserved() / 1024**3
print(f"Allocated: {allocated:.2f} GB, Reserved: {reserved:.2f} GB")
# Add snapshots at key points in your code
Comparing Outputs
Needle-in-Haystack Test
# Test with CPU offload
PYTHONPATH=/path/to/nano-vllm:$PYTHONPATH python tests/test_needle.py --enable-offload --input-len 8192
# Test without CPU offload (GPU-only)
PYTHONPATH=/path/to/nano-vllm:$PYTHONPATH python tests/test_needle.py --input-len 8192
# Compare with reference implementation
PYTHONPATH=/path/to/nano-vllm:$PYTHONPATH python tests/test_needle_ref.py --input-len 8192
Tensor Comparison
def compare_tensors(a, b, name, rtol=1e-3, atol=1e-5):
if a.shape != b.shape:
print(f"{name}: Shape mismatch {a.shape} vs {b.shape}")
return False
diff = (a - b).abs()
max_diff = diff.max().item()
mean_diff = diff.mean().item()
close = torch.allclose(a, b, rtol=rtol, atol=atol)
print(f"{name}: max_diff={max_diff:.6f}, mean_diff={mean_diff:.6f}, close={close}")
return close