nano-vllm

Author	SHA1	Message	Date
Zijie Tian	9177b62d7f	✨ feat: add --enforce-eager option to bench.py Allow disabling CUDA graphs for benchmarking comparison between eager mode and graph mode execution. Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>	2026-01-27 09:19:53 +08:00
Zijie Tian	3956a30b14	🔧 chore: add --use-v1 flag to bench_vllm.py Allow switching between vLLM V1/V2 engines via command line flag. Default behavior now uses V2 (VLLM_USE_V1=0). Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>	2026-01-27 09:14:55 +08:00
Zijie Tian	59473fa432	🔧 chore: add configurable arguments to bench_vllm.py Add --model, --gpu-util, and --enforce-eager arguments for flexible vLLM benchmarking comparisons. Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>	2026-01-27 09:07:49 +08:00
Zijie Tian	4467e1f654	🔧 chore: add --block-size argument to bench_offload.py Allow configuring KV cache block size for benchmarking different chunk sizes (default: 1024, can set to 4096 for larger chunks). Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>	2026-01-27 09:07:44 +08:00
Zijie Tian	0437311068	⚡ feat: add Phase 5 CUDA Graph optimization for chunked prefill Implement extended CUDA Graph coverage for CPU offload path: - Add graphed_layers.py with N+2 graph architecture (EmbedGraph, FirstGraph, InterGraphs, LastGraph) - Support both prefill (seq_len=chunk_size) and decode (seq_len=1) graph modes - Extend graph coverage to ~70-80% including qkv_proj, rotary, o_proj - Only attention core remains in eager mode for dynamic offload Performance: Prefill throughput improved ~5.6% (3782 -> 3995 tok/s at 32K) Also adds: - --enforce-eager flag to bench_offload.py for comparison - Offload mode constraint documentation in CLAUDE.md Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>	2026-01-27 07:38:40 +08:00
Zijie Tian	6da116de98	📝 docs: add GPU-Only XAttention guide with performance analysis Add comprehensive documentation for GPU-only XAttention BSA mode: - Architecture design and SparsePolicy interface - Memory pre-allocation mechanism (alloc_policy_metadata) - Performance analysis: 32K +15%, 64K +41% vs baseline - CUDA Graph limitations explanation (variable seq_len in prefill) - nsys profiling tools usage guide Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>	2026-01-27 07:21:46 +08:00
Zijie Tian	f5682ca4a7	🔧 chore: add GPU-only profiling script Add scripts/profile.sh for nsys profiling of GPU-only mode benchmarks. Usage: bash scripts/profile.sh # Default: 32K xattn prefill bash scripts/profile.sh --max-len 65536 --gpu-util 0.7 bash scripts/profile.sh --policy full bash scripts/profile.sh --bench-decode Output: results/nsys/bench_<policy>_<len>_<mode>_<timestamp>.nsys-rep Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>	2026-01-27 05:55:31 +08:00
Zijie Tian	a504bd873d	⚡ perf: pre-allocate GQA buffers in XAttention policy Add alloc_policy_metadata() method to SparsePolicy base class for pre-allocating GPU buffers during initialization. This avoids dynamic memory allocation during forward pass. Changes: - Add alloc_policy_metadata() to SparsePolicy base class - Implement GQA buffer pre-allocation in XAttentionBSAPolicy - Call alloc_policy_metadata() in model_runner for GPU-only mode - Modify compute_prefill() to reuse pre-allocated buffers - Add --gpu-util parameter to bench.py Memory savings: - Previously: 2x GQA expansion (~2GB for 64K) - Now: 1x pre-allocated buffer (~1GB for 64K, reused) Tested: - GPU-only 32K: 5602 tok/s (512MB pre-allocated) - GPU-only 64K: 4821 tok/s (1GB pre-allocated, gpu_util=0.7) - Offload Full: PASSED (no changes to offload path) - Offload XAttention: PASSED (uses compute_chunked_prefill) Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>	2026-01-27 05:49:23 +08:00
Zijie Tian	076656c9c2	✨ feat: add GPU-only XAttention BSA sparse attention support - Implement compute_prefill() in XAttentionBSAPolicy for GPU-only mode - Uses xattn_estimate to compute sparse block mask - Uses block_sparse_attn_func for efficient sparse attention - Handles GQA by expanding K/V heads - Falls back to flash_attn for paged KV cache (prefix cache) - Implement compute_decode() by delegating to FullAttentionPolicy - Add --policy xattn option to bench.py Verified: RULER 32k niah_single_1 5/5 samples passed (100%) Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>	2026-01-27 05:19:24 +08:00
Zijie Tian	b6b59b50ed	📝 docs: add sparse policy None constraint rule - Add "Policy 不能为 None (CRITICAL)" section - Document that sparse_policy must always be at least FullAttentionPolicy - Document warmup phase as the only exception where kvcache_manager can be None Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>	2026-01-27 05:08:08 +08:00
Zijie Tian	09b2136e9f	✨ feat: integrate sparse policy architecture into GPU-only mode - Add compute_prefill() and compute_decode() GPU-only methods to SparsePolicy base class - Implement GPU-only methods in FullAttentionPolicy using flash_attn - Add sparse_policy parameter to GPUOnlyManager - Update create_kvcache_manager() to create FullAttentionPolicy for GPU-only mode - Route GPU-only attention through sparse_policy in attention.py - Pass kvcache_manager to context for policy access - Add --enable-policy flag to bench.py for testing - Handle warmup phase when kvcache_manager is not yet allocated This allows GPU-only mode to use the same policy architecture as CPU offload mode, enabling future sparse attention implementations (Quest, XAttention) in GPU-only mode. Performance verified: ~4890 tok/s (unchanged from baseline) Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>	2026-01-27 05:08:02 +08:00
Zijie Tian	0d31b3f71f	📝 docs: add CPU offload optimization strategies guide - Document chunk size optimization (simplest, most effective) - Analyze CUDA Graph limitations for offload scenarios - Cover CUDA Graph applicability for MLP/Proj layers - Survey frontier research: InfiniGen, ShadowKV, L2 Prefetch, KVPR - Add optimization priority recommendations Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>	2026-01-27 04:44:36 +08:00
Zijie Tian	05ce57ee8e	📝 docs: add GPU-only sparse policy integration baseline Document baseline performance before integrating sparse attention to GPU-only mode: - GPU-only Full Attention: 4869 tok/s (32K prefill) - CPU Offload Full Attention: 1500 tok/s (3.2x slower) Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>	2026-01-27 04:36:31 +08:00
Zijie Tian	94a6e06d79	📝 docs: add GPU VRAM requirement rule for GPU-only mode GPU-only mode requires 40GB+ VRAM. This rule enforces checking GPU memory before running non-offload tests to prevent OOM errors on consumer GPUs (3090/4090). Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>	2026-01-27 04:36:24 +08:00
Zijie Tian	c717072f31	✨ feat: add --model argument to bench.py for configurable model path Previously bench.py had a hardcoded model path. Now it accepts --model argument (default: Llama-3.1-8B-Instruct) to align with bench_offload.py. Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>	2026-01-27 04:36:17 +08:00
Zijie Tian	73c9dc46ff	✨ feat: add XAttention BSA support to bench_offload.py - Add --model parameter (default: Llama-3.1-8B-Instruct) - Add --enable-xattn flag for XAttention BSA sparse prefill - Add --xattn-threshold and --xattn-stride parameters - Change default num-gpu-blocks from 6 to 4 - Add benchmark results doc with Full vs XAttn comparison (32K/128K) Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>	2026-01-27 04:20:16 +08:00
Zijie Tian	924a0d2bfa	🔧 chore: add nsys profiling rule and update gitignore - Add rule requiring profile_offload.sh for all nsys profiling - Document available parameters and typical workflows - Ignore Snipaste screenshot files Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>	2026-01-27 03:42:17 +08:00
Zijie Tian	0619accd1c	📝 docs: add CPU scheduling latency analysis for chunked attention - Document kernel gap analysis showing 77-81% CPU scheduling overhead - Identify GPU utilization at 12.8% with potential to reach 39.5% - Outline optimization directions: CUDA Graph, Triton fusion, C++ extension - Add documentation index entry in CLAUDE.md Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>	2026-01-27 03:42:12 +08:00
Zijie Tian	18bc433f09	⚡ perf: improve NVTX profiling with colored ranges and configurable slots - Switch from torch.cuda.nvtx to nvtx package for colored range support - Add color coding: blue for H2D, green for D2H decode, orange for D2H prefill - Add --num-gpu-blocks parameter to profile_offload.sh - Include slot count in output filename for easier comparison Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>	2026-01-27 03:42:05 +08:00
Zijie Tian	aea3812230	♻️ refactor: unify KV cache operations through OffloadEngine - Add write_to_prefill_buffer() and write_to_decode_buffer() methods - Add chunk_idx parameter to load_to_slot_layer() for NVTX labeling - Replace direct copy_() calls with OffloadEngine methods in attention.py - Update all load_to_slot_layer() calls to pass chunk_idx - NVTX markers now show chunk info: "H2D: L{layer} Chunk{chunk} CPU[{block}]->Slot[{slot}]" All KV cache data transfers in chunked offload mode now go through OffloadEngine, enabling better profiling and consistent management. Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>	2026-01-27 02:20:59 +08:00
Zijie Tian	3100724666	📝 docs: add nsys wrong event order bug investigation - Document ring buffer pipeline triggering nsys timestamp bug - Update profile_offload.sh to use test_ruler.py with options - Add reference to new doc in CLAUDE.md Root cause: 4-slot ring buffer pipeline (4 transfer streams + 1 compute stream) triggers event ordering bug in nsys < 2024.2 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-24 04:32:05 +08:00
Zijie Tian	78a44f3536	📝 docs: add GPU memory monitoring rule - Add .claude/rules/gpu-monitor.md requiring gpu-monitor agent for all GPU memory monitoring tasks - Update CLAUDE.md rules index with reference to new rule Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-24 01:41:25 +08:00
Zijie Tian	7c41032a2e	✨ feat: add configurable stride and chunk_size for XAttention BSA - Add sparse_chunk_size config option (default: 16384) - Pass stride, chunk_size, use_triton through factory function - Add --sparse-stride CLI option to test_ruler.py Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-23 10:37:04 +08:00
Zijie Tian	f28b500120	🙈 chore: uncomment planning files in gitignore These files are session-level temporary and should not be tracked. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-23 09:43:46 +08:00
Zijie Tian	be67fa8060	🗑️ chore: remove temporary planning files These files are session-level temporary files and should not be tracked. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-23 09:43:22 +08:00
Zijie Tian	4f35526457	🔀 merge: integrate remote changes (exec-plan command, CUDA graph plan) Resolve task_plan.md conflict by keeping remote version (CUDA Graph optimization plan). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-23 09:43:06 +08:00
Zijie Tian	da5e13e2bb	📝 docs: update XAttention BSA Policy with benchmarks and memory management Add new sections to xattn_bsa_policy_design.md: - Performance benchmarks: 128K context comparison (Full vs XAttn BSA) - Density trend analysis across chunks - Memory leak issue and fix (64GB -> 4GB reduction) - Memory monitoring guide with gpu-monitor agent - Density statistics API documentation - Known issues and optimization directions Update CLAUDE.md description to reflect new content. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-23 09:35:18 +08:00
Zijie Tian	dd31033732	🔧 chore: add gpu-monitor agent for memory leak debugging Add a custom agent for continuous GPU monitoring during benchmarks: - Track GPU utilization, memory usage, and temperature - Support multi-GPU and configurable sampling intervals - Generate summary statistics when stopped Useful for debugging memory leaks and profiling long-running tasks. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-23 09:33:15 +08:00
Zijie Tian	ed3c8bb4b8	🐛 fix: memory leak in XAttentionBSAPolicy select_blocks Fix severe memory leak (64GB -> 4GB growth) by: - Remove unused sparse_metadata storage (was accumulating attn_scores) - Delete intermediate tensor list (attn_scores_list) after use - Explicitly delete intermediate tensors before return Before: 16GB -> 80GB during 128K prefill After: 16GB -> 19.8GB during 128K prefill Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-23 09:30:18 +08:00
Zijie Tian	5eb35982bf	🔧 feat: add density statistics tracking to sparse policies Add statistics tracking to compare block selection between policies: - XAttentionBSAPolicy: track available/selected blocks per chunk - FullAttentionPolicy: track total blocks (always 100% density) - Add reset_stats(), get_density_stats(), print_density_stats() methods - Use logger.debug for per-chunk density logging Results on 32K niah_single_1: - Full: 100% density across all chunks - XAttn BSA: 90% -> 73% density (saves ~25-30% blocks in later chunks) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-23 08:53:22 +08:00
Zijie Tian	ad361c2c3b	📝 docs: add XAttention BSA Policy design documentation - Create docs/xattn_bsa_policy_design.md with: - Algorithm overview and data flow diagram - select_blocks implementation details - GQA-aware aggregation and majority voting - compute_chunked_prefill ring buffer pipeline - Parameter configuration and usage examples - Performance characteristics and limitations - Update CLAUDE.md documentation index Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-23 08:36:56 +08:00
Zijie Tian	4d1e40152d	✨ feat(xattn): implement compute_chunked_prefill with ring buffer pipeline - Copy compute_chunked_prefill implementation from FullAttentionPolicy - Set default threshold to 0.95 for accuracy testing - Remove debug code (sys.exit, verbose prints) - Use ring buffer pipeline for historical block loading - Merge with current chunk attention using flash_attn_with_lse RULER NIAH test passed with 5/5 samples (100% accuracy). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-23 08:27:40 +08:00
Zijie Tian	832b352afa	✨ feat(xattn): implement select_blocks with majority voting aggregation Implement XAttention-based block selection for sparse attention: - Use flat_group_gemm_fuse_reshape to compute Q@K^T attention scores - Apply softmax_fuse_block_sum to aggregate into block-level attention - Use find_blocks_chunked for threshold-based block selection - Handle GQA by aggregating within KV head groups first - Use majority voting (>50%) across heads instead of any() for better sparsity - Align block_size with CPU offload block size (1024 tokens / stride = 128) Test results show ~45% density at chunk 40 (down from 100% with any() aggregation). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-23 08:19:05 +08:00
Zijie Tian	a50b4c2ac2	♻️ refactor: move select_blocks from policy to attention layer Move block selection logic from compute_chunked_prefill/decode methods to attention.py caller. This improves separation of concerns: - attention.py now calls select_blocks() before compute_chunked_*() - Policy methods receive pre-selected blocks via selected_blocks parameter - Enables sparse policies to implement custom block selection without modifying the compute path Changes: - policy.py: Add selected_blocks parameter to abstract methods - full_policy.py: Remove internal select_blocks calls, use passed blocks - xattn_bsa.py: Sync signatures for prefill/decode methods - attention.py: Add select_blocks calls before policy delegation Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-23 05:21:28 +08:00
Zijie Tian	ca32ea6f93	[WIP] Before refactor the compute)_chunked_prefill.	2026-01-23 03:36:12 +08:00
Zijie Tian	edc006463b	docs: add XAttention kernels guide - Document flat_group_gemm_fuse_reshape and softmax_fuse_block_sum kernels - Explain anti-diagonal sum principle and stride sampling - Add GPU-specific BLOCK_M/N constraints (RTX 3090 vs A100) - Show Q/K can have different lengths (chunked prefill support) - Update CLAUDE.md with doc reference Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-23 03:22:25 +08:00
Zijie Tian	999858e82f	feat: add xattn kernels test and update testing rules - Add test_xattn_kernels.py demonstrating flat_group_gemm_fuse_reshape and softmax_fuse_block_sum Triton kernels with structured data - Update testing.md with new test code style guidelines - Update xattn.py and xattn_bsa.py with improvements Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-23 03:01:25 +08:00
Zijie Tian	47d237bb7e	✨ feat: add exec-plan command for automated task plan execution Add a new Claude command that executes task_plan.md refactoring with: - GPU isolation via --gpu <id> parameter (required) - Optional --no-interrupt mode for autonomous execution - Progress tracking via progress.md and findings.md - Strict CUDA_VISIBLE_DEVICES enforcement Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-22 02:23:12 +08:00
Zijie Tian	a5307fb124	📝 docs: add CUDA Graph optimization plan for offload mode decode - Update task_plan.md with 6-phase segmented graph implementation plan - Add findings.md documenting 7 key discoveries about current implementation - Add progress.md for tracking implementation progress - Add test_chunk_attention_graph_reuse.py validating 2-graph reuse strategy Key architecture decision: Split transformer layer into 3 segments: - PRE-ATTENTION GRAPH: norm → qkv_proj → rotary (1 graph, reused) - CHUNKED ATTENTION: H2D (eager) + flash_attn (2 graphs) + merge (eager) - POST-ATTENTION GRAPH: o_proj → norm → FFN (1 graph, reused) Total: 4 graphs serving all layers via copy_() tensor updates. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-22 02:12:24 +08:00
Zijie Tian	d808970f2f	[WIP] Before implement the plan.	2026-01-22 01:35:13 +08:00
Zijie Tian	bc92c1fdb8	feat: add xattn_estimate_chunked for chunked prefill support - Add xattn_estimate_chunked function ported from COMPASS - Support chunked prefill with q_start_pos parameter - Ensure 100% consistency with standard xattn_estimate when using matching chunk_size parameter - Add test and documentation Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-22 01:13:17 +08:00
Zijie Tian	2866d4fd88	✨ feat: add chunk attention CUDA graph test for block sparse attention Validates that pre-allocated CUDA graphs work for chunk-wise attention: - Each (Q_chunk, K_chunk) pair has its own captured graph - Zero copy_() during replay - all data pre-filled - Uses nanovllm's flash_attn_with_lse and merge_attention_outputs Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-22 00:57:05 +08:00
Zijie Tian	5d722968ff	[docs] Added cuda_graph_guide.md	2026-01-21 21:56:24 +08:00
Zijie Tian	d21b40f48f	[test] Added test_cudagraph_memory.py.	2026-01-21 03:30:36 +08:00
Zijie Tian	42cf124343	📝 docs: add CUDA Graph memory mechanism guide Document CUDA Graph memory behavior based on actual testing: - Memory overhead at each stage (model, cache, warmup, capture, replay) - StaticCache is the main overhead (~144MB for 1K tokens) - Graph capture adds minimal overhead (~8MB) - Graph replay requires zero additional allocation - Performance improvement: ~2.8x decode throughput Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-21 02:59:21 +08:00
Zijie Tian	78050aef9f	🐛 fix: resolve CPU KV cache state leakage between requests Root Cause: - OffloadEngine.reset() cleared GPU buffers but NOT CPU cache - Previous request's KV cache data persisted in CPU memory, contaminating subsequent requests Fixes: - Add k_cache_cpu.zero_() and v_cache_cpu.zero_() to OffloadEngine.reset() - Add clear_decode_tracking(seq) call in HybridKVCacheManager.deallocate() Results: - niah_single_1 accuracy improved from ~80% to 94% (+14%) - Remaining ~6% errors are model limitations, not state leakage Also: - Update docs/ruler_32k_chunked_offload_issue.md with fix details - Remove debug planning files (findings.md, progress.md, task_plan.md) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-21 01:12:21 +08:00
Zijie Tian	4d8ae951c3	[WIP] Before debug plan.	2026-01-21 00:01:10 +08:00
Zijie Tian	1ab4676396	♻️ refactor: consolidate RULER test files and document root cause - test_ruler.py: add --fresh-llm, --sample-indices, --json-output options - test_ruler.py: consolidate test_ruler_single_sample.py, test_ruler_sequential.py, test_ruler_samples.py - docs: update chunked offload issue with root cause (state leakage confirmed) - docs: add single-sample test results showing 100% accuracy for niah_single_1 Deleted redundant test files: - tests/test_ruler_single_sample.py - tests/test_ruler_sequential.py - tests/test_ruler_samples.py Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 23:41:17 +08:00
Zijie Tian	512e1e5401	🔧 chore: add Claude rules for agent result format and multi-GPU debugging - Add agent-result-format.md: standardize output formats for background agents - Add multi-gpu-debugging.md: guidelines for parallel GPU testing workflows - Update CLAUDE.md: add documentation index entry for chunked offload issue Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 23:41:08 +08:00
Zijie Tian	6180055ed8	📝 docs: add chunked attention solutions guide and update doc index Add comprehensive documentation analyzing the 32K chunked offload accuracy issues with proposed solutions covering LSE precision, ring buffer state management, and position encoding validation. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 04:48:20 +08:00

1 2 3 4 5

248 Commits