Add comprehensive test suite for OffloadedTensor implementation, including basic functionality, chunked GEMM, and sync analysis. Components: - OffloadedTensor: Virtual GPU tensor with transparent CPU/GPU data movement - OffloadManager: LRU cache management with performance stats - ChunkedOffloadLinear: Chunked GEMM along seqlen dimension Tests (10 total): - Basic functionality, MLP integration, LRU eviction, correctness - Memory analysis, 128K sequence, performance comparison, transformers layer - Sync behavior analysis, profiler analysis Key findings: - 93.9% memory savings for 128K sequences (3156MB → 191MB) - Constant memory footprint regardless of sequence length - Only 8% performance overhead from chunked processing Co-Authored-By: Claude <noreply@anthropic.com>
26 KiB
26 KiB