test: add OffloadedTensor unified test suite
Add comprehensive test suite for OffloadedTensor implementation,
including basic functionality, chunked GEMM, and sync analysis.
Components:
- OffloadedTensor: Virtual GPU tensor with transparent CPU/GPU data movement
- OffloadManager: LRU cache management with performance stats
- ChunkedOffloadLinear: Chunked GEMM along seqlen dimension
Tests (10 total):
- Basic functionality, MLP integration, LRU eviction, correctness
- Memory analysis, 128K sequence, performance comparison, transformers layer
- Sync behavior analysis, profiler analysis
Key findings:
- 93.9% memory savings for 128K sequences (3156MB → 191MB)
- Constant memory footprint regardless of sequence length
- Only 8% performance overhead from chunked processing
Co-Authored-By: Claude <noreply@anthropic.com>