nano-vllm

Author	SHA1	Message	Date
Zijie Tian	29e102720b	🐛 fix: support multiple EOS tokens for GLM-4 GLM-4 uses multiple EOS tokens [151329, 151336, 151338] where 151336 (<\|user\|>) should also stop generation. Previously only the first EOS from tokenizer was used, causing generation to always hit max_tokens. Changes: - config.py: Change eos type to int \| list[int] - llm_engine.py: Read eos_token_id from hf_config (contains full list) - scheduler.py: Use set for efficient multi-EOS lookup Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-28 13:23:53 +08:00
Zijie Tian	726e4b58cf	✨ feat: add GLM-4-9B-Chat-1M model support Add support for GLM-4 model architecture with the following changes: - Add glm4.py with ChatGLMForCausalLM, GLM4Model, GLM4Attention, GLM4MLP - Add GLM4RotaryEmbedding with interleaved partial rotation (rotary_dim = head_dim // 2) - Add apply_rotary_emb_interleaved function for GLM-4 style RoPE - Add GLM-4 weight name conversion and loading in loader.py - Add GLM-4 chat template conversion in test_ruler.py - Add trust_remote_code=True for GLM-4 config loading Key GLM-4 specific adaptations: - QKV bias enabled (add_qkv_bias: true) - RoPE with rope_ratio scaling (base = 10000 * rope_ratio) - Interleaved RoPE (pairs adjacent elements, not first/second half) - Partial rotation (only half of head_dim is rotated) - Uses multi_query_group_num instead of num_key_value_heads - Uses kv_channels instead of head_dim - Uses ffn_hidden_size instead of intermediate_size Tested with RULER niah_single_1 (5 samples): 100% accuracy Both GPU-only and CPU offload modes verified Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-28 13:15:57 +08:00
Zijie Tian	7c41032a2e	✨ feat: add configurable stride and chunk_size for XAttention BSA - Add sparse_chunk_size config option (default: 16384) - Pass stride, chunk_size, use_triton through factory function - Add --sparse-stride CLI option to test_ruler.py Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-23 10:37:04 +08:00
Zijie Tian	ca32ea6f93	[WIP] Before refactor the compute)_chunked_prefill.	2026-01-23 03:36:12 +08:00
Zijie Tian	b5da802dff	[WIP] Before integrate the xattn operator.	2026-01-19 21:19:21 +08:00
Zijie Tian	9e6fdc0650	[WIP] Before plan execute.	2026-01-19 03:30:44 +08:00
Zijie Tian	2a6e0a2c02	[feat] Added Quest Sparsity Policy.	2026-01-07 03:29:21 +08:00
Zijie Tian	c99a6f3d3f	[WIP] Before add Quest policy.	2026-01-07 02:32:30 +08:00
Zijie Tian	054aaff403	[fix] Fixed needle test bug.	2026-01-05 18:34:09 +08:00
Zijie Tian	484d0de9f9	[feat] Added debug hook to offload_engine.py.	2025-12-31 19:44:39 +08:00
Zijie Tian	782437c486	[WIP] remove num_prefetch_blocks varible.	2025-12-24 18:22:26 +08:00
Zijie Tian	051f2295c9	[feat] Added sparse KVcache feature, NEED VERIFY.	2025-12-22 08:51:02 +08:00
Zijie Tian	b8b6478506	[feat] Need to optimized with async prefetch.	2025-12-15 06:58:40 +08:00
Zijie Tian	babfa17354	[refactor] Translate into english, void Chinese due to claude.	2025-12-11 00:30:24 +08:00
Zijie Tian	e85c2b4776	[fix] Fixed kvcache offload bugs.	2025-12-10 22:34:00 +08:00
Zijie Tian	190df5f70d	[refactor] Refactor current gpu and cpu block allocation strategy.	2025-12-10 21:23:31 +08:00
Zijie Tian	0a247ccb1b	[feat] Added `num_gpu_blocks` limit gpu blocks.	2025-12-10 20:17:42 +08:00
Zijie Tian	0b6f19242d	[feat] Added chunked prefill and kvcache offload mechenism.	2025-12-10 03:47:37 +08:00
GeeeekExplorer	658520b788	warmup and allocate	2025-06-27 01:51:57 +08:00
GeeeekExplorer	fc778a4da9	better	2025-06-15 10:36:45 +08:00
cheunglei	53b3ef2e32	support tensor parallel	2025-06-15 01:31:24 +08:00
GeeeekExplorer	f16adb729e	refactor	2025-06-12 09:41:12 +08:00
GeeeekExplorer	386290d69e	refactor	2025-06-11 21:12:57 +08:00
GeeeekExplorer	b98e1ca305	fix	2025-06-10 21:25:54 +08:00
GeeeekExplorer	a5a4909e6a	init commit	2025-06-10 00:27:01 +08:00

25 Commits