🐛 fix: support multiple EOS tokens for GLM-4

GLM-4 uses multiple EOS tokens [151329, 151336, 151338] where 151336 (<|user|>) should also stop generation. Previously only the first EOS from tokenizer was used, causing generation to always hit max_tokens. Changes: - config.py: Change eos type to int | list[int] - llm_engine.py: Read eos_token_id from hf_config (contains full list) - scheduler.py: Use set for efficient multi-EOS lookup Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-28 13:23:53 +08:00
parent 726e4b58cf
commit 29e102720b
3 changed files with 12 additions and 4 deletions
--- a/nanovllm/config.py
+++ b/nanovllm/config.py
@@ -22,7 +22,7 @@ class Config:
    tensor_parallel_size: int = 1
    enforce_eager: bool = False
    hf_config: AutoConfig | None = None
-    eos: int = -1
+    eos: int | list[int] = -1  # Single EOS token or list of EOS tokens (e.g., GLM-4)
    kvcache_block_size: int = 1024
    num_kvcache_blocks: int = -1
    dtype: str | None = None  # "float16", "bfloat16", or None (use model default)