GLM-4 uses multiple EOS tokens [151329, 151336, 151338] where 151336 (<|user|>) should also stop generation. Previously only the first EOS from tokenizer was used, causing generation to always hit max_tokens. Changes: - config.py: Change eos type to int | list[int] - llm_engine.py: Read eos_token_id from hf_config (contains full list) - scheduler.py: Use set for efficient multi-EOS lookup Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
3.5 KiB
3.5 KiB