-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Open
Description
Hi,
It appears that lm-evaluation-harness does not add a bos_token by default:
add_bos_token: bool = False, |
add_bos_token: bool | None = False, |
except for a hard-coded gemma:
lm-evaluation-harness/lm_eval/models/huggingface.py
Lines 252 to 256 in 4439847
if "gemma" in getattr(self.config, "model_type", ""): | |
self.add_bos_token = True | |
eval_logger.info( | |
f"Model type is '{self.config.model_type}', part of the Gemma family--a BOS token will be used as Gemma underperforms without it." | |
) |
Why is that?
Some models explicitely state in their tokenizer_config.json that bos_token should be used: https://huggingface.co/unsloth/Llama-3.2-1B-Instruct/blob/main/tokenizer_config.json#L2
Is it intended?
Moreover in case apply_chat_template
is used, bos token is added disregarding the self.add_bos_token
option:
lm-evaluation-harness/lm_eval/models/huggingface.py
Lines 1508 to 1515 in 4439847
try: | |
chat_templated = self.tokenizer.apply_chat_template( | |
chat_history, | |
tokenize=False, | |
add_generation_prompt=add_generation_prompt, | |
continue_final_message=not add_generation_prompt, | |
**self.chat_template_args, | |
) |
Thanks @baberabb
Metadata
Metadata
Assignees
Labels
No labels