[transformers] bos_token is not added by default

Hi,

It appears that lm-evaluation-harness does not add a bos_token by default:

https://github.com/EleutherAI/lm-evaluation-harness/blob/4439847887ea0481f4f1eb335d39f6f5207904b6/lm_eval/models/api_models.py#L127
https://github.com/EleutherAI/lm-evaluation-harness/blob/4439847887ea0481f4f1eb335d39f6f5207904b6/lm_eval/models/huggingface.py#L87

except for a hard-coded gemma:

https://github.com/EleutherAI/lm-evaluation-harness/blob/4439847887ea0481f4f1eb335d39f6f5207904b6/lm_eval/models/huggingface.py#L252-L256

Why is that?

Some models explicitely state in their tokenizer_config.json that bos_token should be used: https://huggingface.co/unsloth/Llama-3.2-1B-Instruct/blob/main/tokenizer_config.json#L2

Is it intended?

Moreover in case `apply_chat_template` is used, bos token is added disregarding the `self.add_bos_token` option: https://github.com/EleutherAI/lm-evaluation-harness/blob/4439847887ea0481f4f1eb335d39f6f5207904b6/lm_eval/models/huggingface.py#L1508-L1515

Thanks @baberabb 

	if "gemma" in getattr(self.config, "model_type", ""):
	self.add_bos_token = True
	eval_logger.info(
	f"Model type is '{self.config.model_type}', part of the Gemma family--a BOS token will be used as Gemma underperforms without it."
	)

	try:
	chat_templated = self.tokenizer.apply_chat_template(
	chat_history,
	tokenize=False,
	add_generation_prompt=add_generation_prompt,
	continue_final_message=not add_generation_prompt,
	**self.chat_template_args,
	)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[transformers] bos_token is not added by default #3295

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[transformers] bos_token is not added by default #3295

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions