Skip to content

[transformers] bos_token is not added by default #3295

@fxmarty-amd

Description

@fxmarty-amd

Hi,

It appears that lm-evaluation-harness does not add a bos_token by default:

add_bos_token: bool = False,

add_bos_token: bool | None = False,

except for a hard-coded gemma:

if "gemma" in getattr(self.config, "model_type", ""):
self.add_bos_token = True
eval_logger.info(
f"Model type is '{self.config.model_type}', part of the Gemma family--a BOS token will be used as Gemma underperforms without it."
)

Why is that?

Some models explicitely state in their tokenizer_config.json that bos_token should be used: https://huggingface.co/unsloth/Llama-3.2-1B-Instruct/blob/main/tokenizer_config.json#L2

Is it intended?

Moreover in case apply_chat_template is used, bos token is added disregarding the self.add_bos_token option:

try:
chat_templated = self.tokenizer.apply_chat_template(
chat_history,
tokenize=False,
add_generation_prompt=add_generation_prompt,
continue_final_message=not add_generation_prompt,
**self.chat_template_args,
)

Thanks @baberabb

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions