Skip to content

💬 Fix setup_chat_format and add clone_chat_template #3404

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 23 commits into from
Jun 15, 2025

Conversation

qgallouedec
Copy link
Member

@qgallouedec qgallouedec commented May 2, 2025

This PR does two things:

  1. It fixes the resizing of the embedding dimension in setup_chat_format.
  2. It introduces a new function, clone_chat_template.

Both functions essentially serve the same purpose. The key difference is that setup_chat_template takes as input any tokenizer that already includes a chat template. I think this new function is more convenient and flexible in practice.

Also, if you look at the implementations, you'll notice that unlike setup_chat_format, clone_chat_template doesn't handle BOS or PAD tokens. Correct me if I’m wrong, but I don’t believe that’s necessary when setting up a chat template.

Also, it adds all the added tokens from the source tokenizer, ensuring tokens like <think> <|eot|> ... are properly set.
Note that the practice vary a lot regarding these added token, whether they are "special" or not. The current implementation fully relies on the source tokenizer: if a token is special in the source tokenizer, it will be special in the target, and vice et versa.

In the long term, I think we should deprecate setup_chat_format in favor of clone_chat_template , but this can be done in a future PR.

Comment on lines 119 to 124
len(tokenizer), pad_to_multiple_of=resize_to_multiple_of if resize_to_multiple_of is not None else None
new_num_tokens=tokenizer.vocab_size + len(tokenizer.added_tokens_encoder.keys()),
pad_to_multiple_of=resize_to_multiple_of if resize_to_multiple_of is not None else None,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix the new embedding size

@qgallouedec qgallouedec marked this pull request as ready for review May 31, 2025 19:02
@qgallouedec qgallouedec changed the title fix setup chat format Fix setup_chat_format and add setup_chat_template May 31, 2025
@qgallouedec qgallouedec changed the title Fix setup_chat_format and add setup_chat_template 💬 Fix setup_chat_format and add setup_chat_template May 31, 2025
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes the embedding resize logic in setup_chat_format and introduces a new helper, setup_chat_template, which copies a chat template from an existing pretrained tokenizer.

  • Replaces calls to setup_chat_format with setup_chat_template in the SFT script.
  • Updates setup_chat_format to use explicit new_num_tokens and adds the setup_chat_template function.
  • Adjusts exports, tests, and documentation to reference and cover setup_chat_template.

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
trl/scripts/sft.py Swap setup_chat_format for setup_chat_template (hardcoded source) and note passing source arg
trl/models/utils.py Fix resize args in setup_chat_format, import AutoTokenizer, and add setup_chat_template
trl/models/init.py Export setup_chat_template
trl/init.py Export setup_chat_template
tests/test_dataset_formatting.py Import and test setup_chat_template; update resize modulus in old test
docs/source/sft_trainer.md Change docs to use setup_chat_template and update references
docs/source/model_utils.md Add autodoc entry for setup_chat_template
Comments suppressed due to low confidence (2)

docs/source/sft_trainer.md:63

  • Update the section heading to reflect setup_chat_template (e.g. "Add Special Tokens for Chat Template") to match the examples below.
### Add Special Tokens for Chat Format

trl/models/utils.py:82

  • [nitpick] Since this function is slated for deprecation, consider emitting a FutureWarning to notify users to migrate to setup_chat_template.
def setup_chat_format(

@qgallouedec qgallouedec requested a review from shirinyamani May 31, 2025 21:46
@@ -132,6 +137,69 @@ def setup_chat_format(
return model, tokenizer


def setup_chat_template(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would clone_chat_template be a more suitable name?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good

@qgallouedec qgallouedec changed the title 💬 Fix setup_chat_format and add setup_chat_template 💬 Fix setup_chat_format and add clone_chat_template Jun 7, 2025
@qgallouedec qgallouedec requested a review from edbeeching June 7, 2025 01:09
Copy link
Member

@lewtun lewtun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the improvements to chat template formatting! For context, setup_chat_format() was designed to enable one to apply a default chat template (ChatML) to base / pretrained models, where a chat template is usually not predefined (Qwen models are the main exception)

The intention was to support a few popular chat templates which could then be set (potentially) via the model config.

If I understand correctly, this PR is about scenarios where one would like to post-train a model with an existing chat template? In that case, I'm not sure I understand why we need to add special tokens etc

@lewtun
Copy link
Member

lewtun commented Jun 11, 2025

Also, if you look at the implementations, you'll notice that unlike setup_chat_format, clone_chat_template doesn't handle BOS or PAD tokens. Correct me if I’m wrong, but I don’t believe that’s necessary when setting up a chat template.

The reason we did this in setup_chat_format() was because some base models like Llama / Qwen don't always define a BOS or PAD token and this needs to be set explicitly if you want to run SFT.

@qgallouedec
Copy link
Member Author

qgallouedec commented Jun 11, 2025

For popular chat templates, there’s usually at least one model on the Hugging Face Hub that already implements that template, right?
The goal is to simplify SFT setup by letting users specify:
“Use the same chat template as [reference_model]”
instead of manually configuring the template in trl. In any case we only support one, very simple chat template, so it's not very satisfying.

@qgallouedec qgallouedec merged commit 4126803 into main Jun 15, 2025
11 checks passed
@qgallouedec qgallouedec deleted the fix-setup-chat-format branch June 15, 2025 13:59
@@ -104,7 +104,8 @@ def main(script_args, training_args, model_args):

# Set default chat template if needed
if tokenizer.chat_template is None:
model, tokenizer = setup_chat_format(model, tokenizer, format="chatml")
# TODO: source should be passed as an argument
model, tokenizer = clone_chat_template(model, tokenizer, "Qwen/Qwen3-0.6B")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have found it useful internally to expose a chat_template arg in SFTConfig which allows one to define a custom template or copy-paste one from an existing model. Perhaps we could expose this along with a chat_template_clone arg (or something similar), now that I better understand what your intent was in this PR?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants