-
Notifications
You must be signed in to change notification settings - Fork 2k
💬 Fix setup_chat_format
and add clone_chat_template
#3404
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
trl/models/utils.py
Outdated
len(tokenizer), pad_to_multiple_of=resize_to_multiple_of if resize_to_multiple_of is not None else None | ||
new_num_tokens=tokenizer.vocab_size + len(tokenizer.added_tokens_encoder.keys()), | ||
pad_to_multiple_of=resize_to_multiple_of if resize_to_multiple_of is not None else None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix the new embedding size
setup_chat_format
and add setup_chat_template
setup_chat_format
and add setup_chat_template
setup_chat_format
and add setup_chat_template
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes the embedding resize logic in setup_chat_format
and introduces a new helper, setup_chat_template
, which copies a chat template from an existing pretrained tokenizer.
- Replaces calls to
setup_chat_format
withsetup_chat_template
in the SFT script. - Updates
setup_chat_format
to use explicitnew_num_tokens
and adds thesetup_chat_template
function. - Adjusts exports, tests, and documentation to reference and cover
setup_chat_template
.
Reviewed Changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
Show a summary per file
File | Description |
---|---|
trl/scripts/sft.py | Swap setup_chat_format for setup_chat_template (hardcoded source) and note passing source arg |
trl/models/utils.py | Fix resize args in setup_chat_format , import AutoTokenizer , and add setup_chat_template |
trl/models/init.py | Export setup_chat_template |
trl/init.py | Export setup_chat_template |
tests/test_dataset_formatting.py | Import and test setup_chat_template ; update resize modulus in old test |
docs/source/sft_trainer.md | Change docs to use setup_chat_template and update references |
docs/source/model_utils.md | Add autodoc entry for setup_chat_template |
Comments suppressed due to low confidence (2)
docs/source/sft_trainer.md:63
- Update the section heading to reflect
setup_chat_template
(e.g. "Add Special Tokens for Chat Template") to match the examples below.
### Add Special Tokens for Chat Format
trl/models/utils.py:82
- [nitpick] Since this function is slated for deprecation, consider emitting a
FutureWarning
to notify users to migrate tosetup_chat_template
.
def setup_chat_format(
Co-authored-by: Copilot <[email protected]>
trl/models/utils.py
Outdated
@@ -132,6 +137,69 @@ def setup_chat_format( | |||
return model, tokenizer | |||
|
|||
|
|||
def setup_chat_template( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would clone_chat_template
be a more suitable name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good
Co-authored-by: Edward Beeching <[email protected]>
Co-authored-by: Edward Beeching <[email protected]>
setup_chat_format
and add setup_chat_template
setup_chat_format
and add clone_chat_template
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the improvements to chat template formatting! For context, setup_chat_format()
was designed to enable one to apply a default chat template (ChatML) to base / pretrained models, where a chat template is usually not predefined (Qwen models are the main exception)
The intention was to support a few popular chat templates which could then be set (potentially) via the model config.
If I understand correctly, this PR is about scenarios where one would like to post-train a model with an existing chat template? In that case, I'm not sure I understand why we need to add special tokens etc
The reason we did this in |
For popular chat templates, there’s usually at least one model on the Hugging Face Hub that already implements that template, right? |
Co-authored-by: lewtun <[email protected]>
@@ -104,7 +104,8 @@ def main(script_args, training_args, model_args): | |||
|
|||
# Set default chat template if needed | |||
if tokenizer.chat_template is None: | |||
model, tokenizer = setup_chat_format(model, tokenizer, format="chatml") | |||
# TODO: source should be passed as an argument | |||
model, tokenizer = clone_chat_template(model, tokenizer, "Qwen/Qwen3-0.6B") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have found it useful internally to expose a chat_template
arg in SFTConfig
which allows one to define a custom template or copy-paste one from an existing model. Perhaps we could expose this along with a chat_template_clone
arg (or something similar), now that I better understand what your intent was in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR does two things:
setup_chat_format
.clone_chat_template
.Both functions essentially serve the same purpose. The key difference is that
setup_chat_template
takes as input any tokenizer that already includes a chat template. I think this new function is more convenient and flexible in practice.Also, if you look at the implementations, you'll notice that unlike
setup_chat_format
,clone_chat_template
doesn't handle BOS or PAD tokens. Correct me if I’m wrong, but I don’t believe that’s necessary when setting up a chat template.Also, it adds all the added tokens from the source tokenizer, ensuring tokens like
<think>
<|eot|>
... are properly set.Note that the practice vary a lot regarding these added token, whether they are "special" or not. The current implementation fully relies on the source tokenizer: if a token is special in the source tokenizer, it will be special in the target, and vice et versa.
In the long term, I think we should deprecate
setup_chat_format
in favor ofclone_chat_template
, but this can be done in a future PR.