Skip to content

Conversation

@seanswyi
Copy link
Contributor

What does this PR do?

Fixes #35609 (TL;DR best_model_checkpoint is being set when the checkpoint may not even exist)

  • Add a new attribute to TrainerState called best_global_step that keeps track of the step where we performed evaluation and achieved a new best metric.
  • Moved setting best_model_checkpoint from _determine_best_metric to _save_checkpoint.
  • Inside _save_checkpoint, check if the best_model_checkpoint actually exists, and set it if so.

This was a bit trickier than I thought it would be, since the Trainer's saving and evaluation logic are not tightly coupled as would be in the save_strategy == "best" scenario.

Before submitting

Who can review?

@muellerzr
@SunMarc
@tomaarsen

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@SunMarc SunMarc requested a review from tomaarsen January 27, 2025 15:05
@SunMarc
Copy link
Member

SunMarc commented Jan 27, 2025

tests/trainer/test_trainer.py::TrainerIntegrationTest::test_best_model_checkpoint_behavior test you created is failing, can you check ?

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@seanswyi
Copy link
Contributor Author

tests/trainer/test_trainer.py::TrainerIntegrationTest::test_best_model_checkpoint_behavior test you created is failing, can you check ?

I think the error should be fixed now. Problem was I had hardcoded values and wasn't using the patch correctly. Thanks for pointing it out.

@SunMarc
Copy link
Member

SunMarc commented Jan 28, 2025

Make sure to run make style for the CI. There is also a failing test but that's unrelated to this PR

@seanswyi
Copy link
Contributor Author

Make sure to run make style for the CI. There is also a failing test but that's unrelated to this PR

Done. I'm not sure why but there was a bit of discrepancy between my local setting and the CI pipeline so I had to make the changes through a Docker image since make style wasn't working.

@SunMarc
Copy link
Member

SunMarc commented Jan 28, 2025

thanks ! i will wait for @tomaarsen to test the PR before reviewing it

@tomaarsen
Copy link
Member

I haven't reviewed the code in this PR, but I can confirm that my test passes with this branch again!
cc @SunMarc

  • Tom Aarsen

Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR and sorry for the wait ! Left a few comments !

@SunMarc SunMarc merged commit 691d1b5 into huggingface:main Mar 14, 2025
21 checks passed
@seanswyi seanswyi deleted the fix/best-model-checkpoint-fix branch March 14, 2025 13:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Trainer sets state.best_model_checkpoint even when it doesn't save there; leads to training crash

4 participants