[train] Simplify `ray.train.xgboost/lightgbm` (1/n): Align frequency-based and `checkpoint_at_end` checkpoint formats by justinvyu · Pull Request #42111 · ray-project/ray

justinvyu · 2023-12-28T00:06:09Z

Why are these changes needed?

This PR fixes XGBoostTrainer and LightGBMTrainer checkpointing:

Centralizes checkpoint saving and loading implementations around the utility callbacks ray.train.xgboost/lightgbm.RayTrainReportCallback as the standard utilities to define checkpoints save/load format.
- Previously, we had 3 separate implementations of xgb/lgbm checkpointing: (1) XGBoostCheckpoint, (2) ray.tune.integration.xgboost.TuneReportCheckpointCallback, and (3) XGBoostTrainer._save_model.
  - (1) was only used by (3) to get the XGBoostCheckpoint.MODEL_FILENAME constant in some places. But, we re-implemented the from_model and get_model logic for some reason.
  - (2) was used for saving frequency-based checkpoints configured via CheckpointConfig(checkpoint_frequency)
  - (3) was only used for saving the final checkpoint at the end of training
  - These diverging codepaths is what resulted in the confusing bug described here.
- Now, we have a single codepath (the ray.train.*.RayTrainReportCallback) that handles both checkpoint_frequency and checkpoint_at_end. This codepath standardizes on the framework specific checkpoint implementation of checkpoint saving.
Deletes a lot of the legacy code that was needed during Ray 2.7 to 2.9 for migration purposes.
Hard deprecates legacy APIs (TuneReportCallback). The migration is simple: TuneReportCallback() -> TuneReportCheckpointCallback(frequency=0).
Removes the circular dependency of ray.tune and xgboost_ray/lightgbm_ray.
- The circular dependency: xgboost_ray -> ray.tune.* -> ray.train.* -> ray.train.xgboost -> xgboost_ray
- This caused an ImportError which xgboost_ray incorrectly used to determine whether Ray Train/Tune were installed.
- The long term fix here is to fully remove the xgboost_ray and lightgbm_ray dependencies by re-implementing simple versions of these trainers as DataParallelTrainers. See: [train] Simplify ray.train.xgboost/lightgbm (2/n): Re-implement XGBoostTrainer as a lightweight DataParallelTrainer #42767.

API Change Summary

API	Change	Notes
`ray.train.xgboost.RayTrainReportCallback`	🆕	This is the utility callback that we can recommend users to build off of, similar to `ray.train.lightning.RayTrainReportCallback`. This will be exposed to users if they have full control over the training loop in the new simplified `XGBoostTrainer`.
`ray.train.xgboost.RayTrainReportCallback.get_model(filename)`	🆕	This is a utility method to load a model from a checkpoint saved by this callback. This should be recommended over `XGBoostTrainer.get_model` in the future.
`ray.tune.integration.xgboost.TuneReportCheckpointCallback`	↔️	This callback is now just an alias of the one above, and will be usable by Tune users who want to use the callback in Tune without distributed training. We can consider deprecating it in the future.

The same APIs are introduced for the lightgbm counterparts.

TODOs left for followups

Revamp release tests, as most of them are just testing xgboost_ray right now.
Fix the checkpoint_at_end vs. checkpoint_frequency overlap logic for the test case with a TODO in test_xgboost_trainer after switching to the simplified xgboost trainer.

Related issue number

Closes #41608

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…xgboost_ckpting Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…xgboost_ckpting

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…ve an alias in tune Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…xgboost_ckpting

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…xgboost_ckpting

… hook gets called) Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…xgboost_ckpting

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

matthewdeng · 2024-02-13T19:55:24Z

python/ray/train/lightgbm/lightgbm_checkpoint.py

        booster: lightgbm.Booster,
        *,
        preprocessor: Optional["Preprocessor"] = None,
+        path: Optional[str] = None,


Do we still need these changes if we're centralizing on the Callbacks?

Nope I can get rid of it. If anybody does use this, specifying your own temp dir might be useful though if you want it to be cleaned up after.

matthewdeng · 2024-02-13T19:58:28Z

python/ray/train/lightgbm/_lightgbm_utils.py

+        from ray.train.lightgbm import RayTrainReportCallback
+
+        # Get a `Checkpoint` object that is saved by the callback during training.
+        result = trainer.fit()


nit: For consistency with this, should we update the training example to use the LightGBMTrainer? Same for xgboost.

I want to add the *Trainer examples once I add in a v2 xgboost/lightgbm trainer, since then it'll actually show the callback usage in the training func. Right now the user doesn't need to create the callback themselves.

python/ray/train/xgboost/xgboost_trainer.py

python/ray/train/xgboost/_xgboost_utils.py

python/ray/tune/integration/lightgbm.py

woshiyyya · 2024-02-13T20:12:00Z

python/ray/train/lightgbm/_lightgbm_utils.py

    independent xgboost trials (without data parallelism within a trial).

    .. testcode::
+        :skipif: True


Are we going to add them back later?

This used to be a code-block that didn't run 😅 I just wanted to show a mock xgboost.train call with the callback inside, without needing to specify the dataset and everything.

doc/source/tune/examples/tune-xgboost.ipynb

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu · 2024-02-13T21:31:37Z

python/ray/train/lightgbm/_lightgbm_utils.py

+
+
+@PublicAPI(stability="beta")
+class RayTrainReportCallback:


TuneCallback for lgbm was originally an empty class that wasn't referenced anywhere else so I just removed it.

…xgboost_ckpting

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu added 13 commits December 26, 2023 10:43

allow xgboost checkpoint to take in custom path

4cb0a9f

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

switch ckpting logic to use xgboost checkpoint cls

d27c12b

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

hard deprecate tune report ckpt

7c4dc27

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

remove unused methods/attrs

c92eaa1

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

do the same for lgbm

9172933

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

add configurability of temp ckpt path for lgbm checkpoint

f2c9758

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

fix error msg wording

634cacb

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

use lgbm ckpt for ckpt saving in lgbm callback

9b37c58

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

fix lint

8751718

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

fix docstrings

e0a4afa

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

add support for checkpoint at end for lightgbm

e1f9798

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

standaredize the method of loading xgb/lgbm ckpts as well

95d4f87

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

remove commented code + add todo

8b6c0c1

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu requested a review from matthewdeng December 28, 2023 00:06

justinvyu assigned matthewdeng Dec 28, 2023

justinvyu added 5 commits December 27, 2023 16:19

remove useless methods

f042ca2

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

fix lint

ec4bfb6

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

add back _TuneCheckpointCallback dummy classes for tests to pass

1adcde6

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

skip tests for now until xgb_ray is merged in

2955b3a

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

make assertion error more lenient (until xgb_ray is merged in)

e607a9c

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu mentioned this pull request Jan 26, 2024

Simplify checkpointing callback for Ray Train/Tune integration ray-project/xgboost_ray#305

Merged

justinvyu added 6 commits January 31, 2024 16:56

Merge branch 'master' of https://github.com/ray-project/ray into fix_…

3e3ba23

…xgboost_ckpting Signed-off-by: Justin Yu <justinvyu@anyscale.com>

remove different Version imports

9332138

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

fix test with xgboost_ray master

45b24cf

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

test with xgboost_ray master (preparing for other pr)

35d997a

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

add xgboost missing check to minimal test (preparation)

efae857

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

isort

8fa702a

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu mentioned this pull request Feb 4, 2024

Simplify checkpointing callback for Ray Train/Tune integration ray-project/lightgbm_ray#53

Merged

justinvyu added 2 commits February 4, 2024 11:35

Merge branch 'master' of https://github.com/ray-project/ray into fix_…

3578b2e

…xgboost_ckpting

fix xgboost_example + xgb_dynamic_resources

a68bbc2

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu added 6 commits February 12, 2024 17:11

update lightgbm integration callback to not use framework ckpt

cef5518

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

move callback impl to ray.train.lightgbm.RayTrainReportCallback + lea…

8e0e9a7

…ve an alias in tune Signed-off-by: Justin Yu <justinvyu@anyscale.com>

remove unneeded legacy callback class

8c2306f

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

remove lightgbm ckpt from LightGBMTrainer.get_model

2d5c0d2

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

fix/add todos to test cases

c4fb911

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Merge branch 'master' of https://github.com/ray-project/ray into fix_…

8f00728

…xgboost_ckpting

justinvyu changed the title ~~[train] Fix XGBoostTrainer and LightGBMTrainer checkpointing~~ [train] Simplify ray.train.xgboost/lightgbm (1/n): Align frequency-based and checkpoint_at_end checkpoint formats Feb 13, 2024

justinvyu assigned woshiyyya Feb 13, 2024

justinvyu added 10 commits February 12, 2024 17:35

remove deprecated apis from tune api listing

4ff91f3

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

fix docstring examples

9375ffa

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

fix ci failures

6d36763

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Merge branch 'master' of https://github.com/ray-project/ray into fix_…

e21567a

…xgboost_ckpting

fix tune-xgboost (asha will terminate the trial before after_training…

b1a1f5c

… hook gets called) Signed-off-by: Justin Yu <justinvyu@anyscale.com>

skip testcode

4478309

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

add callbacks to train api ref

c9a6574

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

fix doc build errors due to automatic alias detection

db411c8

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Merge branch 'master' of https://github.com/ray-project/ray into fix_…

2a1fdd2

…xgboost_ckpting

add test get_model

0be6f71

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

matthewdeng approved these changes Feb 13, 2024

View reviewed changes

woshiyyya approved these changes Feb 13, 2024

View reviewed changes

justinvyu added 2 commits February 13, 2024 13:25

json to ubj format as default

c9362fa

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

remove unnecessary __module__ patches

0388775

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu commented Feb 13, 2024

View reviewed changes

justinvyu added 2 commits February 13, 2024 13:33

Merge branch 'master' of https://github.com/ray-project/ray into fix_…

4e9f6b3

…xgboost_ckpting

fix lint

2548902

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu merged commit eb8950b into ray-project:master Feb 13, 2024

justinvyu deleted the fix_xgboost_ckpting branch February 13, 2024 23:36

justinvyu mentioned this pull request Feb 14, 2024

[ci][train] Remove unnecessary xgboost_ray/lightgbm_ray reinstalls for release tests #43176

Merged

8 tasks

justinvyu mentioned this pull request Apr 16, 2024

[doc][Tune] XGBoost checkpoint not saved at end of iterations when using TuneReportCheckpointCallback #40705

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[train] Simplify `ray.train.xgboost/lightgbm` (1/n): Align frequency-based and `checkpoint_at_end` checkpoint formats#42111

[train] Simplify `ray.train.xgboost/lightgbm` (1/n): Align frequency-based and `checkpoint_at_end` checkpoint formats#42111
justinvyu merged 57 commits intoray-project:masterfrom
justinvyu:fix_xgboost_ckpting

justinvyu commented Dec 28, 2023 •

edited

Loading

Uh oh!

matthewdeng Feb 13, 2024

Uh oh!

justinvyu Feb 13, 2024

Uh oh!

matthewdeng Feb 13, 2024

Uh oh!

justinvyu Feb 13, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

woshiyyya Feb 13, 2024

Uh oh!

justinvyu Feb 13, 2024

Uh oh!

Uh oh!

justinvyu Feb 13, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

justinvyu commented Dec 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

API Change Summary

TODOs left for followups

Related issue number

Checks

Uh oh!

matthewdeng Feb 13, 2024

Choose a reason for hiding this comment

Uh oh!

justinvyu Feb 13, 2024

Choose a reason for hiding this comment

Uh oh!

matthewdeng Feb 13, 2024

Choose a reason for hiding this comment

Uh oh!

justinvyu Feb 13, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

woshiyyya Feb 13, 2024

Choose a reason for hiding this comment

Uh oh!

justinvyu Feb 13, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

justinvyu Feb 13, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

justinvyu commented Dec 28, 2023 •

edited

Loading