[doc][train] Recommend `tree_learner="data_parallel"` in examples to enable distributed lightgbm training by justinvyu · Pull Request #58709 · ray-project/ray

justinvyu · 2025-11-17T20:52:38Z

Description

The default is tree_learner=”serial”, which trains a separate model per worker. Users should set tree_learner in order to configure lightgbm to train a single model across all the dataset shards. pre_partition should also be set if using Ray Data to shard the dataset.

Additional information

See here: https://lightgbm.readthedocs.io/en/stable/Parallel-Learning-Guide.html

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

gemini-code-assist

Code Review

This pull request is a valuable improvement for users of LightGBM with Ray Train. It correctly identifies a common pitfall where users might not be running in a distributed fashion and addresses it by updating documentation, examples, and the legacy trainer implementation to include tree_learner="data_parallel" and pre_partition=True. The changes are consistent and well-executed. I have one minor suggestion to add a comment to the implementation to improve long-term maintainability by explaining why these default parameters are being set.

gemini-code-assist · 2025-11-17T20:54:12Z

python/ray/train/lightgbm/lightgbm_trainer.py

+    config.setdefault("tree_learner", "data_parallel")
+    config.setdefault("pre_partition", True)


It's great that you're setting sensible defaults for distributed training. To improve code clarity and maintainability, consider adding a comment explaining why these specific default values are chosen. This will help future developers understand the reasoning behind these settings.

# Set default parameters for distributed training. # `tree_learner="data_parallel"` enables data-parallel training. # `pre_partition=True` is needed since the data is sharded by Ray Data. config.setdefault("tree_learner", "data_parallel") config.setdefault("pre_partition", True)

…enable distributed lightgbm training (ray-project#58709) The default is tree_learner=”serial”, which trains a separate model per worker. Users should set tree_learner in order to configure lightgbm to train a single model across all the dataset shards. `pre_partition` should also be set if using Ray Data to shard the dataset. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>

…enable distributed lightgbm training (ray-project#58709) The default is tree_learner=”serial”, which trains a separate model per worker. Users should set tree_learner in order to configure lightgbm to train a single model across all the dataset shards. `pre_partition` should also be set if using Ray Data to shard the dataset. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: YK <1811651+ykdojo@users.noreply.github.com>

…enable distributed lightgbm training (ray-project#58709) The default is tree_learner=”serial”, which trains a separate model per worker. Users should set tree_learner in order to configure lightgbm to train a single model across all the dataset shards. `pre_partition` should also be set if using Ray Data to shard the dataset. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…enable distributed lightgbm training (ray-project#58709) The default is tree_learner=”serial”, which trains a separate model per worker. Users should set tree_learner in order to configure lightgbm to train a single model across all the dataset shards. `pre_partition` should also be set if using Ray Data to shard the dataset. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Future-Outlier <eric901201@gmail.com>

…enable distributed lightgbm training (ray-project#58709) The default is tree_learner=”serial”, which trains a separate model per worker. Users should set tree_learner in order to configure lightgbm to train a single model across all the dataset shards. `pre_partition` should also be set if using Ray Data to shard the dataset. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>

justinvyu added 2 commits November 17, 2025 12:36

add tree_learner args in docs

443baf7

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

set default tree learner to be distributed in v1 api

1f4fff7

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu assigned liulehui Nov 17, 2025

justinvyu requested review from a team as code owners November 17, 2025 20:52

gemini-code-assist bot reviewed Nov 17, 2025

View reviewed changes

liulehui approved these changes Nov 17, 2025

View reviewed changes

justinvyu enabled auto-merge (squash) November 17, 2025 21:52

github-actions bot added the go add ONLY when ready to merge, run all tests label Nov 17, 2025

justinvyu merged commit 93dad3b into ray-project:master Nov 17, 2025
7 checks passed

justinvyu deleted the lightgbm_tree_learner branch November 18, 2025 02:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[doc][train] Recommend `tree_learner="data_parallel"` in examples to enable distributed lightgbm training#58709

[doc][train] Recommend `tree_learner="data_parallel"` in examples to enable distributed lightgbm training#58709
justinvyu merged 2 commits intoray-project:masterfrom
justinvyu:lightgbm_tree_learner

justinvyu commented Nov 17, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		config.setdefault("tree_learner", "data_parallel")
		config.setdefault("pre_partition", True)

Conversation

justinvyu commented Nov 17, 2025

Description

Additional information

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants