Skip to content

[doc][train] Recommend tree_learner="data_parallel" in examples to enable distributed lightgbm training#58709

Merged
justinvyu merged 2 commits intoray-project:masterfrom
justinvyu:lightgbm_tree_learner
Nov 17, 2025
Merged

[doc][train] Recommend tree_learner="data_parallel" in examples to enable distributed lightgbm training#58709
justinvyu merged 2 commits intoray-project:masterfrom
justinvyu:lightgbm_tree_learner

Conversation

@justinvyu
Copy link
Contributor

Description

The default is tree_learner=”serial”, which trains a separate model per worker. Users should set tree_learner in order to configure lightgbm to train a single model across all the dataset shards. pre_partition should also be set if using Ray Data to shard the dataset.

Additional information

See here: https://lightgbm.readthedocs.io/en/stable/Parallel-Learning-Guide.html

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
@justinvyu justinvyu requested review from a team as code owners November 17, 2025 20:52
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a valuable improvement for users of LightGBM with Ray Train. It correctly identifies a common pitfall where users might not be running in a distributed fashion and addresses it by updating documentation, examples, and the legacy trainer implementation to include tree_learner="data_parallel" and pre_partition=True. The changes are consistent and well-executed. I have one minor suggestion to add a comment to the implementation to improve long-term maintainability by explaining why these default parameters are being set.

Comment on lines +76 to +77
config.setdefault("tree_learner", "data_parallel")
config.setdefault("pre_partition", True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

It's great that you're setting sensible defaults for distributed training. To improve code clarity and maintainability, consider adding a comment explaining why these specific default values are chosen. This will help future developers understand the reasoning behind these settings.

    # Set default parameters for distributed training.
    # `tree_learner="data_parallel"` enables data-parallel training.
    # `pre_partition=True` is needed since the data is sharded by Ray Data.
    config.setdefault("tree_learner", "data_parallel")
    config.setdefault("pre_partition", True)

@justinvyu justinvyu enabled auto-merge (squash) November 17, 2025 21:52
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Nov 17, 2025
@justinvyu justinvyu merged commit 93dad3b into ray-project:master Nov 17, 2025
7 checks passed
@justinvyu justinvyu deleted the lightgbm_tree_learner branch November 18, 2025 02:35
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
…enable distributed lightgbm training (ray-project#58709)

The default is tree_learner=”serial”, which trains a separate model per
worker. Users should set tree_learner in order to configure lightgbm to
train a single model across all the dataset shards. `pre_partition`
should also be set if using Ray Data to shard the dataset.

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
ykdojo pushed a commit to ykdojo/ray that referenced this pull request Nov 27, 2025
…enable distributed lightgbm training (ray-project#58709)

The default is tree_learner=”serial”, which trains a separate model per
worker. Users should set tree_learner in order to configure lightgbm to
train a single model across all the dataset shards. `pre_partition`
should also be set if using Ray Data to shard the dataset.

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: YK <1811651+ykdojo@users.noreply.github.com>
SheldonTsen pushed a commit to SheldonTsen/ray that referenced this pull request Dec 1, 2025
…enable distributed lightgbm training (ray-project#58709)

The default is tree_learner=”serial”, which trains a separate model per
worker. Users should set tree_learner in order to configure lightgbm to
train a single model across all the dataset shards. `pre_partition`
should also be set if using Ray Data to shard the dataset.

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Future-Outlier pushed a commit to Future-Outlier/ray that referenced this pull request Dec 7, 2025
…enable distributed lightgbm training (ray-project#58709)

The default is tree_learner=”serial”, which trains a separate model per
worker. Users should set tree_learner in order to configure lightgbm to
train a single model across all the dataset shards. `pre_partition`
should also be set if using Ray Data to shard the dataset.

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
…enable distributed lightgbm training (ray-project#58709)

The default is tree_learner=”serial”, which trains a separate model per
worker. Users should set tree_learner in order to configure lightgbm to
train a single model across all the dataset shards. `pre_partition`
should also be set if using Ray Data to shard the dataset.

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants