-
Notifications
You must be signed in to change notification settings - Fork 7.3k
Closed
Labels
P2Important issue, but not time-criticalImportant issue, but not time-criticaldocsAn issue or change related to documentationAn issue or change related to documentationtrainRay Train Related IssueRay Train Related Issue
Description
What happened + What you expected to happen
tl;dr:
- I am using Ray Tune with xgboost_ray to train a XGBoost model on a Ray cluster.
- I want to checkpoint every N iterations and at the end of all iterations.
- I observe that a checkpoint is not saved at the end.
Expected behavior:
- When training for num_boost_round = 13 (iterations), with checkpoint frequency every 5 iterations, I'd expect to see the following checkpoints:
- First checkpoint after 5 iterations
- Second checkpoint after 10 iterations
- Final checkpoint upon completing training, i.e. after 13 iterations
Actual behavior:
- I observe a checkpoint gets created for the following iterations:
- First checkpoint after 5 iterations
- Second checkpoint after 10 iterations
- No checkpoint gets created upon completing training, i.e. after 13 iterations
Additional information:
- This runs in a distributed setting, using Ray 2.7.1, with remote checkpoint storage enabled (in cloud storage)
- This uses Ray Tune's tune.run(callable_function) API paradigm.
- This also uses the ray.tune.integration.xgboost.TuneReportCheckpointCallback
Versions / Dependencies
Versions:
- Ray 2.7.1
- xgboost-ray 0.1.19
- OS: Debian GNU/Linux 10 (buster)
Reproduction script
Reproduction script:
I call the following function from using tune.run() (see below)
def tune_train_model(config, verbose = False):
'''
Outer function of what gets executed by Ray Tune
'''
files = config['files'] # Returns a list of Parquet files read from cloud storage
from xgboost_ray import RayDMatrix, RayParams, train, RayFileType
from xgboost_ray.main import _is_client_connected, is_session_enabled
checkpoint = ray.train.get_checkpoint()
if checkpoint:
bst = train_xgboost_on_ray(config, files, checkpoint_dir=checkpoint.path)
else:
if (verbose): _logger.info("No checkpoint yet")
bst = train_xgboost_on_ray(config, files, checkpoint_dir = None)
bst.save_model("model.xgb") # Added because currently does not save final iteration. Would like to get rid of this.
The above function in turns calls the following inner function to do the actual training:
def train_xgboost_on_ray(config, files, checkpoint_dir=None, verbose = False):
'''
Inner function of what gets executed by Ray Tune
'''
from xgboost_ray import RayDMatrix, RayParams, train, RayFileType
from xgboost_ray.main import _is_client_connected, is_session_enabled
from ray.tune.integration.xgboost import TuneReportCheckpointCallback
import xgboost as xgb
model_file = None
rounds_left = None
if checkpoint_dir: ## Ignore for this repro
model_file = load_model_file()
rounds_left = get_rounds_left(model_file, config['num_boost_rounds'])
dmatrix = RayDMatrix(
[files],
label = config['ycol'],
weight = config['wcol'],
filetype = RayFileType.PARQUET,
sharding = RayShardingMode.BATCH,
lazy = True,
num_actors = config['num_actors'],
columns = config['xcols'] + [config['ycol']] + [config['wcol']],
engine = "pyarrow",
)
# Train XGBoost on Ray
params = get_xgb_params()
_logger.info('Begin training on XGBoost...')
bst = train(
params,
dmatrix,
num_boost_round = rounds_left if rounds_left else config['num_boost_round'],
evals = [(dmatrix, "train")],
ray_params = RayParams(
num_actors=config['num_actors'],
cpus_per_actor=config['cpus_per_actor'],
verbose=True,
),
callbacks = [TuneReportCheckpointCallback(
filename = 'model.xgb',
frequency = config["callback_frequency"] # 5
)],
xgb_model = model_file
)
return bst
Lastly, I parameterize and kickstart my tune runs as follows:
config = {
'files': get_files(),
'xcols': get_xcols(),
'ycol': get_ycol(),
'wcol': get_wcol(),
'num_actors': 12,
'use_gpu': False,
'objective': OBJECTIVE,
'eval_metric': METRIC,
# XGBoost HPs
'num_boost_round': tune.grid_search([13]),
'max_depth': 5,
# ...
# Other standard XGBoost hyperparameters
# ...
'seed': tune.grid_search([1]),
'cpus_per_actor': 29,
'gpus_per_actor': 0,
'experiment_name': f"{str(uuid.uuid4())}",
'callback_frequency': 5,
}
ray_params=RayParams(
num_actors=config['num_actors'],
gpus_per_actor= config['gpus_per_actor'] if config['use_gpu'] else 0,
cpus_per_actor= config['cpus_per_actor'],
verbose=True,
max_actor_restarts=10,
)
if ray.is_initialized():
ray.shutdown()
ray.init(
address=HEAD_ADDR,
log_to_driver=False,
)
from ray.air import RunConfig, CheckpointConfig, ScalingConfig
from ray.tune.progress_reporter import JupyterNotebookReporter
reporter = JupyterNotebookReporter(max_progress_rows=100, max_error_rows=100, infer_limit=10,)
reporter.add_metric_column("time_this_iter_s")
reporter.add_metric_column(METRIC)
analysis = tune.run(
tune_train_model,
config=config,
metric=METRIC,
mode="min",
resources_per_trial=ray_params.get_tune_resources(),
resume = 'AUTO',
name = config['experiment_name'],
storage_path=STORAGE_PATH,
storage_filesystem=FILESYSTEM,
progress_reporter = reporter,
sync_config = ray.train.SyncConfig(sync_period = 30),
)
Issue Severity
High: It blocks me from completing my task.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
P2Important issue, but not time-criticalImportant issue, but not time-criticaldocsAn issue or change related to documentationAn issue or change related to documentationtrainRay Train Related IssueRay Train Related Issue