Skip to content

[doc][Tune] XGBoost checkpoint not saved at end of iterations when using TuneReportCheckpointCallback  #40705

@sjhermanek

Description

@sjhermanek

What happened + What you expected to happen

tl;dr:

  • I am using Ray Tune with xgboost_ray to train a XGBoost model on a Ray cluster.
  • I want to checkpoint every N iterations and at the end of all iterations.
  • I observe that a checkpoint is not saved at the end.

Expected behavior:

  • When training for num_boost_round = 13 (iterations), with checkpoint frequency every 5 iterations, I'd expect to see the following checkpoints:
    • First checkpoint after 5 iterations
    • Second checkpoint after 10 iterations
    • Final checkpoint upon completing training, i.e. after 13 iterations

Actual behavior:

  • I observe a checkpoint gets created for the following iterations:
    • First checkpoint after 5 iterations
    • Second checkpoint after 10 iterations
  • No checkpoint gets created upon completing training, i.e. after 13 iterations

Additional information:

  • This runs in a distributed setting, using Ray 2.7.1, with remote checkpoint storage enabled (in cloud storage)
  • This uses Ray Tune's tune.run(callable_function) API paradigm.
  • This also uses the ray.tune.integration.xgboost.TuneReportCheckpointCallback

Versions / Dependencies

Versions:

  • Ray 2.7.1
  • xgboost-ray 0.1.19
  • OS: Debian GNU/Linux 10 (buster)

Reproduction script

Reproduction script:

I call the following function from using tune.run() (see below)

def tune_train_model(config, verbose = False):
    '''
    Outer function of what gets executed by Ray Tune
    '''
    files = config['files'] # Returns a list of Parquet files read from cloud storage
        
    from xgboost_ray import RayDMatrix, RayParams, train, RayFileType
    from xgboost_ray.main import _is_client_connected, is_session_enabled
    
    checkpoint = ray.train.get_checkpoint()
    if checkpoint:
        bst = train_xgboost_on_ray(config, files, checkpoint_dir=checkpoint.path)
    else:
        if (verbose): _logger.info("No checkpoint yet")
        bst = train_xgboost_on_ray(config, files, checkpoint_dir = None)
        
    bst.save_model("model.xgb") # Added because currently does not save final iteration. Would like to get rid of this.

The above function in turns calls the following inner function to do the actual training:

def train_xgboost_on_ray(config, files, checkpoint_dir=None, verbose = False):
    '''
    Inner function of what gets executed by Ray Tune
    '''
    from xgboost_ray import RayDMatrix, RayParams, train, RayFileType
    from xgboost_ray.main import _is_client_connected, is_session_enabled
    from ray.tune.integration.xgboost import TuneReportCheckpointCallback 
    import xgboost as xgb
    
    model_file = None
    rounds_left = None
    
    if checkpoint_dir: ## Ignore for this repro
        model_file = load_model_file()
        rounds_left = get_rounds_left(model_file, config['num_boost_rounds'])
    
    dmatrix = RayDMatrix(
        [files], 
        label = config['ycol'],
        weight = config['wcol'],
        filetype = RayFileType.PARQUET,
        sharding = RayShardingMode.BATCH,
        lazy = True,
        num_actors = config['num_actors'],
        columns = config['xcols'] + [config['ycol']] + [config['wcol']],
        engine = "pyarrow",
    )

    # Train XGBoost on Ray
    params = get_xgb_params()

    _logger.info('Begin training on XGBoost...')
    bst = train(
        params,
        dmatrix,
        num_boost_round = rounds_left if rounds_left else config['num_boost_round'],
        evals = [(dmatrix, "train")],
        ray_params = RayParams(
            num_actors=config['num_actors'],
            cpus_per_actor=config['cpus_per_actor'],
            verbose=True,
        ),
        callbacks = [TuneReportCheckpointCallback(
            filename = 'model.xgb',
            frequency = config["callback_frequency"] # 5
        )],
        xgb_model = model_file
    )
        
    return bst

Lastly, I parameterize and kickstart my tune runs as follows:

    config = {
        'files': get_files(),
        'xcols': get_xcols(),
        'ycol': get_ycol(),
        'wcol': get_wcol(),
        'num_actors': 12,
        'use_gpu': False,
        
        'objective': OBJECTIVE,
        'eval_metric': METRIC,
        
        # XGBoost HPs
        'num_boost_round': tune.grid_search([13]),
        'max_depth': 5,
        # ...
        # Other standard XGBoost hyperparameters
        # ...
        'seed': tune.grid_search([1]),
        
        'cpus_per_actor': 29,
        'gpus_per_actor': 0,
        'experiment_name': f"{str(uuid.uuid4())}",

        'callback_frequency': 5,
    }

    ray_params=RayParams(
    num_actors=config['num_actors'],
    gpus_per_actor= config['gpus_per_actor'] if config['use_gpu'] else 0,
    cpus_per_actor= config['cpus_per_actor'],
    verbose=True,
    max_actor_restarts=10,
)

    if ray.is_initialized():
        ray.shutdown()

    ray.init(
        address=HEAD_ADDR,
        log_to_driver=False,
    )

    from ray.air import RunConfig, CheckpointConfig, ScalingConfig
    from ray.tune.progress_reporter import JupyterNotebookReporter
    
    reporter = JupyterNotebookReporter(max_progress_rows=100, max_error_rows=100, infer_limit=10,)
    reporter.add_metric_column("time_this_iter_s")
    reporter.add_metric_column(METRIC)

    analysis = tune.run(
        tune_train_model,
        config=config,
        metric=METRIC,
        mode="min",
        resources_per_trial=ray_params.get_tune_resources(),
        resume = 'AUTO',
        name = config['experiment_name'],
        
        storage_path=STORAGE_PATH,
        storage_filesystem=FILESYSTEM,
        
        progress_reporter = reporter,
        
        sync_config = ray.train.SyncConfig(sync_period = 30),
    )

Issue Severity

High: It blocks me from completing my task.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Important issue, but not time-criticaldocsAn issue or change related to documentationtrainRay Train Related Issue

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions