Skip to content

Do distributed erroring properly #60

@RobotSail

Description

@RobotSail

In the current codebase there are a lot of instances where we raise exceptions during the flow of distributed execution. When we branch logic for specific ranks, e.g. rank 0 logs into wandb, it's possible that only a fraction of the process group experiences the failure while the rest continue execution, leading to confusing errors. For example if wandb fails to initialize on rank 0, it will throw an error and quit execution, but the reamining ranks were not asked to initialize it so they continue with the training process as if rank 0 isstill present.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions