-
Notifications
You must be signed in to change notification settings - Fork 11
Open
Description
In the current codebase there are a lot of instances where we raise exceptions during the flow of distributed execution. When we branch logic for specific ranks, e.g. rank 0 logs into wandb, it's possible that only a fraction of the process group experiences the failure while the rest continue execution, leading to confusing errors. For example if wandb fails to initialize on rank 0, it will throw an error and quit execution, but the reamining ranks were not asked to initialize it so they continue with the training process as if rank 0 isstill present.
Metadata
Metadata
Assignees
Labels
No labels