[data][train] Fix deadlocks caused by streaming_split#42601
[data][train] Fix deadlocks caused by streaming_split#42601raulchen merged 8 commits intoray-project:masterfrom
Conversation
Signed-off-by: Hao Chen <chenh1024@gmail.com> fix Signed-off-by: Hao Chen <chenh1024@gmail.com> separate queues Signed-off-by: Hao Chen <chenh1024@gmail.com> debug Signed-off-by: Hao Chen <chenh1024@gmail.com> fix Signed-off-by: Hao Chen <chenh1024@gmail.com> fix Signed-off-by: Hao Chen <chenh1024@gmail.com> debug Signed-off-by: Hao Chen <chenh1024@gmail.com> refine Signed-off-by: Hao Chen <chenh1024@gmail.com> fix Signed-off-by: Hao Chen <chenh1024@gmail.com> Revert "fix" This reverts commit c63f8b71f150b0dc0add60b2817ce2241abd41ac. Revert "refine" This reverts commit 225db8279d128e1d00a359b42a5b7b5b93c57cfb. fix Signed-off-by: Hao Chen <chenh1024@gmail.com>
c63f8b7 to
d9aeb87
Compare
|
Hmm sorry but I don't quite understand the deadlock situation in the PR description and the proposed fix. Doesn't SplitCoordinator explicitly require all the consumers to read at the same time? Is the deadlock situation in the PR description somehow different? |
|
Update: for batch in it.iter_batches():
all_reduce()We suspect it's because |
) Fix a deadlock issue for training jobs. The issue happens in the following situation: * The output blocks of `streaming_split` are assigned to multiple splits (`output_split_idx`). * When one split has finished reading all blocks, it won't stop the iteration until all the other splits have all finished, because of [this](https://github.com/ray-project/ray/blob/fae8d2ff814377eb027d63d73a23d5c5bf3b02bd/python/ray/data/_internal/execution/streaming_executor_state.py#L288). * This is usually fine. But when the unfinished splits are waiting for the finished splits (e.g., there is a gradient synchronization), there will be a dead lock due to circular dependencies. This PR makes the finished splits can finish iteration immediately without waiting for others. --------- Signed-off-by: Hao Chen <chenh1024@gmail.com>
|
Hi @raulchen , thanks for pushing this fix -- this actually fixed a NCCL timeout error that we were seeing when doing multi-node distributed training. The behavior there was that sometimes randomly at the start of a train epoch, we would hit a NCCL timeout error because all of the ranks except one were trying to allreduce the gradients. I'm also confused by the deadlock explanation though. Have you / the team thought more about how exactly this would have created a deadlock with gradient synchronization? We iterate over our data using the If it's helpful, we only started seeing this issue when we scaled up the model size (probably because gradient synchronization took longer). |
Why are these changes needed?
Fix a deadlock issue for training jobs. The issue happens in the following situation:
streaming_splitare assigned to multiple splits (output_split_idx).This PR makes the finished splits can finish iteration immediately without waiting for others.
Related issue number
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.