Skip to content

[data][train] Fix deadlocks caused by streaming_split#42601

Merged
raulchen merged 8 commits intoray-project:masterfrom
raulchen:streaming-split-output-queue
Jan 26, 2024
Merged

[data][train] Fix deadlocks caused by streaming_split#42601
raulchen merged 8 commits intoray-project:masterfrom
raulchen:streaming-split-output-queue

Conversation

@raulchen
Copy link
Contributor

@raulchen raulchen commented Jan 23, 2024

Why are these changes needed?

Fix a deadlock issue for training jobs. The issue happens in the following situation:

  • The output blocks of streaming_split are assigned to multiple splits (output_split_idx).
  • When one split has finished reading all blocks, it won't stop the iteration until all the other splits have all finished, because of this.
  • This is usually fine. But when the unfinished splits are waiting for the finished splits (e.g., there is a gradient synchronization), there will be a dead lock due to circular dependencies.

This PR makes the finished splits can finish iteration immediately without waiting for others.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Hao Chen <chenh1024@gmail.com>

fix

Signed-off-by: Hao Chen <chenh1024@gmail.com>

separate queues

Signed-off-by: Hao Chen <chenh1024@gmail.com>

debug

Signed-off-by: Hao Chen <chenh1024@gmail.com>

fix

Signed-off-by: Hao Chen <chenh1024@gmail.com>

fix

Signed-off-by: Hao Chen <chenh1024@gmail.com>

debug

Signed-off-by: Hao Chen <chenh1024@gmail.com>

refine

Signed-off-by: Hao Chen <chenh1024@gmail.com>

fix

Signed-off-by: Hao Chen <chenh1024@gmail.com>

Revert "fix"

This reverts commit c63f8b71f150b0dc0add60b2817ce2241abd41ac.

Revert "refine"

This reverts commit 225db8279d128e1d00a359b42a5b7b5b93c57cfb.

fix

Signed-off-by: Hao Chen <chenh1024@gmail.com>
@raulchen raulchen force-pushed the streaming-split-output-queue branch from c63f8b7 to d9aeb87 Compare January 23, 2024 21:36
Signed-off-by: Hao Chen <chenh1024@gmail.com>
@raulchen raulchen changed the title [data] Use multiple queues to store output blocks with split index [data][train] Fix deadlocks caused by streaming_split Jan 25, 2024
Signed-off-by: Hao Chen <chenh1024@gmail.com>
Signed-off-by: Hao Chen <chenh1024@gmail.com>
Signed-off-by: Hao Chen <chenh1024@gmail.com>
Signed-off-by: Hao Chen <chenh1024@gmail.com>
Signed-off-by: Hao Chen <chenh1024@gmail.com>
@stephanie-wang
Copy link
Contributor

Hmm sorry but I don't quite understand the deadlock situation in the PR description and the proposed fix. Doesn't SplitCoordinator explicitly require all the consumers to read at the same time? Is the deadlock situation in the PR description somehow different?

@raulchen
Copy link
Contributor Author

Update:
The issue happens when there is a gradient sync per batch. E.g.,

for batch in it.iter_batches():
    all_reduce()

We suspect it's because streaming_split returns different number of rows for each train worker and makes the all_reduce misaligned.
We'll merge this PR first, as it doesn't make sense to have this dependency in the first place.

@raulchen raulchen merged commit df3dd96 into ray-project:master Jan 26, 2024
@raulchen raulchen deleted the streaming-split-output-queue branch January 26, 2024 23:03
raulchen added a commit to raulchen/ray that referenced this pull request Jan 26, 2024
)

Fix a deadlock issue for training jobs. The issue happens in the following situation:
* The output blocks of `streaming_split` are assigned to multiple splits (`output_split_idx`).
* When one split has finished reading all blocks, it won't stop the iteration until all the other splits have all finished, because of [this](https://github.com/ray-project/ray/blob/fae8d2ff814377eb027d63d73a23d5c5bf3b02bd/python/ray/data/_internal/execution/streaming_executor_state.py#L288).
* This is usually fine. But when the unfinished splits are waiting for the finished splits (e.g., there is a gradient synchronization), there will be a dead lock due to circular dependencies.

This PR makes the finished splits can finish iteration immediately without waiting for others.

---------

Signed-off-by: Hao Chen <chenh1024@gmail.com>
Copy link
Contributor

@c21 c21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

late LGTM

architkulkarni pushed a commit that referenced this pull request Jan 29, 2024
pick #42601 to 2.9.2, this PR fixes a potential deadlock issue for training jobs.

---------

Signed-off-by: Hao Chen <chenh1024@gmail.com>
@ghost
Copy link

ghost commented Mar 7, 2024

Hi @raulchen , thanks for pushing this fix -- this actually fixed a NCCL timeout error that we were seeing when doing multi-node distributed training. The behavior there was that sometimes randomly at the start of a train epoch, we would hit a NCCL timeout error because all of the ranks except one were trying to allreduce the gradients.

I'm also confused by the deadlock explanation though. Have you / the team thought more about how exactly this would have created a deadlock with gradient synchronization? We iterate over our data using the iter_torch_batches function, and I thought this would call into streaming_split with equal=True (so each worker gets an equal number of rows).

If it's helpful, we only started seeing this issue when we scaled up the model size (probably because gradient synchronization took longer).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants