[Train] Sort Local Train Workers by GPU id by woshiyyya · Pull Request #40953 · ray-project/ray

woshiyyya · 2023-11-05T02:36:34Z

Why are these changes needed?

Sort the local Ray Train workers according to the GPU device id. This ensures that the allocated GPU == "cuda:{local_rank}". More details in #40803

Related issue number

Close #40803

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

…_by_device_id

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

matthewdeng · 2023-11-15T01:41:15Z

python/ray/train/_internal/worker_group.py

+        # More details: https://github.com/ray-project/ray/issues/40803
+        def get_lowest_gpu_id(worker) -> int:
+            gpu_ids = worker.metadata.resource_ids.get("GPU", [])
+            return min(map(int, gpu_ids), default=0)


Converting to int won't work in the future if we support UUIDs (e.g for MIG). Maybe we can first try to convert to int and then fallback to string?

Got it. Make sense. I've made take str ID as a backup.

matthewdeng · 2023-11-15T01:46:26Z

python/ray/train/_internal/worker_group.py

Wondering if we should update the name of this method now that it's no longer grouping but also sorting. At the very least we should update the docstring.

In the future we may want to start generalizing this more (e.g. have it be a generic sort function that takes in a comparator) and define the comparison logic upstream in the caller, since this IP/GPU sorting logic doesn't actually belong to the "worker group", but the "backend" layer on top of it.

I agree. This is a temporary solution for now. We should actually put the ranking allocation logic into backend executor (e.g. here:

ray/python/ray/train/_internal/backend_executor.py

Line 358 in 473d8f5

def _create_rank_world_size_mappings(self) -> List[Dict]:

) and support different allocation strategies in the future. I'll update the name and docstring. What about naming it as sort_workers_by_ip_and_gpu_id?

matthewdeng · 2023-11-15T01:50:20Z

python/ray/train/tests/test_worker_group.py

+        "pids": [0, 1, 2, 3, 4, 5, 6, 7],
+        "ips": ["2", "2", "1", "1", "2", "1", "1", "2"],
+        "gpu_ids": [None] * 8,
+        "expected_local_ranks": None,  # No expected ranks for CPU workers


I think we should still test default sorting behavior when there are no GPU IDs?

When using cpu actors, GPU ids should be empty, then the sorted order will be non-deterministic since all the worker has the same key 0.

Oh I think sort will retain the ordering when the entries have the same sort value. But to your point I don't know if this is guaranteed or just based on the implementation.

entries = [5, 4, 3, 2, 1] print(entries) entries.sort(key = lambda x: 0) print(entries) entries.sort(key = lambda x: x) print(entries)

[5, 4, 3, 2, 1] [5, 4, 3, 2, 1] [1, 2, 3, 4, 5]

Ah I got you. Seems that the python's sort will maintain the order if multiple entries have identical keys: https://docs.python.org/3.7/howto/sorting.html#sort-stability-and-complex-sorts

I'll add a test for it.

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

…into train/sort_worker_by_device_id

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

woshiyyya added 4 commits October 24, 2023 11:56

update lightning callback

ddc8efe

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

init

d1e3f5c

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

update tests

df68e99

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

Merge remote-tracking branch 'upstream/master' into train/sort_worker…

d2b6bea

…_by_device_id

woshiyyya marked this pull request as ready for review November 6, 2023 19:03

woshiyyya assigned matthewdeng Nov 13, 2023

woshiyyya added 2 commits November 13, 2023 23:48

Merge remote-tracking branch 'upstream/master' into train/sort_worker…

7336ef7

…_by_device_id

fix bug

8cfb75d

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

woshiyyya requested a review from matthewdeng November 14, 2023 19:39

fix empty min args with default value

2a638e4

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

matthewdeng reviewed Nov 15, 2023

View reviewed changes

woshiyyya and others added 4 commits November 14, 2023 21:46

Merge branch 'master' into train/sort_worker_by_device_id

8abc3dc

address comments

1abaecf

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

Merge remote-tracking branch 'origin/train/sort_worker_by_device_id' …

a8a79d8

…into train/sort_worker_by_device_id

fix code comments

655461b

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

woshiyyya requested a review from matthewdeng November 15, 2023 20:54

matthewdeng approved these changes Nov 16, 2023

View reviewed changes

check cpu workers preserved order

1d8473d

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

matthewdeng merged commit 0e7a481 into ray-project:master Nov 17, 2023

ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Nov 29, 2023

[Train] Sort Local Train Workers by GPU id (ray-project#40953)

9ebe32f

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Train] Sort Local Train Workers by GPU id#40953

[Train] Sort Local Train Workers by GPU id#40953
matthewdeng merged 12 commits intoray-project:masterfrom
woshiyyya:train/sort_worker_by_device_id

woshiyyya commented Nov 5, 2023 •

edited

Loading

Uh oh!

matthewdeng Nov 15, 2023

Uh oh!

woshiyyya Nov 15, 2023 •

edited

Loading

Uh oh!

matthewdeng Nov 15, 2023

Uh oh!

woshiyyya Nov 15, 2023 •

edited

Loading

Uh oh!

matthewdeng Nov 15, 2023

Uh oh!

woshiyyya Nov 15, 2023 •

edited

Loading

Uh oh!

matthewdeng Nov 16, 2023

Uh oh!

woshiyyya Nov 16, 2023 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

woshiyyya commented Nov 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

matthewdeng Nov 15, 2023

Choose a reason for hiding this comment

Uh oh!

woshiyyya Nov 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

matthewdeng Nov 15, 2023

Choose a reason for hiding this comment

Uh oh!

woshiyyya Nov 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

matthewdeng Nov 15, 2023

Choose a reason for hiding this comment

Uh oh!

woshiyyya Nov 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

matthewdeng Nov 16, 2023

Choose a reason for hiding this comment

Uh oh!

woshiyyya Nov 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

woshiyyya commented Nov 5, 2023 •

edited

Loading

woshiyyya Nov 15, 2023 •

edited

Loading

woshiyyya Nov 15, 2023 •

edited

Loading

woshiyyya Nov 15, 2023 •

edited

Loading

woshiyyya Nov 16, 2023 •

edited

Loading