[Train] Sort Local Train Workers by GPU id#40953
[Train] Sort Local Train Workers by GPU id#40953matthewdeng merged 12 commits intoray-project:masterfrom
Conversation
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
| # More details: https://github.com/ray-project/ray/issues/40803 | ||
| def get_lowest_gpu_id(worker) -> int: | ||
| gpu_ids = worker.metadata.resource_ids.get("GPU", []) | ||
| return min(map(int, gpu_ids), default=0) |
There was a problem hiding this comment.
Converting to int won't work in the future if we support UUIDs (e.g for MIG). Maybe we can first try to convert to int and then fallback to string?
There was a problem hiding this comment.
Got it. Make sense. I've made take str ID as a backup.
There was a problem hiding this comment.
Wondering if we should update the name of this method now that it's no longer grouping but also sorting. At the very least we should update the docstring.
In the future we may want to start generalizing this more (e.g. have it be a generic sort function that takes in a comparator) and define the comparison logic upstream in the caller, since this IP/GPU sorting logic doesn't actually belong to the "worker group", but the "backend" layer on top of it.
There was a problem hiding this comment.
I agree. This is a temporary solution for now. We should actually put the ranking allocation logic into backend executor (e.g. here:
) and support different allocation strategies in the future. I'll update the name and docstring. What about naming it assort_workers_by_ip_and_gpu_id?
| "pids": [0, 1, 2, 3, 4, 5, 6, 7], | ||
| "ips": ["2", "2", "1", "1", "2", "1", "1", "2"], | ||
| "gpu_ids": [None] * 8, | ||
| "expected_local_ranks": None, # No expected ranks for CPU workers |
There was a problem hiding this comment.
I think we should still test default sorting behavior when there are no GPU IDs?
There was a problem hiding this comment.
When using cpu actors, GPU ids should be empty, then the sorted order will be non-deterministic since all the worker has the same key 0.
There was a problem hiding this comment.
Oh I think sort will retain the ordering when the entries have the same sort value. But to your point I don't know if this is guaranteed or just based on the implementation.
entries = [5, 4, 3, 2, 1]
print(entries)
entries.sort(key = lambda x: 0)
print(entries)
entries.sort(key = lambda x: x)
print(entries)[5, 4, 3, 2, 1]
[5, 4, 3, 2, 1]
[1, 2, 3, 4, 5]
There was a problem hiding this comment.
Ah I got you. Seems that the python's sort will maintain the order if multiple entries have identical keys: https://docs.python.org/3.7/howto/sorting.html#sort-stability-and-complex-sorts
I'll add a test for it.
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
…into train/sort_worker_by_device_id
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Why are these changes needed?
Sort the local Ray Train workers according to the GPU device id. This ensures that the allocated GPU == "cuda:{local_rank}". More details in #40803
Related issue number
Close #40803
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.