Skip to content

[Core] Introduce AcceleratorManager interface#40286

Merged
jjyao merged 21 commits intoray-project:masterfrom
jjyao:jjyao/accelerator
Oct 17, 2023
Merged

[Core] Introduce AcceleratorManager interface#40286
jjyao merged 21 commits intoray-project:masterfrom
jjyao:jjyao/accelerator

Conversation

@jjyao
Copy link
Collaborator

@jjyao jjyao commented Oct 12, 2023

Why are these changes needed?

Introduce AcceleratorManager interface so that each accelerator support can just be implementing a subclass.

Related issue number

#38504

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
jjyao added 3 commits October 11, 2023 23:24
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
jjyao added 7 commits October 12, 2023 16:24
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
@jjyao jjyao changed the title [Core][WIP] Introduce Accelerator interface [Core] Introduce Accelerator interface Oct 13, 2023
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
The number of Neuron cores if any were detected, otherwise 0.
"""
nc_count: int = 0
neuron_path = "/opt/aws/neuron/bin/"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: The path doesn't guarantee for all environments (based on recent slack conversation https://ray-distributed.slack.com/archives/C01DLHZHRBJ/p1696448156026509)

Expected:
find (neuron-ls) command if it exists
then run the command to get the cores

Also, we did ask AWS Neuron SDK team to get the cores information using IMDS (API driven) but no plans to support this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just copy pasted from the existing implementation. Are you planning to fix it for other environments?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a try/catch and fix it in the current PR?
If not, ok to move it to issue and I'll own it (tentative ETA: Q1 2024).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also add TODO maybe?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created #40405 to track this.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
jjyao added 5 commits October 13, 2023 15:33
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally lgtm.

from ray._private.accelerators.neuron import NeuronAccelerator


def get_all_accelerators() -> Set[Accelerator]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are not DevAPIsright?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, just implementation details.



@DeveloperAPI
class Accelerator(ABC):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
class Accelerator(ABC):
class AcceleratorUtil(ABC):

? Since it seems like it is also all staticmethod

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to AcceleratorManager per ray-project/enhancements#46 (comment)

"""Get the mapping from accelerator resource name
to the visible ids."""

from ray._private.accelerators import (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we import here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

circular dependency

constraint_name = f"{ray_constants.RESOURCE_CONSTRAINT_PREFIX}" f"{pretty_name}"
return constraint_name
if last_set_visible_accelerator_ids.get(resource_name, None) == accelerator_ids:
continue # optimization: already set
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it new or did we have the same optimization before?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also is it really necessary? Aren't they just setting an env var? Maybe remove this for now?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same old code.

I actually tried to remove it since I also think it might be unnecessary but it uncovered a bug. I decided to fix the bug in a follow-up PR and then remove this optimization.

}


def get_all_accelerator_resource_names() -> Set[str]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we just use enums here instead of names directly?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What benefits do enum provide here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the top-level API makes sense to accept string, but for internal functions, it's cleaner to pass enum around (otherwise, we should rely on some implicit assumption the input is always valid, or we should do validation everywhere).

"memory": 12025908428,
"neuron_cores": 2,
"accelerator_type:aws-neuron-core": 2,
"accelerator_type:aws-neuron-core": 1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is it changed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bug from the previous PR. Total quantity of the special accelerator_type resource should only be 1.

nc_f = ray.remote(resources={"neuron_cores": 2})(lambda: get_neuron_core_ids(2))
assert ray.get(nc_f.remote()) == 2

with pytest.raises(ValueError):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we removing this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we now allow specifying both gpu and neuro core resources for a single task.

jjyao added 3 commits October 17, 2023 01:16
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
@jjyao jjyao requested a review from rkooo567 October 17, 2023 08:30
@jjyao jjyao changed the title [Core] Introduce Accelerator interface [Core] Introduce AcceleratorManager interface Oct 17, 2023
@jjyao jjyao merged commit 16da484 into ray-project:master Oct 17, 2023
@jjyao jjyao deleted the jjyao/accelerator branch October 17, 2023 15:25

Returns:
The resource name: e.g., the resource name for Nvidia GPUs is "GPU"
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think AcceleratorManager should provide get_resource_type() interface,
while get_resource_name provide detail resource name, such as NvidiaGPU, IntelGPU
get_resource_type provide resource type, such as NvidiaGPU is GPU, IntelGPU is GPU

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_resource_name returns the Ray resource name so it should be "GPU". We can add a get_accelerator_family that returns NvidiaGPU or IntelGPU in the future if needed.

from ray._private.accelerators.neuron import NeuronAcceleratorManager


def get_all_accelerator_managers() -> Set[AcceleratorManager]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also support get accelerator managers from env vars? Some users may need to append accelerator manager modules which are not maintained in Ray repo.

fishbone pushed a commit that referenced this pull request Mar 5, 2024
…set (#43714)

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

#40286 accidentally changed ray.get_gpu_ids() to always return a list of int while it should return a list of str when CUDA_VISIBLE_DEVICES is set before starting ray.

This PR reverts back to the original behavior.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants