Add support for Intel Gaudi Backend by jerome-habana · Pull Request #40561 · ray-project/ray

jerome-habana · 2023-10-23T11:02:49Z

Added support for intel gaudi backend based on new interfaces defined in #40286

Why are these changes needed?

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Jerome <janand@habana.ai>

jjyao

Lg

python/ray/_private/accelerators/hpu.py

python/ray/air/_internal/torch_utils.py

python/ray/tests/accelerators/test_hpu.py

python/ray/train/torch/config.py

Signed-off-by: Jerome <janand@habana.ai>

Signed-off-by: Jerome Anand <88475913+jerome-habana@users.noreply.github.com>

python/ray/_private/accelerators/hpu.py

python/ray/train/torch/config.py

python/ray/tests/accelerators/test_hpu.py

Signed-off-by: Jerome <janand@habana.ai>

jjyao

Have you tested this on the machine with Gaudi?

python/ray/_private/accelerators/hpu.py

python/ray/tests/accelerators/test_hpu.py

Signed-off-by: Jerome <janand@habana.ai>

Signed-off-by: Jerome Anand <88475913+jerome-habana@users.noreply.github.com>

jjyao

Last comment

python/ray/_private/accelerators/hpu.py

jjyao · 2023-10-27T15:31:51Z

Lint failure:

Fri Oct 27 12:06:51 UTC 2023 Flake8....
--
  | python/ray/_private/utils.py:338:89: E501 line too long (108 > 88 characters)
  | python/ray/tests/accelerators/test_hpu.py:3:1: F401 'subprocess' imported but unused
  | python/ray/tests/accelerators/test_hpu.py:110:74: E711 comparison to None should be 'if cond is None:'

Signed-off-by: Jerome Anand <88475913+jerome-habana@users.noreply.github.com>

Signed-off-by: Jerome <janand@habana.ai>

python/ray/_private/accelerators/hpu.py

jjyao · 2023-10-30T04:45:47Z

Lint failure:



def test_get_current_process_visible_accelerator_ids():
--
  | os.environ[hpu.HABANA_VISIBLE_DEVICES_ENV_VAR] = "0,1,2"
  | -    assert HPUAcceleratorManager.get_current_process_visible_accelerator_ids() == ["0", "1", "2"]  # noqa: E501
  | +    assert HPUAcceleratorManager.get_current_process_visible_accelerator_ids() == [
  | +        "0",
  | +        "1",
  | +        "2",
  | +    ]  # noqa: E501

Signed-off-by: Jerome <janand@habana.ai>

Signed-off-by: Jerome Anand <88475913+jerome-habana@users.noreply.github.com>

python/ray/_private/accelerators/hpu.py

python/ray/tests/accelerators/test_hpu.py

python/ray/_private/accelerators/hpu.py

Signed-off-by: Jerome Anand <88475913+jerome-habana@users.noreply.github.com>

* Add Intel gaudi to accelerator list * Add check for backend initialization with updated test Signed-off-by: Jerome <janand@habana.ai>

jjyao · 2023-10-31T04:51:01Z

Lint failure:



if HPUAcceleratorManager.is_initialized():
--
  | -        assert "Intel-GAUDI" in HPUAcceleratorManager.get_current_node_accelerator_type()
  | +        assert (
  | +            "Intel-GAUDI" in HPUAcceleratorManager.get_current_node_accelerator_type()
  | +        )
  | else:
  | assert HPUAcceleratorManager.get_current_node_accelerator_type() is None

python/ray/_private/accelerators/hpu.py

python/ray/util/accelerators/__init__.py

Signed-off-by: Jerome <janand@habana.ai>

jerome-habana · 2023-10-31T04:59:56Z

Lint failure:



if HPUAcceleratorManager.is_initialized():
--
  | -        assert "Intel-GAUDI" in HPUAcceleratorManager.get_current_node_accelerator_type()
  | +        assert (
  | +            "Intel-GAUDI" in HPUAcceleratorManager.get_current_node_accelerator_type()
  | +        )
  | else:
  | assert HPUAcceleratorManager.get_current_node_accelerator_type() is None

might be nice to have auto corrector

jjyao · 2023-10-31T05:51:35Z

python/ray/util/accelerators/accelerators.py

 NVIDIA_TESLA_A10G = "A10G"
 INTEL_MAX_1550 = "Intel-GPU-Max-1550"
 INTEL_MAX_1100 = "Intel-GPU-Max-1100"
+INTEL_GAUDI = "Intel-GAUDI"


Can you also add INTEL_GAUDI2 here?

Sure. I've kept it generic for now. Lets update post closure of the right instance usage ?

Hi all, I'm working on LLM serving on Gaudi 2. Is Gaudi 2 not supported yet?

Signed-off-by: Jerome Anand <88475913+jerome-habana@users.noreply.github.com>

Add support for Intel Gaudi Backend

0d71fb0

Signed-off-by: Jerome <janand@habana.ai>

jjyao reviewed Oct 23, 2023

View reviewed changes

Address reviews

81a161e

Signed-off-by: Jerome <janand@habana.ai>

jerome-habana requested a review from jjyao October 25, 2023 05:26

Merge branch 'master' into ray_hpu2

3306f9a

Signed-off-by: Jerome Anand <88475913+jerome-habana@users.noreply.github.com>

jjyao self-assigned this Oct 25, 2023

jjyao reviewed Oct 25, 2023

View reviewed changes

jerome-habana added 2 commits October 26, 2023 06:54

Address reviews and remove trainer changes

dd19aa0

Signed-off-by: Jerome <janand@habana.ai>

Remove whitespace

722f69f

Signed-off-by: Jerome <janand@habana.ai>

jerome-habana requested a review from jjyao October 26, 2023 03:58

jjyao reviewed Oct 26, 2023

View reviewed changes

python/ray/_private/accelerators/hpu.py Outdated Show resolved Hide resolved

python/ray/tests/accelerators/test_hpu.py Show resolved Hide resolved

Add more tests and address reviews

98a57e5

Signed-off-by: Jerome <janand@habana.ai>

jerome-habana requested a review from jjyao October 27, 2023 07:37

jerome-habana added 2 commits October 27, 2023 13:07

Merge branch 'master' into ray_hpu2

d6d2638

Signed-off-by: Jerome Anand <88475913+jerome-habana@users.noreply.github.com>

Merge branch 'master' into ray_hpu2

05c8505

Signed-off-by: Jerome Anand <88475913+jerome-habana@users.noreply.github.com>

jjyao reviewed Oct 27, 2023

View reviewed changes

python/ray/_private/accelerators/hpu.py Outdated Show resolved Hide resolved

jerome-habana and others added 2 commits October 30, 2023 08:48

Merge branch 'master' into ray_hpu2

aeb2901

Signed-off-by: Jerome Anand <88475913+jerome-habana@users.noreply.github.com>

Correct lint errors

3e5a57d

Signed-off-by: Jerome <janand@habana.ai>

jjyao approved these changes Oct 30, 2023

View reviewed changes

python/ray/_private/accelerators/hpu.py Outdated Show resolved Hide resolved

python/ray/_private/accelerators/hpu.py Outdated Show resolved Hide resolved

jerome-habana and others added 3 commits October 30, 2023 07:15

Update api section, tests

1ac9d52

Signed-off-by: Jerome <janand@habana.ai>

More lint fixes

12dbe78

Signed-off-by: Jerome <janand@habana.ai>

Merge branch 'master' into ray_hpu2

4520ed3

Signed-off-by: Jerome Anand <88475913+jerome-habana@users.noreply.github.com>

jjyao reviewed Oct 30, 2023

View reviewed changes

jerome-habana and others added 2 commits October 31, 2023 07:42

Merge branch 'master' into ray_hpu2

6fe2a83

Signed-off-by: Jerome Anand <88475913+jerome-habana@users.noreply.github.com>

Add Gaudi to acclerator list

8506631

* Add Intel gaudi to accelerator list * Add check for backend initialization with updated test Signed-off-by: Jerome <janand@habana.ai>

jjyao reviewed Oct 31, 2023

View reviewed changes

python/ray/_private/accelerators/hpu.py Show resolved Hide resolved

jjyao reviewed Oct 31, 2023

View reviewed changes

python/ray/util/accelerators/__init__.py Show resolved Hide resolved

Resolve Lint failure

1f13d3b

Signed-off-by: Jerome <janand@habana.ai>

jjyao reviewed Oct 31, 2023

View reviewed changes

Merge branch 'master' into ray_hpu2

8d2d5b6

Signed-off-by: Jerome Anand <88475913+jerome-habana@users.noreply.github.com>

jjyao merged commit 04a8aa3 into ray-project:master Oct 31, 2023

jerome-habana mentioned this pull request May 15, 2024

Ray train with Intel Gaudi #40695

Closed

8 tasks

Conversation

jerome-habana commented Oct 23, 2023

Why are these changes needed?

Related issue number

Checks

Uh oh!

jjyao left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jjyao left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jjyao left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jjyao commented Oct 27, 2023

Uh oh!

Uh oh!

Uh oh!

jjyao commented Oct 30, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jjyao commented Oct 31, 2023

Uh oh!

Uh oh!

Uh oh!

jerome-habana commented Oct 31, 2023

Uh oh!

jjyao Oct 31, 2023

Choose a reason for hiding this comment

Uh oh!

jerome-habana Oct 31, 2023

Choose a reason for hiding this comment

Uh oh!

kira-lin Nov 23, 2023

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants