Skip to content

Add support for Intel Gaudi Backend#40561

Merged
jjyao merged 17 commits intoray-project:masterfrom
jerome-habana:ray_hpu2
Oct 31, 2023
Merged

Add support for Intel Gaudi Backend#40561
jjyao merged 17 commits intoray-project:masterfrom
jerome-habana:ray_hpu2

Conversation

@jerome-habana
Copy link
Contributor

Added support for intel gaudi backend based on new interfaces defined in #40286

Why are these changes needed?

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Jerome <janand@habana.ai>
Copy link
Collaborator

@jjyao jjyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lg

Signed-off-by: Jerome <janand@habana.ai>
@jerome-habana jerome-habana requested a review from jjyao October 25, 2023 05:26
Signed-off-by: Jerome Anand <88475913+jerome-habana@users.noreply.github.com>
@jjyao jjyao self-assigned this Oct 25, 2023
Signed-off-by: Jerome <janand@habana.ai>
Signed-off-by: Jerome <janand@habana.ai>
@jerome-habana jerome-habana requested a review from jjyao October 26, 2023 03:58
Copy link
Collaborator

@jjyao jjyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tested this on the machine with Gaudi?

Signed-off-by: Jerome <janand@habana.ai>
@jerome-habana jerome-habana requested a review from jjyao October 27, 2023 07:37
Signed-off-by: Jerome Anand <88475913+jerome-habana@users.noreply.github.com>
Signed-off-by: Jerome Anand <88475913+jerome-habana@users.noreply.github.com>
Copy link
Collaborator

@jjyao jjyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last comment

@jjyao
Copy link
Collaborator

jjyao commented Oct 27, 2023

Lint failure:

Fri Oct 27 12:06:51 UTC 2023 Flake8....
--
  | python/ray/_private/utils.py:338:89: E501 line too long (108 > 88 characters)
  | python/ray/tests/accelerators/test_hpu.py:3:1: F401 'subprocess' imported but unused
  | python/ray/tests/accelerators/test_hpu.py:110:74: E711 comparison to None should be 'if cond is None:'


jerome-habana and others added 2 commits October 30, 2023 08:48
Signed-off-by: Jerome Anand <88475913+jerome-habana@users.noreply.github.com>
Signed-off-by: Jerome <janand@habana.ai>
@jjyao
Copy link
Collaborator

jjyao commented Oct 30, 2023

Lint failure:



def test_get_current_process_visible_accelerator_ids():
--
  | os.environ[hpu.HABANA_VISIBLE_DEVICES_ENV_VAR] = "0,1,2"
  | -    assert HPUAcceleratorManager.get_current_process_visible_accelerator_ids() == ["0", "1", "2"]  # noqa: E501
  | +    assert HPUAcceleratorManager.get_current_process_visible_accelerator_ids() == [
  | +        "0",
  | +        "1",
  | +        "2",
  | +    ]  # noqa: E501


jerome-habana and others added 3 commits October 30, 2023 07:15
Signed-off-by: Jerome <janand@habana.ai>
Signed-off-by: Jerome <janand@habana.ai>
Signed-off-by: Jerome Anand <88475913+jerome-habana@users.noreply.github.com>
jerome-habana and others added 2 commits October 31, 2023 07:42
Signed-off-by: Jerome Anand <88475913+jerome-habana@users.noreply.github.com>
* Add Intel gaudi to accelerator list
* Add check for backend initialization with updated test

Signed-off-by: Jerome <janand@habana.ai>
@jjyao
Copy link
Collaborator

jjyao commented Oct 31, 2023

Lint failure:



if HPUAcceleratorManager.is_initialized():
--
  | -        assert "Intel-GAUDI" in HPUAcceleratorManager.get_current_node_accelerator_type()
  | +        assert (
  | +            "Intel-GAUDI" in HPUAcceleratorManager.get_current_node_accelerator_type()
  | +        )
  | else:
  | assert HPUAcceleratorManager.get_current_node_accelerator_type() is None


Signed-off-by: Jerome <janand@habana.ai>
@jerome-habana
Copy link
Contributor Author

Lint failure:



if HPUAcceleratorManager.is_initialized():
--
  | -        assert "Intel-GAUDI" in HPUAcceleratorManager.get_current_node_accelerator_type()
  | +        assert (
  | +            "Intel-GAUDI" in HPUAcceleratorManager.get_current_node_accelerator_type()
  | +        )
  | else:
  | assert HPUAcceleratorManager.get_current_node_accelerator_type() is None

might be nice to have auto corrector

NVIDIA_TESLA_A10G = "A10G"
INTEL_MAX_1550 = "Intel-GPU-Max-1550"
INTEL_MAX_1100 = "Intel-GPU-Max-1100"
INTEL_GAUDI = "Intel-GAUDI"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add INTEL_GAUDI2 here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I've kept it generic for now. Lets update post closure of the right instance usage ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi all, I'm working on LLM serving on Gaudi 2. Is Gaudi 2 not supported yet?

Signed-off-by: Jerome Anand <88475913+jerome-habana@users.noreply.github.com>
@jjyao jjyao merged commit 04a8aa3 into ray-project:master Oct 31, 2023
@jerome-habana jerome-habana mentioned this pull request May 15, 2024
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants