Skip to content

[train] Fix ScalingConfig(accelerator_type) to request a small fraction of the accelerator label#44225

Merged
justinvyu merged 2 commits intoray-project:masterfrom
justinvyu:fix_accelerator_type_amt2
Mar 22, 2024
Merged

[train] Fix ScalingConfig(accelerator_type) to request a small fraction of the accelerator label#44225
justinvyu merged 2 commits intoray-project:masterfrom
justinvyu:fix_accelerator_type_amt2

Conversation

@justinvyu
Copy link
Contributor

Why are these changes needed?

accelerator_type is currently implemented as a custom resource with a quantity of 1 if an instance has an accelerator of that type. For example, both a machine with 1 A10G GPU and a machine with 4 A10G GPUs will have {"accelerator_type:A10G": 1.0}. This label is just an indicator of whether the machine contains the accelerator, rather than a count of the number of accelerators of that type.

This PR makes our accelerator type resource request match Ray Core by setting it to a fractional value (0.001). This is needed to fix autoscaling behavior to request the correct number of GPUs.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
if self.accelerator_type:
accelerator = f"{RESOURCE_CONSTRAINT_PREFIX}{self.accelerator_type}"
resources_per_worker.setdefault(accelerator, 1)
resources_per_worker.setdefault(accelerator, 0.001)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this a constant (or use an existing one if it already exists)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if accelerator_type is not None:
resources[
f"{ray_constants.RESOURCE_CONSTRAINT_PREFIX}{accelerator_type}"
] = 0.001

Seems core team directly use 0.001 here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jjyao cool if we extract this into a constant? Gives it some concrete meaning 🙂

Copy link
Member

@woshiyyya woshiyyya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix!

@justinvyu justinvyu merged commit 5923cb9 into ray-project:master Mar 22, 2024
@justinvyu justinvyu deleted the fix_accelerator_type_amt2 branch March 22, 2024 17:37
stephanie-wang pushed a commit to stephanie-wang/ray that referenced this pull request Mar 27, 2024
…tion of the accelerator label (ray-project#44225)

Make Ray Train's accelerator type resource request match Ray Core by setting it to a fractional value (0.001). This is needed to fix autoscaling behavior to request the correct number of GPUs.

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants