Skip to content

[Data] - Optimize memory usage for One Hot Encoder#56565

Merged
alexeykudinkin merged 4 commits intoray-project:masterfrom
goutamvenkat-anyscale:goutam/less_mem_ohe
Sep 16, 2025
Merged

[Data] - Optimize memory usage for One Hot Encoder#56565
alexeykudinkin merged 4 commits intoray-project:masterfrom
goutamvenkat-anyscale:goutam/less_mem_ohe

Conversation

@goutamvenkat-anyscale
Copy link
Contributor

Why are these changes needed?

Previously, the vector that was holding the values from OneHotEncoder was of type int64. We can reduce this to uint8, which should result in 8x lower memory usage

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Goutam V. <goutam@anyscale.com>
@goutamvenkat-anyscale goutamvenkat-anyscale requested a review from a team as a code owner September 16, 2025 00:47
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to optimize memory usage in OneHotEncoder by changing the dtype of the one-hot encoded array from int to np.uint8. While this is a valuable optimization, the current implementation introduces a critical bug. Casting the category indices to np.uint8 can lead to silent data corruption if the number of categories exceeds 256, as the indices will wrap around. I've provided a specific comment with a suggested fix to dynamically select the correct integer type based on the number of categories, ensuring both memory efficiency and correctness.

one_hot[np.nonzero(valid_rows)[0], codes[valid_rows].astype(int)] = 1
# Dimension should be (num_rows, ) - 1D boolean array
non_zero_indices = np.nonzero(valid_rows)[0]
one_hot[non_zero_indices, codes[valid_rows].astype(np.uint8)] = 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Using np.uint8 for the index type can lead to silent data corruption if num_categories is greater than 256. The category codes will wrap around (e.g., 256 becomes 0), causing incorrect one-hot encoding.

To fix this, you should dynamically select the smallest integer dtype that can accommodate num_categories - 1 using np.min_scalar_type.

Suggested change
one_hot[non_zero_indices, codes[valid_rows].astype(np.uint8)] = 1
one_hot[non_zero_indices, codes[valid_rows].astype(np.min_scalar_type(num_categories - 1))] = 1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@goutamvenkat-anyscale it's actually right -- indexes can't be casted to uint8

Signed-off-by: Goutam V. <goutam@anyscale.com>
@ray-gardener ray-gardener bot added the data Ray Data-related issues label Sep 16, 2025
Signed-off-by: Goutam V. <goutam@anyscale.com>
@goutamvenkat-anyscale goutamvenkat-anyscale added the go add ONLY when ready to merge, run all tests label Sep 16, 2025
Signed-off-by: Goutam V. <goutam@anyscale.com>
@alexeykudinkin alexeykudinkin merged commit 0c62bdb into ray-project:master Sep 16, 2025
5 checks passed
@goutamvenkat-anyscale goutamvenkat-anyscale deleted the goutam/less_mem_ohe branch September 18, 2025 01:28
ZacAttack pushed a commit to ZacAttack/ray that referenced this pull request Sep 24, 2025
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

Previously, the vector that was holding the values from OneHotEncoder
was of type `int64`. We can reduce this to `uint8`, which should result
in 8x lower memory usage

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Goutam V. <goutam@anyscale.com>
Signed-off-by: zac <zac@anyscale.com>
marcostephan pushed a commit to marcostephan/ray that referenced this pull request Sep 24, 2025
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

Previously, the vector that was holding the values from OneHotEncoder
was of type `int64`. We can reduce this to `uint8`, which should result
in 8x lower memory usage

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Goutam V. <goutam@anyscale.com>
Signed-off-by: Marco Stephan <marco@magic.dev>
dstrodtman pushed a commit that referenced this pull request Oct 6, 2025
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

Previously, the vector that was holding the values from OneHotEncoder
was of type `int64`. We can reduce this to `uint8`, which should result
in 8x lower memory usage

## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Goutam V. <goutam@anyscale.com>
Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
justinyeh1995 pushed a commit to justinyeh1995/ray that referenced this pull request Oct 20, 2025
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

Previously, the vector that was holding the values from OneHotEncoder
was of type `int64`. We can reduce this to `uint8`, which should result
in 8x lower memory usage

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Goutam V. <goutam@anyscale.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

Previously, the vector that was holding the values from OneHotEncoder
was of type `int64`. We can reduce this to `uint8`, which should result
in 8x lower memory usage

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Goutam V. <goutam@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants