[Data] - Optimize memory usage for One Hot Encoder by goutamvenkat-anyscale · Pull Request #56565 · ray-project/ray

goutamvenkat-anyscale · 2025-09-16T00:47:18Z

Why are these changes needed?

Previously, the vector that was holding the values from OneHotEncoder was of type int64. We can reduce this to uint8, which should result in 8x lower memory usage

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Goutam V. <goutam@anyscale.com>

gemini-code-assist

Code Review

This pull request aims to optimize memory usage in OneHotEncoder by changing the dtype of the one-hot encoded array from int to np.uint8. While this is a valuable optimization, the current implementation introduces a critical bug. Casting the category indices to np.uint8 can lead to silent data corruption if the number of categories exceeds 256, as the indices will wrap around. I've provided a specific comment with a suggested fix to dynamically select the correct integer type based on the number of categories, ensuring both memory efficiency and correctness.

gemini-code-assist · 2025-09-16T00:48:33Z

python/ray/data/preprocessors/encoder.py

-            one_hot[np.nonzero(valid_rows)[0], codes[valid_rows].astype(int)] = 1
+            # Dimension should be (num_rows, ) - 1D boolean array
+            non_zero_indices = np.nonzero(valid_rows)[0]
+            one_hot[non_zero_indices, codes[valid_rows].astype(np.uint8)] = 1


Using np.uint8 for the index type can lead to silent data corruption if num_categories is greater than 256. The category codes will wrap around (e.g., 256 becomes 0), causing incorrect one-hot encoding.

To fix this, you should dynamically select the smallest integer dtype that can accommodate num_categories - 1 using np.min_scalar_type.

Suggested change

one_hot[non_zero_indices, codes[valid_rows].astype(np.uint8)] = 1

one_hot[non_zero_indices, codes[valid_rows].astype(np.min_scalar_type(num_categories - 1))] = 1

@goutamvenkat-anyscale it's actually right -- indexes can't be casted to uint8

Signed-off-by: Goutam V. <goutam@anyscale.com>

python/ray/data/preprocessors/encoder.py

Signed-off-by: Goutam V. <goutam@anyscale.com>

## Why are these changes needed? Previously, the vector that was holding the values from OneHotEncoder was of type `int64`. We can reduce this to `uint8`, which should result in 8x lower memory usage ## Related issue number  ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Goutam V. <goutam@anyscale.com> Signed-off-by: zac <zac@anyscale.com>

## Why are these changes needed? Previously, the vector that was holding the values from OneHotEncoder was of type `int64`. We can reduce this to `uint8`, which should result in 8x lower memory usage ## Related issue number  ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Goutam V. <goutam@anyscale.com> Signed-off-by: Marco Stephan <marco@magic.dev>

## Why are these changes needed? Previously, the vector that was holding the values from OneHotEncoder was of type `int64`. We can reduce this to `uint8`, which should result in 8x lower memory usage ## Related issue number  ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Goutam V. <goutam@anyscale.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>

## Why are these changes needed? Previously, the vector that was holding the values from OneHotEncoder was of type `int64`. We can reduce this to `uint8`, which should result in 8x lower memory usage ## Related issue number  ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Goutam V. <goutam@anyscale.com>

[Data] - Optimize memory usage for One Hot Encoder

75b1ffb

Signed-off-by: Goutam V. <goutam@anyscale.com>

goutamvenkat-anyscale requested a review from a team as a code owner September 16, 2025 00:47

gemini-code-assist bot reviewed Sep 16, 2025

View reviewed changes

Add clarifying comments

8128215

Signed-off-by: Goutam V. <goutam@anyscale.com>

ray-gardener bot added the data Ray Data-related issues label Sep 16, 2025

Gemini...

ac31762

Signed-off-by: Goutam V. <goutam@anyscale.com>

goutamvenkat-anyscale added the go add ONLY when ready to merge, run all tests label Sep 16, 2025

alexeykudinkin approved these changes Sep 16, 2025

View reviewed changes

python/ray/data/preprocessors/encoder.py Outdated Show resolved Hide resolved

Remove cast

6190c72

Signed-off-by: Goutam V. <goutam@anyscale.com>

alexeykudinkin merged commit 0c62bdb into ray-project:master Sep 16, 2025
5 checks passed

goutamvenkat-anyscale deleted the goutam/less_mem_ohe branch September 18, 2025 01:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] - Optimize memory usage for One Hot Encoder#56565

[Data] - Optimize memory usage for One Hot Encoder#56565
alexeykudinkin merged 4 commits intoray-project:masterfrom
goutamvenkat-anyscale:goutam/less_mem_ohe

goutamvenkat-anyscale commented Sep 16, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Sep 16, 2025

Uh oh!

alexeykudinkin Sep 16, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	one_hot[non_zero_indices, codes[valid_rows].astype(np.uint8)] = 1
	one_hot[non_zero_indices, codes[valid_rows].astype(np.min_scalar_type(num_categories - 1))] = 1

Conversation

goutamvenkat-anyscale commented Sep 16, 2025

Why are these changes needed?

Related issue number

Checks

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

alexeykudinkin Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants