[Data] - Optimize memory usage for One Hot Encoder#56565
[Data] - Optimize memory usage for One Hot Encoder#56565alexeykudinkin merged 4 commits intoray-project:masterfrom
Conversation
Signed-off-by: Goutam V. <goutam@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request aims to optimize memory usage in OneHotEncoder by changing the dtype of the one-hot encoded array from int to np.uint8. While this is a valuable optimization, the current implementation introduces a critical bug. Casting the category indices to np.uint8 can lead to silent data corruption if the number of categories exceeds 256, as the indices will wrap around. I've provided a specific comment with a suggested fix to dynamically select the correct integer type based on the number of categories, ensuring both memory efficiency and correctness.
| one_hot[np.nonzero(valid_rows)[0], codes[valid_rows].astype(int)] = 1 | ||
| # Dimension should be (num_rows, ) - 1D boolean array | ||
| non_zero_indices = np.nonzero(valid_rows)[0] | ||
| one_hot[non_zero_indices, codes[valid_rows].astype(np.uint8)] = 1 |
There was a problem hiding this comment.
Using np.uint8 for the index type can lead to silent data corruption if num_categories is greater than 256. The category codes will wrap around (e.g., 256 becomes 0), causing incorrect one-hot encoding.
To fix this, you should dynamically select the smallest integer dtype that can accommodate num_categories - 1 using np.min_scalar_type.
| one_hot[non_zero_indices, codes[valid_rows].astype(np.uint8)] = 1 | |
| one_hot[non_zero_indices, codes[valid_rows].astype(np.min_scalar_type(num_categories - 1))] = 1 |
There was a problem hiding this comment.
@goutamvenkat-anyscale it's actually right -- indexes can't be casted to uint8
Signed-off-by: Goutam V. <goutam@anyscale.com>
Signed-off-by: Goutam V. <goutam@anyscale.com>
<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? Previously, the vector that was holding the values from OneHotEncoder was of type `int64`. We can reduce this to `uint8`, which should result in 8x lower memory usage ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Goutam V. <goutam@anyscale.com> Signed-off-by: zac <zac@anyscale.com>
<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? Previously, the vector that was holding the values from OneHotEncoder was of type `int64`. We can reduce this to `uint8`, which should result in 8x lower memory usage ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Goutam V. <goutam@anyscale.com> Signed-off-by: Marco Stephan <marco@magic.dev>
<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? Previously, the vector that was holding the values from OneHotEncoder was of type `int64`. We can reduce this to `uint8`, which should result in 8x lower memory usage ## Related issue number <!-- For example: "Closes #1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Goutam V. <goutam@anyscale.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? Previously, the vector that was holding the values from OneHotEncoder was of type `int64`. We can reduce this to `uint8`, which should result in 8x lower memory usage ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Goutam V. <goutam@anyscale.com>
<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? Previously, the vector that was holding the values from OneHotEncoder was of type `int64`. We can reduce this to `uint8`, which should result in 8x lower memory usage ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Goutam V. <goutam@anyscale.com>
Why are these changes needed?
Previously, the vector that was holding the values from OneHotEncoder was of type
int64. We can reduce this touint8, which should result in 8x lower memory usageRelated issue number
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.