Skip to content

[LANG-1772] Restrict size of cache to prevent overflow errors #1379

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

jcwinters
Copy link
Contributor

@jcwinters jcwinters commented May 8, 2025

https://issues.apache.org/jira/browse/LANG-1772

Added a length restriction to RandomStringutils, limiting the cache to 60M entries. Because of rejections the bitIndex in the underling cache can overflow when right shifting. Also added a test to verify the fix.

This test takes quite a while to run, so if necessary I can create a profile for slow tests to exclude the test from the normal build.

@ppkarwasz
Copy link
Contributor

Added a length restriction to RandomStringutils, limiting the cache to 60M entries. Because of rejections the bitIndex in the underling cache can overflow when right shifting. Also added a test to verify the fix.

The problem is caused by an integer overflow of bitIndex in:

result |= cache[bitIndex >> 3] >> (bitIndex & 0x7) & (1 << generatedBitsInIteration) - 1;

A simpler solution would be to:

@garydgregory garydgregory changed the title LANG-1772 restrict size of cache to prevent overflow errors [LANG-1772] Restrict size of cache to prevent overflow errors May 8, 2025
@garydgregory
Copy link
Member

The test as is blows up GitHub builds so let's use something like @EnabledIfSystemProperty(named = "test.large.heap", matches = "true")

@jcwinters
Copy link
Contributor Author

A simpler solution would be to:

  • Change the type of bitIndex to long, so it can count up to 8 * Integer.MAX_VALUE
  • Refactor the expression that computes the cacheSize argument in a way that it does not overflow:

I considered that, changing bitindex to long adds some casting complexity I didn't want to deal with as arrays are int indexed. And I didn't particularly want to change the randomization algorithm, as to get rid of overflow possibilities I'd have to change to something that doesn't have rejections, and it made my head hurt.

@garydgregory
Copy link
Member

Hi @ppkarwasz

You've proposed an alternative solution. Would you shows in a PR?

@ppkarwasz
Copy link
Contributor

You've proposed an alternative solution. Would you shows in a PR?

I'll submit a PR by the end of the week.

…nside the CachedRandomBits constructor - also checking if the padding produces overflow. No longer using an arbitrary value but being more precise.
Copy link
Member

@garydgregory garydgregory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jcwinters
Thank you for your update.
I think the test should be more of a white box test and test just below and above the overflow. WDYT? @ppkarwasz ?

Copy link
Contributor

@ppkarwasz ppkarwasz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks OK to me.

I don’t think we need to preemptively generate more than 256 MiB of random data. The goal of generating data in bulk is to take advantage of the fact that random number generators are typically more efficient when producing large chunks of data rather than individual bytes. However, I suspect the optimal amount is significantly less than 256 MiB—we should run some benchmarks to determine the best value.

@ppkarwasz ppkarwasz requested a review from garydgregory May 13, 2025 14:01
@garydgregory
Copy link
Member

@jcwinters
Please 'mvn' by itself before you push to catch all build errors.

@jcwinters
Copy link
Contributor Author

@jcwinters Please 'mvn' by itself before you push to catch all build errors.
@garydgregory Sorry about that, I'm so used to my workflow with pre_commit hooks running everything, I didn't even look. Thanks for the patience with the newbie, and I promise to do better 😄

* The maximum size of the cache.
*
* <p>
* This is dictated by the {@code if (bitIndex >> 3 >= cache.length)} in the {@link #nextBits(int)} method.
Copy link
Member

@garydgregory garydgregory May 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the 3 in this expression MUST match the 3 in the expression building cacheSize in the random(...) method, then it should be refactored from a magic number in to a constant IMO.

This makes me wonder about the other magic numbers 5 and 10 which beg for documentation if only to help with maintenance.

WDYT?

@garydgregory
Copy link
Member

Hi all,

Where are we on this one? @ppkarwasz do you still plan on providing an alternative solution?

@ppkarwasz
Copy link
Contributor

Where are we on this one? @ppkarwasz do you still plan on providing an alternative solution?

This pull request appears to be nearly ready. It includes the following changes:

  • Sets the maximum size of the internal CachedRandomBits class to Integer.MAX_VALUE / 8.
  • Limits the maximum requested size for CachedRandomBits to approximately Integer.MAX_VALUE / 5.

I could switch the bitIndex field to long, which would allow increasing the maximum size of CachedRandomBits up to Integer.MAX_VALUE. However, it's unclear whether this would provide any practical benefit. A benchmark would likely show that caching such a large number of random bits offers no significant performance advantage.

…ng the result to MAX_INT/5, also restricting max cache length to MAX_INT/3, there are now no opportunities for overflow. The test checks at the boundary condition
@jcwinters
Copy link
Contributor Author

I believe I've now incorporated most of the suggestions (I didn't use constants for the divide by 3 for instance) - sorry for the length of time this has taken

@garydgregory
Copy link
Member

Hi @jcwinters,

No need to apologize, we're all busy 😃

@garydgregory
Copy link
Member

Hi @jcwinters
I still don't understand what the magic numbers mean, so constants with comments or a better code comment is needed IMO.

…ve documentation around the nextBits method and the size allocation for the cache
@jcwinters
Copy link
Contributor Author

I think I'm there now

You're right, should be outside the min

Co-authored-by: Piotr P. Karwasz <[email protected]>
Copy link
Contributor

@ppkarwasz ppkarwasz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for diligently addressing all of my feedback — I have no further suggestions.

As far as I’m concerned, the only remaining task is to add a changelog entry to src/changes/changes.xml.

@garydgregory, have your concerns been addressed as well? If so, I believe we're ready to merge this PR.

@garydgregory garydgregory merged commit c2260f0 into apache:master May 24, 2025
16 of 19 checks passed
@jcwinters jcwinters deleted the LANG-1772-fix-huge-string-randomization branch May 24, 2025 21:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants