Skip to content

Allow BigSizedJoinIterator#buildPartitioner to produce more subparittions #12372

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Mar 27, 2025

Conversation

binmahone
Copy link
Collaborator

This PR closes #12367 by introducing a new config called spark.rapids.sql.join.sizedJoin.buildPartitionNumberAmplification

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>
Signed-off-by: Hongbin Ma (Mahone) <[email protected]>
@binmahone
Copy link
Collaborator Author

build

Copy link
Collaborator

@firestarman firestarman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but better have more reviews from others.

And in the future, we may choose to repartition with an heuristic to calculate the proper partition number to overcome the skew case mentioned in the linked issue. It is something similar as the repartition in GPU hash aggregate or GPU sub hash join.

@binmahone binmahone requested review from revans2 and abellina March 24, 2025 08:36
@sameerz sameerz added the feature request New feature or request label Mar 25, 2025
Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really hate this. I get that you might need a fix quickly, but this is not a long term solution. I like that the config is private so we can remove it in the future, but the only way I am willing to merge this in is if there is a follow on issue early in 25.06 that would find a way to deal with this case properly.

The heuristic being used currently is looking at the build side of the join to estimate how many output rows there would be for each input row on average, aka the amplification. This is not a perfect solution because it assumes that the distribution of the keys on the stream side matches that of the build side, which is not guaranteed.

Perhaps we can look at using a sketch to estimate the number of output rows for the equality portion of a join, and then assume that any non-equality parts are just going to reduce the number of output rows.

Doing a quick bit of research it looks like there are a lot of sketches we could look at using.

Most of them look like something we could build on the GPU fairly quickly. Probably at least as fast as the distinct count we are doing today.

@binmahone
Copy link
Collaborator Author

binmahone commented Mar 26, 2025

I really hate this. I get that you might need a fix quickly, but this is not a long term solution. I like that the config is private so we can remove it in the future, but the only way I am willing to merge this in is if there is a follow on issue early in 25.06 that would find a way to deal with this case properly.

The heuristic being used currently is looking at the build side of the join to estimate how many output rows there would be for each input row on average, aka the amplification. This is not a perfect solution because it assumes that the distribution of the keys on the stream side matches that of the build side, which is not guaranteed.

Perhaps we can look at using a sketch to estimate the number of output rows for the equality portion of a join, and then assume that any non-equality parts are just going to reduce the number of output rows.

Doing a quick bit of research it looks like there are a lot of sketches we could look at using.

Most of them look like something we could build on the GPU fairly quickly. Probably at least as fast as the distinct count we are doing today.

Offline talked with bobby, conclusions:

  1. He's okay with it to as a workaround for the issue described in [FEA] Allow BigSizedJoinIterator#buildPartitioner to produce more subparittions to avoid CudfColumnSizeOverflowException #12367, the new config is internal, it is not supposed to be used unless the user encounters same issue, and experts from our side advises users to.
  2. In 25.06 we will use a follow up issue [FEA] revisit on partitioning of BuildSidePartitioner #12387 to address people's concern on "why another config"

@binmahone
Copy link
Collaborator Author

binmahone commented Mar 26, 2025

This PR is intended to address some corner cases from our customer. I fully understand that this is not a clean solution, but I'm also aware that we don't have a perfect dynamic solution to this issue in the short term (check #12354 (review) for more, we don't have clear roadmap on this yet). My past experience has shown that when users in production encounter unavoidable bugs, they prefer having some special configurations as an escape route rather than being stuck without options or forced to wait for a new release. That's why I'm introducing a new internal config in case they really need it (after consulting our experts). When we have finally finished the perfect dynamic solution we can of course remove such kind of internal configs, normal user will not be aware of this.

Still, If there's strong resistance to the internal config, we can choose not to check in this PR. @revans2 @sameerz @GaryShen2008 @winningsix @abellina , or we can use the customer private repo for now

Copy link
Collaborator

@abellina abellina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with @revans2. For now, if this helps a specific usecase we can merge, but it should be cleaned in 25.06.

@binmahone binmahone merged commit 33bce74 into NVIDIA:branch-25.04 Mar 27, 2025
55 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Allow BigSizedJoinIterator#buildPartitioner to produce more subparittions to avoid CudfColumnSizeOverflowException
5 participants