Allow BigSizedJoinIterator#buildPartitioner to produce more subparittions #12372

binmahone · 2025-03-24T08:10:05Z

This PR closes #12367 by introducing a new config called spark.rapids.sql.join.sizedJoin.buildPartitionNumberAmplification

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

binmahone · 2025-03-24T08:10:50Z

build

firestarman

LGTM, but better have more reviews from others.

And in the future, we may choose to repartition with an heuristic to calculate the proper partition number to overcome the skew case mentioned in the linked issue. It is something similar as the repartition in GPU hash aggregate or GPU sub hash join.

revans2

I really hate this. I get that you might need a fix quickly, but this is not a long term solution. I like that the config is private so we can remove it in the future, but the only way I am willing to merge this in is if there is a follow on issue early in 25.06 that would find a way to deal with this case properly.

The heuristic being used currently is looking at the build side of the join to estimate how many output rows there would be for each input row on average, aka the amplification. This is not a perfect solution because it assumes that the distribution of the keys on the stream side matches that of the build side, which is not guaranteed.

Perhaps we can look at using a sketch to estimate the number of output rows for the equality portion of a join, and then assume that any non-equality parts are just going to reduce the number of output rows.

Doing a quick bit of research it looks like there are a lot of sketches we could look at using.

Count-Min - https://postgrespro.com/list/thread-id/2556195 and https://people.cs.umass.edu/~mcgregor/711S12/sketches1.pdf
Count - https://people.cs.umass.edu/~mcgregor/711S12/sketches1.pdf
AGMS and Fast-AGMS - https://faculty.ucmerced.edu/frusu/Papers/Journal/2008-05-tods.pdf
JoinSketch - https://yangtonghome.github.io/uploads/JoinSketch_2023.pdf
Skimmed - https://minosng.github.io/Papers/edbt04skim-cam.pdf

Most of them look like something we could build on the GPU fairly quickly. Probably at least as fast as the distinct count we are doing today.

binmahone · 2025-03-26T01:37:03Z

I really hate this. I get that you might need a fix quickly, but this is not a long term solution. I like that the config is private so we can remove it in the future, but the only way I am willing to merge this in is if there is a follow on issue early in 25.06 that would find a way to deal with this case properly.

The heuristic being used currently is looking at the build side of the join to estimate how many output rows there would be for each input row on average, aka the amplification. This is not a perfect solution because it assumes that the distribution of the keys on the stream side matches that of the build side, which is not guaranteed.

Perhaps we can look at using a sketch to estimate the number of output rows for the equality portion of a join, and then assume that any non-equality parts are just going to reduce the number of output rows.

Doing a quick bit of research it looks like there are a lot of sketches we could look at using.

Count-Min - https://postgrespro.com/list/thread-id/2556195 and https://people.cs.umass.edu/~mcgregor/711S12/sketches1.pdf

Count - https://people.cs.umass.edu/~mcgregor/711S12/sketches1.pdf

AGMS and Fast-AGMS - https://faculty.ucmerced.edu/frusu/Papers/Journal/2008-05-tods.pdf

JoinSketch - https://yangtonghome.github.io/uploads/JoinSketch_2023.pdf

Skimmed - https://minosng.github.io/Papers/edbt04skim-cam.pdf

Most of them look like something we could build on the GPU fairly quickly. Probably at least as fast as the distinct count we are doing today.

Offline talked with bobby, conclusions:

He's okay with it to as a workaround for the issue described in [FEA] Allow BigSizedJoinIterator#buildPartitioner to produce more subparittions to avoid CudfColumnSizeOverflowException #12367, the new config is internal, it is not supposed to be used unless the user encounters same issue, and experts from our side advises users to.
In 25.06 we will use a follow up issue [FEA] revisit on partitioning of BuildSidePartitioner #12387 to address people's concern on "why another config"

binmahone · 2025-03-26T02:45:50Z

This PR is intended to address some corner cases from our customer. I fully understand that this is not a clean solution, but I'm also aware that we don't have a perfect dynamic solution to this issue in the short term (check #12354 (review) for more, we don't have clear roadmap on this yet). My past experience has shown that when users in production encounter unavoidable bugs, they prefer having some special configurations as an escape route rather than being stuck without options or forced to wait for a new release. That's why I'm introducing a new internal config in case they really need it (after consulting our experts). When we have finally finished the perfect dynamic solution we can of course remove such kind of internal configs, normal user will not be aware of this.

Still, If there's strong resistance to the internal config, we can choose not to check in this PR. @revans2 @sameerz @GaryShen2008 @winningsix @abellina , or we can use the customer private repo for now

abellina

Agree with @revans2. For now, if this helps a specific usecase we can merge, but it should be cleaned in 25.06.

binmahone added 2 commits March 24, 2025 16:06

sized join partition num amplification

2a5572f

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

comment

38c65d9

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

firestarman approved these changes Mar 24, 2025

View reviewed changes

binmahone requested review from revans2 and abellina March 24, 2025 08:36

sameerz added the feature request New feature or request label Mar 25, 2025

revans2 reviewed Mar 25, 2025

View reviewed changes

binmahone mentioned this pull request Mar 25, 2025

[FEA] revisit on partitioning of BuildSidePartitioner #12387

Open

abellina approved these changes Mar 27, 2025

View reviewed changes

binmahone merged commit 33bce74 into NVIDIA:branch-25.04 Mar 27, 2025
55 checks passed

binmahone mentioned this pull request May 31, 2025

Replace bigSizedJoin with SubPartitionHashJoin in SizedHashJoin to avoid CudfColumnSizeOverflow #12734

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Allow BigSizedJoinIterator#buildPartitioner to produce more subparittions #12372

Allow BigSizedJoinIterator#buildPartitioner to produce more subparittions #12372

Uh oh!

binmahone commented Mar 24, 2025

Uh oh!

binmahone commented Mar 24, 2025

Uh oh!

firestarman left a comment •

edited

Loading

Uh oh!

revans2 left a comment

Uh oh!

binmahone commented Mar 26, 2025 •

edited

Loading

Uh oh!

binmahone commented Mar 26, 2025 •

edited

Loading

Uh oh!

abellina left a comment

Uh oh!

Uh oh!

Uh oh!

Allow BigSizedJoinIterator#buildPartitioner to produce more subparittions #12372

Allow BigSizedJoinIterator#buildPartitioner to produce more subparittions #12372

Uh oh!

Conversation

binmahone commented Mar 24, 2025

Uh oh!

binmahone commented Mar 24, 2025

Uh oh!

firestarman left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

revans2 left a comment

Choose a reason for hiding this comment

Uh oh!

binmahone commented Mar 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

binmahone commented Mar 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

abellina left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

firestarman left a comment •

edited

Loading

binmahone commented Mar 26, 2025 •

edited

Loading

binmahone commented Mar 26, 2025 •

edited

Loading