Replace bigSizedJoin with SubPartitionHashJoin in SizedHashJoin to avoid CudfColumnSizeOverflow #12734

thirtiseven · 2025-05-15T09:59:38Z

Closes #12353

I think it's in an early stage. At least need more tests.

Signed-off-by: Haoyang Li <[email protected]>

binmahone · 2025-05-16T07:26:06Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuShuffledSizedHashJoinExec.scala

+  private def realTargetBatchSize(): Long = {
+    val configValue = RapidsConf.GPU_BATCH_SIZE_BYTES.get(conf)
+    // The 10k is mostly for tests, hopefully no one is setting anything that low in production.
+    Math.max(configValue, 10 * 1024)


nit: what we must set this?

binmahone · 2025-05-16T07:28:10Z

need tests on 1. NDS 2. customer queries (we can selectively pick 20 queries whose per task build&stream side sizes are both big)

Copilot

Pull Request Overview

This PR replaces the big-sized join implementation with a sub-partition hash join variant to mitigate potential overflow issues with large build-side batches.

Introduces mixins for GpuHashJoin and GpuSubPartitionHashJoin
Adds a new method, realTargetBatchSize, to enforce a minimum GPU batch size
Removes the legacy BigSizedJoinIterator and updates join execution to use the new sub-partitioning approach

Comments suppressed due to low confidence (1)

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuShuffledSizedHashJoinExec.scala:409

Ensure that unit tests validate the behavior of realTargetBatchSize, especially for configuration values below 10 * 1024, to confirm that the enforced lower limit is working as expected.

private def realTargetBatchSize(): Long = {

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuShuffledSizedHashJoinExec.scala

thirtiseven · 2025-05-27T10:30:42Z

We are still running tests on the customer side to see what happens when we really hit the bigSizedJoin. It will probably take a few more days to get the results back.

just ran nds on two A100 and 3k data for 6 times and got an 8.61% average gain.

@abellina could you please review the code if you have time, since the 25.06 release is approaching? thanks

Update: Note that both the NDS and customer runs are against #12354.

binmahone · 2025-05-31T04:04:48Z

this PR should also close #12387, so may need to revert #12372 once this PR is merged

thirtiseven · 2025-06-12T06:37:17Z

Got some results on #12354 vs this pr:

In the following query, we can see that the subpartition hash join (this pr, on the right) is significantly slower than the bigSizedJoin (Mahone's 12354, on the left).

.

Also, the size of subpartition hash join‘s spill is twice as big as the bigSizedJoin's in this query. This seems unnatural because 12354 will read all following batches to a spillable queue, which should have more spill.

Mahone pointed out that it could possibly be because sub-partition hash join uses small batches, which uses less GPU memory. And then dynamic concurrentGpuTasks #12374 takes effects. So more concurrent gpu tasks caused larger spill size.

Setting set spark.rapids.sql.concurrentGpuTasks.dynamic=falsecauses the subpartition hash join to use 90% less memory than the bigSizedJoin, but it is still slow:

In another query, we can see some following nodes of the subpartition hash join also got slower. This seems because the batches size got smaller.

So I think we can switch back to pr 12354's approach for now.

What do you think? @binmahone @abellina , thanks!

binmahone · 2025-06-18T09:11:09Z

@abellina looks like it's worthwhile to pay for the price of risking spilling, do you think we should turn back to #12354 ? Let me know if you need a face to face discussion on this.

abellina · 2025-06-18T13:14:36Z

@abellina looks like it's worthwhile to pay for the price of risking spilling, do you think we should turn back to #12354 ? Let me know if you need a face to face discussion on this.

Looking at #12734 (comment) one thing I see is the row counts are different between the two joins. Is that a metric issue in this draft? (potentially -> given the second graph shows a project with the right row count).

It feels we need to get an idea about why the perf difference. I see likely causes in the comments, but no definitive "this is the reason for the slowness". In other words, should we be improving sub partitioning instead of moving away from it?

binmahone · 2025-06-18T15:00:17Z

@abellina looks like it's worthwhile to pay for the price of risking spilling, do you think we should turn back to #12354 ? Let me know if you need a face to face discussion on this.

Looking at #12734 (comment) one thing I see is the row counts are different between the two joins. Is that a metric issue in this draft? (potentially -> given the second graph shows a project with the right row count).

It feels we need to get an idea about why the perf difference. I see likely causes in the comments, but no definitive "this is the reason for the slowness". In other words, should we be improving sub partitioning instead of moving away from it?

hi @abellina , we'll investigate on the row count diff in the first query (I checked the second query and see the row count is same, but didn't notice the diff for the first query). We should definitely align the row counts before any meaningful analysis.

For the second query, to be honest, we haven't conducted an in-depth analysis yet, because intuitively, the new approach starts by splitting the data into 16 parts for separate processing, which comes with inherent overhead. So, it's not surprising if it's slower (at least in some cases). However, if you're still keen on understanding the new approach better, we will dive deeper into its NSYS and flame graph.

But taking one step back, is the original solution (#12354) that bad? I does risk more spills, but it will also come up with better number of subpartitions, right?

binmahone · 2025-06-19T01:47:23Z

@abellina just double confirmed with Haoyang, the row number should be a metrics problem. He ran the two runs at the same time, so the input data should be identical

Signed-off-by: Haoyang Li <[email protected]>

thirtiseven · 2025-06-19T09:09:33Z

Looking at #12734 (comment) one thing I see is the row counts are different between the two joins. Is that a metric issue in this draft? (potentially -> given the second graph shows a project with the right row count).

Yes it's a metric bug, fixed in c098407. The test data should be the same.

thirtiseven added 5 commits May 13, 2025 16:21

hack to always use sub partition join

961e380

Signed-off-by: Haoyang Li <[email protected]>

replace bigSizedJoin

04683f3

Signed-off-by: Haoyang Li <[email protected]>

refine

d90d363

Signed-off-by: Haoyang Li <[email protected]>

new line

1ec5667

Signed-off-by: Haoyang Li <[email protected]>

new line

999e71a

Signed-off-by: Haoyang Li <[email protected]>

thirtiseven requested a review from binmahone May 15, 2025 10:02

binmahone reviewed May 16, 2025

View reviewed changes

thirtiseven requested a review from Copilot May 19, 2025 08:54

Copilot AI reviewed May 19, 2025

View reviewed changes

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuShuffledSizedHashJoinExec.scala Show resolved Hide resolved

thirtiseven requested a review from abellina May 27, 2025 10:30

abellina mentioned this pull request May 28, 2025

[BUG] AsymmetricJoinSizer passes wrong buildSize to JoinInfo #12353

Open

sameerz added the bug Something isn't working label Jun 2, 2025

binmahone mentioned this pull request Jun 4, 2025

make inputs of AsymmetricJoinSizer spillable for non kudo cases #12418

Closed

thirtiseven changed the base branch from branch-25.06 to branch-25.08 June 4, 2025 01:49

binmahone mentioned this pull request Jun 4, 2025

AsymmetricJoinSizer passes wrong buildSize to JoinInfo #12354

Closed

fix metric bug

c098407

Signed-off-by: Haoyang Li <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Replace bigSizedJoin with SubPartitionHashJoin in SizedHashJoin to avoid CudfColumnSizeOverflow #12734

Replace bigSizedJoin with SubPartitionHashJoin in SizedHashJoin to avoid CudfColumnSizeOverflow #12734

Uh oh!

thirtiseven commented May 15, 2025 •

edited

Loading

Uh oh!

binmahone May 16, 2025

Uh oh!

binmahone commented May 16, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

thirtiseven commented May 27, 2025 •

edited

Loading

Uh oh!

binmahone commented May 31, 2025

Uh oh!

thirtiseven commented Jun 12, 2025 •

edited

Loading

Uh oh!

binmahone commented Jun 18, 2025

Uh oh!

abellina commented Jun 18, 2025

Uh oh!

binmahone commented Jun 18, 2025

Uh oh!

binmahone commented Jun 19, 2025

Uh oh!

thirtiseven commented Jun 19, 2025

Uh oh!

Uh oh!

Replace bigSizedJoin with SubPartitionHashJoin in SizedHashJoin to avoid CudfColumnSizeOverflow #12734

Are you sure you want to change the base?

Replace bigSizedJoin with SubPartitionHashJoin in SizedHashJoin to avoid CudfColumnSizeOverflow #12734

Uh oh!

Conversation

thirtiseven commented May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

binmahone May 16, 2025

Choose a reason for hiding this comment

Uh oh!

binmahone commented May 16, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

thirtiseven commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

binmahone commented May 31, 2025

Uh oh!

thirtiseven commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

binmahone commented Jun 18, 2025

Uh oh!

abellina commented Jun 18, 2025

Uh oh!

binmahone commented Jun 18, 2025

Uh oh!

binmahone commented Jun 19, 2025

Uh oh!

thirtiseven commented Jun 19, 2025

Uh oh!

Uh oh!

thirtiseven commented May 15, 2025 •

edited

Loading

thirtiseven commented May 27, 2025 •

edited

Loading

thirtiseven commented Jun 12, 2025 •

edited

Loading