-
Notifications
You must be signed in to change notification settings - Fork 256
Replace bigSizedJoin with SubPartitionHashJoin in SizedHashJoin to avoid CudfColumnSizeOverflow #12734
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: branch-25.08
Are you sure you want to change the base?
Conversation
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
private def realTargetBatchSize(): Long = { | ||
val configValue = RapidsConf.GPU_BATCH_SIZE_BYTES.get(conf) | ||
// The 10k is mostly for tests, hopefully no one is setting anything that low in production. | ||
Math.max(configValue, 10 * 1024) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: what we must set this?
need tests on 1. NDS 2. customer queries (we can selectively pick 20 queries whose per task build&stream side sizes are both big) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR replaces the big-sized join implementation with a sub-partition hash join variant to mitigate potential overflow issues with large build-side batches.
- Introduces mixins for GpuHashJoin and GpuSubPartitionHashJoin
- Adds a new method, realTargetBatchSize, to enforce a minimum GPU batch size
- Removes the legacy BigSizedJoinIterator and updates join execution to use the new sub-partitioning approach
Comments suppressed due to low confidence (1)
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuShuffledSizedHashJoinExec.scala:409
- Ensure that unit tests validate the behavior of realTargetBatchSize, especially for configuration values below 10 * 1024, to confirm that the enforced lower limit is working as expected.
private def realTargetBatchSize(): Long = {
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuShuffledSizedHashJoinExec.scala
Show resolved
Hide resolved
We are still running tests on the customer side to see what happens when we really hit the bigSizedJoin. It will probably take a few more days to get the results back. just ran nds on two A100 and 3k data for 6 times and got an 8.61% average gain. @abellina could you please review the code if you have time, since the 25.06 release is approaching? thanks Update: Note that both the NDS and customer runs are against #12354. |
Got some results on #12354 vs this pr: In the following query, we can see that the subpartition hash join (this pr, on the right) is significantly slower than the bigSizedJoin (Mahone's 12354, on the left). Also, the size of subpartition hash join‘s spill is twice as big as the bigSizedJoin's in this query. This seems unnatural because 12354 will read all following batches to a spillable queue, which should have more spill. Mahone pointed out that it could possibly be because sub-partition hash join uses small batches, which uses less GPU memory. And then dynamic concurrentGpuTasks #12374 takes effects. So more concurrent gpu tasks caused larger spill size. Setting In another query, we can see some following nodes of the subpartition hash join also got slower. This seems because the batches size got smaller. So I think we can switch back to pr 12354's approach for now. What do you think? @binmahone @abellina , thanks! |
Looking at #12734 (comment) one thing I see is the row counts are different between the two joins. Is that a metric issue in this draft? (potentially -> given the second graph shows a project with the right row count). It feels we need to get an idea about why the perf difference. I see likely causes in the comments, but no definitive "this is the reason for the slowness". In other words, should we be improving sub partitioning instead of moving away from it? |
hi @abellina , we'll investigate on the row count diff in the first query (I checked the second query and see the row count is same, but didn't notice the diff for the first query). We should definitely align the row counts before any meaningful analysis. For the second query, to be honest, we haven't conducted an in-depth analysis yet, because intuitively, the new approach starts by splitting the data into 16 parts for separate processing, which comes with inherent overhead. So, it's not surprising if it's slower (at least in some cases). However, if you're still keen on understanding the new approach better, we will dive deeper into its NSYS and flame graph. But taking one step back, is the original solution (#12354) that bad? I does risk more spills, but it will also come up with better number of subpartitions, right? |
@abellina just double confirmed with Haoyang, the row number should be a metrics problem. He ran the two runs at the same time, so the input data should be identical |
Signed-off-by: Haoyang Li <[email protected]>
Yes it's a metric bug, fixed in c098407. The test data should be the same. |
Closes #12353
I think it's in an early stage. At least need more tests.