Skip to content

Sort-based inner join for high-multiplicity tables #18318

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 118 commits into from
May 2, 2025

Conversation

shrshi
Copy link
Contributor

@shrshi shrshi commented Mar 19, 2025

Description

Contributes to #18533
Addresses performance hotspots outlined in #16025
This PR introduces a sort-based approach for inner joins on low-cardinality high-multiplicity tables i.e. tables that have few unique keys each of which is repeated several times.
Sort-merge join implemetation:

  1. Sort left and right tables using their respective keys.
  2. Iterate through the larger of the two tables and compute upper and lower bounds for each key in the smaller table.
  3. For left indices, compute the number of elements $n$ in bounds range for each key, and insert the key $n$ times in the array.
  4. For right indices, insert the positions between lower and upper bound using the sorted ordering of the smaller table.

Progress

  1. Benchmarking results for join on int64 keys for input tables of varying key multiplicity: Performance comparison plot
  2. Benchmarking results after optimizing right indices construction: Profiles and updated benchmarks

TODO:

  • Inner join on nested columns
  • Inner join on nullable keys with null equality set to false.
  • Remaining join types (left, semi, full, ...)
  • Merge join for sorted left and right keys

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Copy link

copy-pr-bot bot commented Mar 19, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Mar 19, 2025
@shrshi shrshi added feature request New feature or request non-breaking Non-breaking change labels Mar 19, 2025
@shrshi
Copy link
Contributor Author

shrshi commented Mar 19, 2025

/ok to test

@shrshi shrshi changed the base branch from branch-25.04 to branch-25.06 March 24, 2025 19:13
@shrshi
Copy link
Contributor Author

shrshi commented Mar 24, 2025

/ok to test

@shrshi
Copy link
Contributor Author

shrshi commented May 1, 2025

/ok to test eaa76fa

@shrshi shrshi added the 5 - Ready to Merge Testing and reviews complete, ready to merge label May 1, 2025
@shrshi
Copy link
Contributor Author

shrshi commented May 1, 2025

/ok to test a083c71

Comment on lines +240 to +241
merge_inner_join(cudf::table_view const& left_keys,
cudf::table_view const& right_keys,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we match the parameter names as in sort_merge_join? That means, using left and right without _keys. I don't see any values tables thus we don't have to add such suffix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I retained the _keys suffix for consistency - it matches the join declarations in join/join.hpp

inner_join(cudf::table_view const& left_keys,

Comment on lines +65 to +71
struct unprocessed_table_mapper {
bitmask_type const* const _validity_mask;
__device__ auto operator()(size_type idx) const noexcept
{
return cudf::bit_is_set(_validity_mask, idx);
}
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this is used only once. So can we just use a lambda in its caller?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since validity mask is a member of a private nested struct of the sort merge join class, we run into the following compile error when we use a device lambda instead of a free functor: enclosing parent function for an extended device lambda cannot have private or protected access.

@shrshi
Copy link
Contributor Author

shrshi commented May 1, 2025

/ok to test 57b4d3f

@shrshi shrshi requested a review from ttnghia May 1, 2025 23:03
@shrshi
Copy link
Contributor Author

shrshi commented May 2, 2025

/ok to test 4ed9c9e

@shrshi
Copy link
Contributor Author

shrshi commented May 2, 2025

/ok to test e9d0612

@shrshi
Copy link
Contributor Author

shrshi commented May 2, 2025

/merge

@rapids-bot rapids-bot bot merged commit 68aa7e2 into rapidsai:branch-25.06 May 2, 2025
112 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge CMake CMake build issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants