-
Notifications
You must be signed in to change notification settings - Fork 961
Add new cudf::top_k API #19303
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new cudf::top_k API #19303
Conversation
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One suggestion but LGTM otherwise.
cpp/tests/sort/sort_test.cpp
Outdated
TYPED_TEST(Sort, TopK) | ||
{ | ||
using T = TypeParam; | ||
if (std::is_same_v<T, bool>) { GTEST_SKIP(); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL about GTEST_SKIP()
Can we have a cheaper solution for this? As for various algorithms (
|
This is mostly a place-holder to get the API right, etc. |
@davidwendt do you know by any chance what algorithm does CUB's topk implement (or will implement)?. I couldn't really find any information about this online. |
It is called AIR topk. https://dl.acm.org/doi/10.1145/3581784.3607062 |
That's what Perplexity told me. Asking it further and here is the answer:
|
This is great. Would it be possible/simple to have an API that optionally takes a second table and produces a table as output? Our code for top_k, is an out of core algorithm. It does a top_k on an input batch, and then if there is more than one batch it will concat the new batch with the previous batch and do another top_k on that. It repeats this until there are no more batches to process. It would be great to drop the concat, too, but that is mostly from a worst case memory standpoint. |
I've included a |
Is the idea that The reason I'm bringing this up is because the implementation we're working on in CUB will not guarantee any order - at least the initial version won't - and so a subsequent |
No. The
This is good to know. I can add an unordered-ness statement to the doxygen. |
/merge |
Description
Adds new
cudf::top_k
API to compute the top K values for a given column.This essentially performs a descending sort followed by a gather of the first K elements.
Reference: #19096
Checklist