Skip to content

Add new cudf::top_k API #19303

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Jul 17, 2025
Merged

Add new cudf::top_k API #19303

merged 11 commits into from
Jul 17, 2025

Conversation

davidwendt
Copy link
Contributor

@davidwendt davidwendt commented Jul 7, 2025

Description

Adds new cudf::top_k API to compute the top K values for a given column.
This essentially performs a descending sort followed by a gather of the first K elements.

Reference: #19096

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@davidwendt davidwendt self-assigned this Jul 7, 2025
@davidwendt davidwendt added feature request New feature or request 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change labels Jul 7, 2025
Copy link

copy-pr-bot bot commented Jul 7, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions github-actions bot added the CMake CMake build issue label Jul 7, 2025
@davidwendt
Copy link
Contributor Author

/ok to test

@GregoryKimball GregoryKimball moved this to In progress in libcudf Jul 11, 2025
@GregoryKimball GregoryKimball moved this from In progress to Burndown in libcudf Jul 11, 2025
@davidwendt davidwendt added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Jul 14, 2025
@davidwendt davidwendt marked this pull request as ready for review July 14, 2025 13:05
@davidwendt davidwendt requested review from a team as code owners July 14, 2025 13:05
Copy link
Member

@mhaseeb123 mhaseeb123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One suggestion but LGTM otherwise.

TYPED_TEST(Sort, TopK)
{
using T = TypeParam;
if (std::is_same_v<T, bool>) { GTEST_SKIP(); }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL about GTEST_SKIP()

@ttnghia
Copy link
Contributor

ttnghia commented Jul 14, 2025

This essentially performs a descending sort followed by a gather of the first K elements.

Can we have a cheaper solution for this? As for various algorithms (N is the column size):

Algorithm Time Complexity Space Complexity Best Use Case
Heap-based O(Nlogk) O(k) When k≪N, streaming data
Quickselect O(N) average, O(N^2) worst O(1) or O(log⁡N) When average case performance is acceptable
Sorting O(NlogN) O(1) or O(N) When simplicity is preferred or k≈N

@davidwendt
Copy link
Contributor Author

This essentially performs a descending sort followed by a gather of the first K elements.

Can we have a cheaper solution for this? As for various algorithms (N is the column size):

This is mostly a place-holder to get the API right, etc.
We will use CUB's topk for fixed-width types when it becomes available but likely keep the sort based one for string types.
Maybe you can point me to the heap-based and quick-select algorithms you mentioned for the string specialization.

@mhaseeb123
Copy link
Member

We will use CUB's topk for fixed-width types when it becomes available

@davidwendt do you know by any chance what algorithm does CUB's topk implement (or will implement)?. I couldn't really find any information about this online.

@davidwendt
Copy link
Contributor Author

We will use CUB's topk for fixed-width types when it becomes available

@davidwendt do you know by any chance what algorithm does CUB's topk implement (or will implement)?. I couldn't really find any information about this online.

It is called AIR topk. https://dl.acm.org/doi/10.1145/3581784.3607062
It is a kind of radix select-k algorithm. I'm not sure if the implementation has been finalized but I expect we can use it in 25.10.

@ttnghia
Copy link
Contributor

ttnghia commented Jul 15, 2025

Maybe you can point me to the heap-based and quick-select algorithms you mentioned for the string specialization.

That's what Perplexity told me. Asking it further and here is the answer:

Heap-Based Implementation
Algorithm Overview:
Use a min-heap of size k to keep track of the top k elements as you iterate through the array. For each new element, if the heap is not full, add it; if it is full and the new element is larger than the smallest in the heap, replace the smallest.

Python Example:

python
import heapq

def find_top_k_elements(arr, k):
    if k <= 0:
        return []
    min_heap = []
    for num in arr:
        if len(min_heap) < k:
            heapq.heappush(min_heap, num)
        elif num > min_heap[0]:
            heapq.heappop(min_heap)
            heapq.heappush(min_heap, num)
    return min_heap
This approach is detailed in guides and tutorials, such as those on Algocademy and Interviewing.io.

Further Reading:

[Python heapq documentation]

[Algocademy: Top K Elements with Heap]

[Interviewing.io: Top K Frequent Elements]

Quickselect Implementation
Algorithm Overview:
Quickselect is a selection algorithm related to quicksort. It partitions the array to find the k-th largest element, then collects all elements greater than or equal to this value.

Python Example:

python
import random

def quickselect(arr, k):
    if not [1](https://leimao.github.io/blog/CPU-TopK-Algorithm/) <= k <= len(arr):
        return None
    def partition(left, right, pivot_idx):
        pivot = arr[pivot_idx]
        arr[pivot_idx], arr[right] = arr[right], arr[pivot_idx]
        store_idx = left
        for i in range(left, right):
            if arr[i] > pivot:
                arr[store_idx], arr[i] = arr[i], arr[store_idx]
                store_idx += [1](https://leimao.github.io/blog/CPU-TopK-Algorithm/)
        arr[right], arr[store_idx] = arr[store_idx], arr[right]
        return store_idx
    def select(left, right):
        if left == right:
            return arr[left]
        pivot_idx = random.randint(left, right)
        pivot_idx = partition(left, right, pivot_idx)
        if k == pivot_idx + [1](https://leimao.github.io/blog/CPU-TopK-Algorithm/):
            return arr[pivot_idx]
        elif k < pivot_idx + 1:
            return select(left, pivot_idx - 1)
        else:
            return select(pivot_idx + 1, right)
    return select(0, len(arr) - 1)

def find_top_k_elements_quickselect(arr, k):
    if k <= 0:
        return []
    kth_largest = quickselect(arr, k)
    return [num for num in arr if num >= kth_largest]
This method is explained in detail on Algocademy, GeeksforGeeks, and Stack Overflow.

@revans2
Copy link
Contributor

revans2 commented Jul 16, 2025

This is great. Would it be possible/simple to have an API that optionally takes a second table and produces a table as output?

Our code for top_k, is an out of core algorithm. It does a top_k on an input batch, and then if there is more than one batch it will concat the new batch with the previous batch and do another top_k on that. It repeats this until there are no more batches to process. It would be great to drop the concat, too, but that is mostly from a worst case memory standpoint.

https://github.com/NVIDIA/spark-rapids/blob/99defa9c8828131e6c21b134412560fc0e0f6348/sql-plugin/src/main/scala/com/nvidia/spark/rapids/limit.scala#L293-L313

@davidwendt
Copy link
Contributor Author

davidwendt commented Jul 16, 2025

This is great. Would it be possible/simple to have an API that optionally takes a second table and produces a table as output?

I've included a top_k_order API as well so you can do a gather on a separate table/column.

@elstehle
Copy link
Contributor

This is great. Would it be possible/simple to have an API that optionally takes a second table and produces a table as output?

I've included a top_k_ordered API as well so you can do a gather on a separate table/column.

Is the idea that top_k does not necessarily return the top k elements in any given order, while top_k_ordered would return them in ascending and descending order, respectively? If this is the idea behind the two cudf interfaces, this would align well with our interfaces.

The reason I'm bringing this up is because the implementation we're working on in CUB will not guarantee any order - at least the initial version won't - and so a subsequent cub::DeviceRadixSort call would be required if the output order is of concern. The motivation simply is that most users do not require the output to be ordered and returning the top-k as a "set" of items allows for a more efficient code path. Ultimately, we'll add another overload in CUB that takes care of order too but it's not that high on our list (because usually k<<N and so a subsequent cub::DeviceRadixSort on the user end should be ok until then).

@davidwendt
Copy link
Contributor Author

davidwendt commented Jul 16, 2025

Is the idea that top_k does not necessarily return the top k elements in any given order, while top_k_ordered would return them in ascending and descending order, respectively? If this is the idea behind the two cudf interfaces, this would align well with our interfaces.

No. The top_k_order is simply the indices and is meant to have parity with the sorted_order function we already have.
It will be useful to have just the indices to do a gather on a different set of columns.
Here we would call the CUB TopKPairs function to get the indices to return.

The reason I'm bringing this up is because the implementation we're working on in CUB will not guarantee any order - at least the initial version won't - and so a subsequent cub::DeviceRadixSort call would be required if the output order is of concern. The motivation simply is that most users do not require the output to be ordered and returning the top-k as a "set" of items allows for a more efficient code path. Ultimately, we'll add another overload in CUB that takes care of order too but it's not that high on our list (because usually k<<N and so a subsequent cub::DeviceRadixSort on the user end should be ok until then).

This is good to know. I can add an unordered-ness statement to the doxygen.

@davidwendt
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit d245a63 into rapidsai:branch-25.08 Jul 17, 2025
91 checks passed
@davidwendt davidwendt deleted the topk-api branch July 17, 2025 21:23
@mhaseeb123 mhaseeb123 moved this from Burndown to Landed in libcudf Jul 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team CMake CMake build issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
Status: Landed
Development

Successfully merging this pull request may close these issues.

6 participants