Add new cudf::top_k API #19303

davidwendt · 2025-07-07T20:05:07Z

Description

Adds new cudf::top_k API to compute the top K values for a given column.
This essentially performs a descending sort followed by a gather of the first K elements.

Reference: #19096

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2025-07-07T20:05:10Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

davidwendt · 2025-07-09T17:15:49Z

/ok to test

mhaseeb123

One suggestion but LGTM otherwise.

mhaseeb123 · 2025-07-14T18:40:36Z

cpp/tests/sort/sort_test.cpp

+TYPED_TEST(Sort, TopK)
+{
+  using T = TypeParam;
+  if (std::is_same_v<T, bool>) { GTEST_SKIP(); }


TIL about GTEST_SKIP()

cpp/tests/sort/sort_test.cpp

cpp/src/sort/top_k.cu

ttnghia · 2025-07-14T22:17:00Z

This essentially performs a descending sort followed by a gather of the first K elements.

Can we have a cheaper solution for this? As for various algorithms (N is the column size):

Algorithm	Time Complexity	Space Complexity	Best Use Case
Heap-based	O(Nlogk)	O(k)	When k≪N, streaming data
Quickselect	O(N) average, O(N^2) worst	O(1) or O(log⁡N)	When average case performance is acceptable
Sorting	O(NlogN)	O(1) or O(N)	When simplicity is preferred or k≈N

davidwendt · 2025-07-14T22:22:04Z

This essentially performs a descending sort followed by a gather of the first K elements.

Can we have a cheaper solution for this? As for various algorithms (N is the column size):

This is mostly a place-holder to get the API right, etc.
We will use CUB's topk for fixed-width types when it becomes available but likely keep the sort based one for string types.
Maybe you can point me to the heap-based and quick-select algorithms you mentioned for the string specialization.

mhaseeb123 · 2025-07-14T22:24:58Z

We will use CUB's topk for fixed-width types when it becomes available

@davidwendt do you know by any chance what algorithm does CUB's topk implement (or will implement)?. I couldn't really find any information about this online.

davidwendt · 2025-07-14T22:34:20Z

We will use CUB's topk for fixed-width types when it becomes available

@davidwendt do you know by any chance what algorithm does CUB's topk implement (or will implement)?. I couldn't really find any information about this online.

It is called AIR topk. https://dl.acm.org/doi/10.1145/3581784.3607062
It is a kind of radix select-k algorithm. I'm not sure if the implementation has been finalized but I expect we can use it in 25.10.

cpp/src/sort/top_k.cu

ttnghia · 2025-07-15T20:15:56Z

Maybe you can point me to the heap-based and quick-select algorithms you mentioned for the string specialization.

That's what Perplexity told me. Asking it further and here is the answer:

Heap-Based Implementation
Algorithm Overview:
Use a min-heap of size k to keep track of the top k elements as you iterate through the array. For each new element, if the heap is not full, add it; if it is full and the new element is larger than the smallest in the heap, replace the smallest.

Python Example:

python
import heapq

def find_top_k_elements(arr, k):
    if k <= 0:
        return []
    min_heap = []
    for num in arr:
        if len(min_heap) < k:
            heapq.heappush(min_heap, num)
        elif num > min_heap[0]:
            heapq.heappop(min_heap)
            heapq.heappush(min_heap, num)
    return min_heap
This approach is detailed in guides and tutorials, such as those on Algocademy and Interviewing.io.

Further Reading:

[Python heapq documentation]

[Algocademy: Top K Elements with Heap]

[Interviewing.io: Top K Frequent Elements]

Quickselect Implementation
Algorithm Overview:
Quickselect is a selection algorithm related to quicksort. It partitions the array to find the k-th largest element, then collects all elements greater than or equal to this value.

Python Example:

python
import random

def quickselect(arr, k):
    if not [1](https://leimao.github.io/blog/CPU-TopK-Algorithm/) <= k <= len(arr):
        return None
    def partition(left, right, pivot_idx):
        pivot = arr[pivot_idx]
        arr[pivot_idx], arr[right] = arr[right], arr[pivot_idx]
        store_idx = left
        for i in range(left, right):
            if arr[i] > pivot:
                arr[store_idx], arr[i] = arr[i], arr[store_idx]
                store_idx += [1](https://leimao.github.io/blog/CPU-TopK-Algorithm/)
        arr[right], arr[store_idx] = arr[store_idx], arr[right]
        return store_idx
    def select(left, right):
        if left == right:
            return arr[left]
        pivot_idx = random.randint(left, right)
        pivot_idx = partition(left, right, pivot_idx)
        if k == pivot_idx + [1](https://leimao.github.io/blog/CPU-TopK-Algorithm/):
            return arr[pivot_idx]
        elif k < pivot_idx + 1:
            return select(left, pivot_idx - 1)
        else:
            return select(pivot_idx + 1, right)
    return select(0, len(arr) - 1)

def find_top_k_elements_quickselect(arr, k):
    if k <= 0:
        return []
    kth_largest = quickselect(arr, k)
    return [num for num in arr if num >= kth_largest]
This method is explained in detail on Algocademy, GeeksforGeeks, and Stack Overflow.

revans2 · 2025-07-16T13:33:47Z

This is great. Would it be possible/simple to have an API that optionally takes a second table and produces a table as output?

Our code for top_k, is an out of core algorithm. It does a top_k on an input batch, and then if there is more than one batch it will concat the new batch with the previous batch and do another top_k on that. It repeats this until there are no more batches to process. It would be great to drop the concat, too, but that is mostly from a worst case memory standpoint.

https://github.com/NVIDIA/spark-rapids/blob/99defa9c8828131e6c21b134412560fc0e0f6348/sql-plugin/src/main/scala/com/nvidia/spark/rapids/limit.scala#L293-L313

davidwendt · 2025-07-16T13:47:41Z

This is great. Would it be possible/simple to have an API that optionally takes a second table and produces a table as output?

I've included a top_k_order API as well so you can do a gather on a separate table/column.

elstehle · 2025-07-16T16:10:50Z

This is great. Would it be possible/simple to have an API that optionally takes a second table and produces a table as output?

I've included a top_k_ordered API as well so you can do a gather on a separate table/column.

Is the idea that top_k does not necessarily return the top k elements in any given order, while top_k_ordered would return them in ascending and descending order, respectively? If this is the idea behind the two cudf interfaces, this would align well with our interfaces.

The reason I'm bringing this up is because the implementation we're working on in CUB will not guarantee any order - at least the initial version won't - and so a subsequent cub::DeviceRadixSort call would be required if the output order is of concern. The motivation simply is that most users do not require the output to be ordered and returning the top-k as a "set" of items allows for a more efficient code path. Ultimately, we'll add another overload in CUB that takes care of order too but it's not that high on our list (because usually k<<N and so a subsequent cub::DeviceRadixSort on the user end should be ok until then).

davidwendt · 2025-07-16T17:14:01Z

Is the idea that top_k does not necessarily return the top k elements in any given order, while top_k_ordered would return them in ascending and descending order, respectively? If this is the idea behind the two cudf interfaces, this would align well with our interfaces.

No. The top_k_order is simply the indices and is meant to have parity with the sorted_order function we already have.
It will be useful to have just the indices to do a gather on a different set of columns.
Here we would call the CUB TopKPairs function to get the indices to return.

The reason I'm bringing this up is because the implementation we're working on in CUB will not guarantee any order - at least the initial version won't - and so a subsequent cub::DeviceRadixSort call would be required if the output order is of concern. The motivation simply is that most users do not require the output to be ordered and returning the top-k as a "set" of items allows for a more efficient code path. Ultimately, we'll add another overload in CUB that takes care of order too but it's not that high on our list (because usually k<<N and so a subsequent cub::DeviceRadixSort on the user end should be ok until then).

This is good to know. I can add an unordered-ness statement to the doxygen.

davidwendt · 2025-07-17T21:22:51Z

/merge

Add new cudf::top_k API

790f110

davidwendt self-assigned this Jul 7, 2025

davidwendt added feature request New feature or request 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change labels Jul 7, 2025

github-actions bot added the CMake CMake build issue label Jul 7, 2025

davidwendt added 3 commits July 8, 2025 19:00

Merge branch 'branch-25.08' into topk-api

39bd888

add top_k_order() and benchmarks

7c8598e

Merge branch 'branch-25.08' into topk-api

e72aa35

GregoryKimball added this to libcudf Jul 11, 2025

GregoryKimball moved this to In progress in libcudf Jul 11, 2025

GregoryKimball moved this from In progress to Burndown in libcudf Jul 11, 2025

davidwendt added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Jul 14, 2025

davidwendt marked this pull request as ready for review July 14, 2025 13:05

davidwendt requested review from a team as code owners July 14, 2025 13:05

davidwendt requested review from karthikeyann and PointKernel July 14, 2025 13:05

mhaseeb123 reviewed Jul 14, 2025

View reviewed changes

mhaseeb123 approved these changes Jul 14, 2025

View reviewed changes

davidwendt added 2 commits July 14, 2025 19:59

Merge branch 'branch-25.08' into topk-api

d8f4d46

add constexpr

d72c7c2

Merge branch 'branch-25.08' into topk-api

d1b214c

PointKernel approved these changes Jul 15, 2025

View reviewed changes

cpp/src/sort/top_k.cu Outdated Show resolved Hide resolved

cpp/src/sort/top_k.cu Outdated Show resolved Hide resolved

davidwendt added 2 commits July 16, 2025 08:52

Merge branch 'branch-25.08' into topk-api

6751256

add const to var decls

44be2c8

davidwendt added 2 commits July 16, 2025 13:19

add non-sorted k elements description to doxygen

e506a41

Merge branch 'branch-25.08' into topk-api

6547569

rapids-bot bot merged commit d245a63 into rapidsai:branch-25.08 Jul 17, 2025
91 checks passed

davidwendt deleted the topk-api branch July 17, 2025 21:23

mhaseeb123 moved this from Burndown to Landed in libcudf Jul 17, 2025

Matt711 mentioned this pull request Jul 18, 2025

Implement top k expression in cudf-polars using cudf::top_k #19431

Open

3 tasks

Add new cudf::top_k API #19303

Add new cudf::top_k API #19303

Uh oh!

Conversation

davidwendt commented Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

copy-pr-bot bot commented Jul 7, 2025

Uh oh!

davidwendt commented Jul 9, 2025

Uh oh!

mhaseeb123 left a comment

Choose a reason for hiding this comment

Uh oh!

mhaseeb123 Jul 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ttnghia commented Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davidwendt commented Jul 14, 2025

Uh oh!

mhaseeb123 commented Jul 14, 2025

Uh oh!

davidwendt commented Jul 14, 2025

Uh oh!

Uh oh!

Uh oh!

ttnghia commented Jul 15, 2025

Uh oh!

revans2 commented Jul 16, 2025

Uh oh!

davidwendt commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elstehle commented Jul 16, 2025

Uh oh!

davidwendt commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davidwendt commented Jul 17, 2025

Uh oh!

Uh oh!

Uh oh!

davidwendt commented Jul 7, 2025 •

edited

Loading

ttnghia commented Jul 14, 2025 •

edited

Loading

davidwendt commented Jul 16, 2025 •

edited

Loading

davidwendt commented Jul 16, 2025 •

edited

Loading