Add `coalesce` kernel and`BatchCoalescer` for statefully combining selected b…atches: #7597

alamb · 2025-06-03T21:00:49Z

Which issue does this PR close?

Part of Optimize take/filter/concat from multiple input arrays to a single large output array #6692
Part of Benchmark for filter+concat and take+concat into even sized record batches #7589

Rationale for this change

The pattern of combining multiple small RecordBatches to form one larger one for
subsequent processing is common in query engines like DataFusion which filter or
partition incoming Arrays. Current best practice is to use the filter or take kernels and then
concat kernels as explained in

Optimize take/filter/concat from multiple input arrays to a single large output array #6692

This pattern also appears in my attempt to improve parquet filter performance (to cache the result
of applying a filter rather than re-decoding the results). See

The current pattern is non optimal as it requires:

At least 2x peak memory (holding the input and output of concat)
2 copies of the data (to create the output of filter and then create the output of concat)

The theory is that with sufficient optimization we can reduce the peak memory
requirements and (possibly) make it faster as well.

However, to add a benchmark for this filter+concat, I basically had nothing to
benchmark. Specifically, there needed to be an API to call.

Note I also made a PR to DataFusion showing this API can be used and it is not slower: Draft: Use upstream arrow coalesce kernel in DataFusion datafusion#16249

What changes are included in this PR?

I ported the code from DataFusion downstream upstream into arrow-rs so

We can use it in the parquet reader
We can benchmark and optimize it appropriately
Add BatchCoalescer to arrow-select, and tests
Update documentation
Add examples
Add a pub export in arrow
Add Benchmark

Are there any user-facing changes?

This is a new API.

I next plan to make an benchmark for this particular

alamb · 2025-06-03T21:02:59Z

arrow-select/src/coalesce.rs

+/// assert!(coalescer.next_batch().is_none());
+/// ```
+///
+/// # Background


The major differences between this and what is in DataFusion:

does not implement limit which I think is more appropriate to leave to a higher level structure

Outputs exactly the target batch size -- which will (hopefully) result in the ability to avoid any additional allocations

…atches:

Dandandan · 2025-06-04T07:47:04Z

arrow-select/src/coalesce.rs

+/// However, after a while (e.g., after `FilterExec` or `HashJoinExec`) the
+/// `StringViewArray` may only refer to a small portion of the buffer,
+/// significantly increasing memory usage.
+fn gc_string_view_batch(batch: &RecordBatch) -> RecordBatch {


This

Suggested change

fn gc_string_view_batch(batch: &RecordBatch) -> RecordBatch {

fn gc_string_view_batch(batch: RecordBatch) -> RecordBatch {

This can avoid some allocations / Arc clones in the implementation.

Dandandan · 2025-06-04T07:48:35Z

arrow-select/src/coalesce.rs

+            let Some(s) = c.as_string_view_opt() else {
+                return Arc::clone(c);
+            };
+            let ideal_buffer_size: usize = s


I think it makes sense to have another fast path here before looking into views: if the data buffer is small compared to view, gc doesn't have much impact.

Dandandan · 2025-06-04T07:53:08Z

arrow-select/src/coalesce.rs

+            if actual_buffer_size > (ideal_buffer_size * 2) {
+                // We set the block size to `ideal_buffer_size` so that the new StringViewArray only has one buffer, which accelerate later concat_batches.
+                // See https://github.com/apache/arrow-rs/issues/6094 for more details.
+                let mut builder = StringViewBuilder::with_capacity(s.len());


Reusing the views is likely quite a bit faster (didn't test the transmute, I think that should go to from / as_u128 if it shows to be faster in benchmarks):

Suggested change

let mut builder = StringViewBuilder::with_capacity(s.len());

let mut buffer: Vec<u8> = Vec::with_capacity(ideal_buffer_size);

let views: Vec<u128> = s.views().as_ref().iter().cloned().map(|v| {

// SAFETY: ByteView has same memory layout as u128

let mut b: ByteView = unsafe { std::mem::transmute(v) };

if b.length > 12 {

let offset = buffer.len() as u32;

buffer.extend_from_slice(

buffers[b.buffer_index as usize]

.get(b.offset as usize..b.offset as usize + b.length as usize)

.expect("Invalid buffer slice"),

);

b.offset = offset;

b.buffer_index = 0; // Set buffer index to 0, as we only have one buffer

}

unsafe { std::mem::transmute(b) }

}).collect();```

Dandandan · 2025-06-04T07:57:09Z

arrow-select/src/coalesce.rs

+            return Ok(());
+        }
+
+        let mut batch = gc_string_view_batch(&batch);


I think ideally we would support gc'ing multiple batches together which is faster / concat is faster (also here the risk for buffering too many small batches).

indeed -- this is a great idea. I hope to have some benchmarks written up today that we can then start optimizing this kernel substantially

Dandandan · 2025-06-04T11:51:18Z

arrow-select/src/coalesce.rs

+    ///
+    /// See [`Self::next_batch()`] to retrieve any completed batches.
+    pub fn push_batch(&mut self, batch: RecordBatch) -> Result<(), ArrowError> {
+        if batch.num_rows() == 0 {


Another fast path could be pushing the batch as is if the buffered batches is empty and added batch bigger than a certain limit (e.g. batch_size / 2). This avoids concatenating already large enough batches.

alamb · 2025-06-04T20:10:24Z

Ok, I think this PR is ready for a real review and hopefully merge.

My proposed next steps are:

Merge this PR
Iterate in subsequent PRs to improve the kernel using the benchmarks
Eventually work up to adding a special push_filtered method for skipping the filter step

It pains me to leave so much potential performance (I really want to try several of the ones that @Dandandan has listed), but I think we can do them as follow on PRs because:

This API should remain stable
The kernel is basically "state of the art" (aka I copied it from DataFusion) in terms of performance.

Dandandan

this looks nice to start iterating on.

alamb · 2025-06-05T00:25:25Z

Thank you for the review @Dandandan

I'll plan to merge this tomorrow so we can begin iterating.

cc @zhuqi-lucas and @tustvold

zhuqi-lucas

LGTM thank you @alamb , let's go!

zhuqi-lucas · 2025-06-05T03:29:23Z

arrow-select/src/coalesce.rs

+///
+/// # Heuristic
+///
+/// If the average size of each view is larger than 32 bytes, we compact the array.


Minor question, do we have some benchmark result for the:

average size of each view is larger than 32 bytes

I think the current heuristic is a bit different? The total buffer size is 2x the size of the non-inlined (>12) view lengths.

Yeah @Dandandan , it seems no 32 bytes in code implement now.

zhuqi-lucas · 2025-06-05T03:41:44Z

arrow-select/src/coalesce.rs

+
+            // Re-creating the array copies data and can be time consuming.
+            // We only do it if the array is sparse
+            if actual_buffer_size > (ideal_buffer_size * 2) {


Here we change to only one buffer, may be we can investigate emit StringArray for those cases which are natural only one buffer, and to compare the performance.

I remember some compare OP, StringArray has better performance especially for larger buffer size for StringViewArray.

I think StringArray can be faster for some cases (where all the strings are used and are longer than 12 bytes, for example) so this is an interesting idea.

the challenge is that the kernels typically have known output type -- the output types only depend on the input type, they don't vary based on input VALUE.

It would be pretty hard to use a kernel that sometimes returns a StringViewArray and sometimes returns a StringArray given the current code structure

Thank you @alamb , i agree, it's hard for us to forward this investigation.

Dandandan · 2025-06-05T15:56:00Z

Thanks @alamb let's try some suggestions.

alamb · 2025-06-06T15:34:20Z

Thanks @alamb let's try some suggestions.

@Dandandan made a great PR here:

Improve coalesce and concat performance for views #7614

…ws (#7619) # Which issue does this PR close? - Follow on to #7597 # Rationale for this change While reviewing the code and the concat kernel for - #7617 I realized there is a non trivial difference when there all inlined views vs some inlined views vs mostly large strings so the benchmarks should capture that # What changes are included in this PR? 1. Add variations of benchmark with different size strings in StringViewArray # Are there any user-facing changes? If there are user-facing changes then we may require documentation to be updated before approving the PR. If there are any breaking changes to public APIs, please call them out.

# Which issue does this PR close? - Closes #7615 - Follow on to #7597 # Rationale for this change Improve performance of `gc_string_view_batch` ``` filter: mixed_utf8view, 8192, nulls: 0, selectivity: 0.001 1.00 30.4±1.05ms ? ?/sec 1.29 39.3±0.88ms ? ?/sec filter: mixed_utf8view, 8192, nulls: 0, selectivity: 0.01 1.00 4.3±0.17ms ? ?/sec 1.20 5.2±0.15ms ? ?/sec filter: mixed_utf8view, 8192, nulls: 0, selectivity: 0.1 1.00 1805.1±25.77µs ? ?/sec 1.32 2.4±0.20ms ? ?/sec filter: mixed_utf8view, 8192, nulls: 0, selectivity: 0.8 1.00 2.6±0.12ms ? ?/sec 1.48 3.8±0.11ms ? ?/sec filter: mixed_utf8view, 8192, nulls: 0.1, selectivity: 0.001 1.00 42.5±0.48ms ? ?/sec 1.23 52.2±1.33ms ? ?/sec filter: mixed_utf8view, 8192, nulls: 0.1, selectivity: 0.01 1.00 5.8±0.12ms ? ?/sec 1.28 7.4±0.20ms ? ?/sec filter: mixed_utf8view, 8192, nulls: 0.1, selectivity: 0.1 1.00 2.2±0.02ms ? ?/sec 1.37 3.1±0.18ms ? ?/sec filter: mixed_utf8view, 8192, nulls: 0.1, selectivity: 0.8 1.00 3.6±0.15ms ? ?/sec 1.43 5.1±0.12ms ? ?/sec filter: single_utf8view, 8192, nulls: 0, selectivity: 0.001 1.00 51.0±0.59ms ? ?/sec 1.38 70.3±1.11ms ? ?/sec filter: single_utf8view, 8192, nulls: 0, selectivity: 0.01 1.00 6.7±0.03ms ? ?/sec 1.32 8.8±0.16ms ? ?/sec filter: single_utf8view, 8192, nulls: 0, selectivity: 0.1 1.00 3.0±0.01ms ? ?/sec 1.41 4.3±0.09ms ? ?/sec filter: single_utf8view, 8192, nulls: 0, selectivity: 0.8 1.00 4.5±0.34ms ? ?/sec 1.71 7.7±0.28ms ? ?/sec filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.001 1.00 64.2±0.74ms ? ?/sec 1.33 85.1±1.52ms ? ?/sec filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.01 1.00 9.4±0.09ms ? ?/sec 1.35 12.6±0.26ms ? ?/sec filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.1 1.00 3.8±0.03ms ? ?/sec 1.46 5.6±0.11ms ? ?/sec filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.8 1.00 5.7±0.28ms ? ?/sec 1.73 9.9±0.27ms ? ?/sec ``` # What changes are included in this PR? * Avoiding recreating the views from scratch. * Specialize concat for view types * Takes owned RecordBatch (effect on performance is small, might be measurable with smaller batch size / more columns). # Are there any user-facing changes? no --------- Co-authored-by: Andrew Lamb <[email protected]>

github-actions bot added the arrow Changes to the arrow crate label Jun 3, 2025

alamb commented Jun 3, 2025

View reviewed changes

alamb mentioned this pull request Jun 3, 2025

POC: Sketch out parquet cached filter result API #7513

Closed

3 tasks

alamb changed the title ~~Add coalesce / BatchCoalescer for statefully combining selected b…atches:~~ Add coalesce kernel andBatchCoalescer for statefully combining selected b…atches: Jun 3, 2025

Add coalesce / BatchCoalescer for statefully combining selected b…

ac051b7

…atches:

alamb force-pushed the alamb/coalesce branch from 6a4df48 to ac051b7 Compare June 3, 2025 21:08

Dandandan reviewed Jun 4, 2025

View reviewed changes

Adjust API

313ed72

alamb added a commit to alamb/datafusion that referenced this pull request Jun 4, 2025

Pin to apache/arrow-rs#7597

9a161d2

alamb mentioned this pull request Jun 4, 2025

Draft: Use upstream arrow coalesce kernel in DataFusion apache/datafusion#16249

Draft

alamb added 2 commits June 4, 2025 10:49

Finish batch correctly

363a747

Add CoaleseBatches benchmark

d75a71b

alamb force-pushed the alamb/coalesce branch from 80c49c7 to d75a71b Compare June 4, 2025 20:05

alamb marked this pull request as ready for review June 4, 2025 20:06

Dandandan approved these changes Jun 4, 2025

View reviewed changes

zhuqi-lucas approved these changes Jun 5, 2025

View reviewed changes

Dandandan merged commit f92ff18 into apache:main Jun 5, 2025
30 checks passed

alamb mentioned this pull request Jun 5, 2025

Benchmark for filter+concat and take+concat into even sized record batches #7589

Closed

This was referenced Jun 6, 2025

Improve coalesce_kernel benchmark to capture inline vs non inline views #7619

Merged

Optimize coalesce kernel for StringViewArray (5-10%) #7620

Closed

Improve coalesce and concat performance for views #7614

Merged

alamb mentioned this pull request Jun 24, 2025

Optimize take/filter/concat from multiple input arrays to a single large output array #6692

Open

	fn gc_string_view_batch(batch: &RecordBatch) -> RecordBatch {
	fn gc_string_view_batch(batch: RecordBatch) -> RecordBatch {

-                let mut builder = StringViewBuilder::with_capacity(s.len());
+                let mut buffer: Vec<u8> = Vec::with_capacity(ideal_buffer_size);
+                let views: Vec<u128> = s.views().as_ref().iter().cloned().map(|v| {
+                    // SAFETY: ByteView has same memory layout as u128
+                    let mut b: ByteView = unsafe { std::mem::transmute(v) };
+                    if b.length > 12 {
+                        let offset = buffer.len() as u32;
+                        buffer.extend_from_slice(
+                            buffers[b.buffer_index as usize]
+                                .get(b.offset as usize..b.offset as usize + b.length as usize)
+                                .expect("Invalid buffer slice"),
+                        );
+                        b.offset = offset;
+                        b.buffer_index = 0; // Set buffer index to 0, as we only have one buffer
+                    }
+                    unsafe { std::mem::transmute(b) }
+                }).collect();```

Add coalesce kernel andBatchCoalescer for statefully combining selected b…atches: #7597

Add coalesce kernel andBatchCoalescer for statefully combining selected b…atches: #7597

Uh oh!

Conversation

alamb commented Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Jun 4, 2025

Uh oh!

Dandandan left a comment

Choose a reason for hiding this comment

Uh oh!

alamb commented Jun 5, 2025

Uh oh!

zhuqi-lucas left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Dandandan commented Jun 5, 2025

Uh oh!

alamb commented Jun 6, 2025

Uh oh!

Uh oh!

Add `coalesce` kernel and`BatchCoalescer` for statefully combining selected b…atches: #7597

Add `coalesce` kernel and`BatchCoalescer` for statefully combining selected b…atches: #7597

alamb commented Jun 3, 2025 •

edited

Loading