POC: Sketch out parquet cached filter result API #7513

alamb · 2025-05-15T19:23:14Z

Draft until:

Pull out StringViewBuilder::concat_array into its own PR
Avoid double buffering of intermediate results
Add memory limit for results cache

Which issue does this PR close?

Builds on Introduce ReadPlan to encapsulate the calculation of what parquet rows to decode #7502
part of [EPIC] Faster performance for parquet predicate evaluation for non selective filters #7456

Rationale for this change

I am trying to sketch out enough of a cached filter result API to show performance improvements. Once I have done that, I will start proposing how to break it up into smaller PRs

What changes are included in this PR?

Add code to cache columns which are reused in filter and scan

Are there any user-facing changes?

zhuqi-lucas · 2025-05-16T04:48:39Z

parquet/src/arrow/arrow_reader/read_plan.rs

+    filters: Vec<BooleanArray>,
+}
+
+impl CachedPredicateResultBuilder {


Nice, this is very clear to get the cached result!

zhuqi-lucas · 2025-05-16T14:24:50Z

parquet/src/arrow/arrow_reader/read_plan.rs

+    /// TODO: potentially incrementally build the result of the predicate
+    /// evaluation without holding all the batches in memory. See
+    /// <https://github.com/apache/arrow-rs/issues/6692>
+    in_progress_arrays: Vec<Box<dyn InProgressArray>>,


Thank you @alamb ,

Does it mean, this in_progress_arrays is not the final result for us to generate the final batch?

For example:
Predicate a > 1 => in_progress_array_a filtered by a > 1
Predicate b >2 => in_progress_array_b filtered by b > 2 also based filtered by a > 1, but we don't update the in_progress_array_a

This is an excellent question

What I was thinking is that CachedPredicateResult would manage the "currently" applied predicate

So in the case where there are multiple predicates, I was thinking of a method like CachedPredicateResult::merge method which could take the result of filtering a and apply the result of filtering by b

We can then put heuristics / logic for if/when we materialize the filters into CachedPredicateResult

But that is sort of speculation at this point -- I don't have it all worked out yet

My plan is to get far enough to show this structure works and can improve performance, and then I'll work on the trickier logic of applying multiple filters

CachedPredicateResult::merge method which could take the result of filtering a and apply the result of filtering by b

Great idea!

But that is sort of speculation at this point -- I don't have it all worked out yet

Sure, i will continue to review, thank you @alamb !

alamb · 2025-05-16T15:05:43Z

I tested this branch using a query that filters and selects the same column (NOTE it is critical to NOT use --all-features as all features turns on force_validate

cargo bench --features="arrow async" --bench arrow_reader_clickbench -- Q24

Here are the benchmark results (30ms --> 22ms) (25 % faster)

Gnuplot not found, using plotters backend
Looking for ClickBench files starting in current_dir and all parent directories: "/Users/andrewlamb/Software/arrow-rs/parquet"
arrow_reader_clickbench/sync/Q24
                        time:   [22.532 ms 22.604 ms 22.682 ms]
                        change: [-27.751% -27.245% -26.791%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe

arrow_reader_clickbench/async/Q24
                        time:   [24.043 ms 24.171 ms 24.308 ms]
                        change: [-26.223% -25.697% -25.172%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe

I realize this branch currently uses more memory (to buffer the filter results), but I think the additional memory growth can be limited with a setting.

zhuqi-lucas · 2025-05-16T15:09:31Z

I tested this branch using a query that filters and selects the same column (NOTE it is critical to NOT use --all-features as all features turns on force_validate

cargo bench --features="arrow async" --bench arrow_reader_clickbench -- Q24

Here are the benchmark results (30ms --> 22ms) (25 % faster)

Gnuplot not found, using plotters backend
Looking for ClickBench files starting in current_dir and all parent directories: "/Users/andrewlamb/Software/arrow-rs/parquet"
arrow_reader_clickbench/sync/Q24
                        time:   [22.532 ms 22.604 ms 22.682 ms]
                        change: [-27.751% -27.245% -26.791%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe

arrow_reader_clickbench/async/Q24
                        time:   [24.043 ms 24.171 ms 24.308 ms]
                        change: [-26.223% -25.697% -25.172%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe

I realize this branch currently uses more memory (to buffer the filter results), but I think the additional memory growth can be limited with a setting.

Amazing result , i think it will be the perfect way instead of page cache, because page caching will have cache missing, but this PR will always cache the result!

alamb · 2025-05-16T15:28:48Z

Amazing result , i think it will be the perfect way instead of page cache, because page caching will have cache missing, but this PR will always cache the result!

Thanks -- I think one potential problem is that the cached results may consume too much memory (but I will try and handle that shortly)

I think we should proceed with starting to merge some refactorings; I left some suggestions here:

[EPIC] Faster performance for parquet predicate evaluation for non selective filters #7456 (comment)

zhuqi-lucas · 2025-05-16T15:33:56Z

Amazing result , i think it will be the perfect way instead of page cache, because page caching will have cache missing, but this PR will always cache the result!

Thanks -- I think one potential problem is that the cached results may consume too much memory (but I will try and handle that shortly)

I think we should proceed with starting to merge some refactorings; I left some suggestions here:

[EPIC] Faster performance for parquet predicate evaluation for non selective filters #7456 (comment)

It makes sense! Thank you @alamb.

alamb · 2025-05-16T18:47:45Z

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.11.0-1013-gcp #13~24.04.1-Ubuntu SMP Wed Apr 2 16:34:16 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/cache_filter_result (f1f7103) to 1a5999a diff
BENCH_NAME=arrow_reader_clickbench
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench arrow_reader_clickbench
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_cache_filter_result
Results will be posted here when complete

alamb · 2025-05-16T19:14:08Z

🤖: Benchmark completed

Details

group                                alamb_cache_filter_result              main
-----                                -------------------------              ----
arrow_reader_clickbench/async/Q1     1.00      2.0±0.03ms        ? ?/sec    1.15      2.4±0.01ms        ? ?/sec
arrow_reader_clickbench/async/Q10    1.00     12.9±0.06ms        ? ?/sec    1.08     13.9±0.11ms        ? ?/sec
arrow_reader_clickbench/async/Q11    1.00     14.9±0.16ms        ? ?/sec    1.06     15.8±0.14ms        ? ?/sec
arrow_reader_clickbench/async/Q12    1.00     24.4±0.26ms        ? ?/sec    1.59     38.8±0.25ms        ? ?/sec
arrow_reader_clickbench/async/Q13    1.00     37.6±0.33ms        ? ?/sec    1.39     52.3±0.32ms        ? ?/sec
arrow_reader_clickbench/async/Q14    1.00     35.5±0.24ms        ? ?/sec    1.41     50.0±0.37ms        ? ?/sec
arrow_reader_clickbench/async/Q19    1.01      5.1±0.05ms        ? ?/sec    1.00      5.0±0.07ms        ? ?/sec
arrow_reader_clickbench/async/Q20    1.00    114.6±0.51ms        ? ?/sec    1.42    162.8±0.60ms        ? ?/sec
arrow_reader_clickbench/async/Q21    1.00    132.0±0.61ms        ? ?/sec    1.59    209.4±0.69ms        ? ?/sec
arrow_reader_clickbench/async/Q22    1.00    200.4±0.94ms        ? ?/sec    2.12    425.7±1.52ms        ? ?/sec
arrow_reader_clickbench/async/Q23    1.00   414.9±12.61ms        ? ?/sec    1.18   491.6±11.23ms        ? ?/sec
arrow_reader_clickbench/async/Q24    1.00     41.9±0.46ms        ? ?/sec    1.38     57.7±0.51ms        ? ?/sec
arrow_reader_clickbench/async/Q27    1.00    105.5±0.37ms        ? ?/sec    1.58    166.9±1.13ms        ? ?/sec
arrow_reader_clickbench/async/Q28    1.00    103.1±0.51ms        ? ?/sec    1.59    164.1±0.89ms        ? ?/sec
arrow_reader_clickbench/async/Q30    1.00     64.2±0.58ms        ? ?/sec    1.00     64.2±0.51ms        ? ?/sec
arrow_reader_clickbench/async/Q36    1.38    234.5±1.60ms        ? ?/sec    1.00    169.6±0.96ms        ? ?/sec
arrow_reader_clickbench/async/Q37    1.58    162.6±0.64ms        ? ?/sec    1.00    102.6±0.55ms        ? ?/sec
arrow_reader_clickbench/async/Q38    1.00     38.9±0.26ms        ? ?/sec    1.00     39.1±0.26ms        ? ?/sec
arrow_reader_clickbench/async/Q39    1.00     48.5±0.26ms        ? ?/sec    1.00     48.6±0.25ms        ? ?/sec
arrow_reader_clickbench/async/Q40    1.00     47.8±0.32ms        ? ?/sec    1.11     53.2±0.48ms        ? ?/sec
arrow_reader_clickbench/async/Q41    1.00     40.0±0.30ms        ? ?/sec    1.00     39.9±0.42ms        ? ?/sec
arrow_reader_clickbench/async/Q42    1.00     14.3±0.07ms        ? ?/sec    1.00     14.4±0.18ms        ? ?/sec
arrow_reader_clickbench/sync/Q1      1.00  1848.0±17.20µs        ? ?/sec    1.19      2.2±0.01ms        ? ?/sec
arrow_reader_clickbench/sync/Q10     1.00     12.0±0.08ms        ? ?/sec    1.05     12.6±0.04ms        ? ?/sec
arrow_reader_clickbench/sync/Q11     1.00     13.8±0.07ms        ? ?/sec    1.05     14.4±0.07ms        ? ?/sec
arrow_reader_clickbench/sync/Q12     1.00     25.5±1.92ms        ? ?/sec    1.59     40.6±0.46ms        ? ?/sec
arrow_reader_clickbench/sync/Q13     1.00     36.2±1.26ms        ? ?/sec    1.49     54.0±2.06ms        ? ?/sec
arrow_reader_clickbench/sync/Q14     1.00     33.9±0.35ms        ? ?/sec    1.52     51.4±0.30ms        ? ?/sec
arrow_reader_clickbench/sync/Q19     1.04      4.4±0.11ms        ? ?/sec    1.00      4.2±0.02ms        ? ?/sec
arrow_reader_clickbench/sync/Q20     1.00    138.2±1.41ms        ? ?/sec    1.29    178.9±1.12ms        ? ?/sec
arrow_reader_clickbench/sync/Q21     1.00    135.0±1.00ms        ? ?/sec    1.75    236.5±1.61ms        ? ?/sec
arrow_reader_clickbench/sync/Q22     1.00    198.4±4.34ms        ? ?/sec    2.47    490.2±2.72ms        ? ?/sec
arrow_reader_clickbench/sync/Q23     1.00    375.7±8.92ms        ? ?/sec    1.16   433.9±10.25ms        ? ?/sec
arrow_reader_clickbench/sync/Q24     1.00     38.5±0.44ms        ? ?/sec    1.42     54.8±0.67ms        ? ?/sec
arrow_reader_clickbench/sync/Q27     1.00     95.3±0.41ms        ? ?/sec    1.64    156.6±0.97ms        ? ?/sec
arrow_reader_clickbench/sync/Q28     1.00     93.2±0.54ms        ? ?/sec    1.65    153.8±0.74ms        ? ?/sec
arrow_reader_clickbench/sync/Q30     1.01     62.3±0.39ms        ? ?/sec    1.00     61.8±0.38ms        ? ?/sec
arrow_reader_clickbench/sync/Q36     5.27   835.6±11.42ms        ? ?/sec    1.00    158.6±0.82ms        ? ?/sec
arrow_reader_clickbench/sync/Q37     5.89    561.1±3.23ms        ? ?/sec    1.00     95.2±0.51ms        ? ?/sec
arrow_reader_clickbench/sync/Q38     1.00     31.6±0.24ms        ? ?/sec    1.00     31.7±0.32ms        ? ?/sec
arrow_reader_clickbench/sync/Q39     1.01     35.0±0.32ms        ? ?/sec    1.00     34.7±0.28ms        ? ?/sec
arrow_reader_clickbench/sync/Q40     1.00     44.2±0.28ms        ? ?/sec    1.12     49.3±0.33ms        ? ?/sec
arrow_reader_clickbench/sync/Q41     1.01     37.1±0.29ms        ? ?/sec    1.00     36.8±0.19ms        ? ?/sec
arrow_reader_clickbench/sync/Q42     1.00     13.6±0.06ms        ? ?/sec    1.00     13.5±0.06ms        ? ?/sec

zhuqi-lucas · 2025-05-20T13:04:46Z

🤖: Benchmark completed

Details

group                                alamb_cache_filter_result              main
-----                                -------------------------              ----
arrow_reader_clickbench/async/Q1     1.00      2.0±0.03ms        ? ?/sec    1.15      2.4±0.01ms        ? ?/sec
arrow_reader_clickbench/async/Q10    1.00     12.9±0.06ms        ? ?/sec    1.08     13.9±0.11ms        ? ?/sec
arrow_reader_clickbench/async/Q11    1.00     14.9±0.16ms        ? ?/sec    1.06     15.8±0.14ms        ? ?/sec
arrow_reader_clickbench/async/Q12    1.00     24.4±0.26ms        ? ?/sec    1.59     38.8±0.25ms        ? ?/sec
arrow_reader_clickbench/async/Q13    1.00     37.6±0.33ms        ? ?/sec    1.39     52.3±0.32ms        ? ?/sec
arrow_reader_clickbench/async/Q14    1.00     35.5±0.24ms        ? ?/sec    1.41     50.0±0.37ms        ? ?/sec
arrow_reader_clickbench/async/Q19    1.01      5.1±0.05ms        ? ?/sec    1.00      5.0±0.07ms        ? ?/sec
arrow_reader_clickbench/async/Q20    1.00    114.6±0.51ms        ? ?/sec    1.42    162.8±0.60ms        ? ?/sec
arrow_reader_clickbench/async/Q21    1.00    132.0±0.61ms        ? ?/sec    1.59    209.4±0.69ms        ? ?/sec
arrow_reader_clickbench/async/Q22    1.00    200.4±0.94ms        ? ?/sec    2.12    425.7±1.52ms        ? ?/sec
arrow_reader_clickbench/async/Q23    1.00   414.9±12.61ms        ? ?/sec    1.18   491.6±11.23ms        ? ?/sec
arrow_reader_clickbench/async/Q24    1.00     41.9±0.46ms        ? ?/sec    1.38     57.7±0.51ms        ? ?/sec
arrow_reader_clickbench/async/Q27    1.00    105.5±0.37ms        ? ?/sec    1.58    166.9±1.13ms        ? ?/sec
arrow_reader_clickbench/async/Q28    1.00    103.1±0.51ms        ? ?/sec    1.59    164.1±0.89ms        ? ?/sec
arrow_reader_clickbench/async/Q30    1.00     64.2±0.58ms        ? ?/sec    1.00     64.2±0.51ms        ? ?/sec
arrow_reader_clickbench/async/Q36    1.38    234.5±1.60ms        ? ?/sec    1.00    169.6±0.96ms        ? ?/sec
arrow_reader_clickbench/async/Q37    1.58    162.6±0.64ms        ? ?/sec    1.00    102.6±0.55ms        ? ?/sec
arrow_reader_clickbench/async/Q38    1.00     38.9±0.26ms        ? ?/sec    1.00     39.1±0.26ms        ? ?/sec
arrow_reader_clickbench/async/Q39    1.00     48.5±0.26ms        ? ?/sec    1.00     48.6±0.25ms        ? ?/sec
arrow_reader_clickbench/async/Q40    1.00     47.8±0.32ms        ? ?/sec    1.11     53.2±0.48ms        ? ?/sec
arrow_reader_clickbench/async/Q41    1.00     40.0±0.30ms        ? ?/sec    1.00     39.9±0.42ms        ? ?/sec
arrow_reader_clickbench/async/Q42    1.00     14.3±0.07ms        ? ?/sec    1.00     14.4±0.18ms        ? ?/sec
arrow_reader_clickbench/sync/Q1      1.00  1848.0±17.20µs        ? ?/sec    1.19      2.2±0.01ms        ? ?/sec
arrow_reader_clickbench/sync/Q10     1.00     12.0±0.08ms        ? ?/sec    1.05     12.6±0.04ms        ? ?/sec
arrow_reader_clickbench/sync/Q11     1.00     13.8±0.07ms        ? ?/sec    1.05     14.4±0.07ms        ? ?/sec
arrow_reader_clickbench/sync/Q12     1.00     25.5±1.92ms        ? ?/sec    1.59     40.6±0.46ms        ? ?/sec
arrow_reader_clickbench/sync/Q13     1.00     36.2±1.26ms        ? ?/sec    1.49     54.0±2.06ms        ? ?/sec
arrow_reader_clickbench/sync/Q14     1.00     33.9±0.35ms        ? ?/sec    1.52     51.4±0.30ms        ? ?/sec
arrow_reader_clickbench/sync/Q19     1.04      4.4±0.11ms        ? ?/sec    1.00      4.2±0.02ms        ? ?/sec
arrow_reader_clickbench/sync/Q20     1.00    138.2±1.41ms        ? ?/sec    1.29    178.9±1.12ms        ? ?/sec
arrow_reader_clickbench/sync/Q21     1.00    135.0±1.00ms        ? ?/sec    1.75    236.5±1.61ms        ? ?/sec
arrow_reader_clickbench/sync/Q22     1.00    198.4±4.34ms        ? ?/sec    2.47    490.2±2.72ms        ? ?/sec
arrow_reader_clickbench/sync/Q23     1.00    375.7±8.92ms        ? ?/sec    1.16   433.9±10.25ms        ? ?/sec
arrow_reader_clickbench/sync/Q24     1.00     38.5±0.44ms        ? ?/sec    1.42     54.8±0.67ms        ? ?/sec
arrow_reader_clickbench/sync/Q27     1.00     95.3±0.41ms        ? ?/sec    1.64    156.6±0.97ms        ? ?/sec
arrow_reader_clickbench/sync/Q28     1.00     93.2±0.54ms        ? ?/sec    1.65    153.8±0.74ms        ? ?/sec
arrow_reader_clickbench/sync/Q30     1.01     62.3±0.39ms        ? ?/sec    1.00     61.8±0.38ms        ? ?/sec
arrow_reader_clickbench/sync/Q36     5.27   835.6±11.42ms        ? ?/sec    1.00    158.6±0.82ms        ? ?/sec
arrow_reader_clickbench/sync/Q37     5.89    561.1±3.23ms        ? ?/sec    1.00     95.2±0.51ms        ? ?/sec
arrow_reader_clickbench/sync/Q38     1.00     31.6±0.24ms        ? ?/sec    1.00     31.7±0.32ms        ? ?/sec
arrow_reader_clickbench/sync/Q39     1.01     35.0±0.32ms        ? ?/sec    1.00     34.7±0.28ms        ? ?/sec
arrow_reader_clickbench/sync/Q40     1.00     44.2±0.28ms        ? ?/sec    1.12     49.3±0.33ms        ? ?/sec
arrow_reader_clickbench/sync/Q41     1.01     37.1±0.29ms        ? ?/sec    1.00     36.8±0.19ms        ? ?/sec
arrow_reader_clickbench/sync/Q42     1.00     13.6±0.06ms        ? ?/sec    1.00     13.5±0.06ms        ? ?/sec

It seems regression for Q36/Q37.

alamb · 2025-05-20T14:11:09Z

It seems regression for Q36/Q37.

Yes, I agree -- I will figure out why

alamb · 2025-05-20T17:33:49Z

It seems regression for Q36/Q37.

Yes, I agree -- I will figure out why

I did some profiling:

samply record target/release/deps/arrow_reader_clickbench-aef15514767c9665 --bench arrow_reader_clickbench/sync/Q36

Basically, the issue is that calling slice() is taking a non trivial amount of the time for Q36

I added some printlns and it seems like we have 181k rows in total that pass but the number of buffers is crazy (I think this is related to concat not compacting the ByteViewArray). Working on this...

ByteViewArray::slice offset=8192 length=8192, total_rows: 181198 buffer_count: 542225
ByteViewArray::slice offset=16384 length=8192, total_rows: 181198 buffer_count: 542225
ByteViewArray::slice offset=24576 length=8192, total_rows: 181198 buffer_count: 542225
ByteViewArray::slice offset=32768 length=8192, total_rows: 181198 buffer_count: 542225
ByteViewArray::slice offset=40960 length=8192, total_rows: 181198 buffer_count: 542225
ByteViewArray::slice offset=49152 length=8192, total_rows: 181198 buffer_count: 542225
ByteViewArray::slice offset=57344 length=8192, total_rows: 181198 buffer_count: 542225
ByteViewArray::slice offset=65536 length=8192, total_rows: 181198 buffer_count: 542225
ByteViewArray::slice offset=73728 length=8192, total_rows: 181198 buffer_count: 542225
ByteViewArray::slice offset=81920 length=8192, total_rows: 181198 buffer_count: 542225
ByteViewArray::slice offset=90112 length=8192, total_rows: 181198 buffer_count: 542225
ByteViewArray::slice offset=98304 length=8192, total_rows: 181198 buffer_count: 542225
ByteViewArray::slice offset=106496 length=8192, total_rows: 181198 buffer_count: 542225
ByteViewArray::slice offset=114688 length=8192, total_rows: 181198 buffer_count: 542225
ByteViewArray::slice offset=122880 length=8192, total_rows: 181198 buffer_count: 542225
ByteViewArray::slice offset=131072 length=8192, total_rows: 181198 buffer_count: 542225
ByteViewArray::slice offset=139264 length=8192, total_rows: 181198 buffer_count: 542225

alamb · 2025-05-20T18:59:21Z

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.11.0-1013-gcp #13~24.04.1-Ubuntu SMP Wed Apr 2 16:34:16 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/cache_filter_result (c0c3eb4) to 45bda04 diff
BENCH_NAME=arrow_reader_clickbench
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench arrow_reader_clickbench
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_cache_filter_result
Results will be posted here when complete

alamb · 2025-05-20T19:24:22Z

🤖: Benchmark completed

Details

group                                alamb_cache_filter_result              main
-----                                -------------------------              ----
arrow_reader_clickbench/async/Q1     1.00      2.0±0.01ms        ? ?/sec    1.16      2.4±0.01ms        ? ?/sec
arrow_reader_clickbench/async/Q10    1.00     14.2±0.16ms        ? ?/sec    1.03     14.7±0.13ms        ? ?/sec
arrow_reader_clickbench/async/Q11    1.00     16.1±0.14ms        ? ?/sec    1.03     16.5±0.18ms        ? ?/sec
arrow_reader_clickbench/async/Q12    1.00     27.4±0.33ms        ? ?/sec    1.39     38.0±0.30ms        ? ?/sec
arrow_reader_clickbench/async/Q13    1.00     39.9±0.33ms        ? ?/sec    1.29     51.6±0.41ms        ? ?/sec
arrow_reader_clickbench/async/Q14    1.00     38.3±0.34ms        ? ?/sec    1.30     49.7±0.31ms        ? ?/sec
arrow_reader_clickbench/async/Q19    1.01      5.2±0.07ms        ? ?/sec    1.00      5.1±0.08ms        ? ?/sec
arrow_reader_clickbench/async/Q20    1.00    114.5±0.73ms        ? ?/sec    1.38    158.5±0.67ms        ? ?/sec
arrow_reader_clickbench/async/Q21    1.00    131.5±0.79ms        ? ?/sec    1.68    220.4±1.03ms        ? ?/sec
arrow_reader_clickbench/async/Q22    1.00    234.3±8.23ms        ? ?/sec    2.07    486.1±2.04ms        ? ?/sec
arrow_reader_clickbench/async/Q23    1.00   440.6±13.11ms        ? ?/sec    1.11   489.2±17.69ms        ? ?/sec
arrow_reader_clickbench/async/Q24    1.00     45.0±0.37ms        ? ?/sec    1.29     58.1±0.59ms        ? ?/sec
arrow_reader_clickbench/async/Q27    1.00    119.0±0.58ms        ? ?/sec    1.36    161.5±0.80ms        ? ?/sec
arrow_reader_clickbench/async/Q28    1.00    115.4±0.73ms        ? ?/sec    1.39    160.0±0.95ms        ? ?/sec
arrow_reader_clickbench/async/Q30    1.01     65.7±0.48ms        ? ?/sec    1.00     64.8±0.60ms        ? ?/sec
arrow_reader_clickbench/async/Q36    1.00    129.4±0.83ms        ? ?/sec    1.29    167.2±0.84ms        ? ?/sec
arrow_reader_clickbench/async/Q37    1.00     99.2±0.68ms        ? ?/sec    1.00     98.9±0.53ms        ? ?/sec
arrow_reader_clickbench/async/Q38    1.01     39.9±0.27ms        ? ?/sec    1.00     39.5±0.30ms        ? ?/sec
arrow_reader_clickbench/async/Q39    1.01     49.4±0.40ms        ? ?/sec    1.00     49.0±0.38ms        ? ?/sec
arrow_reader_clickbench/async/Q40    1.00     49.1±0.66ms        ? ?/sec    1.09     53.5±0.43ms        ? ?/sec
arrow_reader_clickbench/async/Q41    1.00     41.2±0.47ms        ? ?/sec    1.00     41.0±0.40ms        ? ?/sec
arrow_reader_clickbench/async/Q42    1.00     14.7±0.18ms        ? ?/sec    1.00     14.6±0.15ms        ? ?/sec
arrow_reader_clickbench/sync/Q1      1.00   1843.8±9.23µs        ? ?/sec    1.20      2.2±0.01ms        ? ?/sec
arrow_reader_clickbench/sync/Q10     1.00     13.0±0.07ms        ? ?/sec    1.03     13.3±0.12ms        ? ?/sec
arrow_reader_clickbench/sync/Q11     1.00     14.8±0.11ms        ? ?/sec    1.02     15.2±0.10ms        ? ?/sec
arrow_reader_clickbench/sync/Q12     1.00     32.4±0.50ms        ? ?/sec    1.25     40.6±0.31ms        ? ?/sec
arrow_reader_clickbench/sync/Q13     1.00     44.3±0.42ms        ? ?/sec    1.21     53.7±0.46ms        ? ?/sec
arrow_reader_clickbench/sync/Q14     1.00     42.9±0.51ms        ? ?/sec    1.22     52.3±0.46ms        ? ?/sec
arrow_reader_clickbench/sync/Q19     1.02      4.4±0.02ms        ? ?/sec    1.00      4.3±0.05ms        ? ?/sec
arrow_reader_clickbench/sync/Q20     1.00   121.8±10.69ms        ? ?/sec    1.44    175.5±1.15ms        ? ?/sec
arrow_reader_clickbench/sync/Q21     1.00    137.2±9.68ms        ? ?/sec    1.70    233.1±1.71ms        ? ?/sec
arrow_reader_clickbench/sync/Q22     1.00    214.2±9.00ms        ? ?/sec    2.22    475.1±3.54ms        ? ?/sec
arrow_reader_clickbench/sync/Q23     1.00   383.2±15.35ms        ? ?/sec    1.16   442.7±15.50ms        ? ?/sec
arrow_reader_clickbench/sync/Q24     1.00     41.7±0.48ms        ? ?/sec    1.31     54.5±0.58ms        ? ?/sec
arrow_reader_clickbench/sync/Q27     1.13   172.6±10.81ms        ? ?/sec    1.00    152.3±1.01ms        ? ?/sec
arrow_reader_clickbench/sync/Q28     1.06    158.6±6.71ms        ? ?/sec    1.00    150.2±0.76ms        ? ?/sec
arrow_reader_clickbench/sync/Q30     1.03     64.3±0.70ms        ? ?/sec    1.00     62.5±0.48ms        ? ?/sec
arrow_reader_clickbench/sync/Q36     1.00    119.8±0.89ms        ? ?/sec    1.31    157.5±0.88ms        ? ?/sec
arrow_reader_clickbench/sync/Q37     1.01     93.6±0.71ms        ? ?/sec    1.00     92.3±0.40ms        ? ?/sec
arrow_reader_clickbench/sync/Q38     1.02     32.3±0.25ms        ? ?/sec    1.00     31.7±0.21ms        ? ?/sec
arrow_reader_clickbench/sync/Q39     1.02     35.1±0.39ms        ? ?/sec    1.00     34.3±0.26ms        ? ?/sec
arrow_reader_clickbench/sync/Q40     1.00     45.5±0.48ms        ? ?/sec    1.11     50.5±0.37ms        ? ?/sec
arrow_reader_clickbench/sync/Q41     1.01     38.2±0.32ms        ? ?/sec    1.00     37.9±0.28ms        ? ?/sec
arrow_reader_clickbench/sync/Q42     1.01     13.7±0.07ms        ? ?/sec    1.00     13.6±0.06ms        ? ?/sec

alamb · 2025-05-20T19:41:46Z

🤖: Benchmark completed

Well, that is looking quite a bit better

I am now working on a way to reduce buffering requirements (will require incremental concat'ing)

zhuqi-lucas · 2025-05-21T07:02:00Z

🤖: Benchmark completed

Well, that is looking quite a bit better

I am now working on a way to reduce buffering requirements (will require incremental concat'ing)

Amazing result @alamb , it looks pretty cool!

alamb · 2025-06-02T14:30:27Z

I looked at the profiling for using the incremental record batch builder in DataFusion -- 5% of the time is taken copying StringView data. Maybe we can improve that as well.

Dandandan · 2025-06-02T21:57:26Z

arrow-array/src/builder/generic_bytes_view_builder.rs

+                let view_len = *v as u32;
+                // if view_len is 12 or less, data is inlined and doesn't need an update
+                // if view is 12 or more, need to update the buffer offset
+                if view_len > 12 {


I think the check can be omitted (and will be faster without it).

Views with length less than 12 have the entire string inlined -- so if we update the buffer_index "field" for such small views we may well be updating the inlined string bytes. So I do think we need the check here

Dandandan · 2025-06-02T22:00:16Z

arrow-array/src/builder/generic_bytes_view_builder.rs

+            }
+
+            // Update any views that point to the old buffers
+            for v in views.as_slice_mut()[starting_view..].iter_mut() {


It would be faster to extend in one go, rather than first append_from_slice and later iter_mut (in views.append_slice(array.views()))

Dandandan · 2025-06-02T22:01:18Z

arrow-array/src/builder/generic_bytes_view_builder.rs

+                // if view_len is 12 or less, data is inlined and doesn't need an update
+                // if view is 12 or more, need to update the buffer offset
+                if view_len > 12 {
+                    let mut view = ByteView::from(*v);


ByteView can use std::mem::transmute for from / as_u128 which I think would be faster.

Thank you. I need to make some benchmarks so I can test these ideas more easily. Right now the cycle time is pretty long as I need to rebuild datafusion and then kick off a benchmark run.

Dandandan · 2025-06-02T22:06:19Z

arrow-array/src/builder/generic_bytes_view_builder.rs

+        } else {
+            // otherwise the array is sparse so copy the data into a single new
+            // buffer as well as updating the views
+            let mut new_buffer: Vec<u8> = Vec::with_capacity(used_buffer_size);


I wonder if this doesn't create too small buffer size after filtering?

Shouldn't we create a in progress buffer with a larger buffer size if used_buffer_size is smaller than that (based on allocation strategy) and then use that one instead of creating small buffers and flushing them every time?

(That could be one benefit from concat / potentially buffering filter - we can know the exact capacity upfront).

I think you are 100% right - the copying should be handled in a similar way to the rest of the builder: fill up any remaining allocation first and allocate new buffers following the normal allocation strategy. I will pursue that approach

Dandandan · 2025-06-02T22:06:57Z

arrow-array/src/builder/generic_bytes_view_builder.rs

+            let mut new_buffer: Vec<u8> = Vec::with_capacity(used_buffer_size);
+            let new_buffer_index = buffers.len() as u32; // making one new buffer
+                                                         // Update any views that point to the old buffers.
+            for v in views.as_slice_mut()[starting_view..].iter_mut() {


same - better to extend

Dandandan · 2025-06-02T22:24:02Z

I added some insights for the string view data when looking at the concat and string gc implementations.

Dandandan · 2025-06-02T22:29:17Z

arrow-array/src/array/byte_view_array.rs

@@ -538,6 +538,18 @@ impl<T: ByteViewType + ?Sized> GenericByteViewArray<T> {

        l_full_data.cmp(r_full_data)
    }
+
+    /// return the total number of bytes required to hold all strings pointed to by views in this array
+    pub fn minimum_buffer_size(&self) -> usize {


probably compiles to the same, but I think using sum is slightly more idiomatic.

self.views.iter() .map(|v| { let len = (*v as u32) as usize; if len > 12 { len } else { 0 } }) .sum()

Yeah, I used sum before and it showed up in a profile -- I was trying to see if I could get the code faster but I didn't do it scientifically

Dandandan · 2025-06-02T22:55:35Z

arrow-select/src/incremental_batch_builder.rs

+    ///
+    /// This method optimizes for the case where the filter selects all or no rows
+    /// and ensures all output arrays in `current` is at most `batch_size` rows long.
+    pub fn append_filtered(


I think for any binary data it makes sense to buffer the BooleanArray up until batch_size row limit (or number of array limit if it gets too large) so the final capacity of data buffer can be determined in one go.

The challenge with this strategy is that it may hold on to many many buffers (for example, if the array only holds on to 1 row from each batch, each 8MB or whatever buffer is mostly unused.

I will make a benchmark to explore the options more fully

Yeah true I thought about this as well (right now this is the case with concat after filtering views as well, but I think we can do better with e.g. setting a memory budget)

Well, I started writing a benchmark and then realized I had nothing to benchmark. So I ported a version of code from DataFusion here:

Add coalesce kernel andBatchCoalescer for statefully combining selected b…atches: #7597

I plan to make a benchmark tomorrow and hopefully start playing around with the optimizations

Dandandan · 2025-06-02T23:14:10Z

arrow-select/src/filter.rs

+        // Flush the in-progress buffer
+        unsafe {
+            // All views were coped from `array`
+            self.finalize_copied_views(starting_view, array);


it would be good to delay this until finishing the buffer (so it can allocate larger buffer and go through the views in one go.

Right, but as mentioned above that will hold on to quite a bit more memory for selective filters. I have some other ideas of how to improve it

Dandandan · 2025-06-03T10:21:47Z

arrow-array/src/builder/generic_bytes_view_builder.rs

+    /// Append all views from the given array into the inprogress builder
+    ///
+    /// Will copy the underlying views based on the value of target_buffer_load_factor
+    pub fn append_array(&mut self, array: &GenericByteViewArray<T>) {


This can be used in concat as well (a similar implementation is almost 1.5x as fast on string views)

I will file a ticket for this idea which we can pursue separately

…lected b…atches: (#7597) # Which issue does this PR close? - Part of #6692 - Part of #7589 # Rationale for this change The pattern of combining multiple small RecordBatches to form one larger one for subsequent processing is common in query engines like DataFusion which filter or partition incoming Arrays. Current best practice is to use the `filter` or `take` kernels and then `concat` kernels as explained in - #6692 This pattern also appears in my attempt to improve parquet filter performance (to cache the result of applying a filter rather than re-decoding the results). See - apache/datafusion#3463 - #7513 The current pattern is non optimal as it requires: 1. At least 2x peak memory (holding the input and output of `concat`) 2. 2 copies of the data (to create the output of `filter` and then create the output of `concat`) The theory is that with sufficient optimization we can reduce the peak memory requirements and (possibly) make it faster as well. However, to add a benchmark for this filter+concat, I basically had nothing to benchmark. Specifically, there needed to be an API to call. - Note I also made a PR to DataFusion showing this API can be used and it is not slower: apache/datafusion#16249 # What changes are included in this PR? I ported the code from DataFusion downstream upstream into arrow-rs so 1. We can use it in the parquet reader 2. We can benchmark and optimize it appropriately 1. Add `BatchCoalescer` to `arrow-select`, and tests 2. Update documentation 2. Add examples 3. Add a `pub` export in `arrow` 4. Add Benchmark # Are there any user-facing changes? This is a new API. I next plan to make an benchmark for this particular

alamb · 2025-06-06T19:40:30Z

In case anyone is following along, we added the coalesce kernel and have started optimizing it

alamb · 2025-06-24T12:07:25Z

Let's continue work on POC: Parquet predicate results cache #7760

github-actions bot added the parquet Changes to the parquet crate label May 15, 2025

alamb force-pushed the alamb/cache_filter_result branch from 78f96d1 to 31f2fa1 Compare May 15, 2025 19:39

alamb mentioned this pull request May 15, 2025

Introduce ReadPlan to encapsulate the calculation of what parquet rows to decode #7502

Merged

alamb force-pushed the alamb/cache_filter_result branch from 31f2fa1 to 244e187 Compare May 15, 2025 20:33

zhuqi-lucas reviewed May 16, 2025

View reviewed changes

alamb force-pushed the alamb/cache_filter_result branch 2 times, most recently from 8961196 to 9e91e9f Compare May 16, 2025 12:48

zhuqi-lucas reviewed May 16, 2025

View reviewed changes

alamb force-pushed the alamb/cache_filter_result branch from 9e91e9f to 147c7a7 Compare May 16, 2025 14:50

alamb force-pushed the alamb/cache_filter_result branch from 147c7a7 to f1f7103 Compare May 16, 2025 15:08

alamb mentioned this pull request May 16, 2025

[EPIC] Faster performance for parquet predicate evaluation for non selective filters #7456

Open

8 tasks

This was referenced May 16, 2025

Refactor build_array_reader into a struct #7521

Merged

Weekly Plan: Andrew Lamb 2025-05-12 apache/datafusion#16022

Closed

alamb force-pushed the alamb/cache_filter_result branch from f1f7103 to a0e4b29 Compare May 20, 2025 17:12

alamb force-pushed the alamb/cache_filter_result branch from 4dd1d69 to 893819a Compare May 20, 2025 20:52

alamb mentioned this pull request May 20, 2025

Optimize take/filter/concat from multiple input arrays to a single large output array #6692

Open

github-actions bot added the arrow Changes to the arrow crate label May 22, 2025

alamb added 10 commits June 2, 2025 09:59

Speed up with trusted len iter

29e42cf

clippy

0936edf

Optimize StringView filter

9185a3f

fix bug found in datafusion

5129b5d

Make incremental batch builders accessable

d74affd

export more

91e2516

remove println

ce57cff

temp: only use incremental builder for stringview

53b3430

Use finish allocation

fc9269e

Add FixedSizeBinary to take_kernel benchmark

80b30ca

alamb force-pushed the alamb/cache_filter_result branch from ffb8f44 to 80b30ca Compare June 2, 2025 14:29

tweak minimum buffer size

055c7f9

Dandandan reviewed Jun 2, 2025

View reviewed changes

Dandandan reviewed Jun 3, 2025

View reviewed changes

alamb mentioned this pull request Jun 3, 2025

Add coalesce kernel andBatchCoalescer for statefully combining selected b…atches: #7597

Merged

alamb changed the title ~~POC: Sketch out cached filter result API~~ POC: Sketch out parquet cached filter result API Jun 24, 2025

alamb mentioned this pull request Jun 24, 2025

POC: Parquet predicate results cache #7760

Draft

4 tasks

alamb closed this Jun 24, 2025

POC: Sketch out parquet cached filter result API #7513

POC: Sketch out parquet cached filter result API #7513

Uh oh!

Conversation

alamb commented May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented May 16, 2025

Uh oh!

zhuqi-lucas commented May 16, 2025

Uh oh!

alamb commented May 16, 2025

Uh oh!

zhuqi-lucas commented May 16, 2025

Uh oh!

alamb commented May 16, 2025

Uh oh!

alamb commented May 16, 2025

Uh oh!

zhuqi-lucas commented May 20, 2025

Uh oh!

alamb commented May 20, 2025

Uh oh!

alamb commented May 20, 2025

Uh oh!

alamb commented May 20, 2025

Uh oh!

alamb commented May 20, 2025

Uh oh!

alamb commented May 20, 2025

Uh oh!

zhuqi-lucas commented May 21, 2025

Uh oh!

alamb commented Jun 2, 2025

Uh oh!

Dandandan Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dandandan Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dandandan Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dandandan Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dandandan commented Jun 2, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dandandan Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

alamb commented May 15, 2025 •

edited

Loading

Dandandan Jun 2, 2025 •

edited

Loading

Dandandan Jun 2, 2025 •

edited

Loading

Dandandan Jun 2, 2025 •

edited

Loading

Dandandan Jun 2, 2025 •

edited

Loading

Dandandan Jun 2, 2025 •

edited

Loading

Dandandan Jun 3, 2025 •

edited

Loading