Skip to content

POC: Sketch out parquet cached filter result API #7513

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 21 commits into from

Conversation

alamb
Copy link
Contributor

@alamb alamb commented May 15, 2025

Draft until:

  • Pull out StringViewBuilder::concat_array into its own PR
  • Avoid double buffering of intermediate results
  • Add memory limit for results cache

Which issue does this PR close?

Rationale for this change

I am trying to sketch out enough of a cached filter result API to show performance improvements. Once I have done that, I will start proposing how to break it up into smaller PRs

What changes are included in this PR?

  1. Add code to cache columns which are reused in filter and scan

Are there any user-facing changes?

@github-actions github-actions bot added the parquet Changes to the parquet crate label May 15, 2025
@alamb alamb force-pushed the alamb/cache_filter_result branch from 78f96d1 to 31f2fa1 Compare May 15, 2025 19:39
@alamb alamb force-pushed the alamb/cache_filter_result branch from 31f2fa1 to 244e187 Compare May 15, 2025 20:33
filters: Vec<BooleanArray>,
}

impl CachedPredicateResultBuilder {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, this is very clear to get the cached result!

@alamb alamb force-pushed the alamb/cache_filter_result branch 2 times, most recently from 8961196 to 9e91e9f Compare May 16, 2025 12:48
/// TODO: potentially incrementally build the result of the predicate
/// evaluation without holding all the batches in memory. See
/// <https://github.com/apache/arrow-rs/issues/6692>
in_progress_arrays: Vec<Box<dyn InProgressArray>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @alamb ,

Does it mean, this in_progress_arrays is not the final result for us to generate the final batch?

For example:
Predicate a > 1 => in_progress_array_a filtered by a > 1
Predicate b >2 => in_progress_array_b filtered by b > 2 also based filtered by a > 1, but we don't update the in_progress_array_a

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an excellent question

What I was thinking is that CachedPredicateResult would manage the "currently" applied predicate

So in the case where there are multiple predicates, I was thinking of a method like CachedPredicateResult::merge method which could take the result of filtering a and apply the result of filtering by b

We can then put heuristics / logic for if/when we materialize the filters into CachedPredicateResult

But that is sort of speculation at this point -- I don't have it all worked out yet

My plan is to get far enough to show this structure works and can improve performance, and then I'll work on the trickier logic of applying multiple filters

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CachedPredicateResult::merge method which could take the result of filtering a and apply the result of filtering by b

Great idea!

But that is sort of speculation at this point -- I don't have it all worked out yet

Sure, i will continue to review, thank you @alamb !

@alamb alamb force-pushed the alamb/cache_filter_result branch from 9e91e9f to 147c7a7 Compare May 16, 2025 14:50
@alamb
Copy link
Contributor Author

alamb commented May 16, 2025

I tested this branch using a query that filters and selects the same column (NOTE it is critical to NOT use --all-features as all features turns on force_validate

cargo bench --features="arrow async" --bench arrow_reader_clickbench -- Q24

Here are the benchmark results (30ms --> 22ms) (25 % faster)

Gnuplot not found, using plotters backend
Looking for ClickBench files starting in current_dir and all parent directories: "/Users/andrewlamb/Software/arrow-rs/parquet"
arrow_reader_clickbench/sync/Q24
                        time:   [22.532 ms 22.604 ms 22.682 ms]
                        change: [-27.751% -27.245% -26.791%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe

arrow_reader_clickbench/async/Q24
                        time:   [24.043 ms 24.171 ms 24.308 ms]
                        change: [-26.223% -25.697% -25.172%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe

I realize this branch currently uses more memory (to buffer the filter results), but I think the additional memory growth can be limited with a setting.

@alamb alamb force-pushed the alamb/cache_filter_result branch from 147c7a7 to f1f7103 Compare May 16, 2025 15:08
@zhuqi-lucas
Copy link
Contributor

I tested this branch using a query that filters and selects the same column (NOTE it is critical to NOT use --all-features as all features turns on force_validate

cargo bench --features="arrow async" --bench arrow_reader_clickbench -- Q24

Here are the benchmark results (30ms --> 22ms) (25 % faster)

Gnuplot not found, using plotters backend
Looking for ClickBench files starting in current_dir and all parent directories: "/Users/andrewlamb/Software/arrow-rs/parquet"
arrow_reader_clickbench/sync/Q24
                        time:   [22.532 ms 22.604 ms 22.682 ms]
                        change: [-27.751% -27.245% -26.791%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe

arrow_reader_clickbench/async/Q24
                        time:   [24.043 ms 24.171 ms 24.308 ms]
                        change: [-26.223% -25.697% -25.172%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe

I realize this branch currently uses more memory (to buffer the filter results), but I think the additional memory growth can be limited with a setting.

Amazing result , i think it will be the perfect way instead of page cache, because page caching will have cache missing, but this PR will always cache the result!

@alamb
Copy link
Contributor Author

alamb commented May 16, 2025

Amazing result , i think it will be the perfect way instead of page cache, because page caching will have cache missing, but this PR will always cache the result!

Thanks -- I think one potential problem is that the cached results may consume too much memory (but I will try and handle that shortly)

I think we should proceed with starting to merge some refactorings; I left some suggestions here:

@zhuqi-lucas
Copy link
Contributor

Amazing result , i think it will be the perfect way instead of page cache, because page caching will have cache missing, but this PR will always cache the result!

Thanks -- I think one potential problem is that the cached results may consume too much memory (but I will try and handle that shortly)

I think we should proceed with starting to merge some refactorings; I left some suggestions here:

It makes sense! Thank you @alamb.

@alamb
Copy link
Contributor Author

alamb commented May 16, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.11.0-1013-gcp #13~24.04.1-Ubuntu SMP Wed Apr 2 16:34:16 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/cache_filter_result (f1f7103) to 1a5999a diff
BENCH_NAME=arrow_reader_clickbench
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench arrow_reader_clickbench
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_cache_filter_result
Results will be posted here when complete

@alamb
Copy link
Contributor Author

alamb commented May 16, 2025

🤖: Benchmark completed

Details

group                                alamb_cache_filter_result              main
-----                                -------------------------              ----
arrow_reader_clickbench/async/Q1     1.00      2.0±0.03ms        ? ?/sec    1.15      2.4±0.01ms        ? ?/sec
arrow_reader_clickbench/async/Q10    1.00     12.9±0.06ms        ? ?/sec    1.08     13.9±0.11ms        ? ?/sec
arrow_reader_clickbench/async/Q11    1.00     14.9±0.16ms        ? ?/sec    1.06     15.8±0.14ms        ? ?/sec
arrow_reader_clickbench/async/Q12    1.00     24.4±0.26ms        ? ?/sec    1.59     38.8±0.25ms        ? ?/sec
arrow_reader_clickbench/async/Q13    1.00     37.6±0.33ms        ? ?/sec    1.39     52.3±0.32ms        ? ?/sec
arrow_reader_clickbench/async/Q14    1.00     35.5±0.24ms        ? ?/sec    1.41     50.0±0.37ms        ? ?/sec
arrow_reader_clickbench/async/Q19    1.01      5.1±0.05ms        ? ?/sec    1.00      5.0±0.07ms        ? ?/sec
arrow_reader_clickbench/async/Q20    1.00    114.6±0.51ms        ? ?/sec    1.42    162.8±0.60ms        ? ?/sec
arrow_reader_clickbench/async/Q21    1.00    132.0±0.61ms        ? ?/sec    1.59    209.4±0.69ms        ? ?/sec
arrow_reader_clickbench/async/Q22    1.00    200.4±0.94ms        ? ?/sec    2.12    425.7±1.52ms        ? ?/sec
arrow_reader_clickbench/async/Q23    1.00   414.9±12.61ms        ? ?/sec    1.18   491.6±11.23ms        ? ?/sec
arrow_reader_clickbench/async/Q24    1.00     41.9±0.46ms        ? ?/sec    1.38     57.7±0.51ms        ? ?/sec
arrow_reader_clickbench/async/Q27    1.00    105.5±0.37ms        ? ?/sec    1.58    166.9±1.13ms        ? ?/sec
arrow_reader_clickbench/async/Q28    1.00    103.1±0.51ms        ? ?/sec    1.59    164.1±0.89ms        ? ?/sec
arrow_reader_clickbench/async/Q30    1.00     64.2±0.58ms        ? ?/sec    1.00     64.2±0.51ms        ? ?/sec
arrow_reader_clickbench/async/Q36    1.38    234.5±1.60ms        ? ?/sec    1.00    169.6±0.96ms        ? ?/sec
arrow_reader_clickbench/async/Q37    1.58    162.6±0.64ms        ? ?/sec    1.00    102.6±0.55ms        ? ?/sec
arrow_reader_clickbench/async/Q38    1.00     38.9±0.26ms        ? ?/sec    1.00     39.1±0.26ms        ? ?/sec
arrow_reader_clickbench/async/Q39    1.00     48.5±0.26ms        ? ?/sec    1.00     48.6±0.25ms        ? ?/sec
arrow_reader_clickbench/async/Q40    1.00     47.8±0.32ms        ? ?/sec    1.11     53.2±0.48ms        ? ?/sec
arrow_reader_clickbench/async/Q41    1.00     40.0±0.30ms        ? ?/sec    1.00     39.9±0.42ms        ? ?/sec
arrow_reader_clickbench/async/Q42    1.00     14.3±0.07ms        ? ?/sec    1.00     14.4±0.18ms        ? ?/sec
arrow_reader_clickbench/sync/Q1      1.00  1848.0±17.20µs        ? ?/sec    1.19      2.2±0.01ms        ? ?/sec
arrow_reader_clickbench/sync/Q10     1.00     12.0±0.08ms        ? ?/sec    1.05     12.6±0.04ms        ? ?/sec
arrow_reader_clickbench/sync/Q11     1.00     13.8±0.07ms        ? ?/sec    1.05     14.4±0.07ms        ? ?/sec
arrow_reader_clickbench/sync/Q12     1.00     25.5±1.92ms        ? ?/sec    1.59     40.6±0.46ms        ? ?/sec
arrow_reader_clickbench/sync/Q13     1.00     36.2±1.26ms        ? ?/sec    1.49     54.0±2.06ms        ? ?/sec
arrow_reader_clickbench/sync/Q14     1.00     33.9±0.35ms        ? ?/sec    1.52     51.4±0.30ms        ? ?/sec
arrow_reader_clickbench/sync/Q19     1.04      4.4±0.11ms        ? ?/sec    1.00      4.2±0.02ms        ? ?/sec
arrow_reader_clickbench/sync/Q20     1.00    138.2±1.41ms        ? ?/sec    1.29    178.9±1.12ms        ? ?/sec
arrow_reader_clickbench/sync/Q21     1.00    135.0±1.00ms        ? ?/sec    1.75    236.5±1.61ms        ? ?/sec
arrow_reader_clickbench/sync/Q22     1.00    198.4±4.34ms        ? ?/sec    2.47    490.2±2.72ms        ? ?/sec
arrow_reader_clickbench/sync/Q23     1.00    375.7±8.92ms        ? ?/sec    1.16   433.9±10.25ms        ? ?/sec
arrow_reader_clickbench/sync/Q24     1.00     38.5±0.44ms        ? ?/sec    1.42     54.8±0.67ms        ? ?/sec
arrow_reader_clickbench/sync/Q27     1.00     95.3±0.41ms        ? ?/sec    1.64    156.6±0.97ms        ? ?/sec
arrow_reader_clickbench/sync/Q28     1.00     93.2±0.54ms        ? ?/sec    1.65    153.8±0.74ms        ? ?/sec
arrow_reader_clickbench/sync/Q30     1.01     62.3±0.39ms        ? ?/sec    1.00     61.8±0.38ms        ? ?/sec
arrow_reader_clickbench/sync/Q36     5.27   835.6±11.42ms        ? ?/sec    1.00    158.6±0.82ms        ? ?/sec
arrow_reader_clickbench/sync/Q37     5.89    561.1±3.23ms        ? ?/sec    1.00     95.2±0.51ms        ? ?/sec
arrow_reader_clickbench/sync/Q38     1.00     31.6±0.24ms        ? ?/sec    1.00     31.7±0.32ms        ? ?/sec
arrow_reader_clickbench/sync/Q39     1.01     35.0±0.32ms        ? ?/sec    1.00     34.7±0.28ms        ? ?/sec
arrow_reader_clickbench/sync/Q40     1.00     44.2±0.28ms        ? ?/sec    1.12     49.3±0.33ms        ? ?/sec
arrow_reader_clickbench/sync/Q41     1.01     37.1±0.29ms        ? ?/sec    1.00     36.8±0.19ms        ? ?/sec
arrow_reader_clickbench/sync/Q42     1.00     13.6±0.06ms        ? ?/sec    1.00     13.5±0.06ms        ? ?/sec

@zhuqi-lucas
Copy link
Contributor

🤖: Benchmark completed

Details

group                                alamb_cache_filter_result              main
-----                                -------------------------              ----
arrow_reader_clickbench/async/Q1     1.00      2.0±0.03ms        ? ?/sec    1.15      2.4±0.01ms        ? ?/sec
arrow_reader_clickbench/async/Q10    1.00     12.9±0.06ms        ? ?/sec    1.08     13.9±0.11ms        ? ?/sec
arrow_reader_clickbench/async/Q11    1.00     14.9±0.16ms        ? ?/sec    1.06     15.8±0.14ms        ? ?/sec
arrow_reader_clickbench/async/Q12    1.00     24.4±0.26ms        ? ?/sec    1.59     38.8±0.25ms        ? ?/sec
arrow_reader_clickbench/async/Q13    1.00     37.6±0.33ms        ? ?/sec    1.39     52.3±0.32ms        ? ?/sec
arrow_reader_clickbench/async/Q14    1.00     35.5±0.24ms        ? ?/sec    1.41     50.0±0.37ms        ? ?/sec
arrow_reader_clickbench/async/Q19    1.01      5.1±0.05ms        ? ?/sec    1.00      5.0±0.07ms        ? ?/sec
arrow_reader_clickbench/async/Q20    1.00    114.6±0.51ms        ? ?/sec    1.42    162.8±0.60ms        ? ?/sec
arrow_reader_clickbench/async/Q21    1.00    132.0±0.61ms        ? ?/sec    1.59    209.4±0.69ms        ? ?/sec
arrow_reader_clickbench/async/Q22    1.00    200.4±0.94ms        ? ?/sec    2.12    425.7±1.52ms        ? ?/sec
arrow_reader_clickbench/async/Q23    1.00   414.9±12.61ms        ? ?/sec    1.18   491.6±11.23ms        ? ?/sec
arrow_reader_clickbench/async/Q24    1.00     41.9±0.46ms        ? ?/sec    1.38     57.7±0.51ms        ? ?/sec
arrow_reader_clickbench/async/Q27    1.00    105.5±0.37ms        ? ?/sec    1.58    166.9±1.13ms        ? ?/sec
arrow_reader_clickbench/async/Q28    1.00    103.1±0.51ms        ? ?/sec    1.59    164.1±0.89ms        ? ?/sec
arrow_reader_clickbench/async/Q30    1.00     64.2±0.58ms        ? ?/sec    1.00     64.2±0.51ms        ? ?/sec
arrow_reader_clickbench/async/Q36    1.38    234.5±1.60ms        ? ?/sec    1.00    169.6±0.96ms        ? ?/sec
arrow_reader_clickbench/async/Q37    1.58    162.6±0.64ms        ? ?/sec    1.00    102.6±0.55ms        ? ?/sec
arrow_reader_clickbench/async/Q38    1.00     38.9±0.26ms        ? ?/sec    1.00     39.1±0.26ms        ? ?/sec
arrow_reader_clickbench/async/Q39    1.00     48.5±0.26ms        ? ?/sec    1.00     48.6±0.25ms        ? ?/sec
arrow_reader_clickbench/async/Q40    1.00     47.8±0.32ms        ? ?/sec    1.11     53.2±0.48ms        ? ?/sec
arrow_reader_clickbench/async/Q41    1.00     40.0±0.30ms        ? ?/sec    1.00     39.9±0.42ms        ? ?/sec
arrow_reader_clickbench/async/Q42    1.00     14.3±0.07ms        ? ?/sec    1.00     14.4±0.18ms        ? ?/sec
arrow_reader_clickbench/sync/Q1      1.00  1848.0±17.20µs        ? ?/sec    1.19      2.2±0.01ms        ? ?/sec
arrow_reader_clickbench/sync/Q10     1.00     12.0±0.08ms        ? ?/sec    1.05     12.6±0.04ms        ? ?/sec
arrow_reader_clickbench/sync/Q11     1.00     13.8±0.07ms        ? ?/sec    1.05     14.4±0.07ms        ? ?/sec
arrow_reader_clickbench/sync/Q12     1.00     25.5±1.92ms        ? ?/sec    1.59     40.6±0.46ms        ? ?/sec
arrow_reader_clickbench/sync/Q13     1.00     36.2±1.26ms        ? ?/sec    1.49     54.0±2.06ms        ? ?/sec
arrow_reader_clickbench/sync/Q14     1.00     33.9±0.35ms        ? ?/sec    1.52     51.4±0.30ms        ? ?/sec
arrow_reader_clickbench/sync/Q19     1.04      4.4±0.11ms        ? ?/sec    1.00      4.2±0.02ms        ? ?/sec
arrow_reader_clickbench/sync/Q20     1.00    138.2±1.41ms        ? ?/sec    1.29    178.9±1.12ms        ? ?/sec
arrow_reader_clickbench/sync/Q21     1.00    135.0±1.00ms        ? ?/sec    1.75    236.5±1.61ms        ? ?/sec
arrow_reader_clickbench/sync/Q22     1.00    198.4±4.34ms        ? ?/sec    2.47    490.2±2.72ms        ? ?/sec
arrow_reader_clickbench/sync/Q23     1.00    375.7±8.92ms        ? ?/sec    1.16   433.9±10.25ms        ? ?/sec
arrow_reader_clickbench/sync/Q24     1.00     38.5±0.44ms        ? ?/sec    1.42     54.8±0.67ms        ? ?/sec
arrow_reader_clickbench/sync/Q27     1.00     95.3±0.41ms        ? ?/sec    1.64    156.6±0.97ms        ? ?/sec
arrow_reader_clickbench/sync/Q28     1.00     93.2±0.54ms        ? ?/sec    1.65    153.8±0.74ms        ? ?/sec
arrow_reader_clickbench/sync/Q30     1.01     62.3±0.39ms        ? ?/sec    1.00     61.8±0.38ms        ? ?/sec
arrow_reader_clickbench/sync/Q36     5.27   835.6±11.42ms        ? ?/sec    1.00    158.6±0.82ms        ? ?/sec
arrow_reader_clickbench/sync/Q37     5.89    561.1±3.23ms        ? ?/sec    1.00     95.2±0.51ms        ? ?/sec
arrow_reader_clickbench/sync/Q38     1.00     31.6±0.24ms        ? ?/sec    1.00     31.7±0.32ms        ? ?/sec
arrow_reader_clickbench/sync/Q39     1.01     35.0±0.32ms        ? ?/sec    1.00     34.7±0.28ms        ? ?/sec
arrow_reader_clickbench/sync/Q40     1.00     44.2±0.28ms        ? ?/sec    1.12     49.3±0.33ms        ? ?/sec
arrow_reader_clickbench/sync/Q41     1.01     37.1±0.29ms        ? ?/sec    1.00     36.8±0.19ms        ? ?/sec
arrow_reader_clickbench/sync/Q42     1.00     13.6±0.06ms        ? ?/sec    1.00     13.5±0.06ms        ? ?/sec

It seems regression for Q36/Q37.

@alamb
Copy link
Contributor Author

alamb commented May 20, 2025

It seems regression for Q36/Q37.

Yes, I agree -- I will figure out why

@alamb alamb force-pushed the alamb/cache_filter_result branch from f1f7103 to a0e4b29 Compare May 20, 2025 17:12
@alamb
Copy link
Contributor Author

alamb commented May 20, 2025

It seems regression for Q36/Q37.

Yes, I agree -- I will figure out why

I did some profiling:

samply record target/release/deps/arrow_reader_clickbench-aef15514767c9665 --bench arrow_reader_clickbench/sync/Q36

Basically, the issue is that calling slice() is taking a non trivial amount of the time for Q36

Screenshot 2025-05-20 at 1 23 25 PM

I added some printlns and it seems like we have 181k rows in total that pass but the number of buffers is crazy (I think this is related to concat not compacting the ByteViewArray). Working on this...

ByteViewArray::slice offset=8192 length=8192, total_rows: 181198 buffer_count: 542225
ByteViewArray::slice offset=16384 length=8192, total_rows: 181198 buffer_count: 542225
ByteViewArray::slice offset=24576 length=8192, total_rows: 181198 buffer_count: 542225
ByteViewArray::slice offset=32768 length=8192, total_rows: 181198 buffer_count: 542225
ByteViewArray::slice offset=40960 length=8192, total_rows: 181198 buffer_count: 542225
ByteViewArray::slice offset=49152 length=8192, total_rows: 181198 buffer_count: 542225
ByteViewArray::slice offset=57344 length=8192, total_rows: 181198 buffer_count: 542225
ByteViewArray::slice offset=65536 length=8192, total_rows: 181198 buffer_count: 542225
ByteViewArray::slice offset=73728 length=8192, total_rows: 181198 buffer_count: 542225
ByteViewArray::slice offset=81920 length=8192, total_rows: 181198 buffer_count: 542225
ByteViewArray::slice offset=90112 length=8192, total_rows: 181198 buffer_count: 542225
ByteViewArray::slice offset=98304 length=8192, total_rows: 181198 buffer_count: 542225
ByteViewArray::slice offset=106496 length=8192, total_rows: 181198 buffer_count: 542225
ByteViewArray::slice offset=114688 length=8192, total_rows: 181198 buffer_count: 542225
ByteViewArray::slice offset=122880 length=8192, total_rows: 181198 buffer_count: 542225
ByteViewArray::slice offset=131072 length=8192, total_rows: 181198 buffer_count: 542225
ByteViewArray::slice offset=139264 length=8192, total_rows: 181198 buffer_count: 542225

@alamb
Copy link
Contributor Author

alamb commented May 20, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.11.0-1013-gcp #13~24.04.1-Ubuntu SMP Wed Apr 2 16:34:16 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/cache_filter_result (c0c3eb4) to 45bda04 diff
BENCH_NAME=arrow_reader_clickbench
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench arrow_reader_clickbench
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_cache_filter_result
Results will be posted here when complete

@alamb
Copy link
Contributor Author

alamb commented May 20, 2025

🤖: Benchmark completed

Details

group                                alamb_cache_filter_result              main
-----                                -------------------------              ----
arrow_reader_clickbench/async/Q1     1.00      2.0±0.01ms        ? ?/sec    1.16      2.4±0.01ms        ? ?/sec
arrow_reader_clickbench/async/Q10    1.00     14.2±0.16ms        ? ?/sec    1.03     14.7±0.13ms        ? ?/sec
arrow_reader_clickbench/async/Q11    1.00     16.1±0.14ms        ? ?/sec    1.03     16.5±0.18ms        ? ?/sec
arrow_reader_clickbench/async/Q12    1.00     27.4±0.33ms        ? ?/sec    1.39     38.0±0.30ms        ? ?/sec
arrow_reader_clickbench/async/Q13    1.00     39.9±0.33ms        ? ?/sec    1.29     51.6±0.41ms        ? ?/sec
arrow_reader_clickbench/async/Q14    1.00     38.3±0.34ms        ? ?/sec    1.30     49.7±0.31ms        ? ?/sec
arrow_reader_clickbench/async/Q19    1.01      5.2±0.07ms        ? ?/sec    1.00      5.1±0.08ms        ? ?/sec
arrow_reader_clickbench/async/Q20    1.00    114.5±0.73ms        ? ?/sec    1.38    158.5±0.67ms        ? ?/sec
arrow_reader_clickbench/async/Q21    1.00    131.5±0.79ms        ? ?/sec    1.68    220.4±1.03ms        ? ?/sec
arrow_reader_clickbench/async/Q22    1.00    234.3±8.23ms        ? ?/sec    2.07    486.1±2.04ms        ? ?/sec
arrow_reader_clickbench/async/Q23    1.00   440.6±13.11ms        ? ?/sec    1.11   489.2±17.69ms        ? ?/sec
arrow_reader_clickbench/async/Q24    1.00     45.0±0.37ms        ? ?/sec    1.29     58.1±0.59ms        ? ?/sec
arrow_reader_clickbench/async/Q27    1.00    119.0±0.58ms        ? ?/sec    1.36    161.5±0.80ms        ? ?/sec
arrow_reader_clickbench/async/Q28    1.00    115.4±0.73ms        ? ?/sec    1.39    160.0±0.95ms        ? ?/sec
arrow_reader_clickbench/async/Q30    1.01     65.7±0.48ms        ? ?/sec    1.00     64.8±0.60ms        ? ?/sec
arrow_reader_clickbench/async/Q36    1.00    129.4±0.83ms        ? ?/sec    1.29    167.2±0.84ms        ? ?/sec
arrow_reader_clickbench/async/Q37    1.00     99.2±0.68ms        ? ?/sec    1.00     98.9±0.53ms        ? ?/sec
arrow_reader_clickbench/async/Q38    1.01     39.9±0.27ms        ? ?/sec    1.00     39.5±0.30ms        ? ?/sec
arrow_reader_clickbench/async/Q39    1.01     49.4±0.40ms        ? ?/sec    1.00     49.0±0.38ms        ? ?/sec
arrow_reader_clickbench/async/Q40    1.00     49.1±0.66ms        ? ?/sec    1.09     53.5±0.43ms        ? ?/sec
arrow_reader_clickbench/async/Q41    1.00     41.2±0.47ms        ? ?/sec    1.00     41.0±0.40ms        ? ?/sec
arrow_reader_clickbench/async/Q42    1.00     14.7±0.18ms        ? ?/sec    1.00     14.6±0.15ms        ? ?/sec
arrow_reader_clickbench/sync/Q1      1.00   1843.8±9.23µs        ? ?/sec    1.20      2.2±0.01ms        ? ?/sec
arrow_reader_clickbench/sync/Q10     1.00     13.0±0.07ms        ? ?/sec    1.03     13.3±0.12ms        ? ?/sec
arrow_reader_clickbench/sync/Q11     1.00     14.8±0.11ms        ? ?/sec    1.02     15.2±0.10ms        ? ?/sec
arrow_reader_clickbench/sync/Q12     1.00     32.4±0.50ms        ? ?/sec    1.25     40.6±0.31ms        ? ?/sec
arrow_reader_clickbench/sync/Q13     1.00     44.3±0.42ms        ? ?/sec    1.21     53.7±0.46ms        ? ?/sec
arrow_reader_clickbench/sync/Q14     1.00     42.9±0.51ms        ? ?/sec    1.22     52.3±0.46ms        ? ?/sec
arrow_reader_clickbench/sync/Q19     1.02      4.4±0.02ms        ? ?/sec    1.00      4.3±0.05ms        ? ?/sec
arrow_reader_clickbench/sync/Q20     1.00   121.8±10.69ms        ? ?/sec    1.44    175.5±1.15ms        ? ?/sec
arrow_reader_clickbench/sync/Q21     1.00    137.2±9.68ms        ? ?/sec    1.70    233.1±1.71ms        ? ?/sec
arrow_reader_clickbench/sync/Q22     1.00    214.2±9.00ms        ? ?/sec    2.22    475.1±3.54ms        ? ?/sec
arrow_reader_clickbench/sync/Q23     1.00   383.2±15.35ms        ? ?/sec    1.16   442.7±15.50ms        ? ?/sec
arrow_reader_clickbench/sync/Q24     1.00     41.7±0.48ms        ? ?/sec    1.31     54.5±0.58ms        ? ?/sec
arrow_reader_clickbench/sync/Q27     1.13   172.6±10.81ms        ? ?/sec    1.00    152.3±1.01ms        ? ?/sec
arrow_reader_clickbench/sync/Q28     1.06    158.6±6.71ms        ? ?/sec    1.00    150.2±0.76ms        ? ?/sec
arrow_reader_clickbench/sync/Q30     1.03     64.3±0.70ms        ? ?/sec    1.00     62.5±0.48ms        ? ?/sec
arrow_reader_clickbench/sync/Q36     1.00    119.8±0.89ms        ? ?/sec    1.31    157.5±0.88ms        ? ?/sec
arrow_reader_clickbench/sync/Q37     1.01     93.6±0.71ms        ? ?/sec    1.00     92.3±0.40ms        ? ?/sec
arrow_reader_clickbench/sync/Q38     1.02     32.3±0.25ms        ? ?/sec    1.00     31.7±0.21ms        ? ?/sec
arrow_reader_clickbench/sync/Q39     1.02     35.1±0.39ms        ? ?/sec    1.00     34.3±0.26ms        ? ?/sec
arrow_reader_clickbench/sync/Q40     1.00     45.5±0.48ms        ? ?/sec    1.11     50.5±0.37ms        ? ?/sec
arrow_reader_clickbench/sync/Q41     1.01     38.2±0.32ms        ? ?/sec    1.00     37.9±0.28ms        ? ?/sec
arrow_reader_clickbench/sync/Q42     1.01     13.7±0.07ms        ? ?/sec    1.00     13.6±0.06ms        ? ?/sec

@alamb
Copy link
Contributor Author

alamb commented May 20, 2025

🤖: Benchmark completed

Well, that is looking quite a bit better :bowtie:

I am now working on a way to reduce buffering requirements (will require incremental concat'ing)

@zhuqi-lucas
Copy link
Contributor

🤖: Benchmark completed

Well, that is looking quite a bit better :bowtie:

I am now working on a way to reduce buffering requirements (will require incremental concat'ing)

Amazing result @alamb , it looks pretty cool!

@github-actions github-actions bot added the arrow Changes to the arrow crate label May 22, 2025
@alamb alamb force-pushed the alamb/cache_filter_result branch from ffb8f44 to 80b30ca Compare June 2, 2025 14:29
@alamb
Copy link
Contributor Author

alamb commented Jun 2, 2025

I looked at the profiling for using the incremental record batch builder in DataFusion -- 5% of the time is taken copying StringView data. Maybe we can improve that as well.

let view_len = *v as u32;
// if view_len is 12 or less, data is inlined and doesn't need an update
// if view is 12 or more, need to update the buffer offset
if view_len > 12 {
Copy link
Contributor

@Dandandan Dandandan Jun 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the check can be omitted (and will be faster without it).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Views with length less than 12 have the entire string inlined -- so if we update the buffer_index "field" for such small views we may well be updating the inlined string bytes. So I do think we need the check here

}

// Update any views that point to the old buffers
for v in views.as_slice_mut()[starting_view..].iter_mut() {
Copy link
Contributor

@Dandandan Dandandan Jun 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be faster to extend in one go, rather than first append_from_slice and later iter_mut (in views.append_slice(array.views()))

// if view_len is 12 or less, data is inlined and doesn't need an update
// if view is 12 or more, need to update the buffer offset
if view_len > 12 {
let mut view = ByteView::from(*v);
Copy link
Contributor

@Dandandan Dandandan Jun 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ByteView can use std::mem::transmute for from / as_u128 which I think would be faster.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. I need to make some benchmarks so I can test these ideas more easily. Right now the cycle time is pretty long as I need to rebuild datafusion and then kick off a benchmark run.

} else {
// otherwise the array is sparse so copy the data into a single new
// buffer as well as updating the views
let mut new_buffer: Vec<u8> = Vec::with_capacity(used_buffer_size);
Copy link
Contributor

@Dandandan Dandandan Jun 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this doesn't create too small buffer size after filtering?

Shouldn't we create a in progress buffer with a larger buffer size if used_buffer_size is smaller than that (based on allocation strategy) and then use that one instead of creating small buffers and flushing them every time?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(That could be one benefit from concat / potentially buffering filter - we can know the exact capacity upfront).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are 100% right - the copying should be handled in a similar way to the rest of the builder: fill up any remaining allocation first and allocate new buffers following the normal allocation strategy. I will pursue that approach

let mut new_buffer: Vec<u8> = Vec::with_capacity(used_buffer_size);
let new_buffer_index = buffers.len() as u32; // making one new buffer
// Update any views that point to the old buffers.
for v in views.as_slice_mut()[starting_view..].iter_mut() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same - better to extend

@Dandandan
Copy link
Contributor

I added some insights for the string view data when looking at the concat and string gc implementations.

@@ -538,6 +538,18 @@ impl<T: ByteViewType + ?Sized> GenericByteViewArray<T> {

l_full_data.cmp(r_full_data)
}

/// return the total number of bytes required to hold all strings pointed to by views in this array
pub fn minimum_buffer_size(&self) -> usize {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably compiles to the same, but I think using sum is slightly more idiomatic.

self.views.iter()
.map(|v| {
  let len = (*v as u32) as usize;
  if len > 12 {
    len
  } else {
    0
}
})
.sum()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I used sum before and it showed up in a profile -- I was trying to see if I could get the code faster but I didn't do it scientifically

///
/// This method optimizes for the case where the filter selects all or no rows
/// and ensures all output arrays in `current` is at most `batch_size` rows long.
pub fn append_filtered(
Copy link
Contributor

@Dandandan Dandandan Jun 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for any binary data it makes sense to buffer the BooleanArray up until batch_size row limit (or number of array limit if it gets too large) so the final capacity of data buffer can be determined in one go.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The challenge with this strategy is that it may hold on to many many buffers (for example, if the array only holds on to 1 row from each batch, each 8MB or whatever buffer is mostly unused.

I will make a benchmark to explore the options more fully

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah true I thought about this as well (right now this is the case with concat after filtering views as well, but I think we can do better with e.g. setting a memory budget)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I started writing a benchmark and then realized I had nothing to benchmark. So I ported a version of code from DataFusion here:

I plan to make a benchmark tomorrow and hopefully start playing around with the optimizations

// Flush the in-progress buffer
unsafe {
// All views were coped from `array`
self.finalize_copied_views(starting_view, array);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be good to delay this until finishing the buffer (so it can allocate larger buffer and go through the views in one go.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but as mentioned above that will hold on to quite a bit more memory for selective filters. I have some other ideas of how to improve it

/// Append all views from the given array into the inprogress builder
///
/// Will copy the underlying views based on the value of target_buffer_load_factor
pub fn append_array(&mut self, array: &GenericByteViewArray<T>) {
Copy link
Contributor

@Dandandan Dandandan Jun 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be used in concat as well (a similar implementation is almost 1.5x as fast on string views)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will file a ticket for this idea which we can pursue separately

Dandandan pushed a commit that referenced this pull request Jun 5, 2025
…lected b…atches: (#7597)

# Which issue does this PR close?

- Part of #6692
- Part of #7589


# Rationale for this change

The pattern of combining multiple small RecordBatches to form one larger
one for
subsequent processing is common in query engines like DataFusion which
filter or
partition incoming Arrays. Current best practice is to use the `filter`
or `take` kernels and then
`concat` kernels as explained in
- #6692

This pattern also appears in my attempt to improve parquet filter
performance (to cache the result
of applying a filter rather than re-decoding the results). See
- apache/datafusion#3463
- #7513

The current pattern is non optimal as it requires:
1. At least 2x peak memory (holding the input and output of `concat`)
2. 2 copies of the data (to create the output of `filter` and then
create the output of `concat`)

The theory is that with sufficient optimization we can reduce the peak
memory
requirements and (possibly) make it faster as well.

However, to add a benchmark for this filter+concat, I basically had
nothing to
benchmark. Specifically, there needed to be an API to call.

- Note I also made a PR to DataFusion showing this API can be used and
it is not slower: apache/datafusion#16249


# What changes are included in this PR?

I ported the code from DataFusion downstream upstream into arrow-rs so
1. We can use it in the parquet reader
2. We can benchmark and optimize it appropriately

1. Add `BatchCoalescer` to `arrow-select`, and tests
2. Update documentation
2. Add examples
3. Add a `pub` export in `arrow`
4. Add Benchmark

# Are there any user-facing changes?

This is a new API. 

I next plan to make an benchmark for this particular
@alamb
Copy link
Contributor Author

alamb commented Jun 6, 2025

@alamb alamb changed the title POC: Sketch out cached filter result API POC: Sketch out parquet cached filter result API Jun 24, 2025
@alamb alamb mentioned this pull request Jun 24, 2025
4 tasks
@alamb
Copy link
Contributor Author

alamb commented Jun 24, 2025

@alamb alamb closed this Jun 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants