Skip to content

Improve performance of ClickBench Q21 by removing the cast #16591

@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

When looking into a purported regression in #16575 (comment) I noticed something else about clickbench 21:

SELECT "SearchPhrase", MIN("URL"), COUNT(*) AS c FROM hits WHERE "URL" LIKE '%google%' AND "SearchPhrase" <> '' GROUP BY "SearchPhrase" ORDER BY c DESC LIMIT 10;

About 30% of the overall execution time is going to CASTing!

Image

Describe the solution you'd like

Reproduce like this:

cd benchmarks
./bench.sh data clickbench_partitioned
datafusion-cli -c "SELECT \"SearchPhrase\", MIN(\"URL\"), COUNT(*) AS c FROM 'data/hits_partitioned' WHERE \"URL\" LIKE '%google%' AND \"SearchPhrase\" <> '' GROUP BY \"SearchPhrase\" ORDER BY c DESC LIMIT 10;"

you can run the explain plan like this

(venv) andrewlamb@Andrews-MacBook-Pro-3:~/Software/datafusion/benchmarks$ datafusion-cli -c "EXPLAIN SELECT \"SearchPhrase\", MIN(\"URL\"), COUNT(*) AS c FROM 'data/hits_partitioned' WHERE \"URL\" LIKE '%google%' AND \"SearchPhrase\" <> '' GROUP BY \"SearchPhrase\" ORDER BY c DESC LIMIT 10;"
DataFusion CLI v48.0.0
+---------------+-------------------------------+
| plan_type     | plan                          |
+---------------+-------------------------------+
| physical_plan | ┌───────────────────────────┐ |
|               | │  SortPreservingMergeExec  │ |
|               | │    --------------------   │ |
|               | │      c DESClimit: 10      │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │       SortExec(TopK)      │ |
|               | │    --------------------   │ |
|               | │          c@2 DESC         │ |
|               | │                           │ |
|               | │         limit: 10         │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │       ProjectionExec      │ |
|               | │    --------------------   │ |
|               | │       SearchPhrase:       │ |
|               | │        SearchPhrase       │ |
|               | │                           │ |
|               | │     c: count(Int64(1))    │ |
|               | │                           │ |
|               | │ min(data/hits_partitioned │ |
|               | │           .URL):          │ |
|               | │ min(data/hits_partitioned │ |
|               | │           .URL)           │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │       AggregateExec       │ |
|               | │    --------------------   │ |
|               | │           aggr:           │ |
|               | │ min(data/hits_partitioned │ |
|               | │      .URL), count(1)      │ |
|               | │                           │ |
|               | │         group_by:         │ |
|               | │        SearchPhrase       │ |
|               | │                           │ |
|               | │           mode:           │ |
|               | │      FinalPartitioned     │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │    CoalesceBatchesExec    │ |
|               | │    --------------------   │ |
|               | │     target_batch_size:    │ |
|               | │            8192           │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │      RepartitionExec      │ |
|               | │    --------------------   │ |
|               | │ partition_count(in->out): │ |
|               | │          16 -> 16         │ |
|               | │                           │ |
|               | │    partitioning_scheme:   │ |
|               | │ Hash([SearchPhrase@0], 16)│ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │       AggregateExec       │ |
|               | │    --------------------   │ |
|               | │           aggr:           │ |
|               | │ min(data/hits_partitioned │ |
|               | │      .URL), count(1)      │ |
|               | │                           │ |
|               | │         group_by:         │ |
|               | │        SearchPhrase       │ |
|               | │                           │ |
|               | │       mode: Partial       │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │    CoalesceBatchesExec    │ |
|               | │    --------------------   │ |
|               | │     target_batch_size:    │ |
|               | │            8192           │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │         FilterExec        │ |
|               | │    --------------------   │ |
|               | │         predicate:        │ |
|               | │ CAST(URL AS Utf8View) LIKE│ |
|               | │        %google% AND       │ |
|               | │        SearchPhrase       │ |
|               | │             !=            │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │       DataSourceExec      │ |
|               | │    --------------------   │ |
|               | │         files: 115        │ |
|               | │      format: parquet      │ |
|               | │                           │ |
|               | │         predicate:        │ |
|               | │ CAST(URL AS Utf8View) LIKE│ | <--- this CAST is the problem, for some reason the LIKE is not evaluated natively on the query
|               | │        %google% AND       │ |
|               | │        SearchPhrase       │ |
|               | │             !=            │ |
|               | └───────────────────────────┘ |
|               |                               |
+---------------+-------------------------------+
1 row(s) fetched.
Elapsed 0.058 seconds.

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions