Description
Is your feature request related to a problem or challenge?
When looking into a purported regression in #16575 (comment) I noticed something else about clickbench 21:
SELECT "SearchPhrase", MIN("URL"), COUNT(*) AS c FROM hits WHERE "URL" LIKE '%google%' AND "SearchPhrase" <> '' GROUP BY "SearchPhrase" ORDER BY c DESC LIMIT 10;
About 30% of the overall execution time is going to CAST
ing!
Describe the solution you'd like
Reproduce like this:
cd benchmarks
./bench.sh data clickbench_partitioned
datafusion-cli -c "SELECT \"SearchPhrase\", MIN(\"URL\"), COUNT(*) AS c FROM 'data/hits_partitioned' WHERE \"URL\" LIKE '%google%' AND \"SearchPhrase\" <> '' GROUP BY \"SearchPhrase\" ORDER BY c DESC LIMIT 10;"
you can run the explain plan like this
(venv) andrewlamb@Andrews-MacBook-Pro-3:~/Software/datafusion/benchmarks$ datafusion-cli -c "EXPLAIN SELECT \"SearchPhrase\", MIN(\"URL\"), COUNT(*) AS c FROM 'data/hits_partitioned' WHERE \"URL\" LIKE '%google%' AND \"SearchPhrase\" <> '' GROUP BY \"SearchPhrase\" ORDER BY c DESC LIMIT 10;"
DataFusion CLI v48.0.0
+---------------+-------------------------------+
| plan_type | plan |
+---------------+-------------------------------+
| physical_plan | ┌───────────────────────────┐ |
| | │ SortPreservingMergeExec │ |
| | │ -------------------- │ |
| | │ c DESClimit: 10 │ |
| | └─────────────┬─────────────┘ |
| | ┌─────────────┴─────────────┐ |
| | │ SortExec(TopK) │ |
| | │ -------------------- │ |
| | │ c@2 DESC │ |
| | │ │ |
| | │ limit: 10 │ |
| | └─────────────┬─────────────┘ |
| | ┌─────────────┴─────────────┐ |
| | │ ProjectionExec │ |
| | │ -------------------- │ |
| | │ SearchPhrase: │ |
| | │ SearchPhrase │ |
| | │ │ |
| | │ c: count(Int64(1)) │ |
| | │ │ |
| | │ min(data/hits_partitioned │ |
| | │ .URL): │ |
| | │ min(data/hits_partitioned │ |
| | │ .URL) │ |
| | └─────────────┬─────────────┘ |
| | ┌─────────────┴─────────────┐ |
| | │ AggregateExec │ |
| | │ -------------------- │ |
| | │ aggr: │ |
| | │ min(data/hits_partitioned │ |
| | │ .URL), count(1) │ |
| | │ │ |
| | │ group_by: │ |
| | │ SearchPhrase │ |
| | │ │ |
| | │ mode: │ |
| | │ FinalPartitioned │ |
| | └─────────────┬─────────────┘ |
| | ┌─────────────┴─────────────┐ |
| | │ CoalesceBatchesExec │ |
| | │ -------------------- │ |
| | │ target_batch_size: │ |
| | │ 8192 │ |
| | └─────────────┬─────────────┘ |
| | ┌─────────────┴─────────────┐ |
| | │ RepartitionExec │ |
| | │ -------------------- │ |
| | │ partition_count(in->out): │ |
| | │ 16 -> 16 │ |
| | │ │ |
| | │ partitioning_scheme: │ |
| | │ Hash([SearchPhrase@0], 16)│ |
| | └─────────────┬─────────────┘ |
| | ┌─────────────┴─────────────┐ |
| | │ AggregateExec │ |
| | │ -------------------- │ |
| | │ aggr: │ |
| | │ min(data/hits_partitioned │ |
| | │ .URL), count(1) │ |
| | │ │ |
| | │ group_by: │ |
| | │ SearchPhrase │ |
| | │ │ |
| | │ mode: Partial │ |
| | └─────────────┬─────────────┘ |
| | ┌─────────────┴─────────────┐ |
| | │ CoalesceBatchesExec │ |
| | │ -------------------- │ |
| | │ target_batch_size: │ |
| | │ 8192 │ |
| | └─────────────┬─────────────┘ |
| | ┌─────────────┴─────────────┐ |
| | │ FilterExec │ |
| | │ -------------------- │ |
| | │ predicate: │ |
| | │ CAST(URL AS Utf8View) LIKE│ |
| | │ %google% AND │ |
| | │ SearchPhrase │ |
| | │ != │ |
| | └─────────────┬─────────────┘ |
| | ┌─────────────┴─────────────┐ |
| | │ DataSourceExec │ |
| | │ -------------------- │ |
| | │ files: 115 │ |
| | │ format: parquet │ |
| | │ │ |
| | │ predicate: │ |
| | │ CAST(URL AS Utf8View) LIKE│ | <--- this CAST is the problem, for some reason the LIKE is not evaluated natively on the query
| | │ %google% AND │ |
| | │ SearchPhrase │ |
| | │ != │ |
| | └───────────────────────────┘ |
| | |
+---------------+-------------------------------+
1 row(s) fetched.
Elapsed 0.058 seconds.
Describe alternatives you've considered
No response
Additional context
No response