Optimize `regex_replace` for scalar patterns #3614

isidentical · 2022-09-25T23:07:29Z

Which issue does this PR close?

Closes #3613.

Rationale for this change

@Dandandan noticed regex_replace with a known pattern seems to be taking an extremely long amount of time during ClickBench suite in #3518. This seems to be true due to many factors, but mainly due to how generic regex_replace implementation is (it can handle 2⁴ combinations when it comes to scalars/arrays). Having a generic version ready is good for compatibility, but at the same time, it makes us pay the overhead for common cases (like the example in #3518, where the pattern is static).

What changes are included in this PR?

This PR adds a scalarity (not sure if this is a real word) based specialization system where at the runtime the best regex_replace variation can be picked and executed for the given set of inputs. The system here is just the start, and if there is enough gains we might add a third case where the replacement is also known.

Are there any user-facing changes?

This is mainly an optimization, and there shouldn't be any user facing changes.

Benchmarks

New benchmarks are here #3614 (comment), and overall it shows a speed-up in the range of 20-35X depending on the query & input.

Old benchmarks

Running all benchmarks with --release mode (using the datafusion-cli crate with -f option).

The initial benchmark is the Query 28 from clickhouse

SELECT
    REGEXP_REPLACE("Referer", '^https?://(?:www.)?([^/]+)/.*$', '1') AS k,
    AVG(length("Referer")) AS l,
    COUNT(*) AS c,
    MIN("Referer")
FROM hits_1
    WHERE "Referer" <> ''
    GROUP BY k
    HAVING COUNT(*) > 100000
    ORDER BY l DESC
LIMIT 25;

	Master	This Branch	Factor
Cold Run	2.875 seconds	0.318 seconds	9.04x speed-up
Hot Run (6th consecutive run)	2.252 seconds	0.266 seconds	8.46x speed-up
Average	2.408 seconds	0.277 seconds	8.69x speed-up

(Note: I don't have the full ClickBench data, just have a partition of it [1/100 scale] so this might not be very reflective)

A second benchmark is the one where we have both the source and the replacements as arrays, which shows speed-up factor of 1.7X.

-- Generate data
--
-- import secrets
-- import random
--
-- rows = 1_000_000
--
-- data = {"user_id": [], "website": []}
-- for _ in range(rows):
--     data["user_id"].append(secrets.token_hex(8))
--
--     # Sometimes it is proper URL, and sometimes it is not.
--     data["website"].append(
--         random.choice(["http", "https", "unknown", ""])
--         + random.choice([":", "://"])
--         + random.choice(["google", "facebook"])
--         + random.choice([".com", ".org", ""])
--     )
--
-- import pandas as pd
-- df = pd.DataFrame(data)
-- df.to_parquet("data.parquet")

CREATE EXTERNAL TABLE generated_data
STORED AS PARQUET
LOCATION 'data.parquet';

-- Query 1
EXPLAIN ANALYZE
SELECT
    REGEXP_REPLACE("website", '^https?://(?:www.)?([^/]+)$', "user_id") AS encoded_website
FROM generated_data;

codecov-commenter · 2022-09-26T00:03:01Z

Codecov Report

Merging #3614 (bbb8c8b) into master (ebb28f5) will decrease coverage by 0.07%.
The diff coverage is 85.23%.

❗ Current head bbb8c8b differs from pull request most recent head d0f1020. Consider uploading reports for the commit d0f1020 to get more accurate results

@@            Coverage Diff             @@
##           master    #3614      +/-   ##
==========================================
- Coverage   86.07%   85.99%   -0.08%     
==========================================
  Files         300      300              
  Lines       56314    56449     +135     
==========================================
+ Hits        48473    48546      +73     
- Misses       7841     7903      +62

Impacted Files	Coverage Δ
datafusion/physical-expr/src/functions.rs	`92.66% <50.00%> (-0.10%)`	⬇️
datafusion/optimizer/src/simplify_expressions.rs	`82.67% <82.60%> (-0.01%)`	⬇️
datafusion/physical-expr/src/regex_expressions.rs	`65.76% <86.88%> (-17.27%)`	⬇️
datafusion/common/src/scalar.rs	`85.18% <0.00%> (-0.07%)`	⬇️
datafusion/expr/src/logical_plan/plan.rs	`77.10% <0.00%> (ø)`
datafusion/core/src/physical_plan/metrics/value.rs	`87.56% <0.00%> (+0.49%)`	⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

datafusion/physical-expr/src/regex_expressions.rs

Dandandan · 2022-09-27T06:32:50Z

Thank you @isidentical !

ursabot · 2022-09-27T06:42:23Z

Benchmark runs are scheduled for baseline = ea3dbb6 and contender = 15c19c3. 15c19c3 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

github-actions bot added the physical-expr Changes to the physical-expr crates label Sep 25, 2022

isidentical force-pushed the gh-3613 branch 2 times, most recently from 9ae012e to fe85ff8 Compare September 25, 2022 23:22

Optimize regex_replace for scalar patterns

d0f1020

isidentical force-pushed the gh-3613 branch from fe85ff8 to d0f1020 Compare September 26, 2022 15:50

isidentical marked this pull request as ready for review September 26, 2022 15:51

Dandandan reviewed Sep 26, 2022

View reviewed changes

datafusion/physical-expr/src/regex_expressions.rs Outdated Show resolved Hide resolved

Change the hot-path on regexp_replace to only variadic source (#2)

b542898

isidentical force-pushed the gh-3613 branch from a7d139e to b542898 Compare September 26, 2022 23:16

isidentical requested a review from Dandandan September 26, 2022 23:16

Dandandan approved these changes Sep 27, 2022

View reviewed changes

Dandandan merged commit 15c19c3 into apache:master Sep 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize `regex_replace` for scalar patterns #3614

Optimize `regex_replace` for scalar patterns #3614

Uh oh!

isidentical commented Sep 25, 2022 •

edited

Loading

Uh oh!

codecov-commenter commented Sep 26, 2022 •

edited

Loading

Uh oh!

Uh oh!

Dandandan commented Sep 27, 2022

Uh oh!

ursabot commented Sep 27, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Optimize regex_replace for scalar patterns #3614

Optimize regex_replace for scalar patterns #3614

Uh oh!

Conversation

isidentical commented Sep 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Benchmarks

Old benchmarks

Uh oh!

codecov-commenter commented Sep 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Dandandan commented Sep 27, 2022

Uh oh!

ursabot commented Sep 27, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Optimize `regex_replace` for scalar patterns #3614

Optimize `regex_replace` for scalar patterns #3614

isidentical commented Sep 25, 2022 •

edited

Loading

codecov-commenter commented Sep 26, 2022 •

edited

Loading