Skip to content

Conversation

@isidentical
Copy link
Contributor

@isidentical isidentical commented Sep 25, 2022

Which issue does this PR close?

Closes #3613.

Rationale for this change

@Dandandan noticed regex_replace with a known pattern seems to be taking an extremely long amount of time during ClickBench suite in #3518. This seems to be true due to many factors, but mainly due to how generic regex_replace implementation is (it can handle 2⁴ combinations when it comes to scalars/arrays). Having a generic version ready is good for compatibility, but at the same time, it makes us pay the overhead for common cases (like the example in #3518, where the pattern is static).

What changes are included in this PR?

This PR adds a scalarity (not sure if this is a real word) based specialization system where at the runtime the best regex_replace variation can be picked and executed for the given set of inputs. The system here is just the start, and if there is enough gains we might add a third case where the replacement is also known.

Are there any user-facing changes?

This is mainly an optimization, and there shouldn't be any user facing changes.

Benchmarks

New benchmarks are here #3614 (comment), and overall it shows a speed-up in the range of 20-35X depending on the query & input.

Old benchmarks

Running all benchmarks with --release mode (using the datafusion-cli crate with -f option).

The initial benchmark is the Query 28 from clickhouse

SELECT
    REGEXP_REPLACE("Referer", '^https?://(?:www.)?([^/]+)/.*$', '1') AS k,
    AVG(length("Referer")) AS l,
    COUNT(*) AS c,
    MIN("Referer")
FROM hits_1
    WHERE "Referer" <> ''
    GROUP BY k
    HAVING COUNT(*) > 100000
    ORDER BY l DESC
LIMIT 25;
Master This Branch Factor
Cold Run 2.875 seconds 0.318 seconds 9.04x speed-up
Hot Run (6th consecutive run) 2.252 seconds 0.266 seconds 8.46x speed-up
Average 2.408 seconds 0.277 seconds 8.69x speed-up

(Note: I don't have the full ClickBench data, just have a partition of it [1/100 scale] so this might not be very reflective)

A second benchmark is the one where we have both the source and the replacements as arrays, which shows speed-up factor of 1.7X.

-- Generate data
--
-- import secrets
-- import random
--
-- rows = 1_000_000
--
-- data = {"user_id": [], "website": []}
-- for _ in range(rows):
--     data["user_id"].append(secrets.token_hex(8))
--
--     # Sometimes it is proper URL, and sometimes it is not.
--     data["website"].append(
--         random.choice(["http", "https", "unknown", ""])
--         + random.choice([":", "://"])
--         + random.choice(["google", "facebook"])
--         + random.choice([".com", ".org", ""])
--     )
--
-- import pandas as pd
-- df = pd.DataFrame(data)
-- df.to_parquet("data.parquet")

CREATE EXTERNAL TABLE generated_data
STORED AS PARQUET
LOCATION 'data.parquet';

-- Query 1
EXPLAIN ANALYZE
SELECT
    REGEXP_REPLACE("website", '^https?://(?:www.)?([^/]+)$', "user_id") AS encoded_website
FROM generated_data;

@github-actions github-actions bot added the physical-expr Changes to the physical-expr crates label Sep 25, 2022
@isidentical isidentical force-pushed the gh-3613 branch 2 times, most recently from 9ae012e to fe85ff8 Compare September 25, 2022 23:22
@codecov-commenter
Copy link

codecov-commenter commented Sep 26, 2022

Codecov Report

Merging #3614 (bbb8c8b) into master (ebb28f5) will decrease coverage by 0.07%.
The diff coverage is 85.23%.

❗ Current head bbb8c8b differs from pull request most recent head d0f1020. Consider uploading reports for the commit d0f1020 to get more accurate results

@@            Coverage Diff             @@
##           master    #3614      +/-   ##
==========================================
- Coverage   86.07%   85.99%   -0.08%     
==========================================
  Files         300      300              
  Lines       56314    56449     +135     
==========================================
+ Hits        48473    48546      +73     
- Misses       7841     7903      +62     
Impacted Files Coverage Δ
datafusion/physical-expr/src/functions.rs 92.66% <50.00%> (-0.10%) ⬇️
datafusion/optimizer/src/simplify_expressions.rs 82.67% <82.60%> (-0.01%) ⬇️
datafusion/physical-expr/src/regex_expressions.rs 65.76% <86.88%> (-17.27%) ⬇️
datafusion/common/src/scalar.rs 85.18% <0.00%> (-0.07%) ⬇️
datafusion/expr/src/logical_plan/plan.rs 77.10% <0.00%> (ø)
datafusion/core/src/physical_plan/metrics/value.rs 87.56% <0.00%> (+0.49%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@isidentical isidentical marked this pull request as ready for review September 26, 2022 15:51
@Dandandan Dandandan merged commit 15c19c3 into apache:master Sep 27, 2022
@Dandandan
Copy link
Contributor

Thank you @isidentical !

@ursabot
Copy link

ursabot commented Sep 27, 2022

Benchmark runs are scheduled for baseline = ea3dbb6 and contender = 15c19c3. 15c19c3 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

physical-expr Changes to the physical-expr crates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Optimize regex_replace with a known pattern / replacement

4 participants