Skip to content

MrPowers/querybench

Repository files navigation

MrPowers Benchmarks

This repo performs benchmarking analysis on common datasets with popular query engines like pandas, Polars, DataFusion, and PySpark. For example, here are the slower ClickBench benchmarks for a few select engines:

clickbench-slow

It draws inspiration from the h2o benchmarks but also includes different types of queries and uses some different execution methodologies (e.g. modern file formats).

The h2o benchmarks have been a great contribution to the data community, but they've made some decisions that aren't as relevant for modern workflows, see this section for more details.

Most readers of this repo are interested in the benchmarking results and don't actually want to reproduce the benchmarking results themselves. In any case, this repo makes it easy for readers to reproducing the results themselves. This is particularily useful if you'd like to run the benchmarks on a specific set of hardware.

This repo provides clear instructions on how to generate the datasets and descriptions of the results, so you can easily gain intuition about the actual benchmarks that are run.

TPC H on localhost

Here are the TPC H queries for scale factor 50:

tpch-sf50

Note these are TPC derivative and don't follow the official TPC methodology.

h2o join on localhost

Here are the results for the h2o join queries on the 1e7 dataset:

h2o_join_1e7

And here are the results on the 1e8 dataset:

h2o_join_1e8

h2o groupby on localhost

Here are the results for the h2o groupby queries on the 10 million row dataset (stored in a single Parquet file):

fast_h2o_groupby_1e7

Here are the results for the h2o groupby queries on the 100 million row dataset:

fast_h2o_groupby_1e8

Here are the longer-running group by queries:

slow_h2o_groupby_1e7

Here they are on the bigger dataset:

slow_h2o_groupby_1e8

Revised h2o methodology

These queries were run on a Macbook M3 with 16 GB of RAM.

Here's how the benchmarking methdology differs from the h2o benchmarks:

  • they include the full query time (h2o benchmarks just include the query time once the data is loaded in memory)
  • Parquet files are used instead of CSV

Single table query results

Here are the results for single table queries on the 1e7 dataset:

single_table_1e7

And here are the results on the 1e8 table:

single_table_1e8

ClickBench queries

Here are the very fast ClickBench queries:

clickbench-very-fast.svg

Here are the fast ClickBench queries:

clickbench-fast.svg

Here are the slow ClickBench queries:

clickbench-slow.svg

About

Benchmarks for DataFusion, Daft, DuckDB, Polars, and pandas

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Contributors 3

  •  
  •  
  •