MrPowers Benchmarks

This repo performs benchmarking analysis on common datasets with popular query engines like pandas, Polars, DataFusion, and PySpark. For example, here are the slower ClickBench benchmarks for a few select engines:

It draws inspiration from the h2o benchmarks but also includes different types of queries and uses some different execution methodologies (e.g. modern file formats).

The h2o benchmarks have been a great contribution to the data community, but they've made some decisions that aren't as relevant for modern workflows, see this section for more details.

Most readers of this repo are interested in the benchmarking results and don't actually want to reproduce the benchmarking results themselves. In any case, this repo makes it easy for readers to reproducing the results themselves. This is particularily useful if you'd like to run the benchmarks on a specific set of hardware.

This repo provides clear instructions on how to generate the datasets and descriptions of the results, so you can easily gain intuition about the actual benchmarks that are run.

TPC H on localhost

Here are the TPC H queries for scale factor 50:

Note these are TPC derivative and don't follow the official TPC methodology.

h2o join on localhost

Here are the results for the h2o join queries on the 1e7 dataset:

And here are the results on the 1e8 dataset:

h2o groupby on localhost

Here are the results for the h2o groupby queries on the 10 million row dataset (stored in a single Parquet file):

Here are the results for the h2o groupby queries on the 100 million row dataset:

Here are the longer-running group by queries:

Here they are on the bigger dataset:

Revised h2o methodology

These queries were run on a Macbook M3 with 16 GB of RAM.

Here's how the benchmarking methdology differs from the h2o benchmarks:

they include the full query time (h2o benchmarks just include the query time once the data is loaded in memory)
Parquet files are used instead of CSV

Single table query results

Here are the results for single table queries on the 1e7 dataset:

And here are the results on the 1e8 table:

ClickBench queries

Here are the very fast ClickBench queries:

Here are the fast ClickBench queries:

Here are the slow ClickBench queries:

Name		Name	Last commit message	Last commit date
Latest commit History 108 Commits
docs		docs
envs		envs
images		images
querybench		querybench
scripts		scripts
tests		tests
.gitignore		.gitignore
CNAME		CNAME
CONTRIBUTING.md		CONTRIBUTING.md
LEGACY_CONTRIBUTING.md		LEGACY_CONTRIBUTING.md
LICENSE.md		LICENSE.md
README.md		README.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MrPowers Benchmarks

TPC H on localhost

h2o join on localhost

h2o groupby on localhost

Revised h2o methodology

Single table query results

ClickBench queries

About

Uh oh!

Releases

Uh oh!

Contributors 3

Uh oh!

Languages

License

MrPowers/querybench

Folders and files

Latest commit

History

Repository files navigation

MrPowers Benchmarks

TPC H on localhost

h2o join on localhost

h2o groupby on localhost

Revised h2o methodology

Single table query results

ClickBench queries

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Uh oh!

Contributors 3

Uh oh!

Languages