Add DataFusion datasource implementation in Python for pandas and DataFrame Interchange #438

jonmmease · 2023-12-21T11:00:39Z

Closes #386

This PR adds custom DataFusion datasources, written in Python, for pandas DataFrames and objects that adhere to the DataFrame Interchange Protocol (i.e. have a __dataframe__ method). Using this approach, we no longer convert these objects directly to arrow before passing them into VegaFusion. Instead, they are converted to arrow dynamically during the DataFusion query. This makes it possible to down select to the required columns before converting to Arrow, which can be much much faster for DataFrames that include lots of columns.

Example of 10 million row histogram with pandas:

import altair as alt
import pandas as pd
alt.data_transformers.enable("vegafusion")

movies = pd.read_json("https://raw.githubusercontent.com/vega/vega-datasets/main/data/movies.json")
movies = pd.concat([movies]*3200, axis=0).reset_index(drop=False)
print(len(movies))
source = movies
chart = alt.Chart(source).mark_bar().encode(
    alt.X("IMDB Rating:Q", bin=True),
    y='count()',
)
chart

10243200

Test performing with chart.to_dict()

%%timeit
d = chart.to_dict(format="vega")

Before: 7.04 s ± 81.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
After: 336 ms ± 11.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

That's 20x faster!

This is because we're only converting 1 out of 17 columns to arrow, and skipping several string columns which are particularly slow to convert.

The duckdb connection against pandas is still a bit faster, but the DataFusion connection is now within a factor of 2 for this example.

DuckDB: 199 ms ± 2.92 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The performance results for DataFrame Interchange Protocol objects will vary depending on how expensive it is to convert their contents to arrow using PyArrow. In this case we're using dfi.select_columns_by_name(columns) to filter down the source columns before converting to Arrow.

cc @ivirshup

jonmmease · 2023-12-21T11:01:44Z

python/vegafusion/tests/test_transformed_data.py

 @pytest.mark.skipif(pa_major_minor < (11, 0), reason="pyarrow 11+ required")
-@pytest.mark.parametrize("connection", get_connections())
-def test_gh_286(connection):
+def test_gh_286():


One thing lost in this PR is the ability to query DataFrame Interchange Protocol objects with the DuckDB connection. This was needed (under the current architecture) to avoid converting to arrow up front.

jonmmease · 2023-12-21T11:02:15Z

python/vegafusion/vegafusion/datasource/_dfi_types.py

This file is copied from https://data-apis.org/dataframe-protocol/latest/API.html

jonmmease · 2023-12-21T11:03:15Z

python/vegafusion/vegafusion/datasource/datasource.py

This is the Python interface that the Rust logic calls into

jonmmease · 2023-12-21T11:03:58Z

python/vegafusion/vegafusion/datasource/dfi_datasource.py

+from ._dfi_types import DtypeKind, DataFrame as DfiDataFrame
+from .datasource import Datasource
+
+# Taken from private pyarrow utilities


I copied some private utilities from pyarrow that handle converting DataFrame Interchange Protocol types to pyarrow schema types.

jonmmease · 2023-12-21T11:05:31Z

python/vegafusion/vegafusion/datasource/dfi_datasource.py

+        columns = list(columns)
+        projected_schema = pa.schema([f for f in self._schema if f.name in columns])
+        table = from_dataframe(self._dataframe.select_columns_by_name(columns))
+        return table.cast(projected_schema, safe=False)


I found that this cast was needed to handle the case where polars returns a LargeUTF8 column since the converted pyarrow types are never LargeUTF8.

jonmmease · 2023-12-21T11:09:30Z

vegafusion-python-embed/src/lib.rs

+        let result_updates = py.allow_threads(|| {
+            self.tokio_runtime
+                .block_on(self.state.update(&self.runtime, updates))
+        })?;


This was needed to avoid a deadlock now that the DataFusion datasource may need to acquire the GIL

jonmmease · 2023-12-21T11:11:13Z

vegafusion-sql/src/connection/datafusion_py_datasource.rs

This file is adapted from the DataFusion custom datasource example: https://github.com/apache/arrow-datafusion/blob/47fd9bf5b7a1b931e6e8bd323a01ae54fda261e5/datafusion-examples/examples/custom_datasource.rs

jonmmease added 3 commits December 20, 2023 09:38

Add Datasource interface and PandasDatasource implementation

cca5394

Add DataFrame Interchange data source

7868c4a

Add DataFusion datasource that wraps Python Datasource

0d5a7ff

jonmmease commented Dec 21, 2023

View reviewed changes

python/vegafusion/vegafusion/datasource/datasource.py

Copy link

Collaborator Author

jonmmease Dec 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the Python interface that the Rust logic calls into

jonmmease commented Dec 21, 2023

View reviewed changes

Python 3.8 compat

fdedf1b

jonmmease merged commit e5f7a61 into main Dec 21, 2023

ivirshup mentioned this pull request Feb 15, 2024

Push down column selections when using __dataframe__ protocol #386

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add DataFusion datasource implementation in Python for pandas and DataFrame Interchange #438

Add DataFusion datasource implementation in Python for pandas and DataFrame Interchange #438

Uh oh!

jonmmease commented Dec 21, 2023

Uh oh!

jonmmease Dec 21, 2023

Uh oh!

jonmmease Dec 21, 2023

Uh oh!

jonmmease Dec 21, 2023

Uh oh!

jonmmease Dec 21, 2023

Uh oh!

jonmmease Dec 21, 2023

Uh oh!

jonmmease Dec 21, 2023

Uh oh!

jonmmease Dec 21, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Add DataFusion datasource implementation in Python for pandas and DataFrame Interchange #438

Add DataFusion datasource implementation in Python for pandas and DataFrame Interchange #438

Uh oh!

Conversation

jonmmease commented Dec 21, 2023

Uh oh!

jonmmease Dec 21, 2023

Choose a reason for hiding this comment

Uh oh!

jonmmease Dec 21, 2023

Choose a reason for hiding this comment

Uh oh!

jonmmease Dec 21, 2023

Choose a reason for hiding this comment

Uh oh!

jonmmease Dec 21, 2023

Choose a reason for hiding this comment

Uh oh!

jonmmease Dec 21, 2023

Choose a reason for hiding this comment

Uh oh!

jonmmease Dec 21, 2023

Choose a reason for hiding this comment

Uh oh!

jonmmease Dec 21, 2023

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants