Skip to content

Push down column selections when using __dataframe__ protocol #386

@ivirshup

Description

@ivirshup

Hi,

I would like to expose a __dataframe__ protocol on a large object where you would never actually want to request all columns. In this feature request:

over on the altair repo it was suggested that vegafusion should be able to push down column selections when handling the dataframe interchange protocol. This does not seem to be happening at first glance.

In this example, I create a subclass of pandas dataframe interchange object that prints the column name being retrieved every time get_column_by_name is called. Making a 2d histogram of "Origin" by "Miles_per_Gallon", I would only expect to see those two columns accessed. However:

import pandas as pd
import vega_datasets
import altair as alt
import vegafusion
vegafusion.enable()

from pandas.core.interchange.dataframe import PandasDataFrameXchg

class NoisyDfInterface(pd.core.interchange.dataframe.PandasDataFrameXchg):
    def __dataframe__(self, allow_copy: bool = True):
        return NoisyDfInterface(self._df, allow_copy=allow_copy)

    def get_column_by_name(self, name):
        print(f"get_column_by_name('{name}')")
        return super().get_column_by_name(name)

cars = vega_datasets.data.cars()

(
    alt.Chart(NoisyDfInterface(cars))
    .mark_rect()
    .encode(
        x=alt.X("Origin"),
        y=alt.Y("Miles_per_Gallon:Q", bin=True),
        color="count()",
    )
)
get_column_by_name('Origin')
get_column_by_name('Name')
get_column_by_name('Miles_per_Gallon')
get_column_by_name('Cylinders')
get_column_by_name('Displacement')
get_column_by_name('Horsepower')
get_column_by_name('Weight_in_lbs')
get_column_by_name('Acceleration')
get_column_by_name('Year')
get_column_by_name('Origin')

This is using latest altair and vegafusion.

Environment info

Output of sessioninfo.show(dependencies=True, html=False)

-----
altair              5.1.1
pandas              2.1.0
session_info        1.0.0
vega_datasets       0.9.0
vegafusion          1.4.0
-----
anyio                       NA
appnope                     0.1.2
arrow                       1.2.3
asttokens                   NA
attr                        23.1.0
attrs                       23.1.0
babel                       2.12.1
backcall                    0.2.0
certifi                     2022.09.24
chardet                     5.1.0
charset_normalizer          2.1.0
cloudpickle                 2.2.1
colorama                    0.4.6
cython_runtime              NA
dateutil                    2.8.2
debugpy                     1.5.1
decorator                   5.1.0
duckdb                      0.8.1
executing                   0.8.2
fastjsonschema              NA
fqdn                        NA
google                      NA
idna                        3.3
importlib_metadata          NA
ipykernel                   6.17.1
ipywidgets                  8.0.7
isoduration                 NA
jedi                        0.18.1
jinja2                      3.1.1
json5                       NA
jsonpointer                 2.4
jsonschema                  4.18.0
jsonschema_specifications   NA
jupyter_events              0.6.3
jupyter_server              2.7.0
jupyterlab_server           2.23.0
markupsafe                  2.1.1
mpl_toolkits                NA
nbformat                    5.9.0
numexpr                     2.8.1
numpy                       1.24.4
overrides                   NA
packaging                   23.1
parso                       0.8.2
pexpect                     4.8.0
pickleshare                 0.7.5
pkg_resources               NA
platformdirs                3.8.1
polars                      0.18.15
prometheus_client           NA
prompt_toolkit              3.0.38
psutil                      5.9.0
ptyprocess                  0.7.0
pure_eval                   0.2.1
pyarrow                     13.0.0
pydev_ipython               NA
pydevconsole                NA
pydevd                      2.6.0
pydevd_concurrency_analyser NA
pydevd_file_utils           NA
pydevd_plugins              NA
pydevd_tracing              NA
pygments                    2.13.0
pythonjsonlogger            NA
pytz                        2022.7.1
referencing                 NA
requests                    2.31.0
rfc3339_validator           0.1.4
rfc3986_validator           0.1.1
rpds                        NA
ruamel                      NA
send2trash                  NA
setuptools                  65.6.3
simplejson                  3.17.6
sitecustomize               NA
six                         1.16.0
sniffio                     1.2.0
sphinxcontrib               NA
stack_data                  0.1.4
toolz                       0.12.0
tornado                     6.2
traitlets                   5.6.0
typing_extensions           NA
uri_template                NA
urllib3                     1.26.12
vegafusion_embed            NA
vegafusion_jupyter          1.4.0
vl_convert                  0.13.1
wcwidth                     0.2.5
webcolors                   1.13
websocket                   1.2.1
yaml                        5.4.1
zipp                        NA
zmq                         24.0.1
zoneinfo                    NA
-----
IPython             8.14.0
jupyter_client      8.3.0
jupyter_core        5.3.1
jupyterlab          4.0.2
notebook            7.0.2
-----
Python 3.9.16 (main, Dec  7 2022, 10:15:13) [Clang 13.0.0 (clang-1300.0.29.30)]
macOS-13.4.1-x86_64-i386-64bit
-----
Session information updated at 2023-09-05 23:09

Any idea what's up here? Is my expectation correct?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions