-
-
Notifications
You must be signed in to change notification settings - Fork 29
Description
Hi,
I would like to expose a __dataframe__ protocol on a large object where you would never actually want to request all columns. In this feature request:
over on the altair repo it was suggested that vegafusion should be able to push down column selections when handling the dataframe interchange protocol. This does not seem to be happening at first glance.
In this example, I create a subclass of pandas dataframe interchange object that prints the column name being retrieved every time get_column_by_name is called. Making a 2d histogram of "Origin" by "Miles_per_Gallon", I would only expect to see those two columns accessed. However:
import pandas as pd
import vega_datasets
import altair as alt
import vegafusion
vegafusion.enable()
from pandas.core.interchange.dataframe import PandasDataFrameXchg
class NoisyDfInterface(pd.core.interchange.dataframe.PandasDataFrameXchg):
def __dataframe__(self, allow_copy: bool = True):
return NoisyDfInterface(self._df, allow_copy=allow_copy)
def get_column_by_name(self, name):
print(f"get_column_by_name('{name}')")
return super().get_column_by_name(name)
cars = vega_datasets.data.cars()
(
alt.Chart(NoisyDfInterface(cars))
.mark_rect()
.encode(
x=alt.X("Origin"),
y=alt.Y("Miles_per_Gallon:Q", bin=True),
color="count()",
)
)get_column_by_name('Origin')
get_column_by_name('Name')
get_column_by_name('Miles_per_Gallon')
get_column_by_name('Cylinders')
get_column_by_name('Displacement')
get_column_by_name('Horsepower')
get_column_by_name('Weight_in_lbs')
get_column_by_name('Acceleration')
get_column_by_name('Year')
get_column_by_name('Origin')
This is using latest altair and vegafusion.
Environment info
Output of sessioninfo.show(dependencies=True, html=False)
-----
altair 5.1.1
pandas 2.1.0
session_info 1.0.0
vega_datasets 0.9.0
vegafusion 1.4.0
-----
anyio NA
appnope 0.1.2
arrow 1.2.3
asttokens NA
attr 23.1.0
attrs 23.1.0
babel 2.12.1
backcall 0.2.0
certifi 2022.09.24
chardet 5.1.0
charset_normalizer 2.1.0
cloudpickle 2.2.1
colorama 0.4.6
cython_runtime NA
dateutil 2.8.2
debugpy 1.5.1
decorator 5.1.0
duckdb 0.8.1
executing 0.8.2
fastjsonschema NA
fqdn NA
google NA
idna 3.3
importlib_metadata NA
ipykernel 6.17.1
ipywidgets 8.0.7
isoduration NA
jedi 0.18.1
jinja2 3.1.1
json5 NA
jsonpointer 2.4
jsonschema 4.18.0
jsonschema_specifications NA
jupyter_events 0.6.3
jupyter_server 2.7.0
jupyterlab_server 2.23.0
markupsafe 2.1.1
mpl_toolkits NA
nbformat 5.9.0
numexpr 2.8.1
numpy 1.24.4
overrides NA
packaging 23.1
parso 0.8.2
pexpect 4.8.0
pickleshare 0.7.5
pkg_resources NA
platformdirs 3.8.1
polars 0.18.15
prometheus_client NA
prompt_toolkit 3.0.38
psutil 5.9.0
ptyprocess 0.7.0
pure_eval 0.2.1
pyarrow 13.0.0
pydev_ipython NA
pydevconsole NA
pydevd 2.6.0
pydevd_concurrency_analyser NA
pydevd_file_utils NA
pydevd_plugins NA
pydevd_tracing NA
pygments 2.13.0
pythonjsonlogger NA
pytz 2022.7.1
referencing NA
requests 2.31.0
rfc3339_validator 0.1.4
rfc3986_validator 0.1.1
rpds NA
ruamel NA
send2trash NA
setuptools 65.6.3
simplejson 3.17.6
sitecustomize NA
six 1.16.0
sniffio 1.2.0
sphinxcontrib NA
stack_data 0.1.4
toolz 0.12.0
tornado 6.2
traitlets 5.6.0
typing_extensions NA
uri_template NA
urllib3 1.26.12
vegafusion_embed NA
vegafusion_jupyter 1.4.0
vl_convert 0.13.1
wcwidth 0.2.5
webcolors 1.13
websocket 1.2.1
yaml 5.4.1
zipp NA
zmq 24.0.1
zoneinfo NA
-----
IPython 8.14.0
jupyter_client 8.3.0
jupyter_core 5.3.1
jupyterlab 4.0.2
notebook 7.0.2
-----
Python 3.9.16 (main, Dec 7 2022, 10:15:13) [Clang 13.0.0 (clang-1300.0.29.30)]
macOS-13.4.1-x86_64-i386-64bit
-----
Session information updated at 2023-09-05 23:09
Any idea what's up here? Is my expectation correct?