Skip to content

Can we use df.select_columns_by_name to subselect columns coming from __dataframe__ protocol #3134

@jcrist

Description

@jcrist

With #3114 (and some recent work in ibis) we can now pass ibis tables directly to altair.Chart and things just work.

import ibis
import altair as alt

t = ibis.examples.penguins.fetch()

chart = (
    alt.Chart(t, width=600)
    .mark_circle(size=50)
    .encode(
        x=alt.X("bill_length_mm").scale(zero=False),
        y=alt.Y("bill_depth_mm").scale(zero=False),
        color="species"
    )
    .interactive()
)

However, currently it appears the entire table is loaded into memory, even if only a few columns are needed for the plot. For example, the above penguins dataset has 8 columns, but only 3 of them are used to generate the plot. Can altair make use of this information to subselect columns before conversion when using the __dataframe__ protocol?

With some recent work in ibis, the following can all happen without loading data into memory:

df = t.__dataframe__()

# subselect columns
df = df.select_columns_by_name(["bill_length_mm", "bill_depth_mm", "species"])

# view dtypes, as needed for altair's type inference
df.get_column_by_name("bill_length_mm").dtype

# convert to pyarrow here. Only this step will actually execute the query
t = pa.interchange.from_dataframe(df)

Especially for wide input tables, having altair handle subselecting columns automatically may be useful for improving performance.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions