Dataframe namespaces #23

datapythonista · 2020-07-28T00:03:49Z

In #10, it's been discussed that it would be convenient if the dataframe API allows method chaining. For example:

import pandas

(pandas.read_csv('countries.csv')
       .rename(columns={'name': 'country'})
       .assign(area_km2=lambda df: df['area_m2'].astype(float) / 100)
       .query('(continent.str.lower() != "antarctica") | (population < area_km2)'))

This implies that most functionality is implemented as methods of the dataframe class. Based on pandas, the number of methods can be 300 or more, so it may be problematic to implement everything in the same namespace. pandas uses a mixed approach, with different techniques to try to organize the API.

Approaches

Top-level methods

df.sum()
df.astype()

Many of the methods are simply implemented directly as methods of dataframe.

Prefixed methods

df.to_csv()
df.to_parquet()

Some of the methods are grouped with a common prefix.

Accessors

df.str.lower()
df.dt.hour()

Accessors are a property of dataframe (or series, but assuming only one dataframe class for simplicity) that groups some methods under it.

Functions

pandas.wide_to_long(df)
pandas.melt(df)

In some cases, functions are used instead of methods.

Functional API

df.apply(func)
df.applymap(func)

pandas also provides a more functional API, where functions can be passed as parameters

Standard API

I guess we will agree, that a uniform and consistent API would be better for the standard. That should make things easier to implement, and also a more intuitive experience for the user.

Also, I think it would be good that the API can be extended easily. Couple of example of how pandas can be extended with custom functions:

@pd.api.extensions.register_dataframe_accessor('my_accessor')
class MyAccessor:
    def my_custom_method(self):
        return True

df.my_accessor.my_custom_method()

df.apply(my_custom_function)
df.apply(numpy.sum)

Conceptually, I think there are some methods that should go together, more than by topic, by the API they follow. The clearest example is reductions, and there was some discussion in #11 (comment).

I think no solution will be perfect, and the options that we have are (feel free to add to the list if I'm missing any option worth considering):

Top-level methods

df.sum()

Prefixed methods

df.reduce_sum()

Accessors

df.reduce.sum()

Functions

mod.reductions.sum(df)

mod represents the implementation module (e.g. pandas)

Functional API

df.reduce(mod.reductions.sum)

Personally, my preference is the functional API. I think it's the simplest that keeps things organized, and the simplest to extend. The main drawback is its readability, it may be too verbose. There is the option to allow using a string instead of the function for known functions (e.g. df.reduce('sum')).

Thoughts? Other ideas?

The text was updated successfully, but these errors were encountered:

devin-petersohn · 2020-07-28T01:34:31Z

@datapythonista Great writeup. What do you see as the biggest benefit of defining things as df.reduce(mod.reductions.sum)?

From the my perspective, it feels a bit cumbersome, but I can see value in seeing that the shape (dimension) of the data will change (in this example) without needing to understand details of sum.

I don't think I am leaning in any particular way yet, but I want to understand your thoughts.

datapythonista · 2020-07-28T10:54:01Z

Thanks @devin-petersohn, I 100% agree on your comments. The syntax is trickier than some other implementation, and also we need to consider extra parameters, for example: df.reduce(mod.reductions.var, ddof=1).

The main advantages I see are:

It decouples things with an interface, so it's easier to deal with the complexity at different levels of abstraction
This implies that we could not care about reductions in particular, and just see that part of a dataframe as a framework that can be extended with functions implementing a certain signature. We probably want to define at least a small subset of map and reduce operations. But then we just need to know which functions are part of the standard, their extra parameters, and which types they support, and no other decision need to be made in an individual way
We don't need to think about how to extend the API for map, reductions and everything that can be implemented in this way. Users and developers of third-party libraries can implement functions with a certain signature, and users can use them naturally, without a registry, entry points, or other "magic"
For users (and this means developers here, since friendlier APIs will be defined on top of this one) the code will become more readable in the sense that the returned shape is known. Of course the code is less readable in some other way (df.sum() vs df.reduce(mod.reductions.sum)), but if you think of less know functions, it's obvious that df.reduce(df.reductions.sem) is returning a scalar (or scalar per column), while for df.sem() you need to check the signature
I think implementation can take advantage of this predictability of shapes to for example allocate memory. You can surely do it with another syntax, but having a single .reduce() method that with a known shape (including types) input, has a known output, feels like it should be much easier to manage memory. Not sure if this is only true for lazy implementations, but I guess optimizations are much easier to implement with a single .reduce() method, than with each reduction individually

I think similar things can be achieved with another syntax, like for example df.reduce.sum(). But IMHO everything becomes trickier. For users it's somehow magic what's going on under the hood. Reductions need to be registered, instead of just used directly. The syntax itself doesn't seem to imply that all the reductions implement the same syntax, so validating and enforcing it feels somehow magic.

But to me, the main thing would be to use a divide and conquer approach to reduce the complexity. I see this as similar to what Linux achieves with the X server (and many other components). You're building a complex piece of software, and instead of dealing with all the complexity in desktops, you just create an interface to build on top. Besides allowing a free market of software to interact with yours, the complexity is reduced dramatically. And its modularity makes changes much simpler. Also, I think we could end up having an independent project for reductions (and maps, like string functions...). If every dataframe library needs the same reductions, and all them are implemented on top of the buffer protocol, array API or whatever, feels like it could be healthier for the ecosystem to have a common dependency (better maintained, less bugs, more optimized...).

So, in summary, even if I also think the functional API is somehow cumbersome, I think it's the one that better captures all these goals and advantages, and makes things simpler. Surely for the implementation I'd say, but also for users, who are presented a more structured interface.

rgommers · 2020-07-28T11:03:49Z

Small comment: the reduce + sum example is a bit odd, it should be reduce + add. E.g. np.sum is np.add.reduce; summing is an aggregation op that implies reduction. reduce should be combined with the element-wise operations (add, multiply, divide, substract, etc.).

rgommers · 2020-07-28T11:08:41Z

Based on pandas, the number of methods can be 300 or more, so it may be problematic to implement everything in the same namespace

A quick count with [s for s in dir(df) if not s.startswith('_')] says there are currently 217 methods + attributes. 300 would probably still be fine, but I agree way more will start to become messy.

datapythonista · 2020-07-28T11:18:57Z

A quick count with [s for s in dir(df) if not s.startswith('_')] says there are currently 217 methods + attributes. 300 would probably still be fine, but I agree way more will start to become messy.

To be more specific on the figure I gave. There are around 200 for dataframe, some more than that for series (many are the same, but I guess the union is probably between 250 and 300). And then there are the accessors, around 50 string methods, and around 50 more for datetime. So, if we implement most things in pandas, we merge series functionality into a single dataframe class, and we don't use accessors (everything is a direct method of dataframe), I guess we're talking of that order of magnitude (300 or more).

rgommers · 2020-07-30T11:28:25Z

Also, I think it would be good that the API can be extended easily. Couple of example of how pandas can be extended with custom functions:

df.apply(my_custom_function) this is not API-extending I'd say, this is regular use of the apply method.

@pd.api.extensions.register_dataframe_accessor('my_accessor')
class MyAccessor:
    def my_custom_method(self):
        return True

df.my_accessor.my_custom_method()

This one seems pretty horrifying to me. This is giving end users and consumer libraries the ability to basically monkeypatch the DataFrame object. This is Python so anyone can monkeypatch things anyway (unless things are implemented as a compiled extension, then the language won't let you), but providing an API to do this seems very odd. If Pandas would like to do that it's of course free to do so, but I'd much prefer to not standardize such a pattern.

TomAugspurger · 2020-07-30T12:05:32Z

pandas' accessors are a bit more structured than monkeypatching since we don't let you overwrite existing methods :) But agreed that they are not appropriate or necessary for the API standard.

…

On Thu, Jul 30, 2020 at 6:28 AM Ralf Gommers ***@***.***> wrote: Also, I think it would be good that the API can be extended easily. Couple of example of how pandas can be extended with custom functions: df.apply(my_custom_function) this is not API-extending I'd say, this is regular use of the apply method. @pd.api.extensions.register_dataframe_accessor('my_accessor') class MyAccessor: def my_custom_method(self): return True df.my_accessor.my_custom_method() This one seems pretty horrifying to me. This is giving end users and consumer libraries the ability to basically monkeypatch the DataFrame object. This is Python so anyone can monkeypatch things anyway (unless things are implemented as a compiled extension, then the language won't let you), but providing an API to do this seems very odd. If Pandas would like to do that it's of course free to do so, but I'd much prefer to not standardize such a pattern. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#23 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKAOIRSSHOEEHBSA3BHL63R6FKOVANCNFSM4PJYGL7A> .

datapythonista · 2020-07-30T18:08:18Z

Some comments made in the meeting:

Discoverabiliy is important
More +1's in method chaining
Several conflicts in naming pointed out: replace, count... which are different for .str than for a Series.
With a functional API it could be possible avoid having specific map/reductions in the standard. For example, define a df.reduce() method in the standard, but don't make part of the standard the specific reductions (same for string methods and others).

maurosilber · 2023-10-04T13:00:30Z

Is this still being considered?

I'd love using an API with few methods that appear in the autocomplete (df.<TAB>), which would imply hiding them somehow from the "main namespace" (that is, neither the top-level nor the prefixed methods approaches).

I'd vote for the accessor approach:

df.reduce.sum() # or add

as it can include the functional one by creating an accessor with a __call__ method, as does pandas.DataFrame.plot.

df.reduce(np.sum)

Then, when using the autocomplete, we would only see a short list of primitive actions to perform on a DataFrame (reduce, transform, plot, export, etc).

A drawback of this would be when using auto-formatters, which can split the accessor and the chosen method from the accessor.

Instead of this:

(
    df
    .fill_nan(0)
    .max()
    .sort()
)

we would write this

(
    df
    .transform.fill_nan(0)
    .reduce.max()
    .transform.sort()
)

but would be auto-formatted to this

(
    df
    .transform
    .fill_nan(0)
    .reduce
    .max()
    .transform
    .sort()
)

MarcoGorelli · 2024-12-04T15:41:42Z

Is this still being considered?

nope

But if you like great autocomplete and static typing support, you may be interested in Narwhals. Going back to the example in the original post:

.assign(area_km2=lambda df: df['area_m2'].astype(float) / 100)

it can be expressed as:

df.with_columns(area_km2 = nw.col('area_m2').cast(nw.Float64)/100)

rgommers mentioned this issue Jun 30, 2022

How to make a future dataframe API available? #79

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataframe namespaces #23

Dataframe namespaces #23

datapythonista commented Jul 28, 2020

devin-petersohn commented Jul 28, 2020

datapythonista commented Jul 28, 2020

rgommers commented Jul 28, 2020

rgommers commented Jul 28, 2020

datapythonista commented Jul 28, 2020

rgommers commented Jul 30, 2020

TomAugspurger commented Jul 30, 2020 via email

datapythonista commented Jul 30, 2020

maurosilber commented Oct 4, 2023

MarcoGorelli commented Dec 4, 2024

Dataframe namespaces #23

Dataframe namespaces #23

Comments

datapythonista commented Jul 28, 2020

Approaches

Top-level methods

Prefixed methods

Accessors

Functions

Functional API

Standard API

Top-level methods

Prefixed methods

Accessors

Functions

Functional API

devin-petersohn commented Jul 28, 2020

datapythonista commented Jul 28, 2020

rgommers commented Jul 28, 2020

rgommers commented Jul 28, 2020

datapythonista commented Jul 28, 2020

rgommers commented Jul 30, 2020

TomAugspurger commented Jul 30, 2020 via email

datapythonista commented Jul 30, 2020

maurosilber commented Oct 4, 2023

MarcoGorelli commented Dec 4, 2024