-
Notifications
You must be signed in to change notification settings - Fork 21
Dataframe namespaces #23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@datapythonista Great writeup. What do you see as the biggest benefit of defining things as From the my perspective, it feels a bit cumbersome, but I can see value in seeing that the shape (dimension) of the data will change (in this example) without needing to understand details of I don't think I am leaning in any particular way yet, but I want to understand your thoughts. |
Thanks @devin-petersohn, I 100% agree on your comments. The syntax is trickier than some other implementation, and also we need to consider extra parameters, for example: The main advantages I see are:
I think similar things can be achieved with another syntax, like for example But to me, the main thing would be to use a divide and conquer approach to reduce the complexity. I see this as similar to what Linux achieves with the X server (and many other components). You're building a complex piece of software, and instead of dealing with all the complexity in desktops, you just create an interface to build on top. Besides allowing a free market of software to interact with yours, the complexity is reduced dramatically. And its modularity makes changes much simpler. Also, I think we could end up having an independent project for reductions (and maps, like string functions...). If every dataframe library needs the same reductions, and all them are implemented on top of the buffer protocol, array API or whatever, feels like it could be healthier for the ecosystem to have a common dependency (better maintained, less bugs, more optimized...). So, in summary, even if I also think the functional API is somehow cumbersome, I think it's the one that better captures all these goals and advantages, and makes things simpler. Surely for the implementation I'd say, but also for users, who are presented a more structured interface. |
Small comment: the |
A quick count with |
To be more specific on the figure I gave. There are around 200 for dataframe, some more than that for series (many are the same, but I guess the union is probably between 250 and 300). And then there are the accessors, around 50 string methods, and around 50 more for datetime. So, if we implement most things in pandas, we merge series functionality into a single dataframe class, and we don't use accessors (everything is a direct method of dataframe), I guess we're talking of that order of magnitude (300 or more). |
This one seems pretty horrifying to me. This is giving end users and consumer libraries the ability to basically monkeypatch the |
pandas' accessors are a bit more structured than monkeypatching since we
don't let you overwrite existing methods :)
But agreed that they are not appropriate or necessary for the API standard.
…On Thu, Jul 30, 2020 at 6:28 AM Ralf Gommers ***@***.***> wrote:
Also, I think it would be good that the API can be extended easily. Couple
of example of how pandas can be extended with custom functions:
df.apply(my_custom_function) this is not API-extending I'd say, this is
regular use of the apply method.
@pd.api.extensions.register_dataframe_accessor('my_accessor')
class MyAccessor:
def my_custom_method(self):
return True
df.my_accessor.my_custom_method()
This one seems pretty horrifying to me. This is giving end users and
consumer libraries the ability to basically monkeypatch the DataFrame
object. This is Python so anyone can monkeypatch things anyway (unless
things are implemented as a compiled extension, then the language won't let
you), but providing an API to do this seems very odd. If Pandas would like
to do that it's of course free to do so, but I'd much prefer to not
standardize such a pattern.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#23 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKAOIRSSHOEEHBSA3BHL63R6FKOVANCNFSM4PJYGL7A>
.
|
Some comments made in the meeting:
|
Is this still being considered? I'd love using an API with few methods that appear in the autocomplete ( I'd vote for the accessor approach: df.reduce.sum() # or add as it can include the functional one by creating an accessor with a df.reduce(np.sum) Then, when using the autocomplete, we would only see a short list of primitive actions to perform on a DataFrame (reduce, transform, plot, export, etc). A drawback of this would be when using auto-formatters, which can split the accessor and the chosen method from the accessor. Instead of this: (
df
.fill_nan(0)
.max()
.sort()
) we would write this (
df
.transform.fill_nan(0)
.reduce.max()
.transform.sort()
) but would be auto-formatted to this (
df
.transform
.fill_nan(0)
.reduce
.max()
.transform
.sort()
) |
nope But if you like great autocomplete and static typing support, you may be interested in Narwhals. Going back to the example in the original post:
it can be expressed as:
|
In #10, it's been discussed that it would be convenient if the dataframe API allows method chaining. For example:
This implies that most functionality is implemented as methods of the dataframe class. Based on pandas, the number of methods can be 300 or more, so it may be problematic to implement everything in the same namespace. pandas uses a mixed approach, with different techniques to try to organize the API.
Approaches
Top-level methods
Many of the methods are simply implemented directly as methods of dataframe.
Prefixed methods
Some of the methods are grouped with a common prefix.
Accessors
Accessors are a property of dataframe (or series, but assuming only one dataframe class for simplicity) that groups some methods under it.
Functions
In some cases, functions are used instead of methods.
Functional API
pandas also provides a more functional API, where functions can be passed as parameters
Standard API
I guess we will agree, that a uniform and consistent API would be better for the standard. That should make things easier to implement, and also a more intuitive experience for the user.
Also, I think it would be good that the API can be extended easily. Couple of example of how pandas can be extended with custom functions:
Conceptually, I think there are some methods that should go together, more than by topic, by the API they follow. The clearest example is reductions, and there was some discussion in #11 (comment).
I think no solution will be perfect, and the options that we have are (feel free to add to the list if I'm missing any option worth considering):
Top-level methods
Prefixed methods
Accessors
Functions
mod represents the implementation module (e.g.
pandas
)Functional API
Personally, my preference is the functional API. I think it's the simplest that keeps things organized, and the simplest to extend. The main drawback is its readability, it may be too verbose. There is the option to allow using a string instead of the function for known functions (e.g.
df.reduce('sum')
).Thoughts? Other ideas?
The text was updated successfully, but these errors were encountered: