-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Switch dataframe constructor to use dispatch #32844
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pandas/core/frame.py
Outdated
@@ -36,6 +37,7 @@ | |||
|
|||
import numpy as np | |||
import numpy.ma as ma | |||
import numpy.ma.mrecords as mrecords |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we import this lazily to speed up import time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought maybe that was the case. However, looking at the PR that added the import (https://github.com/pandas-dev/pandas/pull/5579/files) I didn't see any discussion about import time. Do you know if there are any benchmarks to back this up I could run?
If it's a deal breaker, I can move the mrecords check into the ma check like it was in the original code.
819952c
to
d1c0609
Compare
d1c0609
to
522029d
Compare
I like the idea, the code looks much cleaner and readable (besides being extensible). Not sure if the pytest errors are caused by your changes (the linting problems are from this PR, the docs is unrelated and should be fixed now). |
I have moved this to the The tests are passing locally on my mac with
|
can u add a test where you import dask and register this like in your example see test_downstream.py for now we skip if it’s not available |
I'm not opposed to the idea in principle, but I don't think we should be publicly exposing the block manager in any way to downstream libraries, even if it's just for libraries and not users. I think a top-level @singledispatch.register
def dataframe(data: dask.dataframe.DataFrame, index, columns, ...):
return pd.DataFrame(data.compute(), index, columns, ...) without dask having to know or care what a BlockManager is. And the default implementation is just our current DataFrame constructor. |
Seconded. |
So not change the existing constructor at all, just add a separate single dispatch function that by default just calls the constructor? |
Yep. |
Would that maybe be more confusing for users then, since there are now two public methods of constructing dataframes instead of one? Also, as you see from the example, the dask authors don't have to know what a block_manager is, because then can just call |
Would it make sense to do both separately? What we've got now for the block manager constructor, but private, so the code is cleaner. And then a public |
Sure, but separately is the key :) Then we can decide whether we like the singledispatch-style code structure better that the status quo, without having to worry about API discussions. We'll also need to measure performance.
It's not clear to me how singledispatch works on instance methods like I'd also like a bit more research on who would benefit from a dispatchable |
There is a
The supposed user would be downstream libraries like sklearn, so they could call |
Is scikit-learn discussing turning arbitrary objects into pandas dataframes? Currently scikit-learn/enhancement_proposals#37 is discussing pandas in -> pandas out. There's some discussing there around other objects (e.g. xarray DataArray and Dataset) but it seems just as likely for scikit-learn to develop an interface for preserving the type of the input, rather than coercing to a DataFrame. |
Yep that's a possibility. I was hoping that by providing this alternative, it would help pull out the use cases for sklearn that would benefit from a protocol that wouldn't be addressed by this proposal. I.e. why don't they just depend on pandas? Is it that they don't want the hard dependency? Or is there some other reason? |
theoretically a nice idea, but closing as stale if you want to continue, please open a new PR. |
Opened #34799 for discussing the idea of a singly-dispatched |
This is an attempt to add extensibility to the
DataFrame
constructor so that third party libraries can register their own ways of converting to a Pandas Dataframe. It does this by creating asingledispatch
function that is used in the constructor.For example, Dask could implement the function like this:
Then, if a downstream library tries to construct a Pandas dataframe from a dask dataframe, it will work:
This is response to the thread about providing a protocol for dataframes to present an alternative for the underlying use case. The alternative is:
pandas.Dataframe
on their input data to see if it can be turned into a dataframeIt doesn't try to solve any sort of out-of-core dataframe API conversation and it does require all libraries to have Pandas as a hard dependency.
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff