Skip to content

Switch dataframe constructor to use dispatch #32844

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 14 commits into from

Conversation

saulshanabrook
Copy link
Contributor

@saulshanabrook saulshanabrook commented Mar 19, 2020

This is an attempt to add extensibility to the DataFrame constructor so that third party libraries can register their own ways of converting to a Pandas Dataframe. It does this by creating a singledispatch function that is used in the constructor.

For example, Dask could implement the function like this:

from pandas.core.construction import create_dataframe
import dask.datafame

@create_dataframe.register
def _create_dataframe_dask(data: dask.datafame.DataFrame, *args, **kwargs):
    return create_dataframe(data.compute(), *args, **kwargs)

Then, if a downstream library tries to construct a Pandas dataframe from a dask dataframe, it will work:

import dask
import pandas

df = dask.datasets.timeseries()

assert isinstance(pandas.DataFrame(df), pandas.DataFrame)

This is response to the thread about providing a protocol for dataframes to present an alternative for the underlying use case. The alternative is:

  1. Force libraries like sk learn to depend on pandas
  2. Have them called pandas.Dataframe on their input data to see if it can be turned into a dataframe
  3. Have third party libraries with alternative dataframe implementations register themselves with this function provided here.

It doesn't try to solve any sort of out-of-core dataframe API conversation and it does require all libraries to have Pandas as a hard dependency.

  • closes #xxxx
  • tests added / passed
  • passes black pandas
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry
  • Review name of function

@@ -36,6 +37,7 @@

import numpy as np
import numpy.ma as ma
import numpy.ma.mrecords as mrecords
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we import this lazily to speed up import time

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought maybe that was the case. However, looking at the PR that added the import (https://github.com/pandas-dev/pandas/pull/5579/files) I didn't see any discussion about import time. Do you know if there are any benchmarks to back this up I could run?

If it's a deal breaker, I can move the mrecords check into the ma check like it was in the original code.

@datapythonista datapythonista added Clean DataFrame DataFrame data structure labels Mar 19, 2020
@datapythonista
Copy link
Member

I like the idea, the code looks much cleaner and readable (besides being extensible).

Not sure if the pytest errors are caused by your changes (the linting problems are from this PR, the docs is unrelated and should be fixed now).

@saulshanabrook saulshanabrook marked this pull request as ready for review March 20, 2020 00:05
@saulshanabrook saulshanabrook requested a review from jreback March 20, 2020 00:10
@saulshanabrook
Copy link
Contributor Author

I have moved this to the construction.py file, added a test for it, and changed it to just accept the dataframe class, instead of the instance.

The tests are passing locally on my mac with ./test_fast.sh but they seem to be failing in CI with some error that is maybe hiding the underlying error?

INTERNALERROR> Traceback (most recent call last):
INTERNALERROR>   File "/home/vsts/miniconda3/envs/pandas-dev/lib/python3.6/site-packages/_pytest/main.py", line 191, in wrap_session
INTERNALERROR>     session.exitstatus = doit(config, session) or 0
INTERNALERROR>   File "/home/vsts/miniconda3/envs/pandas-dev/lib/python3.6/site-packages/_pytest/main.py", line 247, in _main
INTERNALERROR>     config.hook.pytest_runtestloop(session=session)
INTERNALERROR>   File "/home/vsts/miniconda3/envs/pandas-dev/lib/python3.6/site-packages/pluggy/hooks.py", line 286, in __call__
INTERNALERROR>     return self._hookexec(self, self.get_hookimpls(), kwargs)
INTERNALERROR>   File "/home/vsts/miniconda3/envs/pandas-dev/lib/python3.6/site-packages/pluggy/manager.py", line 93, in _hookexec
INTERNALERROR>     return self._inner_hookexec(hook, methods, kwargs)
INTERNALERROR>   File "/home/vsts/miniconda3/envs/pandas-dev/lib/python3.6/site-packages/pluggy/manager.py", line 87, in <lambda>
INTERNALERROR>     firstresult=hook.spec.opts.get("firstresult") if hook.spec else False,
INTERNALERROR>   File "/home/vsts/miniconda3/envs/pandas-dev/lib/python3.6/site-packages/pluggy/callers.py", line 208, in _multicall
INTERNALERROR>     return outcome.get_result()
INTERNALERROR>   File "/home/vsts/miniconda3/envs/pandas-dev/lib/python3.6/site-packages/pluggy/callers.py", line 80, in get_result
INTERNALERROR>     raise ex[1].with_traceback(ex[2])
INTERNALERROR>   File "/home/vsts/miniconda3/envs/pandas-dev/lib/python3.6/site-packages/pluggy/callers.py", line 187, in _multicall
INTERNALERROR>     res = hook_impl.function(*args)
INTERNALERROR>   File "/home/vsts/miniconda3/envs/pandas-dev/lib/python3.6/site-packages/xdist/dsession.py", line 112, in pytest_runtestloop
INTERNALERROR>     self.loop_once()
INTERNALERROR>   File "/home/vsts/miniconda3/envs/pandas-dev/lib/python3.6/site-packages/xdist/dsession.py", line 135, in loop_once
INTERNALERROR>     call(**kwargs)
INTERNALERROR>   File "/home/vsts/miniconda3/envs/pandas-dev/lib/python3.6/site-packages/xdist/dsession.py", line 272, in worker_collectreport
INTERNALERROR>     self._failed_worker_collectreport(node, rep)
INTERNALERROR>   File "/home/vsts/miniconda3/envs/pandas-dev/lib/python3.6/site-packages/xdist/dsession.py", line 302, in _failed_worker_collectreport
INTERNALERROR>     if rep.longrepr not in self._failed_collection_errors:
INTERNALERROR> TypeError: unhashable type: 'ExceptionChainRepr'

@jreback
Copy link
Contributor

jreback commented Mar 20, 2020

can u add a test where you import dask and register this like in your example

see test_downstream.py for now we skip if it’s not available

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Mar 20, 2020

I'm not opposed to the idea in principle, but I don't think we should be publicly exposing the block manager in any way to downstream libraries, even if it's just for libraries and not users.

I think a top-level pd.dataframe function which returns a DataFrame is much friendlier to single dispatc. Then we could have

@singledispatch.register
def dataframe(data: dask.dataframe.DataFrame, index, columns, ...):
    return pd.DataFrame(data.compute(), index, columns, ...)

without dask having to know or care what a BlockManager is.

And the default implementation is just our current DataFrame constructor.

@jbrockmendel
Copy link
Member

but I don't think we should be publicly exposing the block manager in any way to downstream libraries, even if it's just for libraries and not users.

Seconded.

@saulshanabrook
Copy link
Contributor Author

I'm not opposed to the idea in principle, but I don't think we should be publicly exposing the block manager in any way to downstream libraries, even if it's just for libraries and not users.

I think a top-level pd.dataframe function which returns a DataFrame is much friendlier to single dispatch.

So not change the existing constructor at all, just add a separate single dispatch function that by default just calls the constructor?

@simonjayhawkins simonjayhawkins added Constructors Series/DataFrame/Index/pd.array Constructors Needs Discussion Requires discussion from core team before further action labels Mar 20, 2020
@TomAugspurger
Copy link
Contributor

Yep.

@saulshanabrook
Copy link
Contributor Author

saulshanabrook commented Mar 20, 2020

Would that maybe be more confusing for users then, since there are now two public methods of constructing dataframes instead of one?

Also, as you see from the example, the dask authors don't have to know what a block_manager is, because then can just call create_dataframe recursively with a dataframe itself.

@datapythonista
Copy link
Member

Would it make sense to do both separately?

What we've got now for the block manager constructor, but private, so the code is cleaner.

And then a public create_dataframe, pandas.dataframe, the __init__ itself, or whatever for other libraries to override the DataFrame constructor?

@TomAugspurger
Copy link
Contributor

Would it make sense to do both separately?

Sure, but separately is the key :) Then we can decide whether we like the singledispatch-style code structure better that the status quo, without having to worry about API discussions. We'll also need to measure performance.

the init itself, or whatever for other libraries to override the DataFrame constructor?

It's not clear to me how singledispatch works on instance methods like __init__, since the first argument is an uninitialized DataFrame.


I'd also like a bit more research on who would benefit from a dispatchable pandas.dataframe method for creating dataframes. Who would be the consumers of that API? Dask is one candidate, but they've gotten along OK without pd.DataFrame(dask_dataframe) working (I recall one issue where a user was confused by it not working).

@saulshanabrook
Copy link
Contributor Author

It's not clear to me how singledispatch works on instance methods like init, since the first argument is an uninitialized DataFrame.

There is a singledispatchmethod, that could work if it is just used internally. However, then all ovverrides for it would have to be other methods on the DataFrame class so would have to live there not in constructors.py and wouldn't be able to be added to by third party libraries (which I guess maybe you want?)

I'd also like a bit more research on who would benefit from a dispatchable pandas.dataframe method for creating dataframes. Who would be the consumers of that API? Dask is one candidate, but they've gotten along OK without pd.DataFrame(dask_dataframe) working (I recall one issue where a user was confused by it not working).

The supposed user would be downstream libraries like sklearn, so they could call pd.DataFrame on their input data and have it be coerced to a pandas dataframe, if possible.

@TomAugspurger
Copy link
Contributor

Is scikit-learn discussing turning arbitrary objects into pandas dataframes? Currently scikit-learn/enhancement_proposals#37 is discussing pandas in -> pandas out. There's some discussing there around other objects (e.g. xarray DataArray and Dataset) but it seems just as likely for scikit-learn to develop an interface for preserving the type of the input, rather than coercing to a DataFrame.

@saulshanabrook
Copy link
Contributor Author

saulshanabrook commented Mar 20, 2020

it seems just as likely for scikit-learn to develop an interface for preserving the type of the input, rather than coercing to a DataFrame.

Yep that's a possibility.

I was hoping that by providing this alternative, it would help pull out the use cases for sklearn that would benefit from a protocol that wouldn't be addressed by this proposal.

I.e. why don't they just depend on pandas? Is it that they don't want the hard dependency? Or is there some other reason?

@jreback
Copy link
Contributor

jreback commented Jun 14, 2020

theoretically a nice idea, but

closing as stale if you want to continue, please open a new PR.

@TomAugspurger
Copy link
Contributor

Opened #34799 for discussing the idea of a singly-dispatched pd.dataframe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Clean Constructors Series/DataFrame/Index/pd.array Constructors DataFrame DataFrame data structure Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants