Switch dataframe constructor to use dispatch #32844

saulshanabrook · 2020-03-19T21:55:08Z

This is an attempt to add extensibility to the DataFrame constructor so that third party libraries can register their own ways of converting to a Pandas Dataframe. It does this by creating a singledispatch function that is used in the constructor.

For example, Dask could implement the function like this:

from pandas.core.construction import create_dataframe
import dask.datafame

@create_dataframe.register
def _create_dataframe_dask(data: dask.datafame.DataFrame, *args, **kwargs):
    return create_dataframe(data.compute(), *args, **kwargs)

Then, if a downstream library tries to construct a Pandas dataframe from a dask dataframe, it will work:

import dask
import pandas

df = dask.datasets.timeseries()

assert isinstance(pandas.DataFrame(df), pandas.DataFrame)

This is response to the thread about providing a protocol for dataframes to present an alternative for the underlying use case. The alternative is:

Force libraries like sk learn to depend on pandas
Have them called pandas.Dataframe on their input data to see if it can be turned into a dataframe
Have third party libraries with alternative dataframe implementations register themselves with this function provided here.

It doesn't try to solve any sort of out-of-core dataframe API conversation and it does require all libraries to have Pandas as a hard dependency.

jbrockmendel · 2020-03-19T22:11:29Z

pandas/core/frame.py

@@ -36,6 +37,7 @@

 import numpy as np
 import numpy.ma as ma
+import numpy.ma.mrecords as mrecords


i think we import this lazily to speed up import time

I thought maybe that was the case. However, looking at the PR that added the import (https://github.com/pandas-dev/pandas/pull/5579/files) I didn't see any discussion about import time. Do you know if there are any benchmarks to back this up I could run?

If it's a deal breaker, I can move the mrecords check into the ma check like it was in the original code.

datapythonista · 2020-03-19T22:44:07Z

I like the idea, the code looks much cleaner and readable (besides being extensible).

Not sure if the pytest errors are caused by your changes (the linting problems are from this PR, the docs is unrelated and should be fixed now).

pandas/core/frame.py

saulshanabrook · 2020-03-20T00:32:25Z

I have moved this to the construction.py file, added a test for it, and changed it to just accept the dataframe class, instead of the instance.

The tests are passing locally on my mac with ./test_fast.sh but they seem to be failing in CI with some error that is maybe hiding the underlying error?

INTERNALERROR> Traceback (most recent call last):
INTERNALERROR>   File "/home/vsts/miniconda3/envs/pandas-dev/lib/python3.6/site-packages/_pytest/main.py", line 191, in wrap_session
INTERNALERROR>     session.exitstatus = doit(config, session) or 0
INTERNALERROR>   File "/home/vsts/miniconda3/envs/pandas-dev/lib/python3.6/site-packages/_pytest/main.py", line 247, in _main
INTERNALERROR>     config.hook.pytest_runtestloop(session=session)
INTERNALERROR>   File "/home/vsts/miniconda3/envs/pandas-dev/lib/python3.6/site-packages/pluggy/hooks.py", line 286, in __call__
INTERNALERROR>     return self._hookexec(self, self.get_hookimpls(), kwargs)
INTERNALERROR>   File "/home/vsts/miniconda3/envs/pandas-dev/lib/python3.6/site-packages/pluggy/manager.py", line 93, in _hookexec
INTERNALERROR>     return self._inner_hookexec(hook, methods, kwargs)
INTERNALERROR>   File "/home/vsts/miniconda3/envs/pandas-dev/lib/python3.6/site-packages/pluggy/manager.py", line 87, in <lambda>
INTERNALERROR>     firstresult=hook.spec.opts.get("firstresult") if hook.spec else False,
INTERNALERROR>   File "/home/vsts/miniconda3/envs/pandas-dev/lib/python3.6/site-packages/pluggy/callers.py", line 208, in _multicall
INTERNALERROR>     return outcome.get_result()
INTERNALERROR>   File "/home/vsts/miniconda3/envs/pandas-dev/lib/python3.6/site-packages/pluggy/callers.py", line 80, in get_result
INTERNALERROR>     raise ex[1].with_traceback(ex[2])
INTERNALERROR>   File "/home/vsts/miniconda3/envs/pandas-dev/lib/python3.6/site-packages/pluggy/callers.py", line 187, in _multicall
INTERNALERROR>     res = hook_impl.function(*args)
INTERNALERROR>   File "/home/vsts/miniconda3/envs/pandas-dev/lib/python3.6/site-packages/xdist/dsession.py", line 112, in pytest_runtestloop
INTERNALERROR>     self.loop_once()
INTERNALERROR>   File "/home/vsts/miniconda3/envs/pandas-dev/lib/python3.6/site-packages/xdist/dsession.py", line 135, in loop_once
INTERNALERROR>     call(**kwargs)
INTERNALERROR>   File "/home/vsts/miniconda3/envs/pandas-dev/lib/python3.6/site-packages/xdist/dsession.py", line 272, in worker_collectreport
INTERNALERROR>     self._failed_worker_collectreport(node, rep)
INTERNALERROR>   File "/home/vsts/miniconda3/envs/pandas-dev/lib/python3.6/site-packages/xdist/dsession.py", line 302, in _failed_worker_collectreport
INTERNALERROR>     if rep.longrepr not in self._failed_collection_errors:
INTERNALERROR> TypeError: unhashable type: 'ExceptionChainRepr'

jreback · 2020-03-20T01:28:52Z

can u add a test where you import dask and register this like in your example

see test_downstream.py for now we skip if it’s not available

TomAugspurger · 2020-03-20T02:00:20Z

I'm not opposed to the idea in principle, but I don't think we should be publicly exposing the block manager in any way to downstream libraries, even if it's just for libraries and not users.

I think a top-level pd.dataframe function which returns a DataFrame is much friendlier to single dispatc. Then we could have

@singledispatch.register
def dataframe(data: dask.dataframe.DataFrame, index, columns, ...):
    return pd.DataFrame(data.compute(), index, columns, ...)

without dask having to know or care what a BlockManager is.

And the default implementation is just our current DataFrame constructor.

jbrockmendel · 2020-03-20T02:47:30Z

but I don't think we should be publicly exposing the block manager in any way to downstream libraries, even if it's just for libraries and not users.

Seconded.

saulshanabrook · 2020-03-20T03:04:26Z

I'm not opposed to the idea in principle, but I don't think we should be publicly exposing the block manager in any way to downstream libraries, even if it's just for libraries and not users.

I think a top-level pd.dataframe function which returns a DataFrame is much friendlier to single dispatch.

So not change the existing constructor at all, just add a separate single dispatch function that by default just calls the constructor?

TomAugspurger · 2020-03-20T11:48:53Z

Yep.

saulshanabrook · 2020-03-20T12:19:46Z

Would that maybe be more confusing for users then, since there are now two public methods of constructing dataframes instead of one?

Also, as you see from the example, the dask authors don't have to know what a block_manager is, because then can just call create_dataframe recursively with a dataframe itself.

datapythonista · 2020-03-20T12:23:12Z

Would it make sense to do both separately?

What we've got now for the block manager constructor, but private, so the code is cleaner.

And then a public create_dataframe, pandas.dataframe, the __init__ itself, or whatever for other libraries to override the DataFrame constructor?

TomAugspurger · 2020-03-20T13:59:27Z

Would it make sense to do both separately?

Sure, but separately is the key :) Then we can decide whether we like the singledispatch-style code structure better that the status quo, without having to worry about API discussions. We'll also need to measure performance.

the init itself, or whatever for other libraries to override the DataFrame constructor?

It's not clear to me how singledispatch works on instance methods like __init__, since the first argument is an uninitialized DataFrame.

I'd also like a bit more research on who would benefit from a dispatchable pandas.dataframe method for creating dataframes. Who would be the consumers of that API? Dask is one candidate, but they've gotten along OK without pd.DataFrame(dask_dataframe) working (I recall one issue where a user was confused by it not working).

saulshanabrook · 2020-03-20T14:08:47Z

It's not clear to me how singledispatch works on instance methods like init, since the first argument is an uninitialized DataFrame.

There is a singledispatchmethod, that could work if it is just used internally. However, then all ovverrides for it would have to be other methods on the DataFrame class so would have to live there not in constructors.py and wouldn't be able to be added to by third party libraries (which I guess maybe you want?)

I'd also like a bit more research on who would benefit from a dispatchable pandas.dataframe method for creating dataframes. Who would be the consumers of that API? Dask is one candidate, but they've gotten along OK without pd.DataFrame(dask_dataframe) working (I recall one issue where a user was confused by it not working).

The supposed user would be downstream libraries like sklearn, so they could call pd.DataFrame on their input data and have it be coerced to a pandas dataframe, if possible.

TomAugspurger · 2020-03-20T14:27:19Z

Is scikit-learn discussing turning arbitrary objects into pandas dataframes? Currently scikit-learn/enhancement_proposals#37 is discussing pandas in -> pandas out. There's some discussing there around other objects (e.g. xarray DataArray and Dataset) but it seems just as likely for scikit-learn to develop an interface for preserving the type of the input, rather than coercing to a DataFrame.

saulshanabrook · 2020-03-20T14:29:46Z

it seems just as likely for scikit-learn to develop an interface for preserving the type of the input, rather than coercing to a DataFrame.

Yep that's a possibility.

I was hoping that by providing this alternative, it would help pull out the use cases for sklearn that would benefit from a protocol that wouldn't be addressed by this proposal.

I.e. why don't they just depend on pandas? Is it that they don't want the hard dependency? Or is there some other reason?

jreback · 2020-06-14T15:42:03Z

theoretically a nice idea, but

closing as stale if you want to continue, please open a new PR.

TomAugspurger · 2020-06-15T14:06:21Z

Opened #34799 for discussing the idea of a singly-dispatched pd.dataframe.

saulshanabrook added 3 commits March 19, 2020 17:39

Switch dataframe constructor to use dispatch

6fe0831

blacken

e123ab4

flake8

4dfb4b7

jbrockmendel reviewed Mar 19, 2020

View reviewed changes

saulshanabrook force-pushed the rework-constructor branch from 819952c to d1c0609 Compare March 19, 2020 22:12

mypy fixes

522029d

saulshanabrook force-pushed the rework-constructor branch from d1c0609 to 522029d Compare March 19, 2020 22:14

datapythonista added Clean DataFrame DataFrame data structure labels Mar 19, 2020

saulshanabrook added 2 commits March 19, 2020 19:13

style fixes

b8dd353

Merge pandas/master into rework-constructor

7259590

jreback requested changes Mar 19, 2020

View reviewed changes

pandas/core/frame.py Outdated Show resolved Hide resolved

pandas/core/frame.py Outdated Show resolved Hide resolved

saulshanabrook added 5 commits March 19, 2020 19:41

rename and take class instead of instance

3e4b466

Move create_dataframe to construction

898e3d7

Sort imports

7e82826

Fix calling

b307c4b

Add test for custom constructor

edd85c3

saulshanabrook marked this pull request as ready for review March 20, 2020 00:05

Added whats new

8240d86

saulshanabrook requested a review from jreback March 20, 2020 00:10

unused imports

b856214

Merge pandas/master into rework-constructor

a4b3ed0

simonjayhawkins added Constructors Series/DataFrame/Index/pd.array Constructors Needs Discussion Requires discussion from core team before further action labels Mar 20, 2020

jreback closed this Jun 14, 2020

TomAugspurger mentioned this pull request Jun 15, 2020

ENH: At top-level dataframe function for single-dispatched construction of a DataFrame #34799

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch dataframe constructor to use dispatch #32844

Switch dataframe constructor to use dispatch #32844

saulshanabrook commented Mar 19, 2020 •

edited

Loading

jbrockmendel Mar 19, 2020

saulshanabrook Mar 19, 2020

datapythonista commented Mar 19, 2020

saulshanabrook commented Mar 20, 2020

jreback commented Mar 20, 2020

TomAugspurger commented Mar 20, 2020 •

edited

Loading

jbrockmendel commented Mar 20, 2020

saulshanabrook commented Mar 20, 2020

TomAugspurger commented Mar 20, 2020

saulshanabrook commented Mar 20, 2020 •

edited

Loading

datapythonista commented Mar 20, 2020

TomAugspurger commented Mar 20, 2020

saulshanabrook commented Mar 20, 2020

TomAugspurger commented Mar 20, 2020

saulshanabrook commented Mar 20, 2020 •

edited

Loading

jreback commented Jun 14, 2020

TomAugspurger commented Jun 15, 2020

Switch dataframe constructor to use dispatch #32844

Switch dataframe constructor to use dispatch #32844

Conversation

saulshanabrook commented Mar 19, 2020 • edited Loading

jbrockmendel Mar 19, 2020

Choose a reason for hiding this comment

saulshanabrook Mar 19, 2020

Choose a reason for hiding this comment

datapythonista commented Mar 19, 2020

saulshanabrook commented Mar 20, 2020

jreback commented Mar 20, 2020

TomAugspurger commented Mar 20, 2020 • edited Loading

jbrockmendel commented Mar 20, 2020

saulshanabrook commented Mar 20, 2020

TomAugspurger commented Mar 20, 2020

saulshanabrook commented Mar 20, 2020 • edited Loading

datapythonista commented Mar 20, 2020

TomAugspurger commented Mar 20, 2020

saulshanabrook commented Mar 20, 2020

TomAugspurger commented Mar 20, 2020

saulshanabrook commented Mar 20, 2020 • edited Loading

jreback commented Jun 14, 2020

TomAugspurger commented Jun 15, 2020

saulshanabrook commented Mar 19, 2020 •

edited

Loading

TomAugspurger commented Mar 20, 2020 •

edited

Loading

saulshanabrook commented Mar 20, 2020 •

edited

Loading

saulshanabrook commented Mar 20, 2020 •

edited

Loading