Skip to content

ExtensionArray being checked if is instance of collections.abc.Sequence #28424

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
fpim opened this issue Sep 13, 2019 · 7 comments · Fixed by #37972
Closed

ExtensionArray being checked if is instance of collections.abc.Sequence #28424

fpim opened this issue Sep 13, 2019 · 7 comments · Fixed by #37972
Labels
Constructors Series/DataFrame/Index/pd.array Constructors ExtensionArray Extending pandas with custom dtypes or arrays. Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@fpim
Copy link

fpim commented Sep 13, 2019

from pandas.api.extensions import ExtensionDtype,ExtensionArray, ExtensionScalarOpsMixin
class FP:

    def __add__(self, other):
        return self + other
    pass
@pd.api.extensions.register_extension_dtype
class FPType(ExtensionDtype):
    name = 'fp'
    type = FP
    @classmethod
    def construct_from_string(cls, string):
        if string == cls.name:
            return cls()
        else:
            raise TypeError("Cannot construct a '{}' from "
                            "'{}'".format(cls, string))

# collections.abc.Sequence -> data is sequence, _reduce is possible
class FPArray(ExtensionArray,ExtensionScalarOpsMixin,collections.abc.Sequence):
    _dtype = FPType()
    @property
    def dtype(self):
        return self._dtype

    def __init__(self, values):
        values = np.array(values)  # TODO: avoid potential copy
        self.data = values

    def __len__(self):
        return len(self.data)
    def __getitem__(self, *args):
        result = operator.getitem(self.data, *args)
        return result


    def take(self,indexer):
        return self.data.take(indexer)

    def __contains__(self,item):
        return item in self.data

    def __iter__(self):
        for i in self.data:
            yield i

    def __reversed__(self):
        return reversed(self.data)

    def _reduce(self, name, skipna=True, **kwargs):
        return sum(self.data+100)

The ExtensionArray array itself is just for illustration and is not the problem, the problematic part is

arr = FPArray([FP(),FP(),FP(),FP()])
df = pd.DataFrame(arr )

when it reached line 444 of pandas/core/frame.py (https://github.com/pandas-dev/pandas/blob/master/pandas/core/frame.py#L444)

            if not isinstance(data, abc.Sequence):
                data = list(data)

where data is referring to my FPArray, if data is not abc.Sequence, my ExtensionArray will be casted to a list.

Having my ExtensionArray casted to a list mean my ExtensionArray will eventually casted back to a normal Pandas series.

The Solution should be having ExtensionArray to be a subclass of abc.Sequence.

INSTALLED VERSIONS

commit : None
python : 3.7.4.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None

pandas : 0.25.1
numpy : 1.17.2
pytz : 2019.2
dateutil : 2.8.0
pip : 19.2.3
setuptools : 41.2.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None

@TomAugspurger
Copy link
Contributor

I'm confused. You're passing an ExtensionArray to the DataFrame constructor? Does it work if you pass a dict DataFrame({"A": arr})?

@fpim
Copy link
Author

fpim commented Sep 13, 2019

I'm confused. You're passing an ExtensionArray to the DataFrame constructor? Does it work if you pass a dict DataFrame({"A": arr})?

That would behave the same.

I understand the confusion, the most observable side effect of the problem would be:
If I do

class FPArray(ExtensionArray,ExtensionScalarOpsMixin,collections.abc.Sequence):
    def _reduce(self, name, skipna=True, **kwargs):
        return 'xxxxx'
arr = FPArray([FP(),FP(),FP(),FP()])
df = pd.DataFrame(arr )
df[0].sum()

def _reduce can be reached

while if i do :

class FPArray(ExtensionArray,ExtensionScalarOpsMixin):
    def _reduce(self, name, skipna=True, **kwargs):
        return 'xxxxx'
arr = FPArray([FP(),FP(),FP(),FP()])
df = pd.DataFrame(arr )
df[0].sum()

def _reduce cannot be reached

@jorisvandenbossche
Copy link
Member

@TomAugspurger I alsways forget whether it creates a row or column, but passing a single list/array-like to DataFrame constructor creates a DataFrame with one column:

In [2]: pd.DataFrame([1, 2, 3])                                                                                                                                                                                    
Out[2]: 
   0
0  1
1  2
2  3

Since that is a way to pass data to DataFrame, you can indeed expect that passing an ExtensionArray also preserves the type, which it currently doesn't:

In [3]: pd.DataFrame(pd.array([1, 2, None], dtype='Int64'))                                                                                                                                                        
Out[3]: 
     0
0  1.0
1  2.0
2  NaN

So I think we can consider this a bug.
Patches welcome!

@jorisvandenbossche jorisvandenbossche added Bug Constructors Series/DataFrame/Index/pd.array Constructors ExtensionArray Extending pandas with custom dtypes or arrays. labels Sep 13, 2019
@jorisvandenbossche jorisvandenbossche added this to the Contributions Welcome milestone Sep 13, 2019
@simonjayhawkins
Copy link
Member

The Solution should be having ExtensionArray to be a subclass of abc.Sequence.

that does not seem to work since that code block doesn't preserve dtypes. However, the following seems to work.

diff --git a/pandas/core/frame.py b/pandas/core/frame.py
index f1ed3a125..dc72caa5a 100644
--- a/pandas/core/frame.py
+++ b/pandas/core/frame.py
@@ -427,7 +427,7 @@ class DataFrame(NDFrame):
                     data = data.copy()
                 mgr = init_ndarray(data, index, columns, dtype=dtype, copy=copy)

-        elif isinstance(data, (np.ndarray, Series, Index)):
+        elif isinstance(data, (np.ndarray, Series, Index, ExtensionArray)):
             if data.dtype.names:
                 data_columns = list(data.dtype.names)
                 data = {k: data[k] for k in data_columns}
>>> import pandas as pd
>>> pd.DataFrame(pd.array([1, 2, None], dtype='Int64')).dtypes
0    Int64
dtype: object
>>>

@TomAugspurger
Copy link
Contributor

I alsways forget whether it creates a row or column, but passing a single list/array-like to DataFrame constructor creates a DataFrame with one column:

Huh, I would not have guessed that.

Anyway, looks like Simon has a fix.

We wouldn't want to inherit from abc.Sequence, since isinstance checks on ABCs are so much slower.

@louisng5
Copy link

louisng5 commented Sep 13, 2019

I alsways forget whether it creates a row or column, but passing a single list/array-like to DataFrame constructor creates a DataFrame with one column:

Huh, I would not have guessed that.

Anyway, looks like Simon has a fix.

We wouldn't want to inherit from abc.Sequence, since isinstance checks on ABCs are so much slower.

I totally understand the fix. This will also change the behavior of all existing ExtensionArrays by having them fall into the elif isinstance(data, (np.ndarray, Series, Index, ExtensionArray)): block.

@arw2019
Copy link
Member

arw2019 commented Nov 14, 2020

This works on 1.2 master

@arw2019 arw2019 added Needs Tests Unit test(s) needed to prevent regressions and removed Bug labels Nov 14, 2020
@jreback jreback modified the milestones: Contributions Welcome, 1.2 Nov 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Constructors Series/DataFrame/Index/pd.array Constructors ExtensionArray Extending pandas with custom dtypes or arrays. Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants