Skip to content

pd.concat() does not work generically for ExtensionArrays #20735

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bfollinprm opened this issue Apr 18, 2018 · 8 comments
Closed

pd.concat() does not work generically for ExtensionArrays #20735

bfollinprm opened this issue Apr 18, 2018 · 8 comments
Labels
ExtensionArray Extending pandas with custom dtypes or arrays.
Milestone

Comments

@bfollinprm
Copy link

bfollinprm commented Apr 18, 2018

Code Sample, a copy-pastable example if possible

Sadly, pseudocode because the underlying container is proprietary, but this is a generic problem. If I get a chance this weekend I will write up a mock open-source-compliant example for illustration and testing.

class MyArray(ExtensionArray):
    def __init__(self, values, **kwargs):
        # All the things
        self.values = MyNotNumpyContainer(values)
    # All the other methods

df = pd.DataFrame({'mycolumn': MyArray(values)})

# Raises AttributeError: 'MyNotNumpyContainer' object has no attribute 'dtype'

Problem description

There is a check at the top of _isna_ndarraylike():

def _isna_ndarraylike(obj):
    values = getattr(obj, 'values', obj)
    dtype = values.dtype

    if is_extension_array_dtype(obj):
        if isinstance(obj, (ABCIndexClass, ABCSeries)):
            values = obj._values
        else:
            values = obj
        result = values.isna()

This fails for ExtensionArray objects which define a values attribute, but whose values attribute does not have a dtype attribute.

I noticed this call in pd.concat above, but no doubt it occurs elsewhere.

Expected Output

Since dtype is a required attribute of an ExtensionArray, _isna_ndarraylike() should at least get the dtype from the ExtensionArray class. Really, a check for whether we are dealing with an ExtensionArray should occur upstream somewhere, since there is no guarantee an ExtensionArray is backended by a numpy array-like object.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.13.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-1048-aws
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.23.0.dev0+762.gbb095a6
pytest: 3.2.2
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.28.2
numpy: 1.13.3
scipy: 0.19.1
pyarrow: None
xarray: None
IPython: 5.3.0
sphinx: 1.5.4
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: 0.9999999
sqlalchemy: 1.1.9
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@TomAugspurger TomAugspurger added this to the 0.23.0 milestone Apr 18, 2018
@TomAugspurger TomAugspurger added the ExtensionArray Extending pandas with custom dtypes or arrays. label Apr 18, 2018
@TomAugspurger
Copy link
Contributor

TomAugspurger commented Apr 18, 2018

Reproducible example

In [21]: from pandas.tests.extension.json.array import JSONArray

In [22]: class JSONArray2(JSONArray):
    ...:     @property
    ...:     def values(self): return self.data
    ...:

In [23]: arr = JSONArray2([{"A": 1}])

In [24]: df = pd.DataFrame({"A": arr})
In [25]: pd.concat([df, df])
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-25-86fe636506f1> in <module>()
----> 1 pd.concat([df, df])

~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/reshape/concat.py in concat(objs, axis, join, join_axes, ignore_index, keys, levels, names, verify_integrity, copy)
    211                        verify_integrity=verify_integrity,
    212                        copy=copy)
--> 213     return op.get_result()
    214
    215

~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/reshape/concat.py in get_result(self)
    406             new_data = concatenate_block_managers(
    407                 mgrs_indexers, self.new_axes, concat_axis=self.axis,
--> 408                 copy=self.copy)
    409             if not self.copy:
    410                 new_data._consolidate_inplace()

~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/internals.py in concatenate_block_managers(mgrs_indexers, axes, concat_axis, copy)
   5402                 values = values.view()
   5403             b = b.make_block_same_class(values, placement=placement)
-> 5404         elif is_uniform_join_units(join_units):
   5405             b = join_units[0].block.concat_same_type(
   5406                 [ju.block for ju in join_units], placement=placement)

~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/internals.py in is_uniform_join_units(join_units)
   5426         # no blocks that would get missing values (can lead to type upcasts)
   5427         # unless we're an extension dtype.
-> 5428         all(not ju.is_na or ju.block.is_extension for ju in join_units) and
   5429         # no blocks with indexers (as then the dimensions do not fit)
   5430         all(not ju.indexers for ju in join_units) and

~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/internals.py in <genexpr>(.0)
   5426         # no blocks that would get missing values (can lead to type upcasts)
   5427         # unless we're an extension dtype.
-> 5428         all(not ju.is_na or ju.block.is_extension for ju in join_units) and
   5429         # no blocks with indexers (as then the dimensions do not fit)
   5430         all(not ju.indexers for ju in join_units) and

~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/_libs/properties.pyx in pandas._libs.properties.CachedProperty.__get__()
     34             val = <object> PyDict_GetItem(cache, self.name)
     35         else:
---> 36             val = self.func(obj)
     37             PyDict_SetItem(cache, self.name, val)
     38         return val

~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/internals.py in is_na(self)
   5782         chunk_len = max(total_len // 40, 1000)
   5783         for i in range(0, total_len, chunk_len):
-> 5784             if not isna(values_flat[i:i + chunk_len]).all():
   5785                 return False
   5786

~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/dtypes/missing.py in isna(obj)
    104     Name: 1, dtype: bool
    105     """
--> 106     return _isna(obj)
    107
    108

~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/dtypes/missing.py in _isna_new(obj)
    118     elif isinstance(obj, (ABCSeries, np.ndarray, ABCIndexClass,
    119                           ABCExtensionArray)):
--> 120         return _isna_ndarraylike(obj)
    121     elif isinstance(obj, ABCGeneric):
    122         return obj._constructor(obj._data.isna(func=isna))

~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/dtypes/missing.py in _isna_ndarraylike(obj)
    185 def _isna_ndarraylike(obj):
    186     values = getattr(obj, 'values', obj)
--> 187     dtype = values.dtype
    188
    189     if is_extension_array_dtype(obj):

AttributeError: 'list' object has no attribute 'dtype'

@jorisvandenbossche
Copy link
Member

Main problem here is that for now it is kind of "not allowed" as an ExtensionArray to have a .values attribute because we use that internally?

@TomAugspurger
Copy link
Contributor

Right. We’ve hit other places like this (setitem IIRC) and have to throw in extra checks.

@jorisvandenbossche
Copy link
Member

Maybe we can, for now, add a test to the extension interface tests that checks the ExtensionArray has no attribute values to warn the author for that.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Apr 19, 2018 via email

@jorisvandenbossche
Copy link
Member

I wouldn't make it a runtime warning, but a test in eg BaseInterfaceTests that asserts it is not present. So at least that gives a warning to extension authors when they subclass the tests. (and of course documenting it as well)

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Apr 21, 2018 via email

@TomAugspurger
Copy link
Contributor

I think everything here has been solved. We still have #22994 for a more general concat protocol.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

No branches or pull requests

4 participants