-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
pd.concat() does not work generically for ExtensionArrays #20735
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Reproducible example In [21]: from pandas.tests.extension.json.array import JSONArray
In [22]: class JSONArray2(JSONArray):
...: @property
...: def values(self): return self.data
...:
In [23]: arr = JSONArray2([{"A": 1}])
In [24]: df = pd.DataFrame({"A": arr}) In [25]: pd.concat([df, df])
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-25-86fe636506f1> in <module>()
----> 1 pd.concat([df, df])
~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/reshape/concat.py in concat(objs, axis, join, join_axes, ignore_index, keys, levels, names, verify_integrity, copy)
211 verify_integrity=verify_integrity,
212 copy=copy)
--> 213 return op.get_result()
214
215
~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/reshape/concat.py in get_result(self)
406 new_data = concatenate_block_managers(
407 mgrs_indexers, self.new_axes, concat_axis=self.axis,
--> 408 copy=self.copy)
409 if not self.copy:
410 new_data._consolidate_inplace()
~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/internals.py in concatenate_block_managers(mgrs_indexers, axes, concat_axis, copy)
5402 values = values.view()
5403 b = b.make_block_same_class(values, placement=placement)
-> 5404 elif is_uniform_join_units(join_units):
5405 b = join_units[0].block.concat_same_type(
5406 [ju.block for ju in join_units], placement=placement)
~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/internals.py in is_uniform_join_units(join_units)
5426 # no blocks that would get missing values (can lead to type upcasts)
5427 # unless we're an extension dtype.
-> 5428 all(not ju.is_na or ju.block.is_extension for ju in join_units) and
5429 # no blocks with indexers (as then the dimensions do not fit)
5430 all(not ju.indexers for ju in join_units) and
~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/internals.py in <genexpr>(.0)
5426 # no blocks that would get missing values (can lead to type upcasts)
5427 # unless we're an extension dtype.
-> 5428 all(not ju.is_na or ju.block.is_extension for ju in join_units) and
5429 # no blocks with indexers (as then the dimensions do not fit)
5430 all(not ju.indexers for ju in join_units) and
~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/_libs/properties.pyx in pandas._libs.properties.CachedProperty.__get__()
34 val = <object> PyDict_GetItem(cache, self.name)
35 else:
---> 36 val = self.func(obj)
37 PyDict_SetItem(cache, self.name, val)
38 return val
~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/internals.py in is_na(self)
5782 chunk_len = max(total_len // 40, 1000)
5783 for i in range(0, total_len, chunk_len):
-> 5784 if not isna(values_flat[i:i + chunk_len]).all():
5785 return False
5786
~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/dtypes/missing.py in isna(obj)
104 Name: 1, dtype: bool
105 """
--> 106 return _isna(obj)
107
108
~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/dtypes/missing.py in _isna_new(obj)
118 elif isinstance(obj, (ABCSeries, np.ndarray, ABCIndexClass,
119 ABCExtensionArray)):
--> 120 return _isna_ndarraylike(obj)
121 elif isinstance(obj, ABCGeneric):
122 return obj._constructor(obj._data.isna(func=isna))
~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/dtypes/missing.py in _isna_ndarraylike(obj)
185 def _isna_ndarraylike(obj):
186 values = getattr(obj, 'values', obj)
--> 187 dtype = values.dtype
188
189 if is_extension_array_dtype(obj):
AttributeError: 'list' object has no attribute 'dtype' |
Main problem here is that for now it is kind of "not allowed" as an ExtensionArray to have a |
Right. We’ve hit other places like this (setitem IIRC) and have to throw in extra checks. |
Maybe we can, for now, add a test to the extension interface tests that checks the ExtensionArray has no attribute |
That may be the most pragmatic approach right now. I'm guessing there's a
long tail of these kinds of places where we "expect" an array like, but
actually use ndarray methods.
Is just documenting it sufficient, or do you think it should be an actual
warning at runtime? And when / where would we emit the warning?
…On Thu, Apr 19, 2018 at 7:35 AM, Joris Van den Bossche < ***@***.***> wrote:
Maybe we can, for now, add a test to the extension interface tests that
checks the ExtensionArray has *no* attribute values to warn the author
for that.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#20735 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIpnrLs3ryAQaet-3VJopngR7Wo4Rks5tqISngaJpZM4TaeaB>
.
|
I wouldn't make it a runtime warning, but a test in eg |
That seems reasonable for now.
…On Fri, Apr 20, 2018 at 2:18 AM, Joris Van den Bossche < ***@***.***> wrote:
I wouldn't make it a runtime warning, but a test in eg BaseInterfaceTests
that asserts it is not present. So at least that gives a warning to
extension authors when they subclass the tests. (and of course documenting
it as well)
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#20735 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIsGQyMTLIuT0BSuzVnEianIDzwPtks5tqYuwgaJpZM4TaeaB>
.
|
I think everything here has been solved. We still have #22994 for a more general concat protocol. |
Code Sample, a copy-pastable example if possible
Sadly, pseudocode because the underlying container is proprietary, but this is a generic problem. If I get a chance this weekend I will write up a mock open-source-compliant example for illustration and testing.
Problem description
There is a check at the top of
_isna_ndarraylike()
:This fails for
ExtensionArray
objects which define avalues
attribute, but whosevalues
attribute does not have adtype
attribute.I noticed this call in
pd.concat
above, but no doubt it occurs elsewhere.Expected Output
Since
dtype
is a required attribute of anExtensionArray
,_isna_ndarraylike()
should at least get the dtype from theExtensionArray
class. Really, a check for whether we are dealing with anExtensionArray
should occur upstream somewhere, since there is no guarantee an ExtensionArray is backended by a numpy array-like object.Output of
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 2.7.13.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-1048-aws
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None
pandas: 0.23.0.dev0+762.gbb095a6
pytest: 3.2.2
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.28.2
numpy: 1.13.3
scipy: 0.19.1
pyarrow: None
xarray: None
IPython: 5.3.0
sphinx: 1.5.4
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: 0.9999999
sqlalchemy: 1.1.9
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: