-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ExtensionArray being checked if is instance of collections.abc.Sequence #28424
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I'm confused. You're passing an ExtensionArray to the DataFrame constructor? Does it work if you pass a dict |
That would behave the same. I understand the confusion, the most observable side effect of the problem would be: class FPArray(ExtensionArray,ExtensionScalarOpsMixin,collections.abc.Sequence):
def _reduce(self, name, skipna=True, **kwargs):
return 'xxxxx'
arr = FPArray([FP(),FP(),FP(),FP()])
df = pd.DataFrame(arr )
df[0].sum()
while if i do : class FPArray(ExtensionArray,ExtensionScalarOpsMixin):
def _reduce(self, name, skipna=True, **kwargs):
return 'xxxxx'
arr = FPArray([FP(),FP(),FP(),FP()])
df = pd.DataFrame(arr )
df[0].sum()
|
@TomAugspurger I alsways forget whether it creates a row or column, but passing a single list/array-like to
Since that is a way to pass data to DataFrame, you can indeed expect that passing an ExtensionArray also preserves the type, which it currently doesn't:
So I think we can consider this a bug. |
that does not seem to work since that code block doesn't preserve dtypes. However, the following seems to work. diff --git a/pandas/core/frame.py b/pandas/core/frame.py
index f1ed3a125..dc72caa5a 100644
--- a/pandas/core/frame.py
+++ b/pandas/core/frame.py
@@ -427,7 +427,7 @@ class DataFrame(NDFrame):
data = data.copy()
mgr = init_ndarray(data, index, columns, dtype=dtype, copy=copy)
- elif isinstance(data, (np.ndarray, Series, Index)):
+ elif isinstance(data, (np.ndarray, Series, Index, ExtensionArray)):
if data.dtype.names:
data_columns = list(data.dtype.names)
data = {k: data[k] for k in data_columns} >>> import pandas as pd
>>> pd.DataFrame(pd.array([1, 2, None], dtype='Int64')).dtypes
0 Int64
dtype: object
>>> |
Huh, I would not have guessed that. Anyway, looks like Simon has a fix. We wouldn't want to inherit from abc.Sequence, since isinstance checks on ABCs are so much slower. |
I totally understand the fix. This will also change the behavior of all existing ExtensionArrays by having them fall into the |
This works on 1.2 master |
The ExtensionArray array itself is just for illustration and is not the problem, the problematic part is
when it reached line 444 of pandas/core/frame.py (https://github.com/pandas-dev/pandas/blob/master/pandas/core/frame.py#L444)
where
data
is referring to my FPArray, if data is not abc.Sequence, my ExtensionArray will be casted to a list.Having my ExtensionArray casted to a list mean my ExtensionArray will eventually casted back to a normal Pandas series.
The Solution should be having
ExtensionArray
to be a subclass ofabc.Sequence
.INSTALLED VERSIONS
commit : None
python : 3.7.4.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None
pandas : 0.25.1
numpy : 1.17.2
pytz : 2019.2
dateutil : 2.8.0
pip : 19.2.3
setuptools : 41.2.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
The text was updated successfully, but these errors were encountered: