Skip to content

BUG: pd.merge with ExtensionArray does not preserve extension dtype #20743

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jorisvandenbossche opened this issue Apr 19, 2018 · 2 comments
Closed
Labels
ExtensionArray Extending pandas with custom dtypes or arrays.
Milestone

Comments

@jorisvandenbossche
Copy link
Member

In [1]: from pandas.tests.extension.decimal.array import DecimalArray, make_data

In [5]: dec_arr = DecimalArray(make_data())

In [6]: df1 = pd.DataFrame({'int1': [1, 2, 3], 'key':[0, 1, 2], 'ext1': dec_arr[:3]})

In [7]: df2 = pd.DataFrame({'int2': [1, 2, 3, 4], 'key':[0, 0, 1, 3], 'ext2': dec_arr[3:7]})

In [8]: pd.merge(df1, df2)
Out[8]: 
                                                ext1  int1  key                                               ext2  int2
0  0.90013275661511904512934734157170169055461883...     1    0  0.67786011817398117429434023506473749876022338...     1
1  0.90013275661511904512934734157170169055461883...     1    0  0.94029656863099908559178174982662312686443328...     2
2  0.96839085663514357094072693143971264362335205...     2    1  0.12455159685855177187363551638554781675338745...     3

In [9]: pd.merge(df1, df2, how='outer')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-9-a573147da092> in <module>()
----> 1 pd.merge(df1, df2, how='outer')

/home/joris/scipy/pandas/pandas/core/reshape/merge.py in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
     59                          copy=copy, indicator=indicator,
     60                          validate=validate)
---> 61     return op.get_result()
     62 
     63 

/home/joris/scipy/pandas/pandas/core/reshape/merge.py in get_result(self)
    579             [(ldata, lindexers), (rdata, rindexers)],
    580             axes=[llabels.append(rlabels), join_index],
--> 581             concat_axis=0, copy=self.copy)
    582 
    583         typ = self.left._constructor

/home/joris/scipy/pandas/pandas/core/internals.py in concatenate_block_managers(mgrs_indexers, axes, concat_axis, copy)
   5407         else:
   5408             b = make_block(
-> 5409                 concatenate_join_units(join_units, concat_axis, copy=copy),
   5410                 placement=placement)
   5411         blocks.append(b)

/home/joris/scipy/pandas/pandas/core/internals.py in concatenate_join_units(join_units, concat_axis, copy)
   5533         raise AssertionError("Concatenating join units along axis0")
   5534 
-> 5535     empty_dtype, upcasted_na = get_empty_dtype_and_na(join_units)
   5536 
   5537     to_concat = [ju.get_reindexed_values(empty_dtype=empty_dtype,

/home/joris/scipy/pandas/pandas/core/internals.py in get_empty_dtype_and_na(join_units)
   5458             has_none_blocks = True
   5459         else:
-> 5460             dtypes[i] = unit.dtype
   5461 
   5462     upcast_classes = defaultdict(list)

/home/joris/scipy/pandas/pandas/_libs/properties.pyx in pandas._libs.properties.CachedProperty.__get__()

/home/joris/scipy/pandas/pandas/core/internals.py in dtype(self)
   5754         else:
   5755             return _get_dtype(maybe_promote(self.block.dtype,
-> 5756                                             self.block.fill_value)[0])
   5757 
   5758     @cache_readonly

/home/joris/scipy/pandas/pandas/core/dtypes/common.py in _get_dtype(arr_or_dtype)
   1830     if hasattr(arr_or_dtype, 'dtype'):
   1831         arr_or_dtype = arr_or_dtype.dtype
-> 1832     return np.dtype(arr_or_dtype)
   1833 
   1834 

TypeError: data type not understood
@jorisvandenbossche jorisvandenbossche added the ExtensionArray Extending pandas with custom dtypes or arrays. label Apr 19, 2018
@jorisvandenbossche jorisvandenbossche added this to the 0.23.0 milestone Apr 19, 2018
@jorisvandenbossche
Copy link
Member Author

And something else, although the inner join seems to work, it looses the extension dtype:

In [8]: df1.dtypes
Out[8]: 
ext1    decimal
int1      int64
key       int64
dtype: object

In [9]: pd.merge(df1, df2).dtypes
Out[9]: 
ext1    object
int1     int64
key      int64
ext2    object
int2     int64
dtype: object

@jorisvandenbossche jorisvandenbossche changed the title BUG: outer merge with ExtensionArray failing BUG: pd.merge with ExtensionArray does not preserve extension dtype Apr 19, 2018
@jorisvandenbossche
Copy link
Member Author

This solves the error:

diff --git a/pandas/core/dtypes/common.py b/pandas/core/dtypes/common.py
index 3a90feb..2f26a06 100644
--- a/pandas/core/dtypes/common.py
+++ b/pandas/core/dtypes/common.py
@@ -1807,6 +1807,8 @@ def _get_dtype(arr_or_dtype):
         return arr_or_dtype
     elif isinstance(arr_or_dtype, type):
         return np.dtype(arr_or_dtype)
+    elif isinstance(arr_or_dtype, ExtensionDtype):
+        return arr_or_dtype
     elif isinstance(arr_or_dtype, CategoricalDtype):
         return arr_or_dtype
     elif isinstance(arr_or_dtype, DatetimeTZDtype):

but the problem of not preserving the dtype still exists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

No branches or pull requests

1 participant