Skip to content

BUG: concat of ExtensionBlock with different dtypes #29569

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jorisvandenbossche opened this issue Nov 12, 2019 · 2 comments
Closed

BUG: concat of ExtensionBlock with different dtypes #29569

jorisvandenbossche opened this issue Nov 12, 2019 · 2 comments
Labels
Bug Duplicate Report Duplicate issue or pull request ExtensionArray Extending pandas with custom dtypes or arrays. Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Nov 12, 2019

Consider concatting two dataframes with both a column with an extension dtype, but with a different one (here string and nullable int):

In [9]: df1 = pd.DataFrame({'A': ['a', 'b']}).astype("string")   

In [10]: df2 = pd.DataFrame({'A': [1, 2]}).astype('Int64')  

In [11]: pd.concat([df1, df2])   
...
~/scipy/pandas/pandas/core/internals/blocks.py in concat_same_type(self, to_concat, placement)
  1838         Concatenate list of single blocks of the same type.
  1839         """
-> 1840         values = self._holder._concat_same_type([blk.values for blk in to_concat])
  1841         placement = placement or slice(0, len(values), 1)
  1842         return self.make_block_same_class(values, ndim=self.ndim, placement=placement)

~/scipy/pandas/pandas/core/arrays/numpy_.py in _concat_same_type(cls, to_concat)
   157     @classmethod
   158     def _concat_same_type(cls, to_concat):
--> 159         return cls(np.concatenate(to_concat))
   160 
   161     # ------------------------------------------------------------------------

~/scipy/pandas/pandas/core/arrays/string_.py in __init__(self, values, copy)
   154         self._dtype = StringDtype()
   155         if not skip_validation:
--> 156             self._validate()
   157 
   158     def _validate(self):

~/scipy/pandas/pandas/core/arrays/string_.py in _validate(self)
   160         if len(self._ndarray) and not lib.is_string_array(self._ndarray, skipna=True):
   161             raise ValueError(
--> 162                 "StringArray requires a sequence of strings or missing values."
   163             )
   164         if self._ndarray.dtype != "object":

ValueError: StringArray requires a sequence of strings or missing values.

This errors because in the concatenation, we have the following code:

elif is_uniform_join_units(join_units):
b = join_units[0].block.concat_same_type(
[ju.block for ju in join_units], placement=placement
)

and the is_uniform_join_units only checks for ExtensionBlock, and not for the dtype of the block. Therefore, the ExtensionBlock.concat_same_type -> ExtensionArray._concat_same_type gets called assuming that all values are of the same dtype (which is not the case here, leading to the above error).

The easy fix is to make is_uniform_join_units do a stricter check. But, we also need to decide on the alternative handling: how can some blocks still coerce? (eg Int64 and Int32 should result in Int64 ?)

@jorisvandenbossche jorisvandenbossche added Bug ExtensionArray Extending pandas with custom dtypes or arrays. Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Nov 12, 2019
@TomAugspurger
Copy link
Contributor

Maybe a duplicate of #22994?

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Nov 12, 2019

Ah, yes, I searched for it but didn't find it ..

Closing as a duplicate of #22994

@jorisvandenbossche jorisvandenbossche added this to the No action milestone Nov 12, 2019
@jorisvandenbossche jorisvandenbossche added the Duplicate Report Duplicate issue or pull request label Nov 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Duplicate Report Duplicate issue or pull request ExtensionArray Extending pandas with custom dtypes or arrays. Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

No branches or pull requests

2 participants