Skip to content

BUG: concat of ExtensionBlock with different dtypes  #29569

Closed
@jorisvandenbossche

Description

@jorisvandenbossche

Consider concatting two dataframes with both a column with an extension dtype, but with a different one (here string and nullable int):

In [9]: df1 = pd.DataFrame({'A': ['a', 'b']}).astype("string")   

In [10]: df2 = pd.DataFrame({'A': [1, 2]}).astype('Int64')  

In [11]: pd.concat([df1, df2])   
...
~/scipy/pandas/pandas/core/internals/blocks.py in concat_same_type(self, to_concat, placement)
  1838         Concatenate list of single blocks of the same type.
  1839         """
-> 1840         values = self._holder._concat_same_type([blk.values for blk in to_concat])
  1841         placement = placement or slice(0, len(values), 1)
  1842         return self.make_block_same_class(values, ndim=self.ndim, placement=placement)

~/scipy/pandas/pandas/core/arrays/numpy_.py in _concat_same_type(cls, to_concat)
   157     @classmethod
   158     def _concat_same_type(cls, to_concat):
--> 159         return cls(np.concatenate(to_concat))
   160 
   161     # ------------------------------------------------------------------------

~/scipy/pandas/pandas/core/arrays/string_.py in __init__(self, values, copy)
   154         self._dtype = StringDtype()
   155         if not skip_validation:
--> 156             self._validate()
   157 
   158     def _validate(self):

~/scipy/pandas/pandas/core/arrays/string_.py in _validate(self)
   160         if len(self._ndarray) and not lib.is_string_array(self._ndarray, skipna=True):
   161             raise ValueError(
--> 162                 "StringArray requires a sequence of strings or missing values."
   163             )
   164         if self._ndarray.dtype != "object":

ValueError: StringArray requires a sequence of strings or missing values.

This errors because in the concatenation, we have the following code:

elif is_uniform_join_units(join_units):
b = join_units[0].block.concat_same_type(
[ju.block for ju in join_units], placement=placement
)

and the is_uniform_join_units only checks for ExtensionBlock, and not for the dtype of the block. Therefore, the ExtensionBlock.concat_same_type -> ExtensionArray._concat_same_type gets called assuming that all values are of the same dtype (which is not the case here, leading to the above error).

The easy fix is to make is_uniform_join_units do a stricter check. But, we also need to decide on the alternative handling: how can some blocks still coerce? (eg Int64 and Int32 should result in Int64 ?)

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugDuplicate ReportDuplicate issue or pull requestExtensionArrayExtending pandas with custom dtypes or arrays.ReshapingConcat, Merge/Join, Stack/Unstack, Explode

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions