Description
Consider concatting two dataframes with both a column with an extension dtype, but with a different one (here string and nullable int):
In [9]: df1 = pd.DataFrame({'A': ['a', 'b']}).astype("string")
In [10]: df2 = pd.DataFrame({'A': [1, 2]}).astype('Int64')
In [11]: pd.concat([df1, df2])
...
~/scipy/pandas/pandas/core/internals/blocks.py in concat_same_type(self, to_concat, placement)
1838 Concatenate list of single blocks of the same type.
1839 """
-> 1840 values = self._holder._concat_same_type([blk.values for blk in to_concat])
1841 placement = placement or slice(0, len(values), 1)
1842 return self.make_block_same_class(values, ndim=self.ndim, placement=placement)
~/scipy/pandas/pandas/core/arrays/numpy_.py in _concat_same_type(cls, to_concat)
157 @classmethod
158 def _concat_same_type(cls, to_concat):
--> 159 return cls(np.concatenate(to_concat))
160
161 # ------------------------------------------------------------------------
~/scipy/pandas/pandas/core/arrays/string_.py in __init__(self, values, copy)
154 self._dtype = StringDtype()
155 if not skip_validation:
--> 156 self._validate()
157
158 def _validate(self):
~/scipy/pandas/pandas/core/arrays/string_.py in _validate(self)
160 if len(self._ndarray) and not lib.is_string_array(self._ndarray, skipna=True):
161 raise ValueError(
--> 162 "StringArray requires a sequence of strings or missing values."
163 )
164 if self._ndarray.dtype != "object":
ValueError: StringArray requires a sequence of strings or missing values.
This errors because in the concatenation, we have the following code:
pandas/pandas/core/internals/managers.py
Lines 2021 to 2024 in 5c36aa1
and the is_uniform_join_units
only checks for ExtensionBlock, and not for the dtype of the block. Therefore, the ExtensionBlock.concat_same_type
-> ExtensionArray._concat_same_type
gets called assuming that all values are of the same dtype (which is not the case here, leading to the above error).
The easy fix is to make is_uniform_join_units
do a stricter check. But, we also need to decide on the alternative handling: how can some blocks still coerce? (eg Int64 and Int32 should result in Int64 ?)