Skip to content

BUG: Poor GroupBy Performance with ArrowDtype(...) wrapped types #60861

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
3 tasks done
kzvezdarov opened this issue Feb 5, 2025 · 13 comments
Open
3 tasks done

BUG: Poor GroupBy Performance with ArrowDtype(...) wrapped types #60861

kzvezdarov opened this issue Feb 5, 2025 · 13 comments
Labels
Arrow pyarrow functionality Bug Dtype Conversions Unexpected or buggy dtype conversions Needs Discussion Requires discussion from core team before further action

Comments

@kzvezdarov
Copy link

kzvezdarov commented Feb 5, 2025

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

df = pd.DataFrame({"key": range(100000), "val": "test"})
%timeit df.groupby(["key"]).first();

pa_df = df.convert_dtypes(dtype_backend="pyarrow")
%timeit pa_df.groupby(["key"]).first();

pa_df = pa_df.astype({"val": pd.StringDtype("pyarrow")})
%timeit pa_df.groupby(["key"]).first();

Issue Description

Grouping by and then aggregating on a dataframe that contains ArrowDtype(pyarrow.string()) columns is orders of magnitude slower than performing the same operations on an equivalent dataframe whose corresponding string column is of any other acceptable string type (e.g. string, StringDtype("python"), StringDtype("pyarrow")). This is surprising in particular because StringDtype("pyarrow") does not exhibit the same problem.

Note that in the bug reproduction example, DataFrame.convert_dtypes with dtype_backend="pyarrow" converts string columns to ArrowDtype(pyarrow.string()) rather than StringDtype("pyarrow").

Finally, here's a sample run, with dtypes printed out for clarity; I've reproduced this on both OS X and OpenSuse Tumbleweed for the listed pandas and pyarrow versions (as well as current main):

In [7]: import pandas as pd

In [8]: df = pd.DataFrame({"key": range(100000), "val": "test"})

In [9]: df["val"].dtype
Out[9]: dtype('O')

In [10]: %timeit df.groupby(["key"]).first();
8.37 ms ± 599 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [11]: pa_df = df.convert_dtypes(dtype_backend="pyarrow")

In [13]: type(pa_df["val"].dtype)
Out[13]: pandas.core.dtypes.dtypes.ArrowDtype

In [14]: %timeit pa_df.groupby(["key"]).first();
2.39 s ± 142 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [15]: pa_df = pa_df.astype({"val": pd.StringDtype("pyarrow")})
    ...:

In [16]: type(pa_df["val"].dtype)
Out[16]: pandas.core.arrays.string_.StringDtype

In [17]: %timeit pa_df.groupby(["key"]).first();
12.9 ms ± 306 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Expected Behavior

Aggregation performance on ArrowDtype(pyarrow.string()) columns should be comparable to aggregation performance on StringDtype("pyarrow"), string typed columns.

Installed Versions

INSTALLED VERSIONS

commit : 0691c5c
python : 3.13.1
python-bits : 64
OS : Darwin
OS-release : 24.3.0
Version : Darwin Kernel Version 24.3.0: Thu Jan 2 20:24:16 PST 2025; root:xnu-11215.81.4~3/RELEASE_ARM64_T6000
machine : arm64
processor : arm
byteorder : little
LC_ALL : en_CA.UTF-8
LANG : None
LOCALE : en_CA.UTF-8

pandas : 2.2.3
numpy : 2.2.2
pytz : 2025.1
dateutil : 2.9.0.post0
pip : 24.3.1
Cython : None
sphinx : None
IPython : 8.32.0
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : None
blosc : None
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : 2025.2.0
html5lib : None
hypothesis : None
gcsfs : None
jinja2 : 3.1.5
lxml.etree : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
psycopg2 : None
pymysql : None
pyarrow : 19.0.0
pyreadstat : None
pytest : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlsxwriter : None
zstandard : None
tzdata : 2025.1
qtpy : None
pyqt5 : None

@kzvezdarov kzvezdarov added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 5, 2025
@rhshadrach
Copy link
Member

Thanks for the report!

Aggregation performance on ArrowDtype(pyarrow.string()) columns should be comparable to aggregation performance on StringDtype("pyarrow"), string typed columns.

I don't think this expectation is correct. ArrowExtensionArray (corresponding to ArrowDtype) is essentially a wrapper around general PyArrow data, whereas ArrowStringArray (corresponding to StringDtype) is a data container specially crafted to support PyArrow strings. I do not believe we aim for neither feature nor performance parity between the two. Users should prefer StringDtype.

@jorisvandenbossche @WillAyd - should

df.convert_dtypes(dtype_backend="pyarrow")

be converting to StringDtype here?

@rhshadrach rhshadrach added Dtype Conversions Unexpected or buggy dtype conversions Needs Discussion Requires discussion from core team before further action Arrow pyarrow functionality and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 5, 2025
@WillAyd
Copy link
Member

WillAyd commented Feb 6, 2025

Hmm I don't think so - the StringDtype is associated with the "numpy_nullable" backend. The pyarrow backend has always provided back the ArrowExtensionArray wrapper types.

Do we know where the bottleneck is performance-wise? I agree with the OP that the performance should be equivalent between the two types, if not slightly faster for the ArrowExtensionArray

@rhshadrach
Copy link
Member

Hmm I don't think so - the StringDtype is associated with the "numpy_nullable" backend.

_dtype: StringDtype # type: ignore[assignment]

@snitish
Copy link
Member

snitish commented Feb 6, 2025

@WillAyd looks like the bottleneck is due to ArrowExtensionArray._groupby_op not properly handling the string[pyarrow] dtype. We have special handling for StringDtype but for string[pyarrow], it hits to_masked() which again does not handle this dtype. This causes the groupby operation to be performed in pure python causing the slowness.

@WillAyd
Copy link
Member

WillAyd commented Feb 6, 2025

Thanks for that insight @snitish . Yea anything we can do to remove the layers of indirection and particularly Arrow <> NumPy copies would be very beneficial

@WillAyd
Copy link
Member

WillAyd commented Feb 6, 2025

FWIW I realize the OP is talking about strings, but this likely applies to the wrapped Arrow types in general. You can see the same performance issues using another Arrow type like decimal:

In [108]: df = pd.DataFrame({
     ...:     "key": pd.Series(range(100_000), dtype=pd.ArrowDtype(pa.int64())),
     ...:     "val": pd.Series(["3.14"] * 100_000, dtype=pd.ArrowDtype(pa.decimal128(10, 4)))
     ...: })
In [109]: %timeit df.groupby("key").sum()
2.83 s ± 131 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [110]: %timeit pa.TableGroupBy(pa.Table.from_pandas(df), "key").aggregate([("val", "sum")])
8.18 ms ± 589 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

@WillAyd WillAyd changed the title BUG: Abnormally slow DataFrame aggregations on ArrowDtype(pa.string()) columns BUG: Poor GroupBy Performance with ArrowDtype(...) wrapped types Feb 6, 2025
@snitish
Copy link
Member

snitish commented Feb 9, 2025

@WillAyd also noticed that .astype("string[pyarrow]") converts the dtype to StringDtype and the aggregation is much faster, while .convert_dtypes(dtype_backend="pyarrow") does not. Both dtypes show as "string["pyarrow"], but only one is a StringDtype. Is this expected?

In [2]: df = pd.DataFrame({"key": range(100000), "val": "test"}).astype({"val": "string[pyarrow]"})

In [3]: df["val"].dtype
Out[3]: string[pyarrow]

In [4]: isinstance(df["val"].dtype, pd.StringDtype)
Out[4]: True

In [6]: %timeit df.groupby("key").first()
16.5 ms ± 234 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [7]: pa_df = df.convert_dtypes(dtype_backend="pyarrow")

In [8]: pa_df["val"].dtype
Out[8]: string[pyarrow]

In [9]: isinstance(pa_df["val"].dtype, pd.StringDtype)
Out[9]: False

In [10]: %timeit pa_df.groupby("key").first()
4.73 s ± 44.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@kzvezdarov
Copy link
Author

I meant to open a separate issue about convert_dtypes but got carried away and forgot to, sorry about that. At least per the pandas.DataFrame.convert_dtypes doc, it seems that when convert_string is set to True, string objects should be converted to StringDtype, but in practice they appear to get converted to the wrapper type ArrowDtype(pyarrow.string()) instead. In both case the dtype's name property resolves to string[pyarrow].

In [3]: df = pd.DataFrame({"val": ["test"]}).convert_dtypes(dtype_backend="pyarrow")

In [4]: type(df.dtypes["val"])
Out[4]: pandas.core.dtypes.dtypes.ArrowDtype

@jorisvandenbossche
Copy link
Member

Both dtypes show as "string["pyarrow"], but only one is a StringDtype. Is this expected?

Yes, that is expected (although very annoying they have the same repr ..). See also https://pandas.pydata.org/pdeps/0014-string-dtype.html for context. The StringDtype variant will become the default in 3.0.

@jorisvandenbossche
Copy link
Member

At least per the pandas.DataFrame.convert_dtypes doc, it seems that when convert_string is set to True, string objects should be converted to StringDtype, but in practice they appear to get converted to the wrapper type ArrowDtype(pyarrow.string()) instead.

This is of course not a simple case with one obvious correct answer, but, in general the idea is that convert_dtypes() is meant to give you the nullable dtypes, and when additionally passing dtype_backend="pyarrow", you are asking to get ArrowDtype dtypes (that is also what the docstring says).

Of course, if we are going to get more of the non-ArrowDtype nullable dtypes getting backed by pyarrow (like StringDtype now), this division gets a bit tricky (and another reason to move forward with #58455).
Now, I assume one issue here is that the default StringDtype right now uses the "python" string storage, and not pyarrow. So that might be a reason you are asking for dtype_backend="pyarrow". Now, on the dev version we already have switched the default storage of StringDtype to pyarrow, and so starting with pandas 3.0 if you want nullable dtypes through convert_dtypes and want to use pyarrow just for the strings, you won't need to specify dtype_backend="pyarrow".

@kzvezdarov
Copy link
Author

Makes sense, thanks for clarifying. I'd just misinterpreted the relation between convert_string=True and dtype_backend="pyarrow" to imply converting to the nullable extension type, set to the pyarrow backend.

@kzvezdarov
Copy link
Author

FWIW I realize the OP is talking about strings, but this likely applies to the wrapped Arrow types in general. You can see the same performance issues using another Arrow type like decimal:

In [108]: df = pd.DataFrame({
...: "key": pd.Series(range(100_000), dtype=pd.ArrowDtype(pa.int64())),
...: "val": pd.Series(["3.14"] * 100_000, dtype=pd.ArrowDtype(pa.decimal128(10, 4)))
...: })
In [109]: %timeit df.groupby("key").sum()
2.83 s ± 131 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [110]: %timeit pa.TableGroupBy(pa.Table.from_pandas(df), "key").aggregate([("val", "sum")])
8.18 ms ± 589 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

It seems to only happen for the decimal and string ArrowDtype, the rest that I've tested with - float64, int64, boolean, timestamp - all seem to be only marginally slower. CProfile shows that for both ArrowDtype(pa.string()) and ArrowDtype(pa.decimal128(10, 4)) aggregation falls through to _agg_py_fallback:
Image

whereas for other types it continues on to _cython_operation:
Image

Making the ArrowDtype(pyarrow.string()) follow the handling of StringDtype seems to work - in the sense of fixing the performance issue - but I'm not that familiar with Pandas' internals, so I'm not sure if this is a correct/acceptable approach:

diff --git a/pandas/core/arrays/arrow/array.py b/pandas/core/arrays/arrow/array.py
index 0b546bed1c..89952b8f96 100644
--- a/pandas/core/arrays/arrow/array.py
+++ b/pandas/core/arrays/arrow/array.py
@@ -2376,7 +2376,9 @@ class ArrowExtensionArray(
         ids: npt.NDArray[np.intp],
         **kwargs,
     ):
-        if isinstance(self.dtype, StringDtype):
+        if isinstance(self.dtype, StringDtype) or (
+            self.dtype.type == str and self.dtype.storage == "pyarrow"
+        ):
             if how in [
                 "prod",
                 "mean",
diff --git a/pandas/core/arrays/base.py b/pandas/core/arrays/base.py
index 33745438e2..f4824255d3 100644
--- a/pandas/core/arrays/base.py
+++ b/pandas/core/arrays/base.py
@@ -2608,7 +2608,9 @@ class ExtensionArray:
         op = WrappedCythonOp(how=how, kind=kind, has_dropped_na=has_dropped_na)
 
         # GH#43682
-        if isinstance(self.dtype, StringDtype):
+        if isinstance(self.dtype, StringDtype) or (
+            self.dtype.type == str and self.dtype.storage == "pyarrow"
+        ):
             # StringArray
             if op.how in [
                 "prod",
@@ -2648,7 +2650,9 @@ class ExtensionArray:
             #  through cython_operation
             return res_values
 
-        if isinstance(self.dtype, StringDtype):
+        if isinstance(self.dtype, StringDtype) or (
+            self.dtype.type == str and self.dtype.storage == "pyarrow"
+        ):
             dtype = self.dtype
             string_array_cls = dtype.construct_array_type()
             return string_array_cls._from_sequence(res_values, dtype=dtype)

@WillAyd
Copy link
Member

WillAyd commented Feb 10, 2025

Thanks @kzvezdarov that is a great analysis.

It seems to only happen for the decimal and string ArrowDtype, the rest that I've tested with - float64, int64, boolean, timestamp - all seem to be only marginally slower.

Fair point. The qualification should be that this applies to ArrowDtypes where there is not a zero copy translation from PyArrow to Numpy. Integral, floating point, and timestamp (without any null values) are all zero copy. Boolean should be taking a performance hit since that requires a copy. Decimal and string were already called out, but I would expect still other types like binary, string_view, list, map, etc... to all exhibit poor performance with the current implementation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Bug Dtype Conversions Unexpected or buggy dtype conversions Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

5 participants