Skip to content

PyArrow StringDtype / StringArray fallback policy #42613

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
TomAugspurger opened this issue Jul 19, 2021 · 3 comments · Fixed by #46732
Closed

PyArrow StringDtype / StringArray fallback policy #42613

TomAugspurger opened this issue Jul 19, 2021 · 3 comments · Fixed by #46732
Labels
API Design Arrow pyarrow functionality Needs Discussion Requires discussion from core team before further action Strings String extension data type and string data

Comments

@TomAugspurger
Copy link
Contributor

In #35169 / #42597 we discussed the desired behavior of PyArrow-backed StringArray when a certain method is not implemented in pyarrow.compute.

For string methods like str_normalize, which aren't currently implemented in pyarrow.compute, I believe we (silently) cast from Pyarrow[string] to an object-dtype ndarray of Python str objects at

mask = isna(self)
arr = np.asarray(self)
. That's going to be slow and more than doubles the memory usage of the array.

These kinds of performance cliffs are difficult for users to debug. I don't think we should do that conversion on behalf of the user. If something isn't implemented yet, then I think we should raise with a message saying they should convert to string[python] dtype first.

If we don't want to raise, we could emit a PerformanceWarning, similar to what we do for SparseArray when converting to dense.

@TomAugspurger TomAugspurger added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 19, 2021
@simonjayhawkins
Copy link
Member

maybe also consider a nopython=True/False setting (pandas option, dtype constructor argument and/or array attribute)

@simonjayhawkins simonjayhawkins added API Design Needs Discussion Requires discussion from core team before further action Strings String extension data type and string data and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 20, 2021
@TomAugspurger
Copy link
Contributor Author

Yep, those would be worth exploring. Global options are a bit harder for dask, since it's a bit hard to propagate global options to multiple processes (it'd be easier if we had a config system like Dask's). A per- StringArray / StringDtype instance option might also be useful, though we'd need to propagate that setting through operations which sounds a bit hard.

@mroeschke
Copy link
Member

I think the nicest behavior for the users would be for operations to "just work" with a fallback method, so I would be partial to raise a PerformanceWarning when the fallback happens. I think the big assumption here is that the fallback method can always produce the same result as the pyarrow compute method.

The easiest thing for us to do is probably raise telling user to convert, but one inconvenience for users is that they to revert their conversions to string[python] in their code base if they ever upgrade pyarrow to access more performant methods.

So I think for users

  • User's codebase calls foo() and gets PerformanceWarning and a result
  • User upgrades Pyarrow
  • User's codebase calls foo() and gets a performant result

is a better experience than

  • User's codebase calls foo() and gets NotImplementedError and told to convert to string[python]
  • User's codebase calls .astype("string[python]").foo() and gets a result
  • User upgrades Pyarrow
  • User needs to remove .astype("string[python]") from their codebase to get a performant result

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Arrow pyarrow functionality Needs Discussion Requires discussion from core team before further action Strings String extension data type and string data
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants