Skip to content

Add a method to ExtensionArray interface for whether an array contains NAs #22680

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
TomAugspurger opened this issue Sep 12, 2018 · 4 comments · Fixed by #45024
Closed

Add a method to ExtensionArray interface for whether an array contains NAs #22680

TomAugspurger opened this issue Sep 12, 2018 · 4 comments · Fixed by #45024
Labels
Enhancement ExtensionArray Extending pandas with custom dtypes or arrays. Performance Memory or execution speed performance
Milestone

Comments

@TomAugspurger
Copy link
Contributor

In lots of places, pandas does something like if np.any(arr.isna()), which is wasteful as we have to create an ndarray of booleans just to check whether there are any True values.

Some arrays, like Arrow, know ahead of time whether there are any NAs in the array. Would it make sense to expose an API for an array saying whether they have any missing values?

With indexes, we work around this by caching a _hasnans value. That wouldn't work for mutable arrays.

@TomAugspurger TomAugspurger added Performance Memory or execution speed performance API Design ExtensionArray Extending pandas with custom dtypes or arrays. labels Sep 12, 2018
@WillAyd
Copy link
Member

WillAyd commented Sep 12, 2018

How wasteful do you think this is exactly? Conceptually makes sense although ensuring the proper state management of a property like that will bring its own nuances, so would want to make sure there's a need for it up front

@TomAugspurger
Copy link
Contributor Author

Depends on the array, but an extreme case is SparseArray. If the fill_value is np.nan, we checking whether there are any NA points is a simple lookup. But currently we would have to allocate a dense boolean ndarray.

ensuring the proper state management of a property like that will bring its own nuances

To be clear, we could make a default implementation

def any_na(arr: Union[ndarray, ExtensionArray]) -> bool:
    if is_extension_array_dtype(arr):
        return arr._any_na()
    else:
        return np.any(pd.isna(arr))

class ExtensionArray:
    ...
    def _any_na(self):
        return np.any(pd.isna(arr))

I don't think this will prove to be a large maintenance burden.

@xhochy
Copy link
Contributor

xhochy commented Sep 13, 2018

In Arrow we also use this parameter to determine different code execution paths. When there are no nulls, you can often switch to a lot more efficient version of an algorithm. Thus this property is a small maintenance burden for a big performance win.

@jorisvandenbossche
Copy link
Member

Yes, +1 to add such a property / method. BTW, we already have actually that on Series as well: Series.hasnans, and there are quite some places in the datetime-related index/array code where this property is used (even if it would not be more efficient for arrays, I think it makes the code cleaner)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement ExtensionArray Extending pandas with custom dtypes or arrays. Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants