PERF: improve StringArray.isna#57733
Conversation
| cnp.PyArray_ITER_NEXT(it) | ||
| if val is C_NA: | ||
| # Dereference pointer (set value) | ||
| (<uint8_t *>(cnp.PyArray_ITER_DATA(it2)))[0] = <uint8_t>1 |
There was a problem hiding this comment.
You can just assign 1U - no need for the cast on the right hand side
|
|
||
| @cython.wraparound(False) | ||
| @cython.boundscheck(False) | ||
| cpdef ndarray[uint8_t] isna_string(ndarray arr): |
There was a problem hiding this comment.
Can we return a const uint8_t memory view instead of an ndarray? I still think generally better to use memoryviews over the long term to stay somewhat backend agnostic
There was a problem hiding this comment.
I think for the return value we want a numpy array? (inside the function we could use more memoryviews)
I am actually only using this function from python, so can make this a def function and remove the return type
There was a problem hiding this comment.
probably doesn't make a difference but could type arr as ndarray[object]?
|
This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this. |
|
needs rebase, otherwise LGTM |
|
Looking at the CI failures, either there's a real bug here or I messed something up by fiddling with joris's branch. |
|
Looks like the CI failure is surfacing a real bug: StringArray[python].astype(object) creates a view instead of a copy, so in in replace_regex we can end up modifying the original. e.g. |
… replace closes pandas-dev#57733 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
a99f9d8 to
a2ac346
Compare
|
@WillAyd i think your comments have been addressed? |
|
@jbrockmendel thanks for updating this PR and fixing the uncovered bug! |
…-comparison * upstream/main: PERF: use lookup instead of hash_inner_join for merge with unique right keys (pandas-dev#64691) BUG : update `SeriesGroupBy.ohlc()` to honor `as_index=False` (pandas-dev#65141) PERF: Use DataFrame-level reductions in DataFrame.agg with list of funcs (pandas-dev#65031) DOC: document required external libraries in read_* I/O docstrings (pandas-dev#65143) DOC: improve MultiIndex.is_monotonic_increasing/decreasing docstrings (pandas-dev#65154) BUG: Raise ValueError for non-boolean numeric_only in DataFrame/Series reductions (GH#53098) (pandas-dev#65131) BUG: Timedelta.round() raises ZeroDivisionError when internal unit is 's' and target frequency is sub-second (pandas-dev#64836) ENH: Add replace method to Index (closes pandas-dev#19495) (pandas-dev#65099) PERF: improve StringArray.isna (pandas-dev#57733) BUG: read parquet files with older pytz (DEP: keep lower pytz minimum version) (pandas-dev#65133) DEPR: deprecate dates-with-datetime64 in _maybe_downcast_for_indexing (pandas-dev#64871) DOC: note that DataFrame.values is not writeable (pandas-dev#65142) CLN: Update groupby observed defaults (pandas-dev#65148) PERF: avoid materializing values[indexer] in Block.setitem (pandas-dev#64251) DOC: update GroupBy.sum/min/max See Also sections (pandas-dev#65144)
See #57431 (comment) for context.
doc/source/whatsnew/vX.X.X.rstfile if fixing a bug or adding a new feature.Speed-up is around 2x for this case (extracted from one of our ASVs):
Not entirely sure it is worth the extra code, but so it definitely gives a decent speedup for a common operation.