PERF: improve StringArray.isna by jorisvandenbossche · Pull Request #57733 · pandas-dev/pandas

jorisvandenbossche · 2024-03-05T09:38:01Z

closes #xxxx (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Speed-up is around 2x for this case (extracted from one of our ASVs):

dtype = "string"
N = 10**6
data = np.array([str(i) * 5 for i in range(N)], dtype=object)
na_value = pd.NA
ser = pd.Series(data, dtype=dtype)

%timeit ser.isna()
# 11.3 ms ± 47.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)    <-- main
# 5.01 ms ± 55.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)    <-- PR

Not entirely sure it is worth the extra code, but so it definitely gives a decent speedup for a common operation.

WillAyd · 2024-03-05T14:24:00Z

+        cnp.PyArray_ITER_NEXT(it)
+        if val is C_NA:
+            # Dereference pointer (set value)
+            (<uint8_t *>(cnp.PyArray_ITER_DATA(it2)))[0] = <uint8_t>1


You can just assign 1U - no need for the cast on the right hand side

WillAyd · 2024-03-05T14:27:07Z


+@cython.wraparound(False)
+@cython.boundscheck(False)
+cpdef ndarray[uint8_t] isna_string(ndarray arr):


Can we return a const uint8_t memory view instead of an ndarray? I still think generally better to use memoryviews over the long term to stay somewhat backend agnostic

I think for the return value we want a numpy array? (inside the function we could use more memoryviews)
I am actually only using this function from python, so can make this a def function and remove the return type

probably doesn't make a difference but could type arr as ndarray[object]?

github-actions · 2024-04-05T00:05:58Z

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

jbrockmendel · 2025-07-08T17:02:06Z

needs rebase, otherwise LGTM

jbrockmendel · 2026-03-07T19:16:42Z

Looking at the CI failures, either there's a real bug here or I messed something up by fiddling with joris's branch.

jbrockmendel · 2026-03-19T02:04:05Z

Looks like the CI failure is surfacing a real bug: StringArray[python].astype(object) creates a view instead of a copy, so in in replace_regex we can end up modifying the original. e.g.

df = pd.DataFrame({'b': list('ab..')}, dtype="string[python]")
res = df.replace([r'\s*\.\s*', 'b'], 0, regex=True)

>>> df["b"]
0    a
1    b
2    0
3    0
Name: b, dtype: string
>>> df["b"][2]
0

… replace closes pandas-dev#57733 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

jbrockmendel · 2026-04-02T15:05:00Z

@WillAyd i think your comments have been addressed?

outdated

jorisvandenbossche · 2026-04-10T17:00:12Z

@jbrockmendel thanks for updating this PR and fixing the uncovered bug!

…-comparison * upstream/main: PERF: use lookup instead of hash_inner_join for merge with unique right keys (pandas-dev#64691) BUG : update `SeriesGroupBy.ohlc()` to honor `as_index=False` (pandas-dev#65141) PERF: Use DataFrame-level reductions in DataFrame.agg with list of funcs (pandas-dev#65031) DOC: document required external libraries in read_* I/O docstrings (pandas-dev#65143) DOC: improve MultiIndex.is_monotonic_increasing/decreasing docstrings (pandas-dev#65154) BUG: Raise ValueError for non-boolean numeric_only in DataFrame/Series reductions (GH#53098) (pandas-dev#65131) BUG: Timedelta.round() raises ZeroDivisionError when internal unit is 's' and target frequency is sub-second (pandas-dev#64836) ENH: Add replace method to Index (closes pandas-dev#19495) (pandas-dev#65099) PERF: improve StringArray.isna (pandas-dev#57733) BUG: read parquet files with older pytz (DEP: keep lower pytz minimum version) (pandas-dev#65133) DEPR: deprecate dates-with-datetime64 in _maybe_downcast_for_indexing (pandas-dev#64871) DOC: note that DataFrame.values is not writeable (pandas-dev#65142) CLN: Update groupby observed defaults (pandas-dev#65148) PERF: avoid materializing values[indexer] in Block.setitem (pandas-dev#64251) DOC: update GroupBy.sum/min/max See Also sections (pandas-dev#65144)

jorisvandenbossche added Performance Memory or execution speed performance Strings String extension data type and string data labels Mar 5, 2024

jorisvandenbossche requested a review from WillAyd as a code owner March 5, 2024 09:38

jorisvandenbossche mentioned this pull request Mar 5, 2024

Potential perf regressions introduced by Copy-on-Write #57431

Closed

50 tasks

WillAyd previously requested changes Mar 5, 2024

View reviewed changes

github-actions Bot added the Stale label Apr 5, 2024

jbrockmendel removed the Stale label Mar 7, 2026

jbrockmendel added a commit to jbrockmendel/pandas that referenced this pull request Mar 21, 2026

BUG: Fix _replace_regex mutating StringArray in-place for non-inplace…

3678205

… replace closes pandas-dev#57733 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

jbrockmendel mentioned this pull request Mar 21, 2026

BUG: Fix _replace_regex mutating StringArray in-place for non-inplace replace #64752

Merged

jbrockmendel approved these changes Apr 1, 2026

View reviewed changes

PERF: improve StringArray.isna

a2ac346

jbrockmendel force-pushed the perf-isna-string branch from a99f9d8 to a2ac346 Compare April 1, 2026 21:58

Merge branch 'main' into perf-isna-string

d1359f3

jorisvandenbossche added this to the 3.1 milestone Apr 10, 2026

jorisvandenbossche merged commit 3fc16af into pandas-dev:main Apr 10, 2026
45 checks passed

jorisvandenbossche deleted the perf-isna-string branch April 10, 2026 17:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

PERF: improve StringArray.isna#57733

PERF: improve StringArray.isna#57733
jorisvandenbossche merged 2 commits into
pandas-dev:mainfrom
jorisvandenbossche:perf-isna-string

jorisvandenbossche commented Mar 5, 2024

Uh oh!

Uh oh!

WillAyd Mar 5, 2024

Uh oh!

WillAyd Mar 5, 2024

Uh oh!

jorisvandenbossche Mar 5, 2024

Uh oh!

jbrockmendel Jul 8, 2025

Uh oh!

github-actions Bot commented Apr 5, 2024

Uh oh!

jbrockmendel commented Jul 8, 2025

Uh oh!

jbrockmendel commented Mar 7, 2026

Uh oh!

jbrockmendel commented Mar 19, 2026

Uh oh!

jbrockmendel commented Apr 2, 2026

Uh oh!

jorisvandenbossche commented Apr 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Uh oh!

Conversation

jorisvandenbossche commented Mar 5, 2024

Uh oh!

Uh oh!

WillAyd Mar 5, 2024

Choose a reason for hiding this comment

Uh oh!

WillAyd Mar 5, 2024

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Mar 5, 2024

Choose a reason for hiding this comment

Uh oh!

jbrockmendel Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Apr 5, 2024

Uh oh!

jbrockmendel commented Jul 8, 2025

Uh oh!

jbrockmendel commented Mar 7, 2026

Uh oh!

jbrockmendel commented Mar 19, 2026

Uh oh!

jbrockmendel commented Apr 2, 2026

Uh oh!

jorisvandenbossche commented Apr 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants