Skip to content

PERF: improve StringArray.isna#57733

Merged
jorisvandenbossche merged 2 commits into
pandas-dev:mainfrom
jorisvandenbossche:perf-isna-string
Apr 10, 2026
Merged

PERF: improve StringArray.isna#57733
jorisvandenbossche merged 2 commits into
pandas-dev:mainfrom
jorisvandenbossche:perf-isna-string

Conversation

@jorisvandenbossche

Copy link
Copy Markdown
Member

See #57431 (comment) for context.

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Speed-up is around 2x for this case (extracted from one of our ASVs):

dtype = "string"
N = 10**6
data = np.array([str(i) * 5 for i in range(N)], dtype=object)
na_value = pd.NA
ser = pd.Series(data, dtype=dtype)

%timeit ser.isna()
# 11.3 ms ± 47.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)    <-- main
# 5.01 ms ± 55.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)    <-- PR

Not entirely sure it is worth the extra code, but so it definitely gives a decent speedup for a common operation.

@jorisvandenbossche jorisvandenbossche added Performance Memory or execution speed performance Strings String extension data type and string data labels Mar 5, 2024
WillAyd
WillAyd previously requested changes Mar 5, 2024
Comment thread pandas/_libs/missing.pyx
Comment thread pandas/_libs/missing.pyx Outdated
cnp.PyArray_ITER_NEXT(it)
if val is C_NA:
# Dereference pointer (set value)
(<uint8_t *>(cnp.PyArray_ITER_DATA(it2)))[0] = <uint8_t>1

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just assign 1U - no need for the cast on the right hand side

Comment thread pandas/_libs/missing.pyx Outdated

@cython.wraparound(False)
@cython.boundscheck(False)
cpdef ndarray[uint8_t] isna_string(ndarray arr):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we return a const uint8_t memory view instead of an ndarray? I still think generally better to use memoryviews over the long term to stay somewhat backend agnostic

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for the return value we want a numpy array? (inside the function we could use more memoryviews)
I am actually only using this function from python, so can make this a def function and remove the return type

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably doesn't make a difference but could type arr as ndarray[object]?

@github-actions

github-actions Bot commented Apr 5, 2024

Copy link
Copy Markdown
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions Bot added the Stale label Apr 5, 2024
@jbrockmendel

Copy link
Copy Markdown
Member

needs rebase, otherwise LGTM

@jbrockmendel jbrockmendel removed the Stale label Mar 7, 2026
@jbrockmendel

Copy link
Copy Markdown
Member

Looking at the CI failures, either there's a real bug here or I messed something up by fiddling with joris's branch.

@jbrockmendel

Copy link
Copy Markdown
Member

Looks like the CI failure is surfacing a real bug: StringArray[python].astype(object) creates a view instead of a copy, so in in replace_regex we can end up modifying the original. e.g.

df = pd.DataFrame({'b': list('ab..')}, dtype="string[python]")
res = df.replace([r'\s*\.\s*', 'b'], 0, regex=True)

>>> df["b"]
0    a
1    b
2    0
3    0
Name: b, dtype: string
>>> df["b"][2]
0

jbrockmendel added a commit to jbrockmendel/pandas that referenced this pull request Mar 21, 2026
… replace

closes pandas-dev#57733

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jbrockmendel

Copy link
Copy Markdown
Member

@WillAyd i think your comments have been addressed?

@WillAyd WillAyd dismissed their stale review April 10, 2026 00:20

outdated

@jorisvandenbossche

Copy link
Copy Markdown
Member Author

@jbrockmendel thanks for updating this PR and fixing the uncovered bug!

@jorisvandenbossche jorisvandenbossche added this to the 3.1 milestone Apr 10, 2026
@jorisvandenbossche jorisvandenbossche merged commit 3fc16af into pandas-dev:main Apr 10, 2026
45 checks passed
@jorisvandenbossche jorisvandenbossche deleted the perf-isna-string branch April 10, 2026 17:00
Sharl0tteIsTaken added a commit to Sharl0tteIsTaken/pandas that referenced this pull request Apr 12, 2026
…-comparison

* upstream/main:
  PERF: use lookup instead of hash_inner_join for merge with unique right keys (pandas-dev#64691)
  BUG : update `SeriesGroupBy.ohlc()` to honor `as_index=False` (pandas-dev#65141)
  PERF: Use DataFrame-level reductions in DataFrame.agg with list of funcs (pandas-dev#65031)
  DOC: document required external libraries in read_* I/O docstrings (pandas-dev#65143)
  DOC: improve MultiIndex.is_monotonic_increasing/decreasing docstrings (pandas-dev#65154)
  BUG: Raise ValueError for non-boolean numeric_only in DataFrame/Series reductions (GH#53098) (pandas-dev#65131)
  BUG: Timedelta.round() raises ZeroDivisionError when internal unit is 's' and target frequency is sub-second (pandas-dev#64836)
  ENH: Add replace method to Index (closes pandas-dev#19495) (pandas-dev#65099)
  PERF: improve StringArray.isna (pandas-dev#57733)
  BUG: read parquet files with older pytz (DEP: keep lower pytz minimum version) (pandas-dev#65133)
  DEPR: deprecate dates-with-datetime64 in _maybe_downcast_for_indexing (pandas-dev#64871)
  DOC: note that DataFrame.values is not writeable (pandas-dev#65142)
  CLN: Update groupby observed defaults (pandas-dev#65148)
  PERF: avoid materializing values[indexer] in Block.setitem (pandas-dev#64251)
  DOC: update GroupBy.sum/min/max See Also sections (pandas-dev#65144)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Performance Memory or execution speed performance Strings String extension data type and string data

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants