BUG: str.find returns byte offset instead of character offset with str dtype by Mr-Neutr0n · Pull Request #64133 · pandas-dev/pandas

Mr-Neutr0n · 2026-02-13T13:37:23Z

pc.find_substring returns byte positions rather than character positions for multi-byte UTF-8 encoded strings. This causes Series.str.find() to return incorrect results when using Arrow-backed StringDtype.

For example, find('a') in '永a' returns 3 (byte offset of 'a' in the UTF-8 encoding of '永') instead of the expected 1 (character offset).

The fix replaces the pc.find_substring call with an elementwise application of Python's str.find, which correctly returns character offsets. This matches the approach already used by _str_rfind in ArrowExtensionArray. As noted by @rhshadrach in the issue, there is no pyarrow.compute function that returns character offsets for this operation.

Test added: test_find_multibyte_chars covers 1-byte (ASCII), 2-byte (Á), 3-byte (永), and 4-byte (🐍) UTF-8 characters across all string dtypes.

…ti-byte UTF-8 chars (pandas-dev#64123) pc.find_substring returns byte positions rather than character positions for multi-byte UTF-8 encoded strings. This causes Series.str.find() to return incorrect results when using Arrow-backed StringDtype, e.g. find('a') in '永a' returns 3 instead of 1. Fix by falling back to elementwise Python str.find(), which correctly returns character offsets. This matches the approach already used by _str_rfind in ArrowExtensionArray.

jorisvandenbossche

Looks good, thanks for the PR!

One thing I am wondering is how much faster the pyarrow method would be compared to the python fallback, for the case of ASCII only, compared to checking if all elements are ASCII. If the difference is big enough, it might still be worth doing a pc.string_is_ascii(..).all() check first.

Mr-Neutr0n · 2026-02-13T18:42:17Z

Good point — for ASCII-only strings the pyarrow path would definitely be faster since it avoids the Python object overhead. A hybrid approach with pc.utf8_is_ascii to fast-path ASCII cases while falling back to elementwise for mixed content could be worth it. Happy to add that if you think the tradeoff makes sense here, though I'd guess most real-world Series have at least some non-ASCII rows so the fallback would trigger often anyway.

jorisvandenbossche · 2026-02-16T10:32:28Z

I'd guess most real-world Series have at least some non-ASCII rows so the fallback would trigger often anyway.

For sure there is lots of non-ASCII cases, but personally my guess is that there is also still a significant use case of simple pure ASCII strings, making it worth to optimize this (str.find is them maybe not the most important method to have optimized, though)

There seem to be some relevant CI test failures:

FAILED pandas/tests/extension/test_arrow.py::test_str_find[ab-0-None-exp0-exp_type0] - AssertionError: Attributes of Series are different

Attribute "dtype" are different
[left]:  int64[pyarrow]
[right]: int32[pyarrow]
FAILED pandas/tests/strings/test_strings.py::test_empty_str_methods[string=str[pyarrow]] - AssertionError: Attributes of Series are different

Attribute "dtype" are different
[left]:  int64
[right]: object

jorisvandenbossche · 2026-02-16T10:35:16Z

pandas/core/arrays/_arrow_string_mixins.py

+        # character offsets for multi-byte UTF-8 characters, so we fall back
+        # to Python str.find which correctly returns character offsets.
+        res_list = self._apply_elementwise(lambda val: val.find(sub, start, end))
+        return self._convert_int_result(pa.chunked_array(res_list))


The res_list is now no longer guaranteed to be of int type, in case of an empty array. I suppose that can be fixed inside the _convert_int_result method to ensure the result is always cast to int.

jorisvandenbossche · 2026-02-16T10:37:04Z

For the pandas/tests/extension/test_arrow.py::test_str_find[ab-0-None-exp0-exp_type0] failure, it might be we actually want to update the test, because it is not entirely clear to me why that one case should return int32 instead of int64

rhshadrach · 2026-02-16T22:07:08Z

Happy to add that if you think the tradeoff makes sense here, though I'd guess most real-world Series have at least some non-ASCII rows so the fallback would trigger often anyway.

In my experience, ASCII-only is the common case. The performance benefit is substantial:

def always(self, sub):
    start, end = 0, None
    res_list = self._apply_elementwise(lambda val: val.find(sub, start, end))
    return self._convert_int_result(pa.chunked_array(res_list))

def sometimes(self, sub):
    start, end = 0, None
    if pc.all(pc.string_is_ascii(self._pa_array)):
        return self._str_find(sub, start, end)
    else:
        res_list = self._apply_elementwise(lambda val: val.find(sub, start, end))
        return self._convert_int_result(pa.chunked_array(res_list))

arr = pd.array([str(e) for e in range(100_000)], dtype="str")

%timeit always(arr, "5")
# 13.1 ms ± 75.3 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit sometimes(arr, "5")
# 681 μs ± 4.25 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

I think we should check for ASCII-only first.

Mr-Neutr0n mentioned this pull request Feb 13, 2026

BUG: Different result from str.find depending on dtype #64123

Open

3 tasks

jorisvandenbossche reviewed Feb 13, 2026

View reviewed changes

jorisvandenbossche added Bug Strings String extension data type and string data Arrow pyarrow functionality labels Feb 13, 2026

jorisvandenbossche added this to the 3.0.1 milestone Feb 13, 2026

jorisvandenbossche reviewed Feb 16, 2026

View reviewed changes

jorisvandenbossche modified the milestones: 3.0.1, 3.0.2 Feb 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

BUG: str.find returns byte offset instead of character offset with str dtype#64133

BUG: str.find returns byte offset instead of character offset with str dtype#64133
Mr-Neutr0n wants to merge 1 commit intopandas-dev:mainfrom
Mr-Neutr0n:fix-str-find-byte-offset

Mr-Neutr0n commented Feb 13, 2026

Uh oh!

jorisvandenbossche left a comment

Uh oh!

Mr-Neutr0n commented Feb 13, 2026

Uh oh!

jorisvandenbossche commented Feb 16, 2026

Uh oh!

jorisvandenbossche Feb 16, 2026

Uh oh!

jorisvandenbossche commented Feb 16, 2026

Uh oh!

rhshadrach commented Feb 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

Mr-Neutr0n commented Feb 13, 2026

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

Mr-Neutr0n commented Feb 13, 2026

Uh oh!

jorisvandenbossche commented Feb 16, 2026

Uh oh!

jorisvandenbossche Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche commented Feb 16, 2026

Uh oh!

rhshadrach commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rhshadrach commented Feb 16, 2026 •

edited

Loading