Skip to content

BUG: str.find returns byte offset instead of character offset with str dtype#64133

Open
Mr-Neutr0n wants to merge 1 commit intopandas-dev:mainfrom
Mr-Neutr0n:fix-str-find-byte-offset
Open

BUG: str.find returns byte offset instead of character offset with str dtype#64133
Mr-Neutr0n wants to merge 1 commit intopandas-dev:mainfrom
Mr-Neutr0n:fix-str-find-byte-offset

Conversation

@Mr-Neutr0n
Copy link

Fixes #64123

pc.find_substring returns byte positions rather than character positions for multi-byte UTF-8 encoded strings. This causes Series.str.find() to return incorrect results when using Arrow-backed StringDtype.

For example, find('a') in '永a' returns 3 (byte offset of 'a' in the UTF-8 encoding of '永') instead of the expected 1 (character offset).

The fix replaces the pc.find_substring call with an elementwise application of Python's str.find, which correctly returns character offsets. This matches the approach already used by _str_rfind in ArrowExtensionArray. As noted by @rhshadrach in the issue, there is no pyarrow.compute function that returns character offsets for this operation.

Test added: test_find_multibyte_chars covers 1-byte (ASCII), 2-byte (Á), 3-byte (永), and 4-byte (🐍) UTF-8 characters across all string dtypes.

…ti-byte UTF-8 chars (pandas-dev#64123)

pc.find_substring returns byte positions rather than character positions
for multi-byte UTF-8 encoded strings. This causes Series.str.find() to
return incorrect results when using Arrow-backed StringDtype, e.g.
find('a') in '永a' returns 3 instead of 1.

Fix by falling back to elementwise Python str.find(), which correctly
returns character offsets. This matches the approach already used by
_str_rfind in ArrowExtensionArray.
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for the PR!

One thing I am wondering is how much faster the pyarrow method would be compared to the python fallback, for the case of ASCII only, compared to checking if all elements are ASCII. If the difference is big enough, it might still be worth doing a pc.string_is_ascii(..).all() check first.

@jorisvandenbossche jorisvandenbossche added Bug Strings String extension data type and string data Arrow pyarrow functionality labels Feb 13, 2026
@jorisvandenbossche jorisvandenbossche added this to the 3.0.1 milestone Feb 13, 2026
@Mr-Neutr0n
Copy link
Author

Good point — for ASCII-only strings the pyarrow path would definitely be faster since it avoids the Python object overhead. A hybrid approach with pc.utf8_is_ascii to fast-path ASCII cases while falling back to elementwise for mixed content could be worth it. Happy to add that if you think the tradeoff makes sense here, though I'd guess most real-world Series have at least some non-ASCII rows so the fallback would trigger often anyway.

@jorisvandenbossche
Copy link
Member

I'd guess most real-world Series have at least some non-ASCII rows so the fallback would trigger often anyway.

For sure there is lots of non-ASCII cases, but personally my guess is that there is also still a significant use case of simple pure ASCII strings, making it worth to optimize this (str.find is them maybe not the most important method to have optimized, though)


There seem to be some relevant CI test failures:

FAILED pandas/tests/extension/test_arrow.py::test_str_find[ab-0-None-exp0-exp_type0] - AssertionError: Attributes of Series are different

Attribute "dtype" are different
[left]:  int64[pyarrow]
[right]: int32[pyarrow]
FAILED pandas/tests/strings/test_strings.py::test_empty_str_methods[string=str[pyarrow]] - AssertionError: Attributes of Series are different

Attribute "dtype" are different
[left]:  int64
[right]: object

# character offsets for multi-byte UTF-8 characters, so we fall back
# to Python str.find which correctly returns character offsets.
res_list = self._apply_elementwise(lambda val: val.find(sub, start, end))
return self._convert_int_result(pa.chunked_array(res_list))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The res_list is now no longer guaranteed to be of int type, in case of an empty array. I suppose that can be fixed inside the _convert_int_result method to ensure the result is always cast to int.

@jorisvandenbossche
Copy link
Member

For the pandas/tests/extension/test_arrow.py::test_str_find[ab-0-None-exp0-exp_type0] failure, it might be we actually want to update the test, because it is not entirely clear to me why that one case should return int32 instead of int64

@rhshadrach
Copy link
Member

rhshadrach commented Feb 16, 2026

Happy to add that if you think the tradeoff makes sense here, though I'd guess most real-world Series have at least some non-ASCII rows so the fallback would trigger often anyway.

In my experience, ASCII-only is the common case. The performance benefit is substantial:

def always(self, sub):
    start, end = 0, None
    res_list = self._apply_elementwise(lambda val: val.find(sub, start, end))
    return self._convert_int_result(pa.chunked_array(res_list))

def sometimes(self, sub):
    start, end = 0, None
    if pc.all(pc.string_is_ascii(self._pa_array)):
        return self._str_find(sub, start, end)
    else:
        res_list = self._apply_elementwise(lambda val: val.find(sub, start, end))
        return self._convert_int_result(pa.chunked_array(res_list))

arr = pd.array([str(e) for e in range(100_000)], dtype="str")

%timeit always(arr, "5")
# 13.1 ms ± 75.3 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit sometimes(arr, "5")
# 681 μs ± 4.25 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

I think we should check for ASCII-only first.

@jorisvandenbossche jorisvandenbossche modified the milestones: 3.0.1, 3.0.2 Feb 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Arrow pyarrow functionality Bug Strings String extension data type and string data

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG: Different result from str.find depending on dtype

3 participants