Skip to content

Incorrect handling of negative start values on PyUnicodeErrorObject #123378

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
picnixz opened this issue Aug 27, 2024 · 20 comments
Closed

Incorrect handling of negative start values on PyUnicodeErrorObject #123378

picnixz opened this issue Aug 27, 2024 · 20 comments
Assignees
Labels
3.14 bugs and security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-C-API type-bug An unexpected behavior, bug, or error type-crash A hard crash of the interpreter, possibly with a core dump

Comments

@picnixz
Copy link
Member

picnixz commented Aug 27, 2024

Bug report

Bug description:

Found when implementing #123343. We have:

int
PyUnicodeEncodeError_GetStart(PyObject *exc, Py_ssize_t *start)
{
    Py_ssize_t size;
    PyObject *obj = get_unicode(((PyUnicodeErrorObject *)exc)->object,
                                "object");
    if (!obj)
        return -1;
    *start = ((PyUnicodeErrorObject *)exc)->start;
    size = PyUnicode_GET_LENGTH(obj);
    if (*start<0)
        *start = 0; /*XXX check for values <0*/
    if (*start>=size)
        *start = size-1;
    Py_DECREF(obj);
    return 0;
}

The line *start = size-1 might set start to -1 when start = 0, in which case this leads to assertion failures when the index is used normally.

CPython versions tested on:

CPython main branch

Operating systems tested on:

Linux

Linked PRs

@picnixz picnixz added type-bug An unexpected behavior, bug, or error topic-C-API labels Aug 27, 2024
@picnixz picnixz self-assigned this Aug 27, 2024
@picnixz picnixz changed the title Incorrect handling of PyUnicode{Decode,Encode}Error_GetStart when start = Incorrect handling of PyUnicode{Decode,Encode}Error_GetStart when start = 0 Aug 27, 2024
@picnixz picnixz changed the title Incorrect handling of PyUnicode{Decode,Encode}Error_GetStart when start = 0 Incorrect handling of PyUnicode{Decode,Encode}Error_GetStart when start = size = 0 Aug 27, 2024
@python python deleted a comment Aug 27, 2024
@serhiy-storchaka
Copy link
Member

This clipping for start and end is weird. It was here from the beginning, from adding support for codec error handling callbacks in 3aeb632 (gh-34615).

@doerwalter, сan you shed some light on this? I guess that the case with size == 0 was not considered, as it does not make much sense. But why not use range 0..size for both start and end?

cc @malemburg

@picnixz
Copy link
Member Author

picnixz commented Aug 27, 2024

This clipping for start and end is weird.

Yes, but I didn't want to change the way it was done. With start = size = 0, we should (at least) not set size = -1... I would have expected the clipping to be usable with self.object[self.start:self.end]. (I observed the weirdness in the tests...)

@malemburg
Copy link
Member

I'm not too familiar with this code, but from looking at it, it seems that the if-statements should be swapped:

    if (*start>=size)
        *start = size-1;
    if (*start<0)
        *start = 0; /*XXX check for values <0*/

It doesn't make sense to set *start to a negative value.

@picnixz
Copy link
Member Author

picnixz commented Aug 27, 2024

I'll update the PR accordingly. Should I also reject calls that set Unicode*Error.start to negative values? currently we have

int
PyUnicodeDecodeError_SetStart(PyObject *exc, Py_ssize_t start)
{
    ((PyUnicodeErrorObject *)exc)->start = start;
    return 0;
}

so the call unconditionally succeeds. I wanted to return -1 if start is < 0 but I'm not sure if it could break existing code. There is no mention of start or end to be >= 0 (though, setting it to -1 and then retrieve it using GetStart would currently return 0...)

UnicodeError has attributes that describe the encoding or decoding error. For example, err.object[err.start:err.end] gives the particular invalid input that the codec failed on.

How should I proceed?

@serhiy-storchaka
Copy link
Member

It would be a double work if you leave it in the getter. We cannot use the check start < size in the setter, because object can be set after start, so the check in the setter can only be one-side. Splitting checks between the getter and the setter is not good.

@picnixz
Copy link
Member Author

picnixz commented Aug 28, 2024

We cannot use the check start < size in the setter

I wasn't thinking of doing this check. What I thought about was doing the check start < 0 and raise an exception if this is not the case. Because in the getter, we are not interpreting negative start indices as positive ones from the end (we just clamp them to 0).

So, I have various ideas:

  • Idea 1: Leave the setter alone, and use what Marc-Andre suggested.
  • Idea 2: Add a start < 0 check in the setter and use what M-A. suggested.
  • Idea 3: Leave the setter alone, interpret negative start values as in Python.

Idea 3 would be a feature, though I don't think we should do it. However, we should document the C API to indicate that start should not be negative. We can enforce this condition by using idea 2. If you don't want to enforce the condition and silently clamp to 0, then I'd go for idea 1.

@doerwalter
Copy link
Contributor

doerwalter commented Aug 28, 2024

That code is nearly 22 year old, so I can no longer remember what my intention back then was. But the code does not treat negative values as being relative to the end, but clips them to valid positive offset positions.

So for an empty string, the only reasonable start value is 0. start should never point outside of the string.

The simplest solution would probably be to fix the getter to never return -1.

(Fixing the setter would be possible, but since there's no PyUnicodeDecodeError_SetObject it isn't clear how replacing the object should be handled).

@picnixz
Copy link
Member Author

picnixz commented Aug 28, 2024

By the way, I found this:

>>> str(UnicodeEncodeError('utf-8', '', -1, 0, ''))
Fatal Python error: _Py_CheckFunctionResult: a function returned a result with an exception set
Python runtime state: initialized
IndexError: string index out of range

The above exception was the direct cause of the following exception:

SystemError: <class 'str'> returned a result with an exception set

Current thread 0x00007f6e8678e740 (most recent call first):
  File "<python-input-3>", line 1 in <module>
  File "/lib/python/cpython/Lib/code.py", line 91 in runcode
  File "/lib/python/cpython/Lib/_pyrepl/console.py", line 205 in runsource
  File "/lib/python/cpython/Lib/code.py", line 312 in push
  File "/lib/python/cpython/Lib/_pyrepl/simple_interact.py", line 157 in run_multiline_interactive_console
  File "/lib/python/cpython/Lib/_pyrepl/main.py", line 59 in interactive_console
  File "/lib/python/cpython/Lib/_pyrepl/__main__.py", line 6 in <module>
  File "/lib/python/cpython/Lib/runpy.py", line 88 in _run_code
  File "/lib/python/cpython/Lib/runpy.py", line 198 in _run_module_as_main
Aborted (core dumped)

The start index should definitely be non-negative since it's being used without checks. We need to fix the setter by disallowing start to be negative in PyUnicode*Error_SetStart.

EDIT: we need to fix the constructor, not the setter actually in this case.

@doerwalter
Copy link
Contributor

Not neccesssarily. We might just have to update UnicodeEncodeError_str to use the getters instead of accessing the start directly.

But fixing the setters might be the safer option.

@picnixz
Copy link
Member Author

picnixz commented Aug 28, 2024

Ah yes, you could also update to use the getters, but I think it's safer to just check that users don't pass a negative value to the setter as well (sorry for panicking)

@picnixz picnixz changed the title Incorrect handling of PyUnicode{Decode,Encode}Error_GetStart when start = size = 0 Incorrect handling negative start values on PyUnicodeError Aug 28, 2024
@picnixz picnixz changed the title Incorrect handling negative start values on PyUnicodeError Incorrect handling of negative start values on PyUnicodeError Aug 28, 2024
@picnixz picnixz changed the title Incorrect handling of negative start values on PyUnicodeError Incorrect handling of negative start values on PyUnicodeErrorObject Aug 28, 2024
@picnixz
Copy link
Member Author

picnixz commented Aug 28, 2024

So I've made that the constructor of UnicodeErrors do not accept negative start values (for the end values, I've left them out for now; I don't know if you want to enforce it in the same PR). I've added tests for that. I've also added the check inside the setter for negative start values (again, I've left out the end case for now).

I don't think user should directly access the start field without using the getter. If they do, they should be ensure that their start is within the expected bounds. I can also change the way the str function queries the start value if you want.

@encukou
Copy link
Member

encukou commented Sep 23, 2024

While people are looking into this:
I propose changing the validation for end: instead of [1, len(object)], allow [clamp(start), len(object)] (where clamp(start) is the adjusted/validated start value).

  • end=0 should be valid if start=0, just like any other case of start==end
  • “negative” ranges (end<start) are nearly always unexpected, and likely to result in untested code paths if not outright bugs.

An alternative would be disallowing “empty” ranges (start==end), but they might be useful so I'd prefer to keep backwards-comptible behavior there.

miss-islington pushed a commit to miss-islington/cpython that referenced this issue Oct 8, 2024
(cherry picked from commit ba14dfa)

Co-authored-by: Bénédikt Tran <[email protected]>
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Oct 8, 2024
(cherry picked from commit ba14dfa)

Co-authored-by: Bénédikt Tran <[email protected]>
@vstinner
Copy link
Member

vstinner commented Oct 8, 2024

UnicodeError.__str__() bug was fixed by change ba14dfa.

vstinner pushed a commit that referenced this issue Oct 8, 2024
…125098)

gh-123378: fix a crash in `UnicodeError.__str__` (GH-124935)
(cherry picked from commit ba14dfa)

Co-authored-by: Bénédikt Tran <[email protected]>
vstinner pushed a commit that referenced this issue Oct 8, 2024
…125099)

gh-123378: fix a crash in `UnicodeError.__str__` (GH-124935)
(cherry picked from commit ba14dfa)

Co-authored-by: Bénédikt Tran <[email protected]>
efimov-mikhail pushed a commit to efimov-mikhail/cpython that referenced this issue Oct 9, 2024
@encukou
Copy link
Member

encukou commented Oct 10, 2024

Victor/Serhiy, do you think the accessors should change?

@vstinner
Copy link
Member

UnicodeError.str() bug was fixed by change ba14dfa.

After this change, is it still possible to crash Python?

@picnixz
Copy link
Member Author

picnixz commented Oct 14, 2024

Err... I don't know, I don't think so but I didn't check whether we can still make it crash the commented PyCodecs API test (I'm no more on my dev env for today so I cannot check).

@vstinner
Copy link
Member

If it's not possible to crash Python, I don't think that it's needed to change the setter/getter.

@picnixz
Copy link
Member Author

picnixz commented Oct 22, 2024

I'm still able to make the interpreter crash as follows:

./python -c "import codecs; codecs.xmlcharrefreplace_errors(UnicodeEncodeError('bad', '', 0, 1, 'reason'))"
Checked 112 modules (34 built-in, 77 shared, 1 n/a on linux-x86_64, 0 disabled, 0 missing, 0 failed on import)
python: ./Include/cpython/unicodeobject.h:339: PyUnicode_READ_CHAR: Assertion `index >= 0' failed.
Aborted (core dumped)

It appears that the handler should be fixed to not blindly use the getters but maybe the choice of the getters to set start/end to some value in some corner cases should also be re-thought (namely if (*start>=size) *start = size-1; will make the getter return start = -1 and then the xmlcharrefreplace_errors crashes because of this).

@picnixz
Copy link
Member Author

picnixz commented Oct 27, 2024

Ok, I've managed to crash or raise a SystemError other handlers. We should definitely do something:

./python -c "import codecs; codecs.backslashreplace_errors(UnicodeDecodeError('utf-8', b'00000', 9, 2, 'reason'))"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
    import codecs; codecs.backslashreplace_errors(UnicodeDecodeError('utf-8', b'00000', 9, 2, 'reason'))
                   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
SystemError: Negative size passed to PyUnicode_New
./python -c "import codecs; codecs.replace_errors(UnicodeTranslateError('000', 1, -7, 'reason'))"
python: Python/codecs.c:743: PyCodec_ReplaceErrors: Assertion `PyUnicode_KIND(res) == PyUnicode_2BYTE_KIND' failed.
Aborted (core dumped)

I have some suggestions:

  • we fix the handlers so that they only consider positive indices and return garbage otherwise (which could be a breaking change)
  • same as above but handlers raise exceptions instead of returning garbage

Note that just fixing the negative values does not seem to patch the above issues. It does patch the following however:

./python -c "import codecs; codecs.xmlcharrefreplace_errors(UnicodeEncodeError('bad', '', 0, 1, 'reason'))"

@picnixz picnixz added type-crash A hard crash of the interpreter, possibly with a core dump interpreter-core (Objects, Python, Grammar, and Parser dirs) labels Oct 27, 2024
encukou pushed a commit that referenced this issue Nov 1, 2024
encukou pushed a commit that referenced this issue Dec 4, 2024
@picnixz
Copy link
Member Author

picnixz commented Dec 6, 2024

We decided NOT to backport this one even though it's a bug fix. The reason is that it could annoy users (although this would annoy them only in the case of an empty message) but to mitigate breakage, we'll just leave 3.12/3.13 broken and only fix 3.14 and later (see the PR post-merge discussion for details).

In particular, fixes for codec handlers will only be 3.14+ as well (note that we did not have an issue report for the past 20+ years that this code existed so I don't think users will really see a change).

@picnixz picnixz closed this as completed Dec 6, 2024
@picnixz picnixz added the 3.14 bugs and security fixes label Dec 6, 2024
picnixz added a commit to picnixz/cpython that referenced this issue Dec 8, 2024
srinivasreddy pushed a commit to srinivasreddy/cpython that referenced this issue Jan 8, 2025
ebonnal pushed a commit to ebonnal/cpython that referenced this issue Jan 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.14 bugs and security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-C-API type-bug An unexpected behavior, bug, or error type-crash A hard crash of the interpreter, possibly with a core dump
Projects
Development

No branches or pull requests

6 participants