-
-
Notifications
You must be signed in to change notification settings - Fork 31.8k
Incorrect handling of negative start
values on PyUnicodeErrorObject
#123378
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
PyUnicode{Decode,Encode}Error_GetStart
when start =
PyUnicode{Decode,Encode}Error_GetStart
when start = 0
PyUnicode{Decode,Encode}Error_GetStart
when start = 0
PyUnicode{Decode,Encode}Error_GetStart
when start = size = 0
This clipping for @doerwalter, сan you shed some light on this? I guess that the case with size == 0 was not considered, as it does not make much sense. But why not use range cc @malemburg |
Yes, but I didn't want to change the way it was done. With start = size = 0, we should (at least) not set size = -1... I would have expected the clipping to be usable with |
I'm not too familiar with this code, but from looking at it, it seems that the if-statements should be swapped:
It doesn't make sense to set *start to a negative value. |
I'll update the PR accordingly. Should I also reject calls that set int
PyUnicodeDecodeError_SetStart(PyObject *exc, Py_ssize_t start)
{
((PyUnicodeErrorObject *)exc)->start = start;
return 0;
} so the call unconditionally succeeds. I wanted to return -1 if start is < 0 but I'm not sure if it could break existing code. There is no mention of start or end to be >= 0 (though, setting it to -1 and then retrieve it using GetStart would currently return 0...)
How should I proceed? |
It would be a double work if you leave it in the getter. We cannot use the check |
I wasn't thinking of doing this check. What I thought about was doing the check So, I have various ideas:
Idea 3 would be a feature, though I don't think we should do it. However, we should document the C API to indicate that start should not be negative. We can enforce this condition by using idea 2. If you don't want to enforce the condition and silently clamp to 0, then I'd go for idea 1. |
That code is nearly 22 year old, so I can no longer remember what my intention back then was. But the code does not treat negative values as being relative to the end, but clips them to valid positive offset positions. So for an empty string, the only reasonable The simplest solution would probably be to fix the getter to never return -1. (Fixing the setter would be possible, but since there's no |
By the way, I found this: >>> str(UnicodeEncodeError('utf-8', '', -1, 0, ''))
Fatal Python error: _Py_CheckFunctionResult: a function returned a result with an exception set
Python runtime state: initialized
IndexError: string index out of range
The above exception was the direct cause of the following exception:
SystemError: <class 'str'> returned a result with an exception set
Current thread 0x00007f6e8678e740 (most recent call first):
File "<python-input-3>", line 1 in <module>
File "/lib/python/cpython/Lib/code.py", line 91 in runcode
File "/lib/python/cpython/Lib/_pyrepl/console.py", line 205 in runsource
File "/lib/python/cpython/Lib/code.py", line 312 in push
File "/lib/python/cpython/Lib/_pyrepl/simple_interact.py", line 157 in run_multiline_interactive_console
File "/lib/python/cpython/Lib/_pyrepl/main.py", line 59 in interactive_console
File "/lib/python/cpython/Lib/_pyrepl/__main__.py", line 6 in <module>
File "/lib/python/cpython/Lib/runpy.py", line 88 in _run_code
File "/lib/python/cpython/Lib/runpy.py", line 198 in _run_module_as_main
Aborted (core dumped) The EDIT: we need to fix the constructor, not the setter actually in this case. |
Not neccesssarily. We might just have to update But fixing the setters might be the safer option. |
Ah yes, you could also update to use the getters, but I think it's safer to just check that users don't pass a negative value to the setter as well (sorry for panicking) |
PyUnicode{Decode,Encode}Error_GetStart
when start = size = 0
start
values on PyUnicodeError
start
values on PyUnicodeError
start
values on PyUnicodeError
start
values on PyUnicodeError
start
values on PyUnicodeErrorObject
So I've made that the constructor of UnicodeErrors do not accept negative start values (for the end values, I've left them out for now; I don't know if you want to enforce it in the same PR). I've added tests for that. I've also added the check inside the setter for negative start values (again, I've left out the end case for now). I don't think user should directly access the |
While people are looking into this:
An alternative would be disallowing “empty” ranges ( |
(cherry picked from commit ba14dfa) Co-authored-by: Bénédikt Tran <[email protected]>
(cherry picked from commit ba14dfa) Co-authored-by: Bénédikt Tran <[email protected]>
|
…125098) gh-123378: fix a crash in `UnicodeError.__str__` (GH-124935) (cherry picked from commit ba14dfa) Co-authored-by: Bénédikt Tran <[email protected]>
…125099) gh-123378: fix a crash in `UnicodeError.__str__` (GH-124935) (cherry picked from commit ba14dfa) Co-authored-by: Bénédikt Tran <[email protected]>
Victor/Serhiy, do you think the accessors should change? |
After this change, is it still possible to crash Python? |
Err... I don't know, I don't think so but I didn't check whether we can still make it crash the commented PyCodecs API test (I'm no more on my dev env for today so I cannot check). |
If it's not possible to crash Python, I don't think that it's needed to change the setter/getter. |
I'm still able to make the interpreter crash as follows: ./python -c "import codecs; codecs.xmlcharrefreplace_errors(UnicodeEncodeError('bad', '', 0, 1, 'reason'))"
Checked 112 modules (34 built-in, 77 shared, 1 n/a on linux-x86_64, 0 disabled, 0 missing, 0 failed on import)
python: ./Include/cpython/unicodeobject.h:339: PyUnicode_READ_CHAR: Assertion `index >= 0' failed.
Aborted (core dumped) It appears that the handler should be fixed to not blindly use the getters but maybe the choice of the getters to set start/end to some value in some corner cases should also be re-thought (namely |
Ok, I've managed to crash or raise a SystemError other handlers. We should definitely do something: ./python -c "import codecs; codecs.backslashreplace_errors(UnicodeDecodeError('utf-8', b'00000', 9, 2, 'reason'))"
Traceback (most recent call last):
File "<string>", line 1, in <module>
import codecs; codecs.backslashreplace_errors(UnicodeDecodeError('utf-8', b'00000', 9, 2, 'reason'))
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
SystemError: Negative size passed to PyUnicode_New ./python -c "import codecs; codecs.replace_errors(UnicodeTranslateError('000', 1, -7, 'reason'))"
python: Python/codecs.c:743: PyCodec_ReplaceErrors: Assertion `PyUnicode_KIND(res) == PyUnicode_2BYTE_KIND' failed.
Aborted (core dumped) I have some suggestions:
Note that just fixing the negative values does not seem to patch the above issues. It does patch the following however: ./python -c "import codecs; codecs.xmlcharrefreplace_errors(UnicodeEncodeError('bad', '', 0, 1, 'reason'))" |
…mped (GH-123380) Co-authored-by: Sergey B Kirpichev <[email protected]>
We decided NOT to backport this one even though it's a bug fix. The reason is that it could annoy users (although this would annoy them only in the case of an empty message) but to mitigate breakage, we'll just leave 3.12/3.13 broken and only fix 3.14 and later (see the PR post-merge discussion for details). In particular, fixes for codec handlers will only be 3.14+ as well (note that we did not have an issue report for the past 20+ years that this code existed so I don't think users will really see a change). |
For now, skip some crashers (tracked in pythongh-123378).
…re clamped (pythonGH-123380) Co-authored-by: Sergey B Kirpichev <[email protected]>
For now, skip some crashers (tracked in pythongh-123378).
Bug report
Bug description:
Found when implementing #123343. We have:
The line
*start = size-1
might setstart
to-1
whenstart = 0
, in which case this leads to assertion failures when the index is used normally.CPython versions tested on:
CPython main branch
Operating systems tested on:
Linux
Linked PRs
UnicodeError.__str__
#124935UnicodeError.__str__
(GH-124935) #125098UnicodeError.__str__
(GH-124935) #125099start
andend
values inPyUnicodeErrorObject
#123380The text was updated successfully, but these errors were encountered: