gh-124188: Fix PyErr_ProgramTextObject() #124189

serhiy-storchaka · 2024-09-17T19:15:11Z

Detect source file encoding.
Use the "replace" error handler even for UTF-8 (default) encoding.
Remove the BOM.
Fix detection of too long lines if they contain NUL.
Return the head rather than the tail for truncated long lines.

Issue: Take into account encoding of source file for syntax error #124188

* Detect source file encoding. * Use the "replace" error handler even for UTF-8 (default) encoding. * Remove the BOM. * Fix detection of too long lines if they contain NUL. * Return the head rather than the tail for truncated long lines.

iritkatriel · 2024-09-18T09:10:43Z

Lib/test/test_exceptions.py

+    if isinstance(source, str):
+        with open(TESTFN, 'w', encoding='utf-8') as testfile:
+            testfile.write(dedent(source))
+    else:
+        with open(TESTFN, 'wb') as testfile:
+            testfile.write(source)


Should this be replaced by a call to script_helper.make_script"?

I considered this option, but I have doubt that this would make the code simpler. This code calls dedent() for string source, so you need a branching in any case. The rest of make_script() is not useful here. You would need to pass os.curdir as the first argument, the .py suffix is not relevant here, as well as the importlib cache invalidation.

iritkatriel · 2024-09-18T09:26:49Z

Python/errors.c

-        } while (*pLastChar != '\0' && *pLastChar != '\n');
+    const char *line = linebuf;
+    /* Skip BOM. */
+    if (lineno == 1 && line_size >= 3 && memcmp(line, "\xef\xbb\xbf", 3) == 0) {


Maybe #define BOM_PREFIX "\xef\xbb\xbf"?

This string is only occurs once in the sources. I do not think that defining a name for it (potentially in a place far from its use) would make the code more clearer. If the comment is not enough, suggest better comment.

iritkatriel · 2024-09-18T09:42:44Z

Python/errors.c

@@ -1960,20 +1960,41 @@ PyErr_ProgramText(const char *filename, int lineno)
    return res;
 }

+/* Function from Parser/tokenizer/file_tokenizer.c */
+extern char* _PyTokenizer_FindEncodingFilename(int, PyObject *);
+
 PyObject *
 _PyErr_ProgramDecodedTextObject(PyObject *filename, int lineno, const char* encoding)


should this be _PyErr_DecodedProgramTextObject?

This already was here.

serhiy-storchaka · 2024-09-18T10:14:56Z

Lib/test/test_exceptions.py

+    if isinstance(source, str):
+        with open(TESTFN, 'w', encoding='utf-8') as testfile:
+            testfile.write(dedent(source))
+    else:
+        with open(TESTFN, 'wb') as testfile:
+            testfile.write(source)


I considered this option, but I have doubt that this would make the code simpler. This code calls dedent() for string source, so you need a branching in any case. The rest of make_script() is not useful here. You would need to pass os.curdir as the first argument, the .py suffix is not relevant here, as well as the importlib cache invalidation.

serhiy-storchaka · 2024-09-18T10:17:52Z

Lib/test/test_exceptions.py

@@ -2052,11 +2057,32 @@ def test_assertion_error_location(self):
                    'AssertionError',
                ],
            ),
-            ('assert 1 > 2, "message"',
+            ('assert 1 > 2, "messäge"',


These tests are not relevant to compiler errors. I updated/added them just to test that tracebacks for other errors also extract non-ASCII lines correctly.

serhiy-storchaka · 2024-09-18T10:19:31Z

Python/errors.c

@@ -1960,20 +1960,41 @@ PyErr_ProgramText(const char *filename, int lineno)
    return res;
 }

+/* Function from Parser/tokenizer/file_tokenizer.c */
+extern char* _PyTokenizer_FindEncodingFilename(int, PyObject *);
+
 PyObject *
 _PyErr_ProgramDecodedTextObject(PyObject *filename, int lineno, const char* encoding)


This already was here.

serhiy-storchaka · 2024-09-18T10:24:28Z

Python/errors.c

-        } while (*pLastChar != '\0' && *pLastChar != '\n');
+    const char *line = linebuf;
+    /* Skip BOM. */
+    if (lineno == 1 && line_size >= 3 && memcmp(line, "\xef\xbb\xbf", 3) == 0) {


This string is only occurs once in the sources. I do not think that defining a name for it (potentially in a place far from its use) would make the code more clearer. If the comment is not enough, suggest better comment.

vstinner · 2024-09-18T13:03:17Z

A test fails on WASI:

ERROR: test_assertion_error_location (test.test_exceptions.AssertionErrorTests.test_assertion_error_location)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Lib/test/test_exceptions.py", line 2124, in test_assertion_error_location
    result = run_script(source)
  File "/Lib/test/test_exceptions.py", line 2022, in run_script
    _rc, _out, err = script_helper.assert_python_failure('-Wd', '-X', 'utf8', TESTFN)
                     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Lib/test/support/script_helper.py", line 193, in assert_python_failure
    return _assert_python(False, *args, **env_vars)
  File "/Lib/unittest/case.py", line 156, in skip_wrapper
    raise SkipTest(reason)
unittest.case.SkipTest: requires subprocess support

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Lib/test/support/__init__.py", line 2814, in wrapper
    return func(*args, **kwargs)
  File "/Lib/test/test_exceptions.py", line 2123, in test_assertion_error_location
    with self.subTest(source):
         ~~~~~~~~~~~~^^^^^^^^
  File "/Lib/contextlib.py", line 162, in __exit__
    self.gen.throw(value)
    ~~~~~~~~~~~~~~^^^^^^^
  File "/Lib/unittest/case.py", line 555, in subTest
    with self._outcome.testPartExecutor(self._subtest, subTest=True):
         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Lib/contextlib.py", line 162, in __exit__
    self.gen.throw(value)
    ~~~~~~~~~~~~~~^^^^^^^
  File "/Lib/unittest/case.py", line 63, in testPartExecutor
    _addSkip(self.result, test_case, str(e))
    ~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Lib/unittest/case.py", line 89, in _addSkip
    addSkip(test_case, reason)
    ~~~~~~~^^^^^^^^^^^^^^^^^^^
  File "/Lib/test/libregrtest/testresult.py", line 123, in addSkip
    super().addSkip(test, reason)
    ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^
  File "/Lib/unittest/runner.py", line 119, in addSkip
    self._write_status(test, "skipped {0!r}".format(reason))
    ~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Lib/unittest/runner.py", line 71, in _write_status
    self.stream.write(self.getDescription(test))
                      ~~~~~~~~~~~~~~~~~~~^^^^^^
  File "/Lib/unittest/runner.py", line 54, in getDescription
    return str(test)
  File "/Lib/unittest/case.py", line 1487, in __str__
    return "{} {}".format(self.test_case, self._subDescription())
                                          ~~~~~~~~~~~~~~~~~~~~^^
  File "/Lib/unittest/case.py", line 1469, in _subDescription
    parts.append("[{}]".format(self._message))
                 ~~~~~~~~~~~~~^^^^^^^^^^^^^^^
BytesWarning: str() on a bytes instance

vstinner · 2024-09-18T13:17:01Z

Oh, it's a bug in unittest. It calls str() on a bytes object indirectly. Workaround:

diff --git a/Lib/unittest/case.py b/Lib/unittest/case.py
index 55c79d35353..4e00267e77c 100644
--- a/Lib/unittest/case.py
+++ b/Lib/unittest/case.py
@@ -1455,6 +1455,8 @@ class _SubTest(TestCase):
 
     def __init__(self, test_case, message, params):
         super().__init__()
+        if message is not _subtest_msg_sentinel and not isinstance(message, str):
+            message = repr(message)
         self._message = message
         self.test_case = test_case
         self.params = params

miss-islington-app · 2024-09-24T08:01:40Z

Thanks @serhiy-storchaka for the PR 🌮🎉.. I'm working now to backport this PR to: 3.12, 3.13.
🐍🍒⛏🤖

* Detect source file encoding. * Use the "replace" error handler even for UTF-8 (default) encoding. * Remove the BOM. * Fix detection of too long lines if they contain NUL. * Return the head rather than the tail for truncated long lines. (cherry picked from commit e2f7107) Co-authored-by: Serhiy Storchaka <[email protected]>

miss-islington-app · 2024-09-24T08:01:49Z

Sorry, @serhiy-storchaka, I could not cleanly backport this to 3.12 due to a conflict.
Please backport using cherry_picker on command line.

cherry_picker e2f710792b0418b8ca1ca3b8cdf39588c7268495 3.12

bedevere-app · 2024-09-24T08:01:51Z

GH-124423 is a backport of this pull request to the 3.13 branch.

* Detect source file encoding. * Use the "replace" error handler even for UTF-8 (default) encoding. * Remove the BOM. * Fix detection of too long lines if they contain NUL. * Return the head rather than the tail for truncated long lines. (cherry picked from commit e2f7107) Co-authored-by: Serhiy Storchaka <[email protected]>

bedevere-app · 2024-09-24T08:37:45Z

GH-124426 is a backport of this pull request to the 3.12 branch.

* Detect source file encoding. * Use the "replace" error handler even for UTF-8 (default) encoding. * Remove the BOM. * Fix detection of too long lines if they contain NUL. * Return the head rather than the tail for truncated long lines. (cherry picked from commit e2f7107)

vstinner · 2024-09-24T09:28:20Z

That's a big piece of work. While _PyErr_ProgramDecodedTextObject() change is short, you added tons of tests for that: thanks.

serhiy-storchaka · 2024-09-24T12:20:54Z

Because it contained tons of bugs.

Not all tests were backported to 3.12, because they exposed other bugs, not fixed in 3.12.

* Detect source file encoding. * Use the "replace" error handler even for UTF-8 (default) encoding. * Remove the BOM. * Fix detection of too long lines if they contain NUL. * Return the head rather than the tail for truncated long lines. (cherry picked from commit e2f7107) Co-authored-by: Serhiy Storchaka <[email protected]>

pythongh-124188: Fix PyErr_ProgramTextObject()

03e5092

* Detect source file encoding. * Use the "replace" error handler even for UTF-8 (default) encoding. * Remove the BOM. * Fix detection of too long lines if they contain NUL. * Return the head rather than the tail for truncated long lines.

serhiy-storchaka added 3.12 only security fixes 3.13 bugs and security fixes 3.14 bugs and security fixes labels Sep 17, 2024

serhiy-storchaka requested a review from iritkatriel as a code owner September 17, 2024 19:15

bedevere-app bot added the awaiting core review label Sep 17, 2024

bedevere-app bot mentioned this pull request Sep 17, 2024

Take into account encoding of source file for syntax error #124188

Closed

iritkatriel reviewed Sep 18, 2024

View reviewed changes

serhiy-storchaka commented Sep 18, 2024

View reviewed changes

iritkatriel approved these changes Sep 18, 2024

View reviewed changes

bedevere-app bot added awaiting merge and removed awaiting core review labels Sep 18, 2024

serhiy-storchaka added 2 commits September 19, 2024 17:14

Fix BytesWarning.

db0b5c8

Merge branch 'main' into err-programtext

6553dc2

serhiy-storchaka added needs backport to 3.12 only security fixes needs backport to 3.13 bugs and security fixes and removed 3.12 only security fixes 3.13 bugs and security fixes 3.14 bugs and security fixes labels Sep 24, 2024

serhiy-storchaka merged commit e2f7107 into python:main Sep 24, 2024
42 checks passed

bedevere-app bot removed the awaiting merge label Sep 24, 2024

serhiy-storchaka deleted the err-programtext branch September 24, 2024 08:01

miss-islington-app bot assigned serhiy-storchaka Sep 24, 2024

bedevere-app bot removed the needs backport to 3.13 bugs and security fixes label Sep 24, 2024

bedevere-app bot removed the needs backport to 3.12 only security fixes label Sep 24, 2024

serhiy-storchaka removed their assignment Sep 29, 2024

Uh oh!

gh-124188: Fix PyErr_ProgramTextObject() #124189

gh-124188: Fix PyErr_ProgramTextObject() #124189

Uh oh!

Conversation

serhiy-storchaka commented Sep 17, 2024 • edited by bedevere-app bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vstinner commented Sep 18, 2024

Uh oh!

vstinner commented Sep 18, 2024

Uh oh!

Uh oh!

miss-islington-app bot commented Sep 24, 2024

Uh oh!

miss-islington-app bot commented Sep 24, 2024

Uh oh!

bedevere-app bot commented Sep 24, 2024

Uh oh!

bedevere-app bot commented Sep 24, 2024

Uh oh!

vstinner commented Sep 24, 2024

Uh oh!

serhiy-storchaka commented Sep 24, 2024

Uh oh!

Uh oh!

serhiy-storchaka commented Sep 17, 2024 •

edited by bedevere-app bot

Loading