gh-108590: Improve sqlite3 docs on encoding issues and how to handle those #108699

erlend-aasland · 2023-08-30T21:57:28Z

Document how to handle table columns with invalid Unicode sequences.

Issue: sqlite3.iterdump() incompatible with binary data #108590

📚 Documentation preview 📚: https://cpython-previews--108699.org.readthedocs.build/

Document how to handle table columns with invalid Unicode sequences.

erlend-aasland · 2023-08-30T22:01:00Z

IMO, the existing text_factory examples look contrived; the added example where we solve the issue of invalid Unicode sequences is a more interesting and useful example.

erlend-aasland · 2023-08-30T22:16:54Z

cc. @AlexWaygood, if you have time to look at the prose.

Doc/library/sqlite3.rst

Co-authored-by: Alex Waygood <[email protected]>

erlend-aasland · 2023-08-30T22:54:26Z

Thanks, Alex!

serhiy-storchaka

print() will fail, because sys.stdout.errors is "strict" by default. It is not a new issue, you have the same issue when you print the result of os.listdir(), for example.

Other issue is that you cannot simply feed the result of iterdump() to execute(), because the Python interface only accept UTF-8 encodable string.

A minor problem is that using anything besides str, bytes or bytearray as text_factory slows down requests. So you only need to use this if necessary.

Doc/library/sqlite3.rst

CorvinM · 2023-08-31T08:39:14Z

It may be helpful to emphasize that the surrogate escapes cannot be passed back to sqlite through python.

… to sqlite3 API

serhiy-storchaka · 2023-08-31T09:43:18Z

The plausible real-world example -- a DB created from a CSV file or other source in legacy encoding.

Ukrainian or Korean are bad as examples, because all characters are non-ASCII, so you cannot not recognize the word in its bytes repr. "Österreich" is better, it contains only one non-Latin letter. Of course you need to use the encoding different from UTF-8 for example. It would be better if the natural legacy encoding is different from Latin1, but if there is nothing better, Latin1 works too. "Österreich" is a bit too long, some shorter Norwegian geographical name may be better.

The new example only shows the use of iterdump() and does not show the use of bytes. I think that examples with fetch() are important too. You can show it for bytes, encoding='latin1' and errors='surrogateescape', then say that iterdump() only works with text_factory producing a string and show how to save it in a file (and you can chose whether you want to recreate the DB with the originally encoded data or recode it in UTF-8 if you know the used encoding).

encukou · 2023-08-31T11:40:13Z

The CSV export from my Czech bank account uses iso-8859-2 and contains the column b'\xc8\xe1stka'.
Částka means "sum" (quantity of money). Decoding with Latin-1 this gives you Èástka which is wrong.

Or you could use b'Polo\xbeka' (položka, "entry"), which latin1 decodes to the more obviously wrong Polo¾ka.

serhiy-storchaka · 2023-08-31T11:46:16Z

Good examples!

Doc/library/sqlite3.rst

serhiy-storchaka

I think we all agree that the current version has serious issues. No need to beat it more, it is already dead. Let @erlend-aasland to address our comments and prepare a new version. Then we will start a new round of bikeshedding.

Doc/library/sqlite3.rst

erlend-aasland · 2023-10-12T10:14:57Z

@ezio-melotti, would you like to take another look?

Doc/library/sqlite3.rst

CAM-Gerlach

Looks very good now, thanks! Just a few small suggestions.

Doc/library/sqlite3.rst

Co-authored-by: C.A.M. Gerlach <[email protected]> Co-authored-by: Ezio Melotti <[email protected]>

Doc/library/sqlite3.rst

Co-authored-by: Ezio Melotti <[email protected]>

erlend-aasland · 2023-10-13T08:01:48Z

I think this is ready to land. Shout out if you disagree. OTOH, we can easily patch up things with a new PR. I'll be merging this tomorrow morning CET.

erlend-aasland · 2023-10-13T08:02:30Z

A big thanks to everyone who helped shape this PR!

serhiy-storchaka

This change removes any mention of bytes. It makes the documentation less informative.

It also does not resolve the original issue, in particularly, it does not document that iterdump() is not compatible with text_factory=bytes. And it does not say what to do with the result of iterdump() with custom text_factory.

erlend-aasland · 2023-10-13T11:04:54Z

This change removes any mention of bytes. It makes the documentation less informative.

Serhiy, please see #108699 (comment). I do not intend to mention bytes unless there is a reason to do it. If you care strongly for it, I suggest you create a follow-up PR with your use-case. IMO, it does not block this PR.

It also does not resolve the original issue, in particularly, it does not document that iterdump() is not compatible with text_factory=bytes. And it does not say what to do with the result of iterdump() with custom text_factory.

It does so implicitly; it helps the user understand how to work around such encoding issues. If you disagree with this approach, please open a competing PR.

serhiy-storchaka · 2023-10-13T11:56:45Z

Okay. I do not fully understand what is the benefit of this change, but I do not oppose it.

erlend-aasland · 2023-10-13T12:42:03Z

Okay. I do not fully understand what is the benefit of this change, but I do not oppose it.

Why so dismissive? I've asked you twice for a concrete proposal regarding your bytes comment, but you seem to ignore this. I do not oppose such a change, I only ask for a use-case for such an example.

serhiy-storchaka · 2023-10-13T13:42:02Z

Sorry, I meant no disrespect. There were so many changes and comments in this PR that I was already lost. After the old examples were removed, there was a lot of discussion about new examples, but they never appeared, so I didn't see much to comment on, as this PR looked far from finished. If you consider it complete, I will perhaps continue the work in a new PR. text_factory should be fully documented (and special cases for bytes and bytearray are specially optimized in the code).

erlend-aasland · 2023-10-13T14:54:27Z

Ok, thanks for your response, Serhiy. It is ok to continue improving the docs in follow-up PRs.

Let's land this, then we can follow up your concerns in a new PR. I appreciate your comments; I just find some of them hard to address.

miss-islington-app · 2023-10-25T13:58:04Z

Thanks @erlend-aasland for the PR 🌮🎉.. I'm working now to backport this PR to: 3.11, 3.12.
🐍🍒⛏🤖 I'm not a witch! I'm not a witch!

…andle those (pythonGH-108699) Add a guide for how to handle non-UTF-8 text encodings. Link to that guide from the 'text_factory' docs. (cherry picked from commit 1262e41) Co-authored-by: Erlend E. Aasland <[email protected]> Co-authored-by: Alex Waygood <[email protected]> Co-authored-by: C.A.M. Gerlach <[email protected]> Co-authored-by: Corvin <[email protected]> Co-authored-by: Ezio Melotti <[email protected]> Co-authored-by: Serhiy Storchaka <[email protected]>

bedevere-app · 2023-10-25T13:58:19Z

GH-111324 is a backport of this pull request to the 3.12 branch.

bedevere-app · 2023-10-25T13:58:22Z

GH-111325 is a backport of this pull request to the 3.11 branch.

…handle those (GH-108699) (#111325) Add a guide for how to handle non-UTF-8 text encodings. Link to that guide from the 'text_factory' docs. (cherry picked from commit 1262e41) Co-authored-by: Erlend E. Aasland <[email protected]> Co-authored-by: Alex Waygood <[email protected]> Co-authored-by: C.A.M. Gerlach <[email protected]> Co-authored-by: Corvin <[email protected]> Co-authored-by: Ezio Melotti <[email protected]> Co-authored-by: Serhiy Storchaka <[email protected]>

…handle those (GH-108699) (#111324) Add a guide for how to handle non-UTF-8 text encodings. Link to that guide from the 'text_factory' docs. (cherry picked from commit 1262e41) Co-authored-by: Erlend E. Aasland <[email protected]> Co-authored-by: Alex Waygood <[email protected]> Co-authored-by: C.A.M. Gerlach <[email protected]> Co-authored-by: Corvin <[email protected]> Co-authored-by: Ezio Melotti <[email protected]> Co-authored-by: Serhiy Storchaka <[email protected]>

…andle those (python#108699) Add a guide for how to handle non-UTF-8 text encodings. Link to that guide from the 'text_factory' docs. Co-authored-by: Alex Waygood <[email protected]> Co-authored-by: C.A.M. Gerlach <[email protected]> Co-authored-by: Corvin <[email protected]> Co-authored-by: Ezio Melotti <[email protected]> Co-authored-by: Serhiy Storchaka <[email protected]>

pythongh-108590: Add sqlite3 text factory howto

139ad73

Document how to handle table columns with invalid Unicode sequences.

erlend-aasland added skip news needs backport to 3.11 only security fixes needs backport to 3.12 only security fixes labels Aug 30, 2023

erlend-aasland requested a review from ezio-melotti August 30, 2023 21:57

bedevere-bot mentioned this pull request Aug 30, 2023

sqlite3.iterdump() incompatible with binary data #108590

Closed

2 tasks

bedevere-bot added the docs Documentation in the Doc dir label Aug 30, 2023

AlexWaygood reviewed Aug 30, 2023

View reviewed changes

Apply suggestions from code review

88a7599

Co-authored-by: Alex Waygood <[email protected]>

serhiy-storchaka reviewed Aug 31, 2023

View reviewed changes

erlend-aasland changed the title ~~gh-108590: Add sqlite3 text factory howto~~ gh-108590: Improve sqlite3 docs on encoding issues and how to handle those Aug 31, 2023

erlend-aasland added 4 commits August 31, 2023 09:53

Add note to iterdump(); explain why we use CAST

f9aac63

Merge Alex's review

60aebce

Pull in main

1fead55

Mention that execute() and friends only accept UTF-8 encoded strings

fdc240f

serhiy-storchaka reviewed Aug 31, 2023

View reviewed changes

Doc/library/sqlite3.rst Outdated Show resolved Hide resolved

erlend-aasland marked this pull request as ready for review August 31, 2023 08:18

erlend-aasland requested a review from berkerpeksag as a code owner August 31, 2023 08:18

bedevere-bot added the awaiting core review label Aug 31, 2023

erlend-aasland added 2 commits August 31, 2023 10:48

Try to emphasize that strings with surrogate escapes cannot be passed…

62073e5

… to sqlite3 API

Add seealso for the Unicode HOWTO

35c6e9d

ezio-melotti reviewed Aug 31, 2023

View reviewed changes

serhiy-storchaka reviewed Sep 1, 2023

View reviewed changes

Doc/library/sqlite3.rst Outdated Show resolved Hide resolved

erlend-aasland requested review from serhiy-storchaka and AlexWaygood October 12, 2023 10:14

ezio-melotti reviewed Oct 12, 2023

View reviewed changes

CAM-Gerlach reviewed Oct 12, 2023

View reviewed changes

Doc/library/sqlite3.rst Outdated Show resolved Hide resolved

Doc/library/sqlite3.rst Outdated Show resolved Hide resolved

Doc/library/sqlite3.rst Outdated Show resolved Hide resolved

Apply suggestions from CAM and Ezio

dc4c820

Co-authored-by: C.A.M. Gerlach <[email protected]> Co-authored-by: Ezio Melotti <[email protected]>

ezio-melotti reviewed Oct 13, 2023

View reviewed changes

Doc/library/sqlite3.rst Outdated Show resolved Hide resolved

Keep it simple

7e0e615

Co-authored-by: Ezio Melotti <[email protected]>

serhiy-storchaka reviewed Oct 13, 2023

View reviewed changes

erlend-aasland merged commit 1262e41 into python:main Oct 25, 2023

erlend-aasland deleted the sqlite/doc-text-factory branch October 25, 2023 13:58

bedevere-app bot removed the awaiting core review label Oct 25, 2023

bedevere-app bot removed the needs backport to 3.12 only security fixes label Oct 25, 2023

bedevere-app bot removed the needs backport to 3.11 only security fixes label Oct 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gh-108590: Improve sqlite3 docs on encoding issues and how to handle those #108699

gh-108590: Improve sqlite3 docs on encoding issues and how to handle those #108699

erlend-aasland commented Aug 30, 2023 •

edited by github-actions bot

Loading

erlend-aasland commented Aug 30, 2023 •

edited

Loading

erlend-aasland commented Aug 30, 2023

erlend-aasland commented Aug 30, 2023

serhiy-storchaka left a comment

CorvinM commented Aug 31, 2023

serhiy-storchaka commented Aug 31, 2023

encukou commented Aug 31, 2023

serhiy-storchaka commented Aug 31, 2023

serhiy-storchaka left a comment

erlend-aasland commented Oct 12, 2023

CAM-Gerlach left a comment

erlend-aasland commented Oct 13, 2023

erlend-aasland commented Oct 13, 2023

serhiy-storchaka left a comment

erlend-aasland commented Oct 13, 2023

serhiy-storchaka commented Oct 13, 2023

erlend-aasland commented Oct 13, 2023

serhiy-storchaka commented Oct 13, 2023

erlend-aasland commented Oct 13, 2023

miss-islington-app bot commented Oct 25, 2023

bedevere-app bot commented Oct 25, 2023

bedevere-app bot commented Oct 25, 2023

gh-108590: Improve sqlite3 docs on encoding issues and how to handle those #108699

gh-108590: Improve sqlite3 docs on encoding issues and how to handle those #108699

Conversation

erlend-aasland commented Aug 30, 2023 • edited by github-actions bot Loading

erlend-aasland commented Aug 30, 2023 • edited Loading

erlend-aasland commented Aug 30, 2023

erlend-aasland commented Aug 30, 2023

serhiy-storchaka left a comment

Choose a reason for hiding this comment

CorvinM commented Aug 31, 2023

serhiy-storchaka commented Aug 31, 2023

encukou commented Aug 31, 2023

serhiy-storchaka commented Aug 31, 2023

serhiy-storchaka left a comment

Choose a reason for hiding this comment

erlend-aasland commented Oct 12, 2023

CAM-Gerlach left a comment

Choose a reason for hiding this comment

erlend-aasland commented Oct 13, 2023

erlend-aasland commented Oct 13, 2023

serhiy-storchaka left a comment

Choose a reason for hiding this comment

erlend-aasland commented Oct 13, 2023

serhiy-storchaka commented Oct 13, 2023

erlend-aasland commented Oct 13, 2023

serhiy-storchaka commented Oct 13, 2023

erlend-aasland commented Oct 13, 2023

miss-islington-app bot commented Oct 25, 2023

bedevere-app bot commented Oct 25, 2023

bedevere-app bot commented Oct 25, 2023

erlend-aasland commented Aug 30, 2023 •

edited by github-actions bot

Loading

erlend-aasland commented Aug 30, 2023 •

edited

Loading