-
-
Notifications
You must be signed in to change notification settings - Fork 31.8k
gh-108590: Improve sqlite3 docs on encoding issues and how to handle those #108699
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gh-108590: Improve sqlite3 docs on encoding issues and how to handle those #108699
Conversation
Document how to handle table columns with invalid Unicode sequences.
IMO, the existing |
cc. @AlexWaygood, if you have time to look at the prose. |
Co-authored-by: Alex Waygood <[email protected]>
Thanks, Alex! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
print()
will fail, because sys.stdout.errors
is "strict" by default. It is not a new issue, you have the same issue when you print the result of os.listdir()
, for example.
Other issue is that you cannot simply feed the result of iterdump()
to execute()
, because the Python interface only accept UTF-8 encodable string.
A minor problem is that using anything besides str
, bytes
or bytearray
as text_factory
slows down requests. So you only need to use this if necessary.
It may be helpful to emphasize that the surrogate escapes cannot be passed back to sqlite through python. |
The plausible real-world example -- a DB created from a CSV file or other source in legacy encoding. Ukrainian or Korean are bad as examples, because all characters are non-ASCII, so you cannot not recognize the word in its bytes repr. "Österreich" is better, it contains only one non-Latin letter. Of course you need to use the encoding different from UTF-8 for example. It would be better if the natural legacy encoding is different from Latin1, but if there is nothing better, Latin1 works too. "Österreich" is a bit too long, some shorter Norwegian geographical name may be better. The new example only shows the use of |
The CSV export from my Czech bank account uses iso-8859-2 and contains the column Or you could use |
Good examples! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we all agree that the current version has serious issues. No need to beat it more, it is already dead. Let @erlend-aasland to address our comments and prepare a new version. Then we will start a new round of bikeshedding.
@ezio-melotti, would you like to take another look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks very good now, thanks! Just a few small suggestions.
Co-authored-by: C.A.M. Gerlach <[email protected]> Co-authored-by: Ezio Melotti <[email protected]>
Co-authored-by: Ezio Melotti <[email protected]>
I think this is ready to land. Shout out if you disagree. OTOH, we can easily patch up things with a new PR. I'll be merging this tomorrow morning CET. |
A big thanks to everyone who helped shape this PR! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change removes any mention of bytes
. It makes the documentation less informative.
It also does not resolve the original issue, in particularly, it does not document that iterdump()
is not compatible with text_factory=bytes
. And it does not say what to do with the result of iterdump()
with custom text_factory
.
Serhiy, please see #108699 (comment). I do not intend to mention
It does so implicitly; it helps the user understand how to work around such encoding issues. If you disagree with this approach, please open a competing PR. |
Okay. I do not fully understand what is the benefit of this change, but I do not oppose it. |
Why so dismissive? I've asked you twice for a concrete proposal regarding your |
Sorry, I meant no disrespect. There were so many changes and comments in this PR that I was already lost. After the old examples were removed, there was a lot of discussion about new examples, but they never appeared, so I didn't see much to comment on, as this PR looked far from finished. If you consider it complete, I will perhaps continue the work in a new PR. |
Ok, thanks for your response, Serhiy. It is ok to continue improving the docs in follow-up PRs. Let's land this, then we can follow up your concerns in a new PR. I appreciate your comments; I just find some of them hard to address. |
Thanks @erlend-aasland for the PR 🌮🎉.. I'm working now to backport this PR to: 3.11, 3.12. |
…andle those (pythonGH-108699) Add a guide for how to handle non-UTF-8 text encodings. Link to that guide from the 'text_factory' docs. (cherry picked from commit 1262e41) Co-authored-by: Erlend E. Aasland <[email protected]> Co-authored-by: Alex Waygood <[email protected]> Co-authored-by: C.A.M. Gerlach <[email protected]> Co-authored-by: Corvin <[email protected]> Co-authored-by: Ezio Melotti <[email protected]> Co-authored-by: Serhiy Storchaka <[email protected]>
…andle those (pythonGH-108699) Add a guide for how to handle non-UTF-8 text encodings. Link to that guide from the 'text_factory' docs. (cherry picked from commit 1262e41) Co-authored-by: Erlend E. Aasland <[email protected]> Co-authored-by: Alex Waygood <[email protected]> Co-authored-by: C.A.M. Gerlach <[email protected]> Co-authored-by: Corvin <[email protected]> Co-authored-by: Ezio Melotti <[email protected]> Co-authored-by: Serhiy Storchaka <[email protected]>
GH-111324 is a backport of this pull request to the 3.12 branch. |
GH-111325 is a backport of this pull request to the 3.11 branch. |
…handle those (GH-108699) (#111325) Add a guide for how to handle non-UTF-8 text encodings. Link to that guide from the 'text_factory' docs. (cherry picked from commit 1262e41) Co-authored-by: Erlend E. Aasland <[email protected]> Co-authored-by: Alex Waygood <[email protected]> Co-authored-by: C.A.M. Gerlach <[email protected]> Co-authored-by: Corvin <[email protected]> Co-authored-by: Ezio Melotti <[email protected]> Co-authored-by: Serhiy Storchaka <[email protected]>
…handle those (GH-108699) (#111324) Add a guide for how to handle non-UTF-8 text encodings. Link to that guide from the 'text_factory' docs. (cherry picked from commit 1262e41) Co-authored-by: Erlend E. Aasland <[email protected]> Co-authored-by: Alex Waygood <[email protected]> Co-authored-by: C.A.M. Gerlach <[email protected]> Co-authored-by: Corvin <[email protected]> Co-authored-by: Ezio Melotti <[email protected]> Co-authored-by: Serhiy Storchaka <[email protected]>
…andle those (python#108699) Add a guide for how to handle non-UTF-8 text encodings. Link to that guide from the 'text_factory' docs. Co-authored-by: Alex Waygood <[email protected]> Co-authored-by: C.A.M. Gerlach <[email protected]> Co-authored-by: Corvin <[email protected]> Co-authored-by: Ezio Melotti <[email protected]> Co-authored-by: Serhiy Storchaka <[email protected]>
…andle those (python#108699) Add a guide for how to handle non-UTF-8 text encodings. Link to that guide from the 'text_factory' docs. Co-authored-by: Alex Waygood <[email protected]> Co-authored-by: C.A.M. Gerlach <[email protected]> Co-authored-by: Corvin <[email protected]> Co-authored-by: Ezio Melotti <[email protected]> Co-authored-by: Serhiy Storchaka <[email protected]>
Document how to handle table columns with invalid Unicode sequences.
📚 Documentation preview 📚: https://cpython-previews--108699.org.readthedocs.build/