Skip to content

Commit 07664c9

Browse files
miss-islingtonerlend-aaslandAlexWaygoodCAM-GerlachCorvinM
authored
[3.11] gh-108590: Improve sqlite3 docs on encoding issues and how to handle those (GH-108699) (#111325)
Add a guide for how to handle non-UTF-8 text encodings. Link to that guide from the 'text_factory' docs. (cherry picked from commit 1262e41) Co-authored-by: Erlend E. Aasland <[email protected]> Co-authored-by: Alex Waygood <[email protected]> Co-authored-by: C.A.M. Gerlach <[email protected]> Co-authored-by: Corvin <[email protected]> Co-authored-by: Ezio Melotti <[email protected]> Co-authored-by: Serhiy Storchaka <[email protected]>
1 parent fc9a5ef commit 07664c9

File tree

1 file changed

+50
-33
lines changed

1 file changed

+50
-33
lines changed

Doc/library/sqlite3.rst

+50-33
Original file line numberDiff line numberDiff line change
@@ -1029,6 +1029,10 @@ Connection objects
10291029
f.write('%s\n' % line)
10301030
con.close()
10311031

1032+
.. seealso::
1033+
1034+
:ref:`sqlite3-howto-encoding`
1035+
10321036

10331037
.. method:: backup(target, *, pages=-1, progress=None, name="main", sleep=0.250)
10341038

@@ -1095,6 +1099,10 @@ Connection objects
10951099

10961100
.. versionadded:: 3.7
10971101

1102+
.. seealso::
1103+
1104+
:ref:`sqlite3-howto-encoding`
1105+
10981106
.. method:: getlimit(category, /)
10991107

11001108
Get a connection runtime limit.
@@ -1253,39 +1261,8 @@ Connection objects
12531261
and returns a text representation of it.
12541262
The callable is invoked for SQLite values with the ``TEXT`` data type.
12551263
By default, this attribute is set to :class:`str`.
1256-
If you want to return ``bytes`` instead, set *text_factory* to ``bytes``.
12571264

1258-
Example:
1259-
1260-
.. testcode::
1261-
1262-
con = sqlite3.connect(":memory:")
1263-
cur = con.cursor()
1264-
1265-
AUSTRIA = "Österreich"
1266-
1267-
# by default, rows are returned as str
1268-
cur.execute("SELECT ?", (AUSTRIA,))
1269-
row = cur.fetchone()
1270-
assert row[0] == AUSTRIA
1271-
1272-
# but we can make sqlite3 always return bytestrings ...
1273-
con.text_factory = bytes
1274-
cur.execute("SELECT ?", (AUSTRIA,))
1275-
row = cur.fetchone()
1276-
assert type(row[0]) is bytes
1277-
# the bytestrings will be encoded in UTF-8, unless you stored garbage in the
1278-
# database ...
1279-
assert row[0] == AUSTRIA.encode("utf-8")
1280-
1281-
# we can also implement a custom text_factory ...
1282-
# here we implement one that appends "foo" to all strings
1283-
con.text_factory = lambda x: x.decode("utf-8") + "foo"
1284-
cur.execute("SELECT ?", ("bar",))
1285-
row = cur.fetchone()
1286-
assert row[0] == "barfoo"
1287-
1288-
con.close()
1265+
See :ref:`sqlite3-howto-encoding` for more details.
12891266

12901267
.. attribute:: total_changes
12911268

@@ -1423,7 +1400,6 @@ Cursor objects
14231400
COMMIT;
14241401
""")
14251402

1426-
14271403
.. method:: fetchone()
14281404

14291405
If :attr:`~Cursor.row_factory` is ``None``,
@@ -2369,6 +2345,47 @@ With some adjustments, the above recipe can be adapted to use a
23692345
instead of a :class:`~collections.namedtuple`.
23702346

23712347

2348+
.. _sqlite3-howto-encoding:
2349+
2350+
How to handle non-UTF-8 text encodings
2351+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2352+
2353+
By default, :mod:`!sqlite3` uses :class:`str` to adapt SQLite values
2354+
with the ``TEXT`` data type.
2355+
This works well for UTF-8 encoded text, but it might fail for other encodings
2356+
and invalid UTF-8.
2357+
You can use a custom :attr:`~Connection.text_factory` to handle such cases.
2358+
2359+
Because of SQLite's `flexible typing`_, it is not uncommon to encounter table
2360+
columns with the ``TEXT`` data type containing non-UTF-8 encodings,
2361+
or even arbitrary data.
2362+
To demonstrate, let's assume we have a database with ISO-8859-2 (Latin-2)
2363+
encoded text, for example a table of Czech-English dictionary entries.
2364+
Assuming we now have a :class:`Connection` instance :py:data:`!con`
2365+
connected to this database,
2366+
we can decode the Latin-2 encoded text using this :attr:`~Connection.text_factory`:
2367+
2368+
.. testcode::
2369+
2370+
con.text_factory = lambda data: str(data, encoding="latin2")
2371+
2372+
For invalid UTF-8 or arbitrary data in stored in ``TEXT`` table columns,
2373+
you can use the following technique, borrowed from the :ref:`unicode-howto`:
2374+
2375+
.. testcode::
2376+
2377+
con.text_factory = lambda data: str(data, errors="surrogateescape")
2378+
2379+
.. note::
2380+
2381+
The :mod:`!sqlite3` module API does not support strings
2382+
containing surrogates.
2383+
2384+
.. seealso::
2385+
2386+
:ref:`unicode-howto`
2387+
2388+
23722389
.. _sqlite3-explanation:
23732390

23742391
Explanation

0 commit comments

Comments
 (0)