Skip to content

gh-108590: Improve sqlite3 docs on encoding issues and how to handle those #108699

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Oct 25, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
83 changes: 50 additions & 33 deletions Doc/library/sqlite3.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1157,6 +1157,10 @@ Connection objects
f.write('%s\n' % line)
con.close()

.. seealso::

:ref:`sqlite3-howto-encoding`


.. method:: backup(target, *, pages=-1, progress=None, name="main", sleep=0.250)

Expand Down Expand Up @@ -1223,6 +1227,10 @@ Connection objects

.. versionadded:: 3.7

.. seealso::

:ref:`sqlite3-howto-encoding`

.. method:: getlimit(category, /)

Get a connection runtime limit.
Expand Down Expand Up @@ -1444,39 +1452,8 @@ Connection objects
and returns a text representation of it.
The callable is invoked for SQLite values with the ``TEXT`` data type.
By default, this attribute is set to :class:`str`.
If you want to return ``bytes`` instead, set *text_factory* to ``bytes``.

Example:

.. testcode::

con = sqlite3.connect(":memory:")
cur = con.cursor()

AUSTRIA = "Österreich"

# by default, rows are returned as str
cur.execute("SELECT ?", (AUSTRIA,))
row = cur.fetchone()
assert row[0] == AUSTRIA

# but we can make sqlite3 always return bytestrings ...
con.text_factory = bytes
cur.execute("SELECT ?", (AUSTRIA,))
row = cur.fetchone()
assert type(row[0]) is bytes
# the bytestrings will be encoded in UTF-8, unless you stored garbage in the
# database ...
assert row[0] == AUSTRIA.encode("utf-8")

# we can also implement a custom text_factory ...
# here we implement one that appends "foo" to all strings
con.text_factory = lambda x: x.decode("utf-8") + "foo"
cur.execute("SELECT ?", ("bar",))
row = cur.fetchone()
assert row[0] == "barfoo"

con.close()
See :ref:`sqlite3-howto-encoding` for more details.

.. attribute:: total_changes

Expand Down Expand Up @@ -1635,7 +1612,6 @@ Cursor objects
COMMIT;
""")


.. method:: fetchone()

If :attr:`~Cursor.row_factory` is ``None``,
Expand Down Expand Up @@ -2614,6 +2590,47 @@ With some adjustments, the above recipe can be adapted to use a
instead of a :class:`~collections.namedtuple`.


.. _sqlite3-howto-encoding:

How to handle non-UTF-8 text encodings
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

By default, :mod:`!sqlite3` uses :class:`str` to adapt SQLite values
with the ``TEXT`` data type.
This works well for UTF-8 encoded text, but it might fail for other encodings
and invalid UTF-8.
You can use a custom :attr:`~Connection.text_factory` to handle such cases.

Because of SQLite's `flexible typing`_, it is not uncommon to encounter table
columns with the ``TEXT`` data type containing non-UTF-8 encodings,
or even arbitrary data.
To demonstrate, let's assume we have a database with ISO-8859-2 (Latin-2)
encoded text, for example a table of Czech-English dictionary entries.
Assuming we now have a :class:`Connection` instance :py:data:`!con`
connected to this database,
we can decode the Latin-2 encoded text using this :attr:`~Connection.text_factory`:

.. testcode::

con.text_factory = lambda data: str(data, encoding="latin2")

For invalid UTF-8 or arbitrary data in stored in ``TEXT`` table columns,
you can use the following technique, borrowed from the :ref:`unicode-howto`:

.. testcode::

con.text_factory = lambda data: str(data, errors="surrogateescape")

.. note::

The :mod:`!sqlite3` module API does not support strings
containing surrogates.

.. seealso::

:ref:`unicode-howto`


.. _sqlite3-explanation:

Explanation
Expand Down