Documentation request: Unicode conversions page #591

jbarlow83 · 2017-01-04T20:47:54Z

I think it would be helpful to have a new section under Type conversions that describes how pybind11 deals with Unicode conversions in Python 2.7 and 3. (I can't find this documented anywhere.)

jagerman · 2017-01-28T17:11:45Z

The quick version is that pybind11 loads and casts between std::string and python strings assuming UTF-8 (calling Python core functions to do the interpretation/conversion), and assumes UTF-16 or UTF-32 when using std::wstring (the former if wchar_t is 2 bytes, the latter if 4 bytes).

jbarlow83 · 2017-01-28T22:57:33Z

Here are some more specific questions whose answers I think should be documented:

If a bytes (Py2 str) is passed to a C++ function accepting std::string, is it implicitly converted to UTF-8 or left alone?
Is it possible to return std::string and have Python receive it bytes (Py2 str)?
In Python 2, will a returned std::string be converted to unencoded str, UTF-8 encoded str or unicode?
What happens in Py2 and 3 if a std::string cannot be implicitly converted to/from UTF-8?
Is there any way to disable UTF-8 conversion and treat all std::string as bytes (str)?

jagerman · 2017-01-29T00:45:59Z

I agree that it would be good to have this documented. Based on my reading of the code (the template <> class type_caster<std::string> { in include/pybind11/cast.h), and Python C API documentation, I believe the answers (and remaining questions!) are:

when going from Python to C++ std::string (i.e. type_caster<std::string>::load()) if the passed object is a unicode object (or subclass) the created std::string will be the UTF-8 encoding of the unicode string. If you give it bytes, it'll be left alone.
when casting a std::string into Python (i.e. returning a std::string) we call PyUnicode_FromStringAndSize on it unconditionally: we have no way to know whether the string came in as bytes or unicode. The documentation for the Python function just says that it interprets it as UTF-8, but it doesn't say what happens if it is passed invalid UTF-8 data.
unicode
Good question. The Python C API documentation is remarkably lacking in description of error handling (the Python API is better).
Not directly, but you can interact with bytes (or str in Python 2) via the py::bytes class, e.g. by returning a py::bytes(s) where s is a std::string.

jbarlow83 · 2017-01-29T05:24:19Z

By code inspection it looks like Python will raise a UnicodeDecodeError if PyUnicode_FromStringAndSize fails.

However the current behavior from pybind11 2.0.1 (arguably a bug) is to return this kind of error:

TypeError: Unable to convert function return value to a Python type! The signature was
	() -> str

It's possibly a bug because it suppresses information that could be used to solve the problem.

My test function was:

    m.def("bad_utf8",
        []() -> std::string {
            return std::string("\xd0\xd0\xd0"); // not utf-8
        }
    );

jbarlow83 · 2017-01-29T05:29:11Z

It would also be useful to document what pybind11 does with single character literals and wchar_t in each direction.

jagerman · 2017-01-29T07:58:19Z

I'm not sure if it should just report a better error, or actually return a bytes in that case. (The latter would make round-tripping of bytes data work, as long as the data didn't happen to be a valid UTF-8 sequence with high-bit bytes).

jbarlow83 · 2017-01-29T10:00:55Z

My thinking is that there should be a 1:1 correspondence between std::string and Python3 str. It is already true that any str can be represented as a utf-8 encoded std::string. The wrapper code then has the burden of ensuring that any strings generated in C++ are normalized to utf-8 before being returned to Python. (Another thing to explain in documentation.)

From Python, you almost never want a function that sometimes returns str and sometimes bytes. That breaks too many simple things that ought to be simple and reliable:

print("I talked to C++ and it said: " + wrapped_cpp_sometimes_returns_bytes())

Ideally the error would be the underlying UnicodeDecodeError rather than that TypeError, because the former gives the byte offset and offending character sequence.

Round-tripping bytes (possibly containing NULs) could be done with passing and returning py::bytes as you mentioned, and it's conveniently explicit.

jagerman · 2017-01-30T04:49:10Z

PR #624 addresses the error being propagated back to Python.

I didn't address the documentation (except to add u16/u32 types to the table).

jbarlow83 · 2017-02-01T09:14:58Z

Well thanks for this, I think the picture is a lot clearer.

I do think pybind11 core devs may want to evaluate whether implicit bytes -> std::string conversion should be allowed since it is not symmetric with the automatic std::string -> str conversion on return and required workaround py::bytes -> bytes.

anntzer · 2017-11-15T02:53:53Z

I agree that it would be nice at least to mark a function as disallowing an implicit bytes ->(utf8)-> std::string. (Here "bytes" and "str" have their Py3 meanings.) An example case would be pathnames: if python passes in a str, we want to encode it using the filesystem encoding (not necessarily utf-8), if python passes in a bytes, we should assume os.fsencode() has already been called on it and just pass it accordingly. If pybind11 always does the case, I believe we can't distinguish between the two cases (other than taking a py::object as argument and typechecking ourselves).

jbarlow83 · 2017-11-15T03:04:26Z

You can mark a function as such by accepting py::bytes as the argument. Then you can implement any conversion in the lambda before dispatching to the C++ codebase. I was thinking the best thing for pathnames would be a special py::pathname type that takes care of all the cases in a version independent way. (Also handling os.PathLike and pathlib paths.)

anntzer · 2017-11-15T03:25:17Z

Ah, great, thanks.
On recent pythons it's just a matter of calling fsencode so it's not too much effort to handle this. If pybind11 is going to have special support for this (not saying it has to, but possibly nice), perhaps it's better to provide casters between pathlikes and std::filesystem instead of inventing its own class...

jbarlow83 · 2017-11-15T07:04:42Z

I suppose std::filesystem would be better but it requires C++17. Maybe there isn't an elegant solution yet.

jbarlow83 changed the title ~~Feature request: Unicode conversions page~~ Documentation request: Unicode conversions page Jan 4, 2017

jagerman mentioned this issue Jan 30, 2017

Unicode fixes and docs #624

Merged

jbarlow83 mentioned this issue Feb 1, 2017

RFC - Add documentation for strings and Unicode issues #636

Merged

jbarlow83 closed this as completed Feb 6, 2017

rwgk mentioned this issue Feb 9, 2023

FWD pybind11 google/pybind11clif#591

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documentation request: Unicode conversions page #591

Documentation request: Unicode conversions page #591

jbarlow83 commented Jan 4, 2017

jagerman commented Jan 28, 2017 •

edited

Loading

jbarlow83 commented Jan 28, 2017

jagerman commented Jan 29, 2017

jbarlow83 commented Jan 29, 2017

jbarlow83 commented Jan 29, 2017

jagerman commented Jan 29, 2017

jbarlow83 commented Jan 29, 2017 •

edited

Loading

jagerman commented Jan 30, 2017

jbarlow83 commented Feb 1, 2017

anntzer commented Nov 15, 2017

jbarlow83 commented Nov 15, 2017 via email •

edited

Loading

anntzer commented Nov 15, 2017

jbarlow83 commented Nov 15, 2017

Documentation request: Unicode conversions page #591

Documentation request: Unicode conversions page #591

Comments

jbarlow83 commented Jan 4, 2017

jagerman commented Jan 28, 2017 • edited Loading

jbarlow83 commented Jan 28, 2017

jagerman commented Jan 29, 2017

jbarlow83 commented Jan 29, 2017

jbarlow83 commented Jan 29, 2017

jagerman commented Jan 29, 2017

jbarlow83 commented Jan 29, 2017 • edited Loading

jagerman commented Jan 30, 2017

jbarlow83 commented Feb 1, 2017

anntzer commented Nov 15, 2017

jbarlow83 commented Nov 15, 2017 via email • edited Loading

anntzer commented Nov 15, 2017

jbarlow83 commented Nov 15, 2017

jagerman commented Jan 28, 2017 •

edited

Loading

jbarlow83 commented Jan 29, 2017 •

edited

Loading

jbarlow83 commented Nov 15, 2017 via email •

edited

Loading