Skip to content

Documentation request: Unicode conversions page #591

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jbarlow83 opened this issue Jan 4, 2017 · 13 comments
Closed

Documentation request: Unicode conversions page #591

jbarlow83 opened this issue Jan 4, 2017 · 13 comments

Comments

@jbarlow83
Copy link
Contributor

I think it would be helpful to have a new section under Type conversions that describes how pybind11 deals with Unicode conversions in Python 2.7 and 3. (I can't find this documented anywhere.)

@jbarlow83 jbarlow83 changed the title Feature request: Unicode conversions page Documentation request: Unicode conversions page Jan 4, 2017
@jagerman
Copy link
Member

jagerman commented Jan 28, 2017

The quick version is that pybind11 loads and casts between std::string and python strings assuming UTF-8 (calling Python core functions to do the interpretation/conversion), and assumes UTF-16 or UTF-32 when using std::wstring (the former if wchar_t is 2 bytes, the latter if 4 bytes).

@jbarlow83
Copy link
Contributor Author

Here are some more specific questions whose answers I think should be documented:

  • If a bytes (Py2 str) is passed to a C++ function accepting std::string, is it implicitly converted to UTF-8 or left alone?
  • Is it possible to return std::string and have Python receive it bytes (Py2 str)?
  • In Python 2, will a returned std::string be converted to unencoded str, UTF-8 encoded str or unicode?
  • What happens in Py2 and 3 if a std::string cannot be implicitly converted to/from UTF-8?
  • Is there any way to disable UTF-8 conversion and treat all std::string as bytes (str)?

@jagerman
Copy link
Member

I agree that it would be good to have this documented. Based on my reading of the code (the template <> class type_caster<std::string> { in include/pybind11/cast.h), and Python C API documentation, I believe the answers (and remaining questions!) are:

  • when going from Python to C++ std::string (i.e. type_caster<std::string>::load()) if the passed object is a unicode object (or subclass) the created std::string will be the UTF-8 encoding of the unicode string. If you give it bytes, it'll be left alone.

  • when casting a std::string into Python (i.e. returning a std::string) we call PyUnicode_FromStringAndSize on it unconditionally: we have no way to know whether the string came in as bytes or unicode. The documentation for the Python function just says that it interprets it as UTF-8, but it doesn't say what happens if it is passed invalid UTF-8 data.

  • unicode

  • Good question. The Python C API documentation is remarkably lacking in description of error handling (the Python API is better).

  • Not directly, but you can interact with bytes (or str in Python 2) via the py::bytes class, e.g. by returning a py::bytes(s) where s is a std::string.

@jbarlow83
Copy link
Contributor Author

By code inspection it looks like Python will raise a UnicodeDecodeError if PyUnicode_FromStringAndSize fails.

However the current behavior from pybind11 2.0.1 (arguably a bug) is to return this kind of error:

TypeError: Unable to convert function return value to a Python type! The signature was
	() -> str

It's possibly a bug because it suppresses information that could be used to solve the problem.

My test function was:

    m.def("bad_utf8",
        []() -> std::string {
            return std::string("\xd0\xd0\xd0"); // not utf-8
        }
    );

@jbarlow83
Copy link
Contributor Author

It would also be useful to document what pybind11 does with single character literals and wchar_t in each direction.

@jagerman
Copy link
Member

I'm not sure if it should just report a better error, or actually return a bytes in that case. (The latter would make round-tripping of bytes data work, as long as the data didn't happen to be a valid UTF-8 sequence with high-bit bytes).

@jbarlow83
Copy link
Contributor Author

jbarlow83 commented Jan 29, 2017

My thinking is that there should be a 1:1 correspondence between std::string and Python3 str. It is already true that any str can be represented as a utf-8 encoded std::string. The wrapper code then has the burden of ensuring that any strings generated in C++ are normalized to utf-8 before being returned to Python. (Another thing to explain in documentation.)

From Python, you almost never want a function that sometimes returns str and sometimes bytes. That breaks too many simple things that ought to be simple and reliable:

print("I talked to C++ and it said: " + wrapped_cpp_sometimes_returns_bytes())

Ideally the error would be the underlying UnicodeDecodeError rather than that TypeError, because the former gives the byte offset and offending character sequence.

Round-tripping bytes (possibly containing NULs) could be done with passing and returning py::bytes as you mentioned, and it's conveniently explicit.

@jagerman
Copy link
Member

PR #624 addresses the error being propagated back to Python.

I didn't address the documentation (except to add u16/u32 types to the table).

@jbarlow83
Copy link
Contributor Author

Well thanks for this, I think the picture is a lot clearer.

I do think pybind11 core devs may want to evaluate whether implicit bytes -> std::string conversion should be allowed since it is not symmetric with the automatic std::string -> str conversion on return and required workaround py::bytes -> bytes.

@anntzer
Copy link
Contributor

anntzer commented Nov 15, 2017

I agree that it would be nice at least to mark a function as disallowing an implicit bytes ->(utf8)-> std::string. (Here "bytes" and "str" have their Py3 meanings.) An example case would be pathnames: if python passes in a str, we want to encode it using the filesystem encoding (not necessarily utf-8), if python passes in a bytes, we should assume os.fsencode() has already been called on it and just pass it accordingly. If pybind11 always does the case, I believe we can't distinguish between the two cases (other than taking a py::object as argument and typechecking ourselves).

@jbarlow83
Copy link
Contributor Author

jbarlow83 commented Nov 15, 2017 via email

@anntzer
Copy link
Contributor

anntzer commented Nov 15, 2017

Ah, great, thanks.
On recent pythons it's just a matter of calling fsencode so it's not too much effort to handle this. If pybind11 is going to have special support for this (not saying it has to, but possibly nice), perhaps it's better to provide casters between pathlikes and std::filesystem instead of inventing its own class...

@jbarlow83
Copy link
Contributor Author

I suppose std::filesystem would be better but it requires C++17. Maybe there isn't an elegant solution yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants