-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Unicode fixes and docs #624
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
include/pybind11/cast.h
Outdated
std::is_same<CharT, char>::value ? 8 : | ||
std::is_same<CharT, char16_t>::value ? 16 : | ||
std::is_same<CharT, char32_t>::value ? 32 : | ||
(sizeof(CharT) == 2 ? 16 : 32); /* std::wstring is UTF-16 on Windows, UTF-32 everywhere else */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this assumption be removed using something like sizeof(wchar_t)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UTF_N = sizeof(CharT)*8
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The char16_t
and char32_t
aren't guaranteed to be 2 and 4 bytes, only that they are at least 2/4 bytes (they are more like std::uint_least16_t
). They are guaranteed to be unique types, however, hence checking the actual types rather than just the sizes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That all makes sense then.
include/pybind11/cast.h
Outdated
} | ||
// Helper class for UTF-{8,16,32} strings: | ||
template <typename CharT, class Traits, class Allocator> | ||
struct type_caster<std::basic_string<CharT, Traits, Allocator>, enable_if_t<is_std_char_type<CharT>::value>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SFINAE potentially overkill: is there ever a situation where we could get a std::basic_string
that uses a non-character type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory you could create a std::basic_string<int>
. We aren't likely to see it, but I figured better to be explicit in what we support here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok!
include/pybind11/cast.h
Outdated
|
||
object utfNbytes = reinterpret_steal<object>(PyUnicode_AsEncodedString( | ||
load_src.ptr(), | ||
UTF_N == 8 ? "utf8" : UTF_N == 16 ? "utf16" : "utf32", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could be moved out of the function call into a constexpr const char *
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't it be treated that way anyway, given that the UTF_N is a constexpr? Never mind, I see it's used more than once.
include/pybind11/cast.h
Outdated
static handle cast(const StringType &src, return_value_policy /* policy */, handle /* parent */) { | ||
const char *buffer = reinterpret_cast<const char *>(src.c_str()); | ||
ssize_t nbytes = ssize_t(src.size() * sizeof(CharT)); | ||
handle s = PyUnicode_Decode(buffer, nbytes, UTF_N == 8 ? "utf8" : UTF_N == 16 ? "utf16" : "utf32", nullptr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
This looks great -- I added a few minor comments. |
df6e1e8
to
a2f41b4
Compare
Rebased onto master (and squashed in the |
a2f41b4
to
6fe53d2
Compare
include/pybind11/cast.h
Outdated
return PyUnicode_FromWideChar(wstr, 1); | ||
static handle cast(CharT src, return_value_policy policy, handle parent) { | ||
if (std::is_same<char, CharT>::value) { | ||
handle s = PyUnicode_DecodeLatin1((const char *) &src, 1, nullptr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is DecodeLatin1
used here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we have to do something with a single char in the 128-255 range, and since Latin1 codepoints are identical to unicode codepoints, this seemed the most logical choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense.
@jagerman Have you checked what happens in this PR if a Python str is cast to a char and the Unicode code point is U+0100 and above, or wchar16_t for U+01_0000 and above? I found that in master this currently truncates the code point which is probably incorrect (and maybe endian unsafe too). Since you changed char casting perhaps your changes address this already? If not I will write up the issue. |
(Edit: ignore all these; see my later comment below). For all of the char types, it decodes to the UTF-n encoding in native byte order, then returns the first For For
|
Okay, I will open a new issue to describe this behavior. |
No need, your comment here suffices. |
Actually, scratch all that I said above. The So I guess my question: is this merely a theoretical problem, or is this a problem you're running into in actual code? |
It's not a problem I have, it's a corner case I discovered because @wjakob asked for my Unicode documentation to mention character literals. I agree that modern C++11 code is unlikely to accept The interface has a several functions like We've established that the pybind11 behavior is incorrect in that it mangles UTF-8, but it is a corner case. I agree it may not even be worth fully fixing, given that Do you think it would be possible to have a |
I just pushed a change to split up the |
a0a98d4
to
a79b5f6
Compare
Will fight with MSVC and figure out what to do with PyPy tomorrow. |
a79b5f6
to
49625f4
Compare
Pushed a rebased/squashed version with MSVC/PyPy fixes, and a cleanup to the pointer type_caster implementation. |
49625f4
to
31c4c78
Compare
... and again to fix the conflict. |
include/pybind11/cast.h
Outdated
|
||
static PYBIND11_DESCR name() { return type_descr(_(PYBIND11_STRING_NAME)); } | ||
// chr()/unichr() isn't technically a type, but it should get the point across: | ||
PYBIND11_TYPE_CASTER(CharT, _(PYBIND11_CHR_NAME "(") + _<(max <= 0xffff)>(_("<=") + _<max>(), _("")) + _(")")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a quite strange type annotation (explicitly specifying the range). Perhaps just chr/unichr
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it's a bit strange, but I have a feeling it's going to be rather confusing to see an error about invalid function arguments with a unichar
accepted, but only certain unichar
actually accepted. (I guess this is more or less the same issue with #651).
I suppose, bigger picture, it would be nice to be able to override or supplement the call failure message so that a caster could return an argument failure with a reason (e.g. "Only char values <= 255 are accepted") and have that reason displayed in the exception (but, I suppose, only if the function isn't overloaded).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you insist on having an informative type annotation, could we call it unichr16
or something like that? Or the type annotation could be more pythonic so that type signature parsers won't be thrown off (e.g. unichr[encoding='utf16']
or unichr[max=0xffff]
), but I think that's less readable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't insist; the latest push changed it to just chr()/
unichr()`. I'm not sure of a better choice here, since the type is inherently non-pythonic.
Re: unichar[encoding='utf16']
is a bit misleading: we don't care about the specific encoding. I propose we leave it without the range and just leave it up to the binder to mention it in the function description.
Are the brackets in chr()
going to cause problems--i.e. should it be shortened to just chr
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. I'm sort of ambivalent about what to report here, but I suppose I'd marginally rank them as char[max=0xffff]
> chr()
> char16
> chr
. (With "uni" prepended for python 2.7).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Round brackets will cause problems. Parameterized types use square brackets in type annotations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(ok, please feel free to take your top pick then.)
I'm a bit skeptical about the spun-off |
The main point is that the unified caster is broken in its current state for char values: it isn't valid to send the first byte of a UTF-8 string as a |
The single-gatekeeper class MyCaster { ... };
DECLARE_CUSTOM_CASTER(MyType, MyCaster); rather than having custom casters declare themselves in the (But of course, this doesn't necessarily mean we need the char caster support). |
I get the issue, I'm just wondering if the expense is worth it for a pretty obscure case. The alternative being documenting it as a limitation which isn't even likely to come up.
Wouldn't just moving |
I don't see how it could be done without breaking backwards compatibility (which, I'm guessing, is the main reason its in But basically, this was just an afterthought: I didn't write it with that intention in the first place. |
@dean0x7d I provided an example of it in use further. It was not difficult to find working C++ that accepts I think refusing to compile a direct binding to |
I think an informative error is quite doable: // the unified char caster with potential silent failure for a single char
operator CharT&() { return value[0]; } // unified caster with loud failure
operator CharT&() {
if (value.size() == 1)
return value[0];
else
throw InformativeError();
} |
@dean0x7d: Oh wow, I'm really surprised that it wasn't like that before. That's definitely a very severe oversight. I also agree with @dean0x7d's point that adding a layer of indirection to every type caster just to special-case |
fb174cf
to
d1964f9
Compare
If returning a std::string with invalid utf-8 data, we currently fail with an uninformative TypeError instead of propagating the UnicodeDecodeError that Python sets on failure.
This adds support for wchar{16,32}_t character literals and the associated std::u{16,32}string types. It also folds the character/string conversion into a single type_caster template, since the type casters for string and wstring were mostly the same anyway.
d1964f9
to
24df2d9
Compare
I've reverted the caster changes, and did the loud failure during load-casting as @dean0x7d suggested. (Though a little more complex to distinguish between codepoint-too-large and too-many-character errors). |
With this commit, when casting to a single character, as opposed to a C-style string, we make sure the input wasn't a multi-character string or a single character with codepoint too large for the character type. This also changes the character cast op to CharT instead of CharT& (we need to be able to return a temporary decoded char value, but also because there's little gained by bothering with an lvalue return here). Finally it changes the char caster to 'has-a-string-caster' instead of 'is-a-string-caster' because, with the cast_op change above, there's nothing at all gained from inheritance. This also lets us remove the `success` from the string caster (which was only there for the char caster) into the char caster itself. (I also renamed it to 'none' and inverted its value to better reflect its purpose). The None -> nullptr loading also now takes place only under a `convert = true` load pass. Although it's unlikely that a function taking a char also has overloads that can take a None, it seems marginally more correct to treat it as a conversion. This commit simplifies the size assumptions about character sizes with static_asserts to back them up.
24df2d9
to
86ca73f
Compare
Fixes clang's sign conversion warnings.
This looks really great -- thank you for working out all the nitty gritty details regarding character casts. |
The string conversion logic added in PR pybind#624 for all std::basic_strings was using the old std::wstring logic, but that was underused and turns out to have hade a bug in accepting almost anything convertible to the previous std::string logic by only accepting unicode or byte/string (Python 3/2) types. Fixes pybind#685.
The string conversion logic added in PR pybind#624 for all std::basic_strings was derived from the old std::wstring logic, but that was underused and turns out to have had a bug in accepting almost anything convertible to unicode, while the previous std::string logic was much stricter. This restores the previous std::string logic by only allowing actual unicode or string types. Fixes pybind#685.
* Make string conversion stricter The string conversion logic added in PR #624 for all std::basic_strings was derived from the old std::wstring logic, but that was underused and turns out to have had a bug in accepting almost anything convertible to unicode, while the previous std::string logic was much stricter. This restores the previous std::string logic by only allowing actual unicode or string types. Fixes #685. * Added missing 'requires numpy' decorator (I forgot that the change to a global decorator here is in the not-yet-merged Eigen PR)
The Unicode support added in 2.1 (PR pybind#624) inadvertently broke accepting `bytes` as std::string/char* arguments. This restores it with a separate path that does a plain conversion (i.e. completely bypassing all the encoding/decoding code), but only for single-byte string types.
The Unicode support added in 2.1 (PR pybind#624) inadvertently broke accepting `bytes` as std::string/char* arguments. This restores it with a separate path that does a plain conversion (i.e. completely bypassing all the encoding/decoding code), but only for single-byte string types.
The Unicode support added in 2.1 (PR #624) inadvertently broke accepting `bytes` as std::string/char* arguments. This restores it with a separate path that does a plain conversion (i.e. completely bypassing all the encoding/decoding code), but only for single-byte string types.
As discussed in #591, we currently don't fail gracefully when encountering invalid unicode. This PR propagates the unicode error, while also adding support for
std::u16string
,std::u32string
and the associatedchar16_t
/char32_t
types.