-
Notifications
You must be signed in to change notification settings - Fork 42
Description
Hello,
Thanks for your work on this, it's been most useful for me.
I have encountered one error and fixed it, so I'm passing it on to see if you agree it's an error, and if it needs inclusion as a revision.
Currently I'm moving contact data between programs, and it's crashing on the decode of some French names.
The problem is in _pack_string. It's calculating the length of the string before it's encoded to UTF-8.
I think you must encode the string before you find the length of it, as some characters need to encode as double-byte or longer.
An example would be the French name Allagbe, or the French word precedent , where the 'e' is with Acute (http://www.fileformat.info/info/unicode/char/e9/index.htm)
Python's encoder makes this b'Allagb\xc3\xa9', which is one byte longer than than the original string.
u-msgpack encodes this as b'\xa7Allagb\xc3\xa9' - notice how it's only 7 bytes long - it's trimming the \xa9 char from the msgpack.
When you feed this trimmed string through a .decode('utf-8') method, you'll crash with a python error : 'utf-8' codec can't decode byte 0xc3 in position 6: unexpected end of data
The solution is to encode to UTF-8 before calculating the string length, as detailed below:
def _pack_string(x):
x = x.encode('utf-8')
if len(x) <= 31:
return struct.pack("B", 0xa0 | len(x)) + x
elif len(x) <= 28-1:
return b"\xd9" + struct.pack("B", len(x)) + x
elif len(x) <= 216-1:
return b"\xda" + struct.pack(">H", len(x)) + x
elif len(x) <= 2**32-1:
return b"\xdb" + struct.pack(">I", len(x)) + x
else:
raise UnsupportedTypeException("huge string")
With this patch in place, I am able to pass all French names in u-msgpack without error.
-end-