Double-byte (or larger) UTF-8 strings are encoded with the wrong size.

Hello,

 Thanks for your work on this, it's been most useful for me. 

 I have encountered one error and fixed it, so I'm passing it on to see if you agree it's an error, and if it needs inclusion as a revision.

 Currently I'm moving contact data between programs, and it's crashing on the decode of some French names. 

 The problem is in _pack_string. It's calculating the length of the string before it's encoded to UTF-8.

 I think you must encode the string before you find the length of it, as some characters need to encode as double-byte or longer.

 An example would be the French name Allagbe, or the French word precedent , where the 'e' is with Acute (http://www.fileformat.info/info/unicode/char/e9/index.htm)

 Python's encoder makes this b'Allagb\xc3\xa9', which is one byte longer than than the original string. 

 u-msgpack encodes this as b'\xa7Allagb\xc3\xa9'  - notice how it's only 7 bytes long - it's trimming the \xa9 char from the msgpack. 

 When you feed this trimmed string through a .decode('utf-8') method, you'll crash with a python error : 'utf-8' codec can't decode byte 0xc3 in position 6: unexpected end of data

 The solution is to encode to UTF-8 before calculating the string length, as detailed below:

def _pack_string(x):
 x = x.encode('utf-8')
    if len(x) <= 31:
        return struct.pack("B", 0xa0 | len(x)) + x
    elif len(x) <= 2**8-1:
        return b"\xd9" + struct.pack("B", len(x)) + x
    elif len(x) <= 2**16-1:
        return b"\xda" + struct.pack(">H", len(x)) + x
    elif len(x) <= 2**32-1:
        return b"\xdb" + struct.pack(">I", len(x)) + x
    else:
        raise UnsupportedTypeException("huge string")

With this patch in place, I am able to pass all French names in u-msgpack without error. 

-end-


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Double-byte (or larger) UTF-8 strings are encoded with the wrong size. #8

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Double-byte (or larger) UTF-8 strings are encoded with the wrong size. #8

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions