Skip to content

Double-byte (or larger) UTF-8 strings are encoded with the wrong size. #8

@cforger

Description

@cforger

Hello,

Thanks for your work on this, it's been most useful for me.

I have encountered one error and fixed it, so I'm passing it on to see if you agree it's an error, and if it needs inclusion as a revision.

Currently I'm moving contact data between programs, and it's crashing on the decode of some French names.

The problem is in _pack_string. It's calculating the length of the string before it's encoded to UTF-8.

I think you must encode the string before you find the length of it, as some characters need to encode as double-byte or longer.

An example would be the French name Allagbe, or the French word precedent , where the 'e' is with Acute (http://www.fileformat.info/info/unicode/char/e9/index.htm)

Python's encoder makes this b'Allagb\xc3\xa9', which is one byte longer than than the original string.

u-msgpack encodes this as b'\xa7Allagb\xc3\xa9' - notice how it's only 7 bytes long - it's trimming the \xa9 char from the msgpack.

When you feed this trimmed string through a .decode('utf-8') method, you'll crash with a python error : 'utf-8' codec can't decode byte 0xc3 in position 6: unexpected end of data

The solution is to encode to UTF-8 before calculating the string length, as detailed below:

def _pack_string(x):
x = x.encode('utf-8')
if len(x) <= 31:
return struct.pack("B", 0xa0 | len(x)) + x
elif len(x) <= 28-1:
return b"\xd9" + struct.pack("B", len(x)) + x
elif len(x) <= 2
16-1:
return b"\xda" + struct.pack(">H", len(x)) + x
elif len(x) <= 2**32-1:
return b"\xdb" + struct.pack(">I", len(x)) + x
else:
raise UnsupportedTypeException("huge string")

With this patch in place, I am able to pass all French names in u-msgpack without error.

-end-

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions