Skip to content

Support SAX-style unpack API #418

@jpetso

Description

@jpetso

It would be super neat if I could use msgpack as a pure wire format reader/visitor without having to create a result msgpack::object.

For instance, my class msgpack_from_json_handler can be passed to rapidjson::Reader::Parse() as a parameter and will be called back on every element in the JSON tree.

With unpack_reference_func(), I can avoid copying all string data with a zone. However, I still have to let msgpack generate a msgpack::object tree, with allocations depending on the non-leaf objects / depth of the tree. In some situations, a simple visitor pattern would work better.

Here are two example use cases:

  • Converting msgpack to a custom type/variant without allocations ("use case 1")

Elvis's library deals mostly with JSON data. Internally he stores much of his data as RapidJSON variant values. He decides he also wants to support serialialization to and from msgpack. He needs to process lots of data and packing/unpacking is a hot code path, so it needs to perform as fast as possible. For serializing to msgpack, he can use pack()/pack_map()/pack_array() functions and json-msgpack. He also wants to be able to read directly from a msgpack buffer into a RapidJSON variant, skipping the intermediate step of the msgpack::object because his variant can represent the same data in a format that the rest of his code uses.

[Note that instead of unpacking into JSON, one could also make use cases for msgpack::type::variant, known user-defined message objects, etc.]

  • Checking msgpack contents without unpacking them ("use case 2")

Jakob wants to implement a SQLite extension function that checks if a serialized msgpack buffer (stored in a column of a SQLite table) contains a certain string. He wants to issue SQL queries such as, SELECT * FROM table WHERE msgpack_array_contains(msgpack_column, id) and implement the msgpack_column() extension function in the most efficient way, because it will be executed for a lot of rows. He would like to write a visitor class that gets called on each item and aborts unpacking when an array item equals id (after setting a found flag to true), or when the top-level msgpack type is something other than ARRAY. He does not care about a return type, neither msgpack::object nor any other variant.

I think RapidJSON's SAX interface (passed to rapidjson::Reader::Parse()) would be a pretty good starting point, with minor adaptations required to suit msgpack's object model. Here's an initial draft of a (template) interface for C++ classes implementing a msgpack visitor, and a suitable unpack() overload:

struct unpack_visitor {
    bool visit_nil();
    bool visit_boolean(bool);
    bool visit_positive_integer(uint64_t);
    bool visit_negative_integer(int64_t);
    bool visit_float(double);
    bool visit_str(const char*, uint32_t size, bool needs_copy); // still not sure why you're not using size_t
    bool visit_bin(const char*, uint32_t size, bool needs_copy);
    bool visit_ext(int8_t type, const char* data, uint32_t size, bool needs_copy);
    bool start_array(uint32_t num_elements);
    bool end_array();
    bool start_map(uint32_t num_kv_pairs);
    bool start_map_key(); // this one could be void, but since all others are bool, might as well keep this one too
    bool end_map_key();
    bool end_map();
    void parse_error(size_t parsed_offset, size_t error_offset);
    void insufficient_bytes(size_t parsed_offset, size_t error_offset);
}

template <typename UnpackVisitor>
bool unpack(const char* data, size_t len, size_t& off, UnpackVisitor&);

[Edit: changed begin_*() to start_*() as RapidJSON, Qt and the Java SAX API all use start as well.]

I prepended visit_ to single-value functions so we don't get issues with keywords (bool float(double);) or the Apple nil define, and msgpack's lower-case naming can still be employed.

The bool return value of all of those represents a successful parse (true for "continue parsing", false for "parse error"). The unpack() function is specified in a way that an exception-less interface can be built with it and all current unpack() functions can be implemented with this unpack() overload and a visitor:

  • The zone and stack of result msgpack objects can be members of the visitor object, which uses/writes/manages them.
  • Stream-level errors (parse_error and insufficient_bytes) are called as error functions in the visitor, which can decide to either throw or store them as state in the visitor object (and do custom error handling afterwards).
    • For use case 2: A caller who does avoids throwing exceptions can distinguish between legitimate parse errors and "found, now abort" early return situations by determining whether error functions have been called. If I know that my object is valid msgpack, an early return will save me some reads and processing time.
    • I chose to include offset parameters for the error functions, because it could be useful for error reporting (invalid bytes range from parsed_offset to error_offset). error_offset is the same value as the off parameter in unpack(). When used within an unpacker, parsed_offset equals parsed_size() and error_offset equals parsed_size() + nonparsed_size() (assuming I understand the purpose of these values correctly).
  • unpack_limit and size_overflow errors can be thrown by the visitor functions themselves.
  • unpack_reference_func can be called by visit_str(), visit_bin() and visit_ext().
    • user_data can be a member of the visitor class.
  • unpacker can't be implemented on top of this overload because unpack() doesn't take a parse context object. (The visitor object likely contains a stack of msgpack objects, but does not tell the calling parser about the stack depth. That's a good thing because it makes the visitor object more lightweight and easier to implement if no stack information is necessary.) However, it should be possible to use the same visitor object for unpacker as for regular unpack() overloads, assuming that insufficient_bytes() doesn't throw.
  • The off result parameter might be important for error messages. It's required here in addition to the visitor's error functions because if false is returned from a visit_*()/start_*()/end_*() function, the visitor itself does not know the current offset so it can't store it as state. Hence unpack() itself must maintain and return it.

Personally, I find this interface a lot more approachable and easier to understand than unpack_reference_func and unpack_limit. Even though it is simpler in concept, it is also more flexible. I hope something like this can be supported by msgpack-c.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions