Support SAX-style unpack API

It would be super neat if I could use msgpack as a pure wire format reader/visitor without having to create a result `msgpack::object`.

For instance, my class [msgpack_from_json_handler](https://github.com/tplgy/json-msgpack/blob/master/include/json_msgpack/json_msgpack_sax.hpp#L113) can be passed to `rapidjson::Reader::Parse()` as a parameter and will be called back on every element in the JSON tree.

With `unpack_reference_func()`, I can avoid copying all string data with a zone. However, I still have to let msgpack generate a `msgpack::object` tree, with allocations depending on the non-leaf objects / depth of the tree. In some situations, a simple visitor pattern would work better.

Here are two example use cases:
- Converting msgpack to a custom type/variant without allocations ("use case 1")

> Elvis's library deals mostly with JSON data. Internally he stores much of his data as RapidJSON variant values. He decides he also wants to support serialialization to and from msgpack. He needs to process lots of data and packing/unpacking is a hot code path, so it needs to perform as fast as possible. For serializing to msgpack, he can use `pack()`/`pack_map()`/`pack_array()` functions and [json-msgpack](https://github.com/tplgy/json-msgpack). He also wants to be able to read directly from a msgpack buffer into a RapidJSON variant, skipping the intermediate step of the `msgpack::object` because his variant can represent the same data in a format that the rest of his code uses.

[Note that instead of unpacking into JSON, one could also make use cases for `msgpack::type::variant`, known user-defined message objects, etc.]
- Checking msgpack contents without unpacking them ("use case 2")

> Jakob wants to implement a SQLite extension function that checks if a serialized msgpack buffer (stored in a column of a SQLite table) contains a certain string. He wants to issue SQL queries such as, `SELECT * FROM table WHERE msgpack_array_contains(msgpack_column, id)` and implement the `msgpack_column()` extension function in the most efficient way, because it will be executed for a lot of rows. He would like to write a visitor class that gets called on each item and aborts unpacking when an array item equals `id` (after setting a `found` flag to true), or when the top-level msgpack type is something other than ARRAY. He does not care about a return type, neither `msgpack::object` nor any other variant.

I think RapidJSON's SAX interface (passed to `rapidjson::Reader::Parse()`) would be a pretty good starting point, with minor adaptations required to suit msgpack's object model. Here's an initial draft of a (template) interface for C++ classes implementing a msgpack visitor, and a suitable `unpack()` overload:

``` C++
struct unpack_visitor {
    bool visit_nil();
    bool visit_boolean(bool);
    bool visit_positive_integer(uint64_t);
    bool visit_negative_integer(int64_t);
    bool visit_float(double);
    bool visit_str(const char*, uint32_t size, bool needs_copy); // still not sure why you're not using size_t
    bool visit_bin(const char*, uint32_t size, bool needs_copy);
    bool visit_ext(int8_t type, const char* data, uint32_t size, bool needs_copy);
    bool start_array(uint32_t num_elements);
    bool end_array();
    bool start_map(uint32_t num_kv_pairs);
    bool start_map_key(); // this one could be void, but since all others are bool, might as well keep this one too
    bool end_map_key();
    bool end_map();
    void parse_error(size_t parsed_offset, size_t error_offset);
    void insufficient_bytes(size_t parsed_offset, size_t error_offset);
}

template <typename UnpackVisitor>
bool unpack(const char* data, size_t len, size_t& off, UnpackVisitor&);
```

[Edit: changed `begin_*()` to `start_*()` as RapidJSON, Qt and the Java SAX API all use `start` as well.]

I prepended `visit_` to single-value functions so we don't get issues with keywords (`bool float(double);`) or the Apple `nil` define, and msgpack's lower-case naming can still be employed.

The `bool` return value of all of those represents a successful parse (`true` for "continue parsing", `false` for "parse error"). The `unpack()` function is specified in a way that an exception-less interface can be built with it and all current `unpack()` functions can be implemented with this `unpack()` overload and a visitor:
- The zone and stack of result msgpack objects can be members of the visitor object, which uses/writes/manages them.
- Stream-level errors (`parse_error` and `insufficient_bytes`) are called as error functions in the visitor, which can decide to either throw or store them as state in the visitor object (and do custom error handling afterwards).
  - For use case 2: A caller who does avoids throwing exceptions can distinguish between legitimate parse errors and "found, now abort" early return situations by determining whether error functions have been called. If I know that my object is valid msgpack, an early return will save me some reads and processing time.
  - I chose to include offset parameters for the error functions, because it could be useful for error reporting (invalid bytes range from `parsed_offset` to `error_offset`). `error_offset` is the same value as the `off` parameter in `unpack()`. When used within an `unpacker`, `parsed_offset` equals `parsed_size()` and `error_offset` equals `parsed_size() + nonparsed_size()` (assuming I understand the purpose of these values correctly).
- `unpack_limit` and `size_overflow` errors can be thrown by the visitor functions themselves.
- `unpack_reference_func` can be called by `visit_str()`, `visit_bin()` and `visit_ext()`.
  - `user_data` can be a member of the visitor class.
- `unpacker` can't be implemented on top of this overload because `unpack()` doesn't take a parse context object. (The visitor object likely contains a stack of msgpack objects, but does not tell the calling parser about the stack depth. That's a good thing because it makes the visitor object more lightweight and easier to implement if no stack information is necessary.) However, it should be possible to use the same visitor object for `unpacker` as for regular `unpack()` overloads, assuming that `insufficient_bytes()` doesn't throw.
- The `off` result parameter might be important for error messages. It's required here in addition to the visitor's error functions because if `false` is returned from a `visit_*()`/`start_*()`/`end_*()` function, the visitor itself does not know the current offset so it can't store it as state. Hence `unpack()` itself must maintain and return it.

Personally, I find this interface a lot more approachable and easier to understand than `unpack_reference_func` and `unpack_limit`. Even though it is simpler in concept, it is also more flexible. I hope something like this can be supported by msgpack-c.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support SAX-style unpack API #418

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support SAX-style unpack API #418

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions