-
Notifications
You must be signed in to change notification settings - Fork 918
Description
It would be super neat if I could use msgpack as a pure wire format reader/visitor without having to create a result msgpack::object.
For instance, my class msgpack_from_json_handler can be passed to rapidjson::Reader::Parse() as a parameter and will be called back on every element in the JSON tree.
With unpack_reference_func(), I can avoid copying all string data with a zone. However, I still have to let msgpack generate a msgpack::object tree, with allocations depending on the non-leaf objects / depth of the tree. In some situations, a simple visitor pattern would work better.
Here are two example use cases:
- Converting msgpack to a custom type/variant without allocations ("use case 1")
Elvis's library deals mostly with JSON data. Internally he stores much of his data as RapidJSON variant values. He decides he also wants to support serialialization to and from msgpack. He needs to process lots of data and packing/unpacking is a hot code path, so it needs to perform as fast as possible. For serializing to msgpack, he can use
pack()/pack_map()/pack_array()functions and json-msgpack. He also wants to be able to read directly from a msgpack buffer into a RapidJSON variant, skipping the intermediate step of themsgpack::objectbecause his variant can represent the same data in a format that the rest of his code uses.
[Note that instead of unpacking into JSON, one could also make use cases for msgpack::type::variant, known user-defined message objects, etc.]
- Checking msgpack contents without unpacking them ("use case 2")
Jakob wants to implement a SQLite extension function that checks if a serialized msgpack buffer (stored in a column of a SQLite table) contains a certain string. He wants to issue SQL queries such as,
SELECT * FROM table WHERE msgpack_array_contains(msgpack_column, id)and implement themsgpack_column()extension function in the most efficient way, because it will be executed for a lot of rows. He would like to write a visitor class that gets called on each item and aborts unpacking when an array item equalsid(after setting afoundflag to true), or when the top-level msgpack type is something other than ARRAY. He does not care about a return type, neithermsgpack::objectnor any other variant.
I think RapidJSON's SAX interface (passed to rapidjson::Reader::Parse()) would be a pretty good starting point, with minor adaptations required to suit msgpack's object model. Here's an initial draft of a (template) interface for C++ classes implementing a msgpack visitor, and a suitable unpack() overload:
struct unpack_visitor {
bool visit_nil();
bool visit_boolean(bool);
bool visit_positive_integer(uint64_t);
bool visit_negative_integer(int64_t);
bool visit_float(double);
bool visit_str(const char*, uint32_t size, bool needs_copy); // still not sure why you're not using size_t
bool visit_bin(const char*, uint32_t size, bool needs_copy);
bool visit_ext(int8_t type, const char* data, uint32_t size, bool needs_copy);
bool start_array(uint32_t num_elements);
bool end_array();
bool start_map(uint32_t num_kv_pairs);
bool start_map_key(); // this one could be void, but since all others are bool, might as well keep this one too
bool end_map_key();
bool end_map();
void parse_error(size_t parsed_offset, size_t error_offset);
void insufficient_bytes(size_t parsed_offset, size_t error_offset);
}
template <typename UnpackVisitor>
bool unpack(const char* data, size_t len, size_t& off, UnpackVisitor&);[Edit: changed begin_*() to start_*() as RapidJSON, Qt and the Java SAX API all use start as well.]
I prepended visit_ to single-value functions so we don't get issues with keywords (bool float(double);) or the Apple nil define, and msgpack's lower-case naming can still be employed.
The bool return value of all of those represents a successful parse (true for "continue parsing", false for "parse error"). The unpack() function is specified in a way that an exception-less interface can be built with it and all current unpack() functions can be implemented with this unpack() overload and a visitor:
- The zone and stack of result msgpack objects can be members of the visitor object, which uses/writes/manages them.
- Stream-level errors (
parse_errorandinsufficient_bytes) are called as error functions in the visitor, which can decide to either throw or store them as state in the visitor object (and do custom error handling afterwards).- For use case 2: A caller who does avoids throwing exceptions can distinguish between legitimate parse errors and "found, now abort" early return situations by determining whether error functions have been called. If I know that my object is valid msgpack, an early return will save me some reads and processing time.
- I chose to include offset parameters for the error functions, because it could be useful for error reporting (invalid bytes range from
parsed_offsettoerror_offset).error_offsetis the same value as theoffparameter inunpack(). When used within anunpacker,parsed_offsetequalsparsed_size()anderror_offsetequalsparsed_size() + nonparsed_size()(assuming I understand the purpose of these values correctly).
unpack_limitandsize_overflowerrors can be thrown by the visitor functions themselves.unpack_reference_funccan be called byvisit_str(),visit_bin()andvisit_ext().user_datacan be a member of the visitor class.
unpackercan't be implemented on top of this overload becauseunpack()doesn't take a parse context object. (The visitor object likely contains a stack of msgpack objects, but does not tell the calling parser about the stack depth. That's a good thing because it makes the visitor object more lightweight and easier to implement if no stack information is necessary.) However, it should be possible to use the same visitor object forunpackeras for regularunpack()overloads, assuming thatinsufficient_bytes()doesn't throw.- The
offresult parameter might be important for error messages. It's required here in addition to the visitor's error functions because iffalseis returned from avisit_*()/start_*()/end_*()function, the visitor itself does not know the current offset so it can't store it as state. Henceunpack()itself must maintain and return it.
Personally, I find this interface a lot more approachable and easier to understand than unpack_reference_func and unpack_limit. Even though it is simpler in concept, it is also more flexible. I hope something like this can be supported by msgpack-c.