-
-
Notifications
You must be signed in to change notification settings - Fork 328
Failure to encode object
types when used with zarr.full
#2081
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for raising this issue. I think this reveals some fundamental problems with the "object" dtype. Basically, because In principle we could alter the zarr v2 spec to include language describing a JSON encoding for What's the goal of setting the fill value to be a pydantic model? Maybe there's another way to achieve what you want. |
good to know! actually i would rather not dump the pickled object, but would much rather be able to provide a hook to serialize it to JSON! context: i'm translating neurodata without borders to linkml, and the models use another tool I wrote, numpydantic, to be able to do shape and dtype specifications with arbitrary array backends. the case here is because NWB has a lot of inter-object references as arrays, so for example with this from numpydantic import NDArray, Shape
from pydantic import BaseModel
from nwb_linkml.models import Unit
class UnitTable(BaseModel):
units: NDArray[Shape["* n_units"], Unit] and so that The interface system, as well as my code generator give me pretty good control over the objects and models that are created, and what would be ideal for me is to have some kind of hook that i can add to my models for serialization/deserialization. The So where pydantic has import zarr
from dataclasses import dataclass
from pydantic import BaseModel
from numcodecs import Codec
@dataclasses
class ZarrSerialization:
data: dict[str, str | float | int]
"""whatever representation of the object is JSON-able"""
array: Optional[zarr.Array]
"""if this object can be directly converted into a zarr array..."""
source_object: str
"""module.object_name"""
metadata: dict[str, str | float | int]
"""Any other json-able stuff"""
class MyClass(BaseModel):
def __zarr_serialization__(self, codec: Codec, ctx: Optional[zarr.SerializationContext] = None) -> ZarrSerialization:
# return something zarr knows how to make
return ZarrSerialization(
data: self.model_dump(),
source_object: ".".join([self.__module__, self.__name__]),
metadata: {'whatever': 'else'}
)
@classmethod
def _from_zarr(cls, serialization: ZarrSerialization) -> 'MyClass': ...
# rehydrate the model from serialization just as a super rough example. So maybe I take the codec that is requested during serialization, i give enough information as would be needed to re-create the object (or, from a multi-language perspective, I could also be able to specify that this was from a Python object, so other languages would know they werent' supposed to try and handle it, you know what's needed there better than me). and any other information that would be useful. and then I take that object back when loading the array (either from another fixed-name method, or i can give that during the serialization). Many of these objects have arrays nested within them, so if i could hook into the zarr serialization process generally I could return the model fields that are arrays as arrays, and then store the object metadata around that - i think the so then like yaml there is a 'safe load' that just returns the JSON object, and an 'unsafe load' which tries to rehydrate/cast objects. that may cut down on the complexity of supporting arbitrary objects - "we only support objects that have specifically implemented our serialization protocol" the reason why it would be good to have arbitrary control over what gets serialized (rather than always just a pure dict of the object contents) is that in eg. NWB i'm sharing object references in my instantiated models to imitate the HDF5 object references, and so when serializing i would want to only save the instantiated model in one place and in other places save a reference to it. Y'all know more about what's good for the format than I do obviously, and understood that arrays of objects are intrinsically awkward, but what i imagine happening with dropping support for objects (the numcodecs system is nice!) is that people will just store things as long opaque strings which is also not great for cross platform use. I would be more than happy to implement this if you're interested, because then zarr becomes sort of like a magic format to me and i can just transparently use it as a backing store for this and other data, and numpydantic can sort of behave as an "ORM-like" interface to zarr stores. lmk! |
If the fill value is JSON, then maybe it's simpler to think of the zarr array having a JSON dtype? I don't think this is very ergonomic in zarr today, because zarr is designed more for numeric types. But at least using JSON gets you around overfitting to python data structures. That being said, I'm not sure I fully understand the plan to serialize an array model inside an array (correct me if this is not an accurate characterization). |
sorry, issue fell out of my notifs- tl;dr: it would be nice to have a hook into zarr's serialization methods to implement the extension points in zarr 3 for storing objects rather than storing objects as serialized blobs, which was the initial idea. So my overall goal is to be able to patch into zarr as a backend for data models that include arrays, and sometimes those arrays include things like references to other arrays (or more generically, objects that require custom serialization). This is for neurodata without borders, if it helps with context at all, since i think y'all have overlap with that dev team. The existing behavior of being able to provide your own serialization codec works, but it's a little awkward, where i need to implement the serialization behavior in the thing that contains the special type, rather than having a hook to allow that special type provide its own serialization. That's one option, imperfect, because it's basically ignorant of the zarr storage model and would basically be stored as a variable length string. That's the OP of the issue. What i would really like to be able to do is to directly patch into the zarr serialization format itself and serialize the object as zarr rather than serialize the object in zarr - especially if y'all are dropping support for objects. that's the later comment. So take for example this relatively simple data model, where I have some model from pydantic import BaseModel
from numpydantic import NDArray, Shape
class Trajectory(BaseModel):
some_field: str = "whatever"
latitude: NDArray[Shape["*"], float]
longitude: NDArray[Shape["*"], float]
time: NDArray[Shape["*"], float]
class Flock(BaseModel):
other_field: int = 5
trajectories: NDArray[Any, Trajectory] So import zarr
from zarr.store import LocalStore
t_store = LocalStore('trajectory_1.zarr', mode='w')
trajectory = zarr.group(store=t_store,
attributes={'some_field': 'whatever'})
latitude = trajectory.create_array('latitude',shape=(10,1), fill_value=0)
longitude = trajectory.create_array('longitude', shape=(10,1), fill_value=0)
time = trajectory.create_array('time', shape=(10,1), fill_value=0) and that same thing would work with numpydantic which wraps zarr trajectory = Trajectory(
latitude = zarr.zeros(shape=(10,1))
longitude = ("trajectory_1.zarr", "longitude"),
time = np.arange(10)
) But it would also be nice to be able to provide a serialization hook so that for these models I can tell zarr how they map onto zarr's group structure. So for flock = Flock(
trajectories = [t1, t2, t3]
)
zarr.save('my_data.zarr', flock) and have that come out as a group This would be very useful for using zarr to model more complex data standards and formats that have things that zarr doesn't support like references, etc. - provide a serialization method that zarr understands so it becomes a transparent backend to the object which acts like an ORM model (ish). so re:
i want to make a model s.t. something behaves like an array of models, but doesn't necessarily get stored as an array of serialized blobs. This seems like it might fit in with zarr 3's extension points - eg. if there was a hook where I could specify that something should be stored with a custom Sorry that this issue drifted focus, i can split off into a separate one if we want to keep this just as a bug report for the specific problem in the OP |
I think I'm seeing two challenges in this issue (feel free to correct me if this summary is bad). The first is how to map pydantic models onto zarr hierachies, and the second is how to serialize references to zarr arrays / groups. Regarding mapping pydantic models to zarr hierarchies, you say:
My approach to this has been to explicitly model zarr's hierarchical group structure in pydantic, and then serialization from zarr-the-model to zarr-the-format is relatively simple. Modelling zarr hierarchies explicitly comes at a cost -- I can't serialize an arbitrary pydantic model to a zarr hierarchy, but that's a potentially unbounded problem. Generally speaking, if you have some data structure Regarding serializing references to arrays: Zarr has no formal support for this, so you would basically need to create your own serialization scheme that can map the references to the types that zarr does support: JSON and numerical values. It seems like the former is a bit easier than the latter. There's some prior art in how hdf5 models virtual datasets, and there might be users of Zarr today who make use of references to arrays and groups, but I don't have a lot of experience with this. |
Exactly - that's why you provide a Ideally the method signature would look something like this:
where
yes this would be one of the purposes of providing a serialization hook, to be able to serialize things that aren't currently supported like references in such a way that the downstream application can understand how to deserialize without needing to overcomplicate the base zarr library |
What class(es) in |
potentially none, if you didn't want to use it internally, it would be something called during the various methods like zarr-python/src/zarr/core/array.py Line 501 in 60b4f57
__zarr_serialize__ method and it returns whatever is expected there. Otherwise it would be on the Array and Group classes, and it seems like the thing that would be returned is GroupMetadata or ArrayMetadata . So maybe another idea would be to have two separate methods like __zarr_array__ or __zarr_group__ for an object to declare whether it should be treated like an array or a group if those things are handled separately.
edit: or maybe another way would be to pass the I haven't read the v3 spec or implementation yet, but if this was something y'all might be interested in i could do a more thorough proposal that includes potential implementations - at this point i'm just pitching an idea that amounts to "i would really like to be able to hook into the zarr serialization process so that I can encode models that contain arrays natively," but again would love to help implement it if there is interest |
Zarr version
v2.18.2
Numcodecs version
v0.13.0
Python Version
3.11
Operating System
Mac
Installation
pip :)
Description
When using an object (specifically a pydantic model) as the
fill_value
in full, the metadata encoding step fails to encode (pickle) the model. It is instead passed unencoded to the JSON codec which chokes.Steps to reproduce
Additional output
Failure happens in
encode_array_metadata
where it tries to calljson_dumps
onwhich, of course, fails :(
(sorry some of my values are different in the meta dict and the example, running this from my tests atm but can reproduce just by running the example)
The text was updated successfully, but these errors were encountered: