-
Notifications
You must be signed in to change notification settings - Fork 33
Description
I will summarize a few concerns I have about the way codecs are handled in the v3 spec, and propose some changes that I think could improve this situation.
the codec problem space
We need Zarr implementations across multiple languages to agree on standard JSON serialization for different codecs. This protects users from fragmentation, e.g. a situation where we end up with multiple flavors of JSON serialization for the same popular codec. At the same time, we want to make it easy for users to experiment with and create new codecs; this enables users to get the most from Zarr.
Also, codecs are generally useful for users outside of Zarr. There are plenty of non-Zarr use cases for compressing / rearranging array data. So I think the codec standardization should support these non-Zarr use cases.
concerns with codecs in the v3 spec
- The v3 spec explicitly states that it does not define a list of codecs, but it does define a list of codecs. We can't have blatant contradictions in the spec, so this needs to be sorted out at a minimum, regardless of whatever decisions we make. The contradiction between the text of the spec and the codec definitions was already a source of confusion in a pull request in
zarr-python
. - Suppose we resolve the above contradiction by stating that zarr v3 does in fact define a fixed set of codecs, where are listed in in the spec. This leads to two sub-problems:
- How does someone design and use a new codec? We cannot require PRs against the spec for every new codec. If writing a new codec started with getting a PR accepted in
zarr-specs
, nobody would ever write a new codec. - What happens if an implementation does not support a codec from the standard list? There is no enforcement mechanism for the requirement that an implementation support that fixed set, so practically the requirement is toothless, which means it cannot be a requirement. Requirements in the spec should be restricted to essential features, but supporting the Gzip compressor is simply not essential, for users who don't work with Gzip-compressed data. So any list of codecs should be a recommendation, not a requirement.
- How does someone design and use a new codec? We cannot require PRs against the spec for every new codec. If writing a new codec started with getting a PR accepted in
- The v3 spec states that the unique identifier for a codec must be "... a URI that dereferences to a human-readable specification of the codec".
Software cannot check if a URI dereferences to a human-readable document. If we want Zarr v3 hierarchies to be validated by software, we must remove this requirement.
how to resolve these concerns
I don't think naming a closed set of "official codecs" in the spec is realistic. There is no enforcement mechanism, and ultimately users don't care if an implementation doesn't support a codec they don't use. That is, if an implementation doesn't support codec X, and none of the users of that implementation use codec X, then IMO this is fine.
To express this differently, I think the Zarr spec should not enumerate the features / behavior an implementation must have. The Zarr spec should just describe the Zarr format, and we leave it to implementations to choose how they implement that format.
Extending this logic, the Zarr format is actually agnostic with respect to particular codecs. So specific codecs should not appear in the Zarr spec! I actually think codecs should be defined entirely in another spec, and we refer to this spec in the Zarr spec, e.g. "codecs is a JSON array of JSON objects that implement the Numcodecs spec (link to the numcodecs spec)" (we can choose a different name for the codecs spec, but it shouldn't refer to zarr).
Recall that In Zarr v2, codecs were basically standardized by the behavior of the numcodecs
python library, which was a stand-alone library with no Zarr dependency. I think this illustrates the right relationship between codecs and the zarr format, but we shouldn't rely on a python library to define a standard for a cross-language concern. Zarr v3 tries to fix the latter problem by folding codec definition inside the spec itself, but as I have argued, this introduces a different set of problems. The solution is to define codecs separately, and make the zarr spec depend on that codec spec. The codec specification can manage a registry of codecs, etc, thereby abstracting the current behavior of numcodecs
in a language-agnostic way.
Another advantage of a separate spec for codecs is that this spec could be used by any project that wants to compress arrays in a standard way. There is nothing Zarr-specific about serializing GZip parameters to JSON, so lets reflect this in the structure of the specification document.
tldr; I think the list of codecs in v3 is trying to solve a problem (a language-agnostic list of codecs) that we can solve in a better way: by migrating the codec specification from Zarr v3 into its own spec.
is this too much churn in the spec
I know it sucks to hear complaints about the spec after it's been finalized. Sorry. But I want zarr v3 to be really good, and I think the way we do codecs in v3 right now is very problematic; if my concerns are valid, then we owe it to users to get this resolved as soon as possible.