-
-
Notifications
You must be signed in to change notification settings - Fork 329
[v3] sharding api #1662
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The sharding codec is more complex than other codecs but I don't share your worry of being too complex. It is nicely encapsulated within the ShardingCodec.
That is handled through the partial encode/decode interface. Currently, only sharding implements that. However, as you noted elswhere, other codecs (e.g. BytesCodec) could also make use of these interfaces. Also, both are optional mixins so new codecs don't have to worry about that at all.
Yes, but the same codec pipeline is reused. So, there is no duplication in terms of abstractions.
We could see if we could increase code reuse from the array. I actually like that the complexity of sharding is captured within the codec. It composes quite nicely. I don't see how it would become less complex if different parts of the code also need to know about sharding. I guess a more concrete proposal would be required. Please note that sharding can be arbitrarily nested. |
In #1670, I refactored the codec APIs a bit. The code of the sharding codec is now much closer to the code of the array because both delegate most of their reading/writing logic to the |
The sharding codec in v3 stands out as significantly more complex than the other codecs.
I propose that the extra complexity of the sharding codec is pressure that can be relieved by introducing abstraction higher in the stack.
I'm not sure what this abstraction should look like. But I feel like we should start from a change in perspective: we should think of every zarr array as sharded (even v2 arrays), with a variable number of chunks per shard, starting with 1 chunk per shard. This means every zarr chunk is addressed not just by the string name of the object it is stored in, but also by a byte range in that object. So we expand the type of the chunk keys, and we start thinking of every array as requiring some sharding machinery for chunk IO.
Sorry for not making this more concrete -- it's late, and I really don't have concrete ideas for this yet. But I think the complexity gradient in the v3 codec code is a red flag we can't ignore.
do you have any ideas here @normanrz?
The text was updated successfully, but these errors were encountered: