Skip to content

[v3] sharding api #1662

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
d-v-b opened this issue Feb 8, 2024 · 2 comments
Closed

[v3] sharding api #1662

d-v-b opened this issue Feb 8, 2024 · 2 comments

Comments

@d-v-b
Copy link
Contributor

d-v-b commented Feb 8, 2024

The sharding codec in v3 stands out as significantly more complex than the other codecs.

  • unlike the other codecs which transform in-memory data, the sharding codec accesses storage to get data.
  • the sharding codec has its own codecs,
  • the sharding codec has its own implementations of chunk reading and writing.

I propose that the extra complexity of the sharding codec is pressure that can be relieved by introducing abstraction higher in the stack.

I'm not sure what this abstraction should look like. But I feel like we should start from a change in perspective: we should think of every zarr array as sharded (even v2 arrays), with a variable number of chunks per shard, starting with 1 chunk per shard. This means every zarr chunk is addressed not just by the string name of the object it is stored in, but also by a byte range in that object. So we expand the type of the chunk keys, and we start thinking of every array as requiring some sharding machinery for chunk IO.

Sorry for not making this more concrete -- it's late, and I really don't have concrete ideas for this yet. But I think the complexity gradient in the v3 codec code is a red flag we can't ignore.

do you have any ideas here @normanrz?

@normanrz
Copy link
Member

normanrz commented Feb 9, 2024

The sharding codec is more complex than other codecs but I don't share your worry of being too complex. It is nicely encapsulated within the ShardingCodec.

  • unlike the other codecs which transform in-memory data, the sharding codec accesses storage to get data.

That is handled through the partial encode/decode interface. Currently, only sharding implements that. However, as you noted elswhere, other codecs (e.g. BytesCodec) could also make use of these interfaces. Also, both are optional mixins so new codecs don't have to worry about that at all.

  • the sharding codec has its own codecs,

Yes, but the same codec pipeline is reused. So, there is no duplication in terms of abstractions.

We could see if we could increase code reuse from the array.

I actually like that the complexity of sharding is captured within the codec. It composes quite nicely. I don't see how it would become less complex if different parts of the code also need to know about sharding. I guess a more concrete proposal would be required. Please note that sharding can be arbitrarily nested.

@jhamman jhamman moved this to Todo in Zarr-Python - 3.0 Apr 5, 2024
@jhamman jhamman added this to the 3.0.0.alpha milestone Apr 5, 2024
@jhamman jhamman moved this from Todo to In Progress in Zarr-Python - 3.0 Apr 5, 2024
@jhamman jhamman moved this from In Progress to Todo in Zarr-Python - 3.0 Apr 5, 2024
@normanrz
Copy link
Member

In #1670, I refactored the codec APIs a bit. The code of the sharding codec is now much closer to the code of the array because both delegate most of their reading/writing logic to the CodecPipeline. This is facilitated by the ByteGetter and ByteSetter type protocols that proxy either StorePaths or shard representations.

@jhamman jhamman closed this as completed May 17, 2024
@github-project-automation github-project-automation bot moved this from Todo to Done in Zarr-Python - 3.0 May 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

No branches or pull requests

3 participants