[v3] sharding api #1662

d-v-b · 2024-02-08T22:09:58Z

The sharding codec in v3 stands out as significantly more complex than the other codecs.

unlike the other codecs which transform in-memory data, the sharding codec accesses storage to get data.
the sharding codec has its own codecs,
the sharding codec has its own implementations of chunk reading and writing.

I propose that the extra complexity of the sharding codec is pressure that can be relieved by introducing abstraction higher in the stack.

I'm not sure what this abstraction should look like. But I feel like we should start from a change in perspective: we should think of every zarr array as sharded (even v2 arrays), with a variable number of chunks per shard, starting with 1 chunk per shard. This means every zarr chunk is addressed not just by the string name of the object it is stored in, but also by a byte range in that object. So we expand the type of the chunk keys, and we start thinking of every array as requiring some sharding machinery for chunk IO.

Sorry for not making this more concrete -- it's late, and I really don't have concrete ideas for this yet. But I think the complexity gradient in the v3 codec code is a red flag we can't ignore.

do you have any ideas here @normanrz?

normanrz · 2024-02-09T11:02:55Z

The sharding codec is more complex than other codecs but I don't share your worry of being too complex. It is nicely encapsulated within the ShardingCodec.

unlike the other codecs which transform in-memory data, the sharding codec accesses storage to get data.

That is handled through the partial encode/decode interface. Currently, only sharding implements that. However, as you noted elswhere, other codecs (e.g. BytesCodec) could also make use of these interfaces. Also, both are optional mixins so new codecs don't have to worry about that at all.

the sharding codec has its own codecs,

Yes, but the same codec pipeline is reused. So, there is no duplication in terms of abstractions.

the sharding codec has its own implementations of chunk reading and writing.

We could see if we could increase code reuse from the array.

I actually like that the complexity of sharding is captured within the codec. It composes quite nicely. I don't see how it would become less complex if different parts of the code also need to know about sharding. I guess a more concrete proposal would be required. Please note that sharding can be arbitrarily nested.

normanrz · 2024-04-22T12:28:34Z

In #1670, I refactored the codec APIs a bit. The code of the sharding codec is now much closer to the code of the array because both delegate most of their reading/writing logic to the CodecPipeline. This is facilitated by the ByteGetter and ByteSetter type protocols that proxy either StorePaths or shard representations.

d-v-b added the V3 label Feb 8, 2024

d-v-b added this to Zarr-Python - 3.0 Feb 8, 2024

jhamman moved this to Todo in Zarr-Python - 3.0 Apr 5, 2024

jhamman added this to the 3.0.0.alpha milestone Apr 5, 2024

jhamman moved this from Todo to In Progress in Zarr-Python - 3.0 Apr 5, 2024

jhamman moved this from In Progress to Todo in Zarr-Python - 3.0 Apr 5, 2024

jhamman added the design discussion label Apr 5, 2024

jhamman closed this as completed May 17, 2024

github-project-automation bot moved this from Todo to Done in Zarr-Python - 3.0 May 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v3] sharding api #1662

[v3] sharding api #1662

d-v-b commented Feb 8, 2024

normanrz commented Feb 9, 2024

normanrz commented Apr 22, 2024

[v3] sharding api #1662

[v3] sharding api #1662

Comments

d-v-b commented Feb 8, 2024

normanrz commented Feb 9, 2024

normanrz commented Apr 22, 2024