add storage_transformers and get/set_partial_values #1096

jstriebel · 2022-07-28T13:26:44Z

This PR adds two features of the current v3 spec to the zarr-python implementation:

storage transformers
get_partial_values and set_partial_values

The current spec is not finalized and waiting for finishing review and approval of ZEP001. Since all v3 features are currently behind the ZARR_V3_EXPERIMENTAL_API flag, I think it might still be fine to incorporate those features to gain early feedback.

Both features are prerequisites for sharding as described here, which I'd like to add as a follow-up PR, together with a new ZEP for it.

TODO:

Add unit tests and/or doctests in docstrings
Add docstrings and API docs for any new/modified user-facing classes and functions
[New/modified features documented in docs/tutorial.rst] (v3 features seem not to be documented yet?)
Changes documented in docs/release.rst
GitHub Actions have all passed
Test coverage is 100% (Codecov passes)
Add follow-up issue to optimize get_partial_values and set_partial_values for specific stores: v3 stores: Implement efficient get/set_partial_values #1106

jstriebel · 2022-07-28T17:35:05Z

The Linux tests fail due to some s3 tests, and the Windows test timeout in the s3 fixture setup, which also fail on different branches and seem to be unrelated to this PR.

jstriebel · 2022-07-29T09:35:44Z

@joshmoore & @grlee77 Would one of you have time to review my changes here? Asking since you authored the initial v3 additions here. From my side this PR is ready for review.

grlee77

Thanks, @jstriebel! I have some minor suggestions, but this looks pretty good already. I think we will get more experience and a better idea of a real-world use case with the follow-up sharding PR.

A couple more thoughts that just came up from an end-user perspective:

does passing storage_transformers=[] work equivalently to storage_transformers=None?
are there plans to expose storage_transformers via the higher level convenience/creation functions? (e.g. zarr/creation.py)

zarr/_storage/store.py

zarr/creation.py

zarr/meta.py

zarr/tests/test_storage_v3.py

Co-authored-by: Gregory Lee <[email protected]>

…nsformers=[]

jstriebel · 2022-07-29T14:46:12Z

I have some minor suggestions, but this looks pretty good already.

Thank you so much for the review, @grlee77! I applied your feedback in the recent commits.

I think we will get more experience and a better idea of a real-world use case with the follow-up sharding PR.

Yes, I think the followup-PR will help to make more sense for those methods.

does passing storage_transformers=[] work equivalently to storage_transformers=None?

Yes, it does, I added a notice for it in the docstring. I also thought about just having an empty list as the default, but lists as default arguments are somewhat deprecated due to their mutability.

are there plans to expose storage_transformers via the higher level convenience/creation functions? (e.g. zarr/creation.py)

I think it is already exposed via create. I also thought about exposing specific shorthand parameters for single storage transformers, e.g. create(…, chunks_per_shard=(32, 32, 32)), but that's not strictly necessary. Are there any other places where storage transformers should be passed, that I'm missing?

joshmoore

@jstriebel I think it might still be fine to incorporate those features to gain early feedback.

Concur. ❤️ for hammering on the Python API a bit while we also debate the spec.

I also thought about just having an empty list as the default, but lists as default arguments are somewhat deprecated due to their mutability.

And a tuple?

joshmoore · 2022-07-29T18:38:51Z

zarr/_storage/store.py

+    """Base class for storage transformers. The methods simply pass on the data as-is
+    and should be overwritten by sub-classes."""
+
+    _store_version = 3


Slight surprise to find this in store.py rather than under v3.

I assumed that the base-classes go into store.py, similar to StoreV3, but also happy to move this into _storage/v3.py if that's a better fit.

jakirkham · 2022-07-29T22:17:04Z

cc @martindurant (in case this is of interest)

martindurant · 2022-07-30T00:55:27Z

@jakirkham , I must be missing something: allowing partial byte ranges to pass through to the storage layer and transformers seem completely orthogonal to me.

In the case of this code, on this line we get the full chunk and then slice it anyway. Is the intend for some storage classes (fsspec!) to subclass this and implement a new get_partial_values?

jstriebel · 2022-07-30T11:47:48Z

allowing partial byte ranges to pass through to the storage layer and transformers seem completely orthogonal to me.

Yes, it is to some extend. I can also have them as separate PR. My motivation was to implement sharding, and I tried to keep the bases for it as a separate PR, that's how they both ended up here.

In the case of this code, on this line we get the full chunk and then slice it anyway. Is the intend for some storage classes (fsspec!) to subclass this and implement a new get_partial_values?

Yes, the idea is to overwrite those functions for stores that allow partial access. Here I just implemented the API and the fallback strategy, I'll open an issue for the optimized paths as a follow-up, also mentioned in the TODOs in the description above:

Add follow-up issue to optimize get_partial_values and set_partial_values for specific stores

Does that answer your questions, @martindurant?

martindurant · 2022-07-30T22:14:43Z

I didn't mean to denigrate this PR, I know very little about how the storage transformer or sharding are to be implemented. I am wondering how the new get_partial_values would get called, and then it sounds like we can agree on something that works for us all.

jstriebel · 2022-08-01T07:40:05Z

I also thought about just having an empty list as the default, but lists as default arguments are somewhat deprecated due to their mutability.

And a tuple?

Good idea, I changed the default to an empty tuple now 👍

codecov · 2022-08-01T08:30:15Z

Codecov Report

Merging #1096 (eba9006) into main (1fd607a) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##              main     #1096    +/-   ##
==========================================
  Coverage   100.00%   100.00%            
==========================================
  Files           35        35            
  Lines        14157     14388   +231     
==========================================
+ Hits         14157     14388   +231

Impacted Files	Coverage Δ
zarr/creation.py	`100.00% <ø> (ø)`
zarr/_storage/store.py	`100.00% <100.00%> (ø)`
zarr/core.py	`100.00% <100.00%> (ø)`
zarr/meta.py	`100.00% <100.00%> (ø)`
zarr/storage.py	`100.00% <100.00%> (ø)`
zarr/tests/test_core.py	`100.00% <100.00%> (ø)`
zarr/tests/test_creation.py	`100.00% <100.00%> (ø)`
zarr/tests/test_storage_v3.py	`100.00% <100.00%> (ø)`

martindurant · 2022-12-13T17:59:58Z

The methods can just be called where appropriate

Can I be super annoying and ask for a smallest possible example here? That other PR and thread is very long and not focussed on partial reads.

x = group.create_dataset(name="x", data=np.array([1,2,3,4]), compression=None, chunks=(4, ))
x[:2]

where the second line should read only 16 bytes from the store, not all 32, if the store supports partials.

The context would then need to force stores into different behavior, and the returned content would depend heavily on the context.

I don't follow this. For the above case,

context={"key": "data/0.0", "out_selection": [slice(0, 2)], "chunk_selection": [slice(0, 2), "meta_array": np.NDArray, "dtype": np.int64]}

where all these things are known in Array._chunk_getitem (meta_array is new and shows that we can do more with this idea, from the other PR). Return bytes for the chunk would be

b'\x01\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00'

and 16 more bytes of or empty/zeros; if we can't just truncate instead.

jstriebel · 2022-12-15T15:39:29Z

@martindurant No worries! This is a short example for your case:

x = group.create_dataset(name="x", data=np.array([1,2,3,4]), compression=None, chunks=(4, ))
x[:2]

# with PR #1111 in place, the indexing would do:
store.get_partial_values[(store_key, (0, 2 * itemsize))]
# where store_key is probably sth like "x/0", and itemsize=8 for the int64 dtype
# returning [b'\x01\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00']

(For uncompressed partial reads the implementation is here in PR #1111

Also the tests have some more examples:
https://github.com/zarr-developers/zarr-python/pull/1096/files#diff-49f50043d526642ad824a8d6a944b896540238b06ec11f182476ff8fdb2b5977R257-R316

I think I finally understand where you are coming from. With the context variant the parts that are not requested would simply be filled I guess (or truncated, as you mentioned). However, I'd prefer to only have a buffer for the actual parts of the chunk, which might be spread through the chunk, so one single buffer would not be enough. Therefore the return type would need to be different for the partial reads, which fits more nicely with a new method IMO. Does that make sense?

martindurant · 2022-12-15T16:47:44Z

OK, I got it: so the code that calls the methods introduced here actually are in the other PR. That call is not really to do with sharding, which is why I got confused. Actually I don't see how that uncompressed-partial-buffer gets used either, there seems to be many levels here.

Are you saying that you want a single call to chunk-getitem to produce many buffers, which will then be filled into the target output array? That would work with contexts too, get([key1, key1], contexts=[context1, context2]). I think in the proposal contexts= was a dict, so it would indeed need to be a list instead. That makes the API the same as here (and we could even have both).

jstriebel · 2022-12-19T13:27:30Z

Are you saying that you want a single call to chunk-getitem to produce many buffers, which will then be filled into the target output array?

Exactly 👍

That would work with contexts too, get([key1, key1], contexts=[context1, context2]). I think in the proposal contexts= was a dict, so it would indeed need to be a list instead. That makes the API the same as here (and we could even have both).

That would work, but it would also make the signature of the get call very complicated (the return type would depend on the supplied arguments). Personally I'm in favor of the current variant with the additional method. @grlee77 @joshmoore @normanrz any other opinions here? I'd like to get this merged soon-ish, so I can finalize #1111 as well.

joshmoore

Two more general points more geared towards potential future refactorings. I don't see anything that causes alarm knowing that v3 is still in flux and protected by the environment variables. Primary question is if you would see still getting this into a 2.13 release (especially if one were to come out over the holidays) or would it make more sense to roll it into 2.14.

Anyone else have thoughts?

joshmoore · 2022-12-21T11:42:51Z

zarr/core.py

@@ -1800,7 +1813,7 @@ def _set_selection(self, indexer, value, fields=None):
            check_array_shape('value', value, sel_shape)

        # iterate over chunks in range
-        if not hasattr(self.store, "setitems") or self._synchronizer is not None \
+        if not hasattr(self.chunk_store, "setitems") or self._synchronizer is not None \


I have to wonder if this wouldn't ever need to be the transformed_chunk_store. In general, this chain of replacement stores feels slightly like an anti-pattern.

Yes, I'm also not particularly happy with this pattern, but also couldn't come up with a better solution for now. Happy about any ideas 👍

joshmoore · 2022-12-21T11:43:37Z

zarr/meta.py

@@ -459,6 +463,36 @@ def _decode_codec_metadata(cls, meta: Optional[Mapping]) -> Optional[Codec]:

        return codec

+    @classmethod
+    def _encode_storage_transformer_metadata(


After the splitting of storage.py into separate files, this makes me wonder if meta and some of the other core classes shouldn't also have v3 variants.

joshmoore · 2022-12-21T11:54:38Z

I should add re-reviewing the interaction with #1131/meta_array I start to wonder if the definition of "context" and "partial" don't need to end up in the spec or at least a clarifying ZEP together.

jstriebel · 2022-12-21T14:36:37Z

@joshmoore Thanks for the review!

Primary question is if you would see still getting this into a 2.13 release (especially if one were to come out over the holidays) or would it make more sense to roll it into 2.14.

I wouldn't mind having this in the 2.13 release, so we can play around with it a bit already. I can also add the additional flag for sharding in #1111 so we are on the safe side.

I should add re-reviewing the interaction with #1131 I start to wonder if the definition of "context" and "partial" don't need to end up in the spec or at least a clarifying ZEP together.

The partial access is part of the v3 spec already: https://zarr-specs.readthedocs.io/en/latest/core/v3.0.html#abstract-store-interface.

rabernat

I finally got around to reviewing this PR! 🙃

Overall it makes sense to me. My impression is that more type hints would be helpful in a few places.

zarr/_storage/store.py

zarr/tests/test_storage_v3.py

martindurant · 2023-01-04T14:43:16Z

Is there any reason I should not press the big green button? We have three approvals already. Last chance to object!

joshmoore · 2023-01-16T14:42:10Z

Now with the somewhat large 2.13.4 out the door, I'm going to go ahead and click the button (but would have also supported @martindurant's pushing thereof 😉 ❤️)

…and-partial-get-set

joshmoore · 2023-01-16T14:50:09Z

@jstriebel: can you please push from scalableminds#3? (I could not push to scalableminds forks as an admin)

cc: @normanrz

…-get-set Storage transformers and partial get set

jstriebel · 2023-01-16T16:05:31Z

I'm going to go ahead and click the button

🚀

@jstriebel: can you please push from scalableminds#3? (I could not push to scalableminds forks as an admin)

Sure, done 👍

joshmoore · 2023-01-16T16:10:05Z

Sorry, @jstriebel, but one more needs to go in: #1320

joshmoore · 2023-01-16T16:37:28Z

Ok. Merging to spare you any more pain!

sanketverma1704 · 2023-01-16T16:39:26Z

🎉

jstriebel and others added 4 commits July 28, 2022 15:04

add storage_transformers and get/set_partial_values

605620b

formatting

566e4b0

add docs and release notes

5f85439

Merge branch 'main' into storage-transformers-and-partial-get-set

3c38d57

jstriebel marked this pull request as ready for review July 28, 2022 15:56

jstriebel mentioned this pull request Jul 29, 2022

ZEP0001 - Core v3.0 spec for review zarr-developers/zarr-specs#149

Closed

add test_core testcase

dd7fedb

grlee77 suggested changes Jul 29, 2022

View reviewed changes

jstriebel and others added 3 commits July 29, 2022 16:29

Update zarr/creation.py

e33b365

Co-authored-by: Gregory Lee <[email protected]>

apply PR feedback

81ebf68

add comment that storage_transformers=None is the same as storage_tra…

ca28471

…nsformers=[]

joshmoore reviewed Jul 29, 2022

View reviewed changes

use empty tuple as default for storage_transformers

85f3309

jstriebel and others added 2 commits August 1, 2022 09:40

Merge branch 'main' into storage-transformers-and-partial-get-set

03de894

make mypy happy

41eaafb

jstriebel added 5 commits August 1, 2022 14:04

better coverage, minor fix, adding rmdir

5d7be76

add missing rmdir to test

46229ad

increase coverage

3a9f7cc

improve test coverage

efa4e07

fix TestArrayWithStorageTransformersV3

b4668a8

jstriebel mentioned this pull request Aug 3, 2022

v3 stores: Implement efficient get/set_partial_values #1106

Closed

Merge branch 'main' into storage-transformers-and-partial-get-set

7402262

joshmoore approved these changes Dec 21, 2022

View reviewed changes

Merge branch 'main' into storage-transformers-and-partial-get-set

b9d8177

rabernat reviewed Dec 21, 2022

View reviewed changes

jstriebel and others added 2 commits December 22, 2022 11:52

apply PR feedback

91f0c2c

Merge branch 'main' into storage-transformers-and-partial-get-set

e68c97f

rabernat approved these changes Dec 23, 2022

View reviewed changes

joshmoore enabled auto-merge (squash) January 16, 2023 14:43

joshmoore disabled auto-merge January 16, 2023 14:43

Merge remote-tracking branch 'origin/main' into storage-transformers-…

a7e4d89

…and-partial-get-set

jstriebel added 2 commits January 16, 2023 16:30

Merge pull request #3 from joshmoore/storage-transformers-and-partial…

fcb9ba0

…-get-set Storage transformers and partial get set

Merge branch 'main' into storage-transformers-and-partial-get-set

b6588e7

Merge branch 'main' into storage-transformers-and-partial-get-set

eba9006

joshmoore merged commit 385b5d3 into zarr-developers:main Jan 16, 2023

d-v-b mentioned this pull request Jan 20, 2023

add ruff and black to precommit #1323

Closed

6 tasks

jstriebel deleted the storage-transformers-and-partial-get-set branch February 3, 2023 16:04

jhamman mentioned this pull request Mar 20, 2024

[v3] Design and implement storage transformer API #1718

Open

Uh oh!

add storage_transformers and get/set_partial_values #1096

add storage_transformers and get/set_partial_values #1096

Uh oh!

Conversation

jstriebel commented Jul 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TODO:

Uh oh!

jstriebel commented Jul 28, 2022

Uh oh!

jstriebel commented Jul 29, 2022

Uh oh!

grlee77 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jstriebel commented Jul 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joshmoore left a comment

Choose a reason for hiding this comment

Uh oh!

joshmoore Jul 29, 2022

Choose a reason for hiding this comment

Uh oh!

jstriebel Aug 1, 2022

Choose a reason for hiding this comment

Uh oh!

jakirkham commented Jul 29, 2022

Uh oh!

martindurant commented Jul 30, 2022

Uh oh!

jstriebel commented Jul 30, 2022

Uh oh!

martindurant commented Jul 30, 2022

Uh oh!

jstriebel commented Aug 1, 2022

Uh oh!

codecov bot commented Aug 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

martindurant commented Dec 13, 2022

Uh oh!

jstriebel commented Dec 15, 2022

Uh oh!

martindurant commented Dec 15, 2022

Uh oh!

jstriebel commented Dec 19, 2022

Uh oh!

joshmoore left a comment

Choose a reason for hiding this comment

Uh oh!

joshmoore Dec 21, 2022

Choose a reason for hiding this comment

Uh oh!

jstriebel Dec 21, 2022

Choose a reason for hiding this comment

Uh oh!

joshmoore Dec 21, 2022

Choose a reason for hiding this comment

Uh oh!

joshmoore commented Dec 21, 2022

Uh oh!

jstriebel commented Dec 21, 2022

Uh oh!

rabernat left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jstriebel commented Jul 28, 2022 •

edited

Loading

jstriebel commented Jul 29, 2022 •

edited

Loading

codecov bot commented Aug 1, 2022 •

edited

Loading

joshmoore commented Jan 16, 2023 •

edited

Loading