Skip to content

add storage_transformers and get/set_partial_values #1096

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

jstriebel
Copy link
Member

@jstriebel jstriebel commented Jul 28, 2022

This PR adds two features of the current v3 spec to the zarr-python implementation:

  • storage transformers
  • get_partial_values and set_partial_values

The current spec is not finalized and waiting for finishing review and approval of ZEP001. Since all v3 features are currently behind the ZARR_V3_EXPERIMENTAL_API flag, I think it might still be fine to incorporate those features to gain early feedback.

Both features are prerequisites for sharding as described here, which I'd like to add as a follow-up PR, together with a new ZEP for it.

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • [New/modified features documented in docs/tutorial.rst] (v3 features seem not to be documented yet?)
  • Changes documented in docs/release.rst
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)
  • Add follow-up issue to optimize get_partial_values and set_partial_values for specific stores: v3 stores: Implement efficient get/set_partial_values #1106

@jstriebel jstriebel marked this pull request as ready for review July 28, 2022 15:56
@jstriebel
Copy link
Member Author

The Linux tests fail due to some s3 tests, and the Windows test timeout in the s3 fixture setup, which also fail on different branches and seem to be unrelated to this PR.

@jstriebel
Copy link
Member Author

@joshmoore & @grlee77 Would one of you have time to review my changes here? Asking since you authored the initial v3 additions here. From my side this PR is ready for review.

Copy link
Contributor

@grlee77 grlee77 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @jstriebel! I have some minor suggestions, but this looks pretty good already. I think we will get more experience and a better idea of a real-world use case with the follow-up sharding PR.

A couple more thoughts that just came up from an end-user perspective:

  • does passing storage_transformers=[] work equivalently to storage_transformers=None?
  • are there plans to expose storage_transformers via the higher level convenience/creation functions? (e.g. zarr/creation.py)

@jstriebel
Copy link
Member Author

jstriebel commented Jul 29, 2022

I have some minor suggestions, but this looks pretty good already.

Thank you so much for the review, @grlee77! I applied your feedback in the recent commits.

I think we will get more experience and a better idea of a real-world use case with the follow-up sharding PR.

Yes, I think the followup-PR will help to make more sense for those methods.

  • does passing storage_transformers=[] work equivalently to storage_transformers=None?

Yes, it does, I added a notice for it in the docstring. I also thought about just having an empty list as the default, but lists as default arguments are somewhat deprecated due to their mutability.

  • are there plans to expose storage_transformers via the higher level convenience/creation functions? (e.g. zarr/creation.py)

I think it is already exposed via create. I also thought about exposing specific shorthand parameters for single storage transformers, e.g. create(…, chunks_per_shard=(32, 32, 32)), but that's not strictly necessary. Are there any other places where storage transformers should be passed, that I'm missing?

Copy link
Member

@joshmoore joshmoore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jstriebel I think it might still be fine to incorporate those features to gain early feedback.

Concur. ❤️ for hammering on the Python API a bit while we also debate the spec.

I also thought about just having an empty list as the default, but lists as default arguments are somewhat deprecated due to their mutability.

And a tuple?

"""Base class for storage transformers. The methods simply pass on the data as-is
and should be overwritten by sub-classes."""

_store_version = 3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slight surprise to find this in store.py rather than under v3.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assumed that the base-classes go into store.py, similar to StoreV3, but also happy to move this into _storage/v3.py if that's a better fit.

@jakirkham
Copy link
Member

cc @martindurant (in case this is of interest)

@martindurant
Copy link
Member

@jakirkham , I must be missing something: allowing partial byte ranges to pass through to the storage layer and transformers seem completely orthogonal to me.

In the case of this code, on this line we get the full chunk and then slice it anyway. Is the intend for some storage classes (fsspec!) to subclass this and implement a new get_partial_values?

@jstriebel
Copy link
Member Author

allowing partial byte ranges to pass through to the storage layer and transformers seem completely orthogonal to me.

Yes, it is to some extend. I can also have them as separate PR. My motivation was to implement sharding, and I tried to keep the bases for it as a separate PR, that's how they both ended up here.

In the case of this code, on this line we get the full chunk and then slice it anyway. Is the intend for some storage classes (fsspec!) to subclass this and implement a new get_partial_values?

Yes, the idea is to overwrite those functions for stores that allow partial access. Here I just implemented the API and the fallback strategy, I'll open an issue for the optimized paths as a follow-up, also mentioned in the TODOs in the description above:

  • Add follow-up issue to optimize get_partial_values and set_partial_values for specific stores

Does that answer your questions, @martindurant?

@martindurant
Copy link
Member

I didn't mean to denigrate this PR, I know very little about how the storage transformer or sharding are to be implemented. I am wondering how the new get_partial_values would get called, and then it sounds like we can agree on something that works for us all.

@jstriebel
Copy link
Member Author

I also thought about just having an empty list as the default, but lists as default arguments are somewhat deprecated due to their mutability.

And a tuple?

Good idea, I changed the default to an empty tuple now 👍

@codecov
Copy link

codecov bot commented Aug 1, 2022

Codecov Report

Merging #1096 (eba9006) into main (1fd607a) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##              main     #1096    +/-   ##
==========================================
  Coverage   100.00%   100.00%            
==========================================
  Files           35        35            
  Lines        14157     14388   +231     
==========================================
+ Hits         14157     14388   +231     
Impacted Files Coverage Δ
zarr/creation.py 100.00% <ø> (ø)
zarr/_storage/store.py 100.00% <100.00%> (ø)
zarr/core.py 100.00% <100.00%> (ø)
zarr/meta.py 100.00% <100.00%> (ø)
zarr/storage.py 100.00% <100.00%> (ø)
zarr/tests/test_core.py 100.00% <100.00%> (ø)
zarr/tests/test_creation.py 100.00% <100.00%> (ø)
zarr/tests/test_storage_v3.py 100.00% <100.00%> (ø)

@martindurant
Copy link
Member

The methods can just be called where appropriate

Can I be super annoying and ask for a smallest possible example here? That other PR and thread is very long and not focussed on partial reads.

x = group.create_dataset(name="x", data=np.array([1,2,3,4]), compression=None, chunks=(4, ))
x[:2]

where the second line should read only 16 bytes from the store, not all 32, if the store supports partials.

The context would then need to force stores into different behavior, and the returned content would depend heavily on the context.

I don't follow this. For the above case,

context={"key": "data/0.0", "out_selection": [slice(0, 2)], "chunk_selection": [slice(0, 2), "meta_array": np.NDArray, "dtype": np.int64]}

where all these things are known in Array._chunk_getitem (meta_array is new and shows that we can do more with this idea, from the other PR). Return bytes for the chunk would be

b'\x01\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00'

and 16 more bytes of or empty/zeros; if we can't just truncate instead.

@jstriebel
Copy link
Member Author

@martindurant No worries! This is a short example for your case:

x = group.create_dataset(name="x", data=np.array([1,2,3,4]), compression=None, chunks=(4, ))
x[:2]

# with PR #1111 in place, the indexing would do:
store.get_partial_values[(store_key, (0, 2 * itemsize))]
# where store_key is probably sth like "x/0", and itemsize=8 for the int64 dtype
# returning [b'\x01\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00']

(For uncompressed partial reads the implementation is here in PR #1111

Also the tests have some more examples:
https://github.com/zarr-developers/zarr-python/pull/1096/files#diff-49f50043d526642ad824a8d6a944b896540238b06ec11f182476ff8fdb2b5977R257-R316

I think I finally understand where you are coming from. With the context variant the parts that are not requested would simply be filled I guess (or truncated, as you mentioned). However, I'd prefer to only have a buffer for the actual parts of the chunk, which might be spread through the chunk, so one single buffer would not be enough. Therefore the return type would need to be different for the partial reads, which fits more nicely with a new method IMO. Does that make sense?

@martindurant
Copy link
Member

OK, I got it: so the code that calls the methods introduced here actually are in the other PR. That call is not really to do with sharding, which is why I got confused. Actually I don't see how that uncompressed-partial-buffer gets used either, there seems to be many levels here.

Are you saying that you want a single call to chunk-getitem to produce many buffers, which will then be filled into the target output array? That would work with contexts too, get([key1, key1], contexts=[context1, context2]). I think in the proposal contexts= was a dict, so it would indeed need to be a list instead. That makes the API the same as here (and we could even have both).

@jstriebel
Copy link
Member Author

Are you saying that you want a single call to chunk-getitem to produce many buffers, which will then be filled into the target output array?

Exactly 👍

That would work with contexts too, get([key1, key1], contexts=[context1, context2]). I think in the proposal contexts= was a dict, so it would indeed need to be a list instead. That makes the API the same as here (and we could even have both).

That would work, but it would also make the signature of the get call very complicated (the return type would depend on the supplied arguments). Personally I'm in favor of the current variant with the additional method. @grlee77 @joshmoore @normanrz any other opinions here? I'd like to get this merged soon-ish, so I can finalize #1111 as well.

Copy link
Member

@joshmoore joshmoore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two more general points more geared towards potential future refactorings. I don't see anything that causes alarm knowing that v3 is still in flux and protected by the environment variables. Primary question is if you would see still getting this into a 2.13 release (especially if one were to come out over the holidays) or would it make more sense to roll it into 2.14.

Anyone else have thoughts?

@@ -1800,7 +1813,7 @@ def _set_selection(self, indexer, value, fields=None):
check_array_shape('value', value, sel_shape)

# iterate over chunks in range
if not hasattr(self.store, "setitems") or self._synchronizer is not None \
if not hasattr(self.chunk_store, "setitems") or self._synchronizer is not None \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to wonder if this wouldn't ever need to be the transformed_chunk_store. In general, this chain of replacement stores feels slightly like an anti-pattern.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'm also not particularly happy with this pattern, but also couldn't come up with a better solution for now. Happy about any ideas 👍

@@ -459,6 +463,36 @@ def _decode_codec_metadata(cls, meta: Optional[Mapping]) -> Optional[Codec]:

return codec

@classmethod
def _encode_storage_transformer_metadata(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the splitting of storage.py into separate files, this makes me wonder if meta and some of the other core classes shouldn't also have v3 variants.

@joshmoore
Copy link
Member

I should add re-reviewing the interaction with #1131/meta_array I start to wonder if the definition of "context" and "partial" don't need to end up in the spec or at least a clarifying ZEP together.

@jstriebel
Copy link
Member Author

@joshmoore Thanks for the review!

Primary question is if you would see still getting this into a 2.13 release (especially if one were to come out over the holidays) or would it make more sense to roll it into 2.14.

I wouldn't mind having this in the 2.13 release, so we can play around with it a bit already. I can also add the additional flag for sharding in #1111 so we are on the safe side.

I should add re-reviewing the interaction with #1131 I start to wonder if the definition of "context" and "partial" don't need to end up in the spec or at least a clarifying ZEP together.

The partial access is part of the v3 spec already: https://zarr-specs.readthedocs.io/en/latest/core/v3.0.html#abstract-store-interface.

Copy link
Contributor

@rabernat rabernat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I finally got around to reviewing this PR! 🙃

Overall it makes sense to me. My impression is that more type hints would be helpful in a few places.

@martindurant
Copy link
Member

Is there any reason I should not press the big green button? We have three approvals already. Last chance to object!

@joshmoore
Copy link
Member

joshmoore commented Jan 16, 2023

Now with the somewhat large 2.13.4 out the door, I'm going to go ahead and click the button (but would have also supported @martindurant's pushing thereof 😉 ❤️)

@joshmoore joshmoore enabled auto-merge (squash) January 16, 2023 14:43
@joshmoore joshmoore disabled auto-merge January 16, 2023 14:43
@joshmoore
Copy link
Member

@jstriebel: can you please push from scalableminds#3? (I could not push to scalableminds forks as an admin)

cc: @normanrz

@jstriebel
Copy link
Member Author

I'm going to go ahead and click the button

🚀

@jstriebel: can you please push from scalableminds#3? (I could not push to scalableminds forks as an admin)

Sure, done 👍

@joshmoore
Copy link
Member

Sorry, @jstriebel, but one more needs to go in: #1320

@joshmoore
Copy link
Member

Ok. Merging to spare you any more pain!

@joshmoore joshmoore merged commit 385b5d3 into zarr-developers:main Jan 16, 2023
@sanketverma1704
Copy link
Member

🎉

@d-v-b d-v-b mentioned this pull request Jan 20, 2023
6 tasks
@jstriebel jstriebel deleted the storage-transformers-and-partial-get-set branch February 3, 2023 16:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants