-
-
Notifications
You must be signed in to change notification settings - Fork 329
add storage_transformers and get/set_partial_values #1096
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add storage_transformers and get/set_partial_values #1096
Conversation
The Linux tests fail due to some s3 tests, and the Windows test timeout in the s3 fixture setup, which also fail on different branches and seem to be unrelated to this PR. |
@joshmoore & @grlee77 Would one of you have time to review my changes here? Asking since you authored the initial v3 additions here. From my side this PR is ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @jstriebel! I have some minor suggestions, but this looks pretty good already. I think we will get more experience and a better idea of a real-world use case with the follow-up sharding PR.
A couple more thoughts that just came up from an end-user perspective:
- does passing
storage_transformers=[]
work equivalently tostorage_transformers=None
? - are there plans to expose
storage_transformers
via the higher level convenience/creation functions? (e.g.zarr/creation.py
)
Co-authored-by: Gregory Lee <[email protected]>
Thank you so much for the review, @grlee77! I applied your feedback in the recent commits.
Yes, I think the followup-PR will help to make more sense for those methods.
Yes, it does, I added a notice for it in the docstring. I also thought about just having an empty list as the default, but lists as default arguments are somewhat deprecated due to their mutability.
I think it is already exposed via |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jstriebel I think it might still be fine to incorporate those features to gain early feedback.
Concur. ❤️ for hammering on the Python API a bit while we also debate the spec.
I also thought about just having an empty list as the default, but lists as default arguments are somewhat deprecated due to their mutability.
And a tuple?
"""Base class for storage transformers. The methods simply pass on the data as-is | ||
and should be overwritten by sub-classes.""" | ||
|
||
_store_version = 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slight surprise to find this in store.py rather than under v3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assumed that the base-classes go into store.py
, similar to StoreV3
, but also happy to move this into _storage/v3.py
if that's a better fit.
cc @martindurant (in case this is of interest) |
@jakirkham , I must be missing something: allowing partial byte ranges to pass through to the storage layer and transformers seem completely orthogonal to me. In the case of this code, on this line we get the full chunk and then slice it anyway. Is the intend for some storage classes (fsspec!) to subclass this and implement a new get_partial_values? |
Yes, it is to some extend. I can also have them as separate PR. My motivation was to implement sharding, and I tried to keep the bases for it as a separate PR, that's how they both ended up here.
Yes, the idea is to overwrite those functions for stores that allow partial access. Here I just implemented the API and the fallback strategy, I'll open an issue for the optimized paths as a follow-up, also mentioned in the TODOs in the description above:
Does that answer your questions, @martindurant? |
I didn't mean to denigrate this PR, I know very little about how the storage transformer or sharding are to be implemented. I am wondering how the new get_partial_values would get called, and then it sounds like we can agree on something that works for us all. |
Good idea, I changed the default to an empty tuple now 👍 |
Codecov Report
@@ Coverage Diff @@
## main #1096 +/- ##
==========================================
Coverage 100.00% 100.00%
==========================================
Files 35 35
Lines 14157 14388 +231
==========================================
+ Hits 14157 14388 +231
|
Can I be super annoying and ask for a smallest possible example here? That other PR and thread is very long and not focussed on partial reads. x = group.create_dataset(name="x", data=np.array([1,2,3,4]), compression=None, chunks=(4, ))
x[:2] where the second line should read only 16 bytes from the store, not all 32, if the store supports partials.
I don't follow this. For the above case,
where all these things are known in Array._chunk_getitem (meta_array is new and shows that we can do more with this idea, from the other PR). Return bytes for the chunk would be
and 16 more bytes of or empty/zeros; if we can't just truncate instead. |
@martindurant No worries! This is a short example for your case:
(For uncompressed partial reads the implementation is here in PR #1111 Also the tests have some more examples: I think I finally understand where you are coming from. With the context variant the parts that are not requested would simply be filled I guess (or truncated, as you mentioned). However, I'd prefer to only have a buffer for the actual parts of the chunk, which might be spread through the chunk, so one single buffer would not be enough. Therefore the return type would need to be different for the partial reads, which fits more nicely with a new method IMO. Does that make sense? |
OK, I got it: so the code that calls the methods introduced here actually are in the other PR. That call is not really to do with sharding, which is why I got confused. Actually I don't see how that uncompressed-partial-buffer gets used either, there seems to be many levels here. Are you saying that you want a single call to chunk-getitem to produce many buffers, which will then be filled into the target output array? That would work with contexts too, get([key1, key1], contexts=[context1, context2]). I think in the proposal contexts= was a dict, so it would indeed need to be a list instead. That makes the API the same as here (and we could even have both). |
Exactly 👍
That would work, but it would also make the signature of the get call very complicated (the return type would depend on the supplied arguments). Personally I'm in favor of the current variant with the additional method. @grlee77 @joshmoore @normanrz any other opinions here? I'd like to get this merged soon-ish, so I can finalize #1111 as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two more general points more geared towards potential future refactorings. I don't see anything that causes alarm knowing that v3 is still in flux and protected by the environment variables. Primary question is if you would see still getting this into a 2.13 release (especially if one were to come out over the holidays) or would it make more sense to roll it into 2.14.
Anyone else have thoughts?
@@ -1800,7 +1813,7 @@ def _set_selection(self, indexer, value, fields=None): | |||
check_array_shape('value', value, sel_shape) | |||
|
|||
# iterate over chunks in range | |||
if not hasattr(self.store, "setitems") or self._synchronizer is not None \ | |||
if not hasattr(self.chunk_store, "setitems") or self._synchronizer is not None \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to wonder if this wouldn't ever need to be the transformed_chunk_store. In general, this chain of replacement stores feels slightly like an anti-pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'm also not particularly happy with this pattern, but also couldn't come up with a better solution for now. Happy about any ideas 👍
@@ -459,6 +463,36 @@ def _decode_codec_metadata(cls, meta: Optional[Mapping]) -> Optional[Codec]: | |||
|
|||
return codec | |||
|
|||
@classmethod | |||
def _encode_storage_transformer_metadata( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After the splitting of storage.py
into separate files, this makes me wonder if meta and some of the other core classes shouldn't also have v3 variants.
I should add re-reviewing the interaction with #1131/meta_array I start to wonder if the definition of "context" and "partial" don't need to end up in the spec or at least a clarifying ZEP together. |
@joshmoore Thanks for the review!
I wouldn't mind having this in the 2.13 release, so we can play around with it a bit already. I can also add the additional flag for sharding in #1111 so we are on the safe side.
The partial access is part of the v3 spec already: https://zarr-specs.readthedocs.io/en/latest/core/v3.0.html#abstract-store-interface. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I finally got around to reviewing this PR! 🙃
Overall it makes sense to me. My impression is that more type hints would be helpful in a few places.
Is there any reason I should not press the big green button? We have three approvals already. Last chance to object! |
Now with the somewhat large 2.13.4 out the door, I'm going to go ahead and click the button (but would have also supported @martindurant's pushing thereof 😉 ❤️) |
…and-partial-get-set
@jstriebel: can you please push from scalableminds#3? (I could not push to scalableminds forks as an admin) cc: @normanrz |
…-get-set Storage transformers and partial get set
🚀
Sure, done 👍 |
Sorry, @jstriebel, but one more needs to go in: #1320 |
Ok. Merging to spare you any more pain! |
🎉 |
This PR adds two features of the current v3 spec to the zarr-python implementation:
get_partial_values
andset_partial_values
The current spec is not finalized and waiting for finishing review and approval of ZEP001. Since all v3 features are currently behind the
ZARR_V3_EXPERIMENTAL_API
flag, I think it might still be fine to incorporate those features to gain early feedback.Both features are prerequisites for sharding as described here, which I'd like to add as a follow-up PR, together with a new ZEP for it.
TODO:
get_partial_values
andset_partial_values
for specific stores: v3 stores: Implement efficient get/set_partial_values #1106