-
-
Notifications
You must be signed in to change notification settings - Fork 328
[v3] Batch array / group access #1805
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I would also add data-getter, so in the first model something like
or similar, In addition to a convenience addition to the functions above
(where store and options are fixed). I am not suggesting these are the right signatures, but this is the functionality I would want. After all "open many nodes" is sort of already covered in v2 for the special case of consolidated metadata (one call, no more latency, rather than many concurrent calls). |
just noting that in the context of xarray, some function or an |
Just dropping by to say that a feature to enable concurrently loading many (e.g. 50-100) small (~1MB) arrays would be very useful :) let me know if I can support in any way. |
You should already be able to use the async class to open them and then asyncio.gather to await a bunch of them. Providing this as a convenience method callable from sync code is a good idea. xarray does something similar to read coordinates of a large hierarchy on open. |
I am struggling with conflicting event loops when I try do to this, but this may just be me not understanding how to do this properly. E.g.
Do you have a link to this code? would be a helpful pointer. |
@oliverwm1 do you have a runnable code example? |
My issues seem due to trying to use sync and async versions of zarr code in same script. I was doing that because I was doing some profiling of each. For example, following code fails:
Traceback:
But if you comment out the |
I'm also confused! I could get this to work, but only by using anonymous credentials: # /// script
# requires-python = ">=3.11"
# dependencies = [
# "zarr",
# "fsspec",
# "gcsfs"
# ]
# ///
import asyncio
import zarr
async def open(url):
return await zarr.api.asynchronous.open(store=url)
url = 'gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3'
# open with zarr sync API
# note the anonymous credentials
store = zarr.storage.FsspecStore.from_url(url, storage_options={'token': 'anon'})
group = zarr.open_group(store, mode='r')
# open with zarr async
g = asyncio.run(open(url))
print(g)
# <AsyncGroup <FsspecStore(GCSFileSystem, gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3)>> I don't know why the choice of credentials would matter here. |
I can only suppose that zarr isn't being careful about passing asynchronous=True when required, and ending up with the same instance in both cases. You can test this by looking at
|
Interesting, yes that also made the script work for me. |
Maybe this is the same issue: #2946 |
In v3, since the storage API is asynchronous, we can open multiple array or groups concurrently. This would be broadly useful, but we don't have a good template from
zarr-python
v2 to extrapolate from, so we have to invent something new here (new, relative tozarr-python
, that is).Over in #1804 @martindurant brought this up, and I suggested something like this:
I was imagining that the arguments to these functions would be the paths of arrays / groups anywhere in a Zarr hierarchy; we could also have a
group.open_groups()
method which can only "see" sub-groups, and similarly forgroup.open_arrays()
.An alternative would be to use a more general transactional context manager:
I'm a lot less sure of this second design, since I have never implemented anything like it. For example, should we use futures for the results of
tx.open_array()
?Are there other ideas, or examples from other implementations / domains we could draw from?
The text was updated successfully, but these errors were encountered: