-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Avoid Adapters in task graphs? #1895
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Yes, that should be the case here. Each of these array objects is very lightweight and should be quickly pickled/unpickled. On the other hand, once evaluated these do correspond to a large chunk of data (entire arrays). If this future needs to be evaluated before being passed around that would be a problem. Getitem fusing is pretty essential here for performance. |
Do these objects happen to store any cached results? I'm seeing odd performance issues around these objects and am curious about any ways in which they might be fancy. |
Any concerns about recreating these objects for every access? |
I don't think there's any caching here. All of these objects are stateless, though
No, not particularly, though potentially opening a zarr store could be a little expensive. I'm mostly not sure how this would be done. Currently, we open files, create array objects, do some lazy decoding and then create dask arrays with |
What makes it expensive?
Maybe we add an option to from_array to have it inline the array into the task, rather than create an explicit dependency. This does feel like I'm trying to duct tape over some underlying problem that I can't resolve though. |
Well, presumably opening a zarr file requires a small amount of IO to read out the metadata. |
Ah, this may actually require a non-trivial amount of IO. It currently takes a non-trivial amount of time to read a zarr file. See pangeo-data/pangeo#99 (comment) . We're doing this on each deserialization? |
We're unpickling the zarr objects. I don't know if that requires IO (probably not). |
Where did we land here? Is there an action item that came from this discussion? In my view, the benefit of having consistent |
If things are operational then we're fine. It may be that a lot of this
cost was due to other serialization things in gcsfs, zarr, or other.
…On Fri, Mar 9, 2018 at 12:33 AM, Joe Hamman ***@***.***> wrote:
Where did we land here? Is there an action item that came from this
discussion?
In my view, the benefit of having consistent getitem behavior for all of
our backends is worth working through potential hiccups in the way dask
interacts with xarray.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#1895 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AASszGISdLyCz1vL3SwpdNv8CplC5hi1ks5tchQNgaJpZM4R9Svr>
.
|
In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity If this issue remains relevant, please comment here or remove the |
|
Looking at an
open_zarr
computation from @rabernat I'm coming across intermediate values like the following:This object has many dependents, and so will presumably have to float around the network to all of the workers
In principle this is fine, especially if this object is cheap to serialize, move, and deserialize. It does introduce a bit of friction though. I'm curious how hard it would be to build task graphs that generated these objects on the fly, or else removed them altogether. It is slightly more convenient from a task scheduling perspective for data access tasks to not have any dependencies.
The text was updated successfully, but these errors were encountered: