Skip to content

Avoid Adapters in task graphs? #1895

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mrocklin opened this issue Feb 7, 2018 · 13 comments · Fixed by #6566
Closed

Avoid Adapters in task graphs? #1895

mrocklin opened this issue Feb 7, 2018 · 13 comments · Fixed by #6566

Comments

@mrocklin
Copy link
Contributor

mrocklin commented Feb 7, 2018

Looking at an open_zarr computation from @rabernat I'm coming across intermediate values like the following:

>>> Future('zarr-adt-0f90b3f56f247f966e5ef01277f31374').result()
ImplicitToExplicitIndexingAdapter(array=LazilyIndexedArray(array=<xarray.backends.zarr.ZarrArrayWrapper object at 0x7fa921fec278>, key=BasicIndexer((slice(None, None, None), slice(None, None, None), slice(None, None, None)))))

This object has many dependents, and so will presumably have to float around the network to all of the workers

>>> len(dep.dependents)
1781

In principle this is fine, especially if this object is cheap to serialize, move, and deserialize. It does introduce a bit of friction though. I'm curious how hard it would be to build task graphs that generated these objects on the fly, or else removed them altogether. It is slightly more convenient from a task scheduling perspective for data access tasks to not have any dependencies.

@mrocklin
Copy link
Contributor Author

mrocklin commented Feb 7, 2018

@shoyer
Copy link
Member

shoyer commented Feb 7, 2018

In principle this is fine, especially if this object is cheap to serialize, move, and deserialize.

Yes, that should be the case here. Each of these array objects is very lightweight and should be quickly pickled/unpickled.

On the other hand, once evaluated these do correspond to a large chunk of data (entire arrays). If this future needs to be evaluated before being passed around that would be a problem. Getitem fusing is pretty essential here for performance.

@mrocklin
Copy link
Contributor Author

mrocklin commented Feb 7, 2018

Do these objects happen to store any cached results? I'm seeing odd performance issues around these objects and am curious about any ways in which they might be fancy.

@mrocklin
Copy link
Contributor Author

mrocklin commented Feb 7, 2018

Any concerns about recreating these objects for every access?

@shoyer
Copy link
Member

shoyer commented Feb 7, 2018

Do these objects happen to store any cached results? I'm seeing odd performance issues around these objects and am curious about any ways in which they might be fancy.

I don't think there's any caching here. All of these objects are stateless, though ZarrArrayWrapper does point back to a ZarrStore object and a zarr.Group object.

Any concerns about recreating these objects for every access?

No, not particularly, though potentially opening a zarr store could be a little expensive. I'm mostly not sure how this would be done. Currently, we open files, create array objects, do some lazy decoding and then create dask arrays with from_array.

@mrocklin
Copy link
Contributor Author

mrocklin commented Feb 7, 2018

No, not particularly, though potentially opening a zarr store could be a little expensive

What makes it expensive?

I'm mostly not sure how this would be done. Currently, we open files, create array objects, do some lazy decoding and then create dask arrays with from_array.

Maybe we add an option to from_array to have it inline the array into the task, rather than create an explicit dependency.

This does feel like I'm trying to duct tape over some underlying problem that I can't resolve though.

@shoyer
Copy link
Member

shoyer commented Feb 7, 2018

What makes it expensive?

Well, presumably opening a zarr file requires a small amount of IO to read out the metadata.

@mrocklin
Copy link
Contributor Author

mrocklin commented Feb 7, 2018

Well, presumably opening a zarr file requires a small amount of IO to read out the metadata.

Ah, this may actually require a non-trivial amount of IO. It currently takes a non-trivial amount of time to read a zarr file. See pangeo-data/pangeo#99 (comment) . We're doing this on each deserialization?

@shoyer
Copy link
Member

shoyer commented Feb 7, 2018

Ah, this may actually require a non-trivial amount of IO. It currently takes a non-trivial amount of time to read a zarr file. See pangeo-data/pangeo#99 (comment) . We're doing this on each deserialization?

We're unpickling the zarr objects. I don't know if that requires IO (probably not).

@jhamman
Copy link
Member

jhamman commented Mar 9, 2018

Where did we land here? Is there an action item that came from this discussion?

In my view, the benefit of having consistent getitem behavior for all of our backends is worth working through potential hiccups in the way dask interacts with xarray.

@mrocklin
Copy link
Contributor Author

mrocklin commented Mar 9, 2018 via email

@stale
Copy link

stale bot commented Dec 12, 2020

In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity

If this issue remains relevant, please comment here or remove the stale label; otherwise it will be marked as closed automatically

@TomNicholas
Copy link
Member

Maybe we add an option to from_array to have it inline the array into the task, rather than create an explicit dependency.

dask.array.from_array does now have an inline_array option, which I've just exposed in open_dataset in #6566. I think that would be a reasonable way to close this issue?

@stale stale bot removed the stale label May 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants