Avoid Adapters in task graphs? #1895

mrocklin · 2018-02-07T19:52:02Z

Looking at an open_zarr computation from @rabernat I'm coming across intermediate values like the following:

>>> Future('zarr-adt-0f90b3f56f247f966e5ef01277f31374').result()
ImplicitToExplicitIndexingAdapter(array=LazilyIndexedArray(array=<xarray.backends.zarr.ZarrArrayWrapper object at 0x7fa921fec278>, key=BasicIndexer((slice(None, None, None), slice(None, None, None), slice(None, None, None)))))

This object has many dependents, and so will presumably have to float around the network to all of the workers

>>> len(dep.dependents)
1781

In principle this is fine, especially if this object is cheap to serialize, move, and deserialize. It does introduce a bit of friction though. I'm curious how hard it would be to build task graphs that generated these objects on the fly, or else removed them altogether. It is slightly more convenient from a task scheduling perspective for data access tasks to not have any dependencies.

The text was updated successfully, but these errors were encountered:

mrocklin · 2018-02-07T19:52:18Z

pangeo-data/pangeo#99 (comment)

also cc @jhamman

shoyer · 2018-02-07T21:44:33Z

In principle this is fine, especially if this object is cheap to serialize, move, and deserialize.

Yes, that should be the case here. Each of these array objects is very lightweight and should be quickly pickled/unpickled.

On the other hand, once evaluated these do correspond to a large chunk of data (entire arrays). If this future needs to be evaluated before being passed around that would be a problem. Getitem fusing is pretty essential here for performance.

mrocklin · 2018-02-07T21:59:28Z

Do these objects happen to store any cached results? I'm seeing odd performance issues around these objects and am curious about any ways in which they might be fancy.

mrocklin · 2018-02-07T21:59:56Z

Any concerns about recreating these objects for every access?

shoyer · 2018-02-07T22:22:40Z

Do these objects happen to store any cached results? I'm seeing odd performance issues around these objects and am curious about any ways in which they might be fancy.

I don't think there's any caching here. All of these objects are stateless, though ZarrArrayWrapper does point back to a ZarrStore object and a zarr.Group object.

Any concerns about recreating these objects for every access?

No, not particularly, though potentially opening a zarr store could be a little expensive. I'm mostly not sure how this would be done. Currently, we open files, create array objects, do some lazy decoding and then create dask arrays with from_array.

mrocklin · 2018-02-07T22:25:45Z

No, not particularly, though potentially opening a zarr store could be a little expensive

What makes it expensive?

I'm mostly not sure how this would be done. Currently, we open files, create array objects, do some lazy decoding and then create dask arrays with from_array.

Maybe we add an option to from_array to have it inline the array into the task, rather than create an explicit dependency.

This does feel like I'm trying to duct tape over some underlying problem that I can't resolve though.

shoyer · 2018-02-07T22:40:22Z

What makes it expensive?

Well, presumably opening a zarr file requires a small amount of IO to read out the metadata.

mrocklin · 2018-02-07T22:42:40Z

Well, presumably opening a zarr file requires a small amount of IO to read out the metadata.

Ah, this may actually require a non-trivial amount of IO. It currently takes a non-trivial amount of time to read a zarr file. See pangeo-data/pangeo#99 (comment) . We're doing this on each deserialization?

shoyer · 2018-02-07T23:33:01Z

Ah, this may actually require a non-trivial amount of IO. It currently takes a non-trivial amount of time to read a zarr file. See pangeo-data/pangeo#99 (comment) . We're doing this on each deserialization?

We're unpickling the zarr objects. I don't know if that requires IO (probably not).

jhamman · 2018-03-09T05:32:58Z

Where did we land here? Is there an action item that came from this discussion?

In my view, the benefit of having consistent getitem behavior for all of our backends is worth working through potential hiccups in the way dask interacts with xarray.

mrocklin · 2018-03-09T13:35:38Z

If things are operational then we're fine. It may be that a lot of this cost was due to other serialization things in gcsfs, zarr, or other.

…

On Fri, Mar 9, 2018 at 12:33 AM, Joe Hamman ***@***.***> wrote: Where did we land here? Is there an action item that came from this discussion? In my view, the benefit of having consistent getitem behavior for all of our backends is worth working through potential hiccups in the way dask interacts with xarray. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#1895 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszGISdLyCz1vL3SwpdNv8CplC5hi1ks5tchQNgaJpZM4R9Svr> .

stale · 2020-12-12T20:29:09Z

In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity

If this issue remains relevant, please comment here or remove the stale label; otherwise it will be marked as closed automatically

TomNicholas · 2022-05-02T19:41:01Z

Maybe we add an option to from_array to have it inline the array into the task, rather than create an explicit dependency.

dask.array.from_array does now have an inline_array option, which I've just exposed in open_dataset in #6566. I think that would be a reasonable way to close this issue?

mrocklin mentioned this issue Feb 7, 2018

xarray groupby monthly mean fail case pangeo-data/pangeo#99

Closed

dcherian added the topic-dask label Jan 8, 2019

stale bot added the stale label Dec 12, 2020

TomNicholas mentioned this issue May 2, 2022

New inline_array kwarg for open_dataset #6566

Merged

3 tasks

stale bot removed the stale label May 2, 2022

TomNicholas closed this as completed in #6566 May 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Avoid Adapters in task graphs? #1895

Avoid Adapters in task graphs? #1895

mrocklin commented Feb 7, 2018

mrocklin commented Feb 7, 2018

Uh oh!

shoyer commented Feb 7, 2018

Uh oh!

mrocklin commented Feb 7, 2018

Uh oh!

mrocklin commented Feb 7, 2018

Uh oh!

shoyer commented Feb 7, 2018

Uh oh!

mrocklin commented Feb 7, 2018

Uh oh!

shoyer commented Feb 7, 2018

Uh oh!

mrocklin commented Feb 7, 2018

Uh oh!

shoyer commented Feb 7, 2018

Uh oh!

jhamman commented Mar 9, 2018

Uh oh!

mrocklin commented Mar 9, 2018 via email

Uh oh!

stale bot commented Dec 12, 2020

Uh oh!

TomNicholas commented May 2, 2022

Uh oh!

Uh oh!

Avoid Adapters in task graphs? #1895

Avoid Adapters in task graphs? #1895

Comments

mrocklin commented Feb 7, 2018

mrocklin commented Feb 7, 2018

Uh oh!

shoyer commented Feb 7, 2018

Uh oh!

mrocklin commented Feb 7, 2018

Uh oh!

mrocklin commented Feb 7, 2018

Uh oh!

shoyer commented Feb 7, 2018

Uh oh!

mrocklin commented Feb 7, 2018

Uh oh!

shoyer commented Feb 7, 2018

Uh oh!

mrocklin commented Feb 7, 2018

Uh oh!

shoyer commented Feb 7, 2018

Uh oh!

jhamman commented Mar 9, 2018

Uh oh!

mrocklin commented Mar 9, 2018 via email

Uh oh!

stale bot commented Dec 12, 2020

Uh oh!

TomNicholas commented May 2, 2022

Uh oh!