Icechunk store #633

tomwhite · 2024-12-04T12:23:03Z

Fixes #628

This is a proof of concept/draft for a store_icechunk function that uses Cubed callbacks to do the Icechunk store merging. The tests pass locally, but not in CI yet. I'm not sure where this code should live.

dcherian · 2024-12-04T13:44:23Z

cubed/icechunk.py

+    *,
+    sources: Union["Array", Sequence["Array"]],
+    targets: List[zarr.Array],
+    executor=None,


could definitely address this later but regions is quite an important kwarg.

dcherian · 2024-12-04T15:10:29Z

cubed/icechunk.py

+            if self.store is None:
+                self.store = store
+            else:
+                self.store.merge(store.change_set_bytes())


@paraseba shall we commit to icechunk.distributed.merge_stores as public API?

https://github.com/earth-mover/icechunk/blob/a8c33b81d329c32a8e34e7151a1a71620967067a/icechunk-python/python/icechunk/distributed.py#L12-L16

@tomwhite It looks like callbacks are "accumulated" on every worker. Is that right?

@dcherian My only concern with it is the implementation of merge_stores is probably very slow, using no parallelism, but I don't see any issues with making it public

Right we can fix that, but note that dask and presumably cubed are accumulating on remote workers already, so there is already some parallelism in how it is used.

Moreover it'd be nice not to have store.change_set_bytes be the public API :)

It looks like callbacks are "accumulated" on every worker. Is that right?

No, they are being accumulated on the client so there is no parallelism. I don't know what Icechunk is doing here, but would it be possible to merge a batch in one go rather than one at a time? Could that be more efficient?

but would it be possible to merge a batch in one go rather than one at a time

we'll have to build some rust API, we'll get to this eventually.

No, they are being accumulated on the client so there is no parallelism.

ah ok. In that case, how about calling reduction on each array that was written? That way you parallelize the merge across blocks for each array, and then the only serial bit is the merging across arrays, which will be a lot smaller. I considered this approach for dask, but then just wrote out a tree reduction across all chunks.

EDIT: or is the reduction approach not viable because you need to serialize to Zarr at some point?

or is the reduction approach not viable because you need to serialize to Zarr at some point?

Yes, that's basically it - Cubed also separates the data paths for array manipulations (contents of the blocks) from the metadata operations (block IDs and - for Icechunk - the changesets). So I think merging in batches would be more feasible.

we'll have to build some rust API, we'll get to this eventually.

+1

Yes, that's basically it

:( I was afraid so.

tomwhite · 2024-12-13T14:06:31Z

I'm not sure where this code should live.

After chatting to @rabernat, @jhamman, @TomNicholas we think Icechunk integration can live in this repo - as an optional dependency. There may be further changes once Xarray supports pluggable writers, but this PR can go in once tests are passing.

rabernat · 2024-12-13T14:11:49Z

This is amazing!

Note that the Python API is getting a major refactor in earth-mover/icechunk#470 which will almost certainly break this.

tomwhite · 2024-12-13T14:18:45Z

Note that the Python API is getting a major refactor in earth-mover/icechunk#470 which will almost certainly break this.

@rabernat thanks for the heads up. I'll keep an eye on that PR.

I think the best place to run the Icechunk tests here is in the Zarr v3 tests workflow in CI.

…echunk

Don't include icechunk in coverage

tomwhite · 2025-01-16T14:20:17Z

This is working well now with the latest zarr and icechunk releases. I plan to merge it soon.

TomNicholas · 2025-01-16T14:41:07Z

Awesome @tomwhite !

tomwhite mentioned this pull request Dec 4, 2024

Icechunk integration #628

Closed

dcherian reviewed Dec 4, 2024

View reviewed changes

tomwhite mentioned this pull request Dec 6, 2024

Support region(s) in to_zarr and store #642

Open

tomwhite force-pushed the icechunk-store branch from a4a84c7 to 471fc9e Compare December 13, 2024 13:51

tomwhite force-pushed the icechunk-store branch from cb7a1c1 to c21a446 Compare December 13, 2024 14:24

TomNicholas added the icechunk 🧊 label Dec 13, 2024

tomwhite mentioned this pull request Dec 17, 2024

Add open_virtual_mfdataset zarr-developers/VirtualiZarr#349

Merged

8 tasks

tomwhite force-pushed the icechunk-store branch 2 times, most recently from 719f83c to 7cc3b8e Compare January 11, 2025 11:44

tomwhite marked this pull request as ready for review January 11, 2025 12:26

tomwhite added 6 commits January 15, 2025 16:35

Add option to return stores in apply_blockwise function to support ic…

5c3fcfd

…echunk

Add store_icechunk

defc050

Add icechunk optional dependency

887115f

Run icechunk tests in CI

3f8342c

Typing improvements

977bd53

Don't include icechunk in coverage

Update to Icechunk 0.1.0-alpha.10 and Zarr 3.0.0

d5a6351

tomwhite force-pushed the icechunk-store branch from 5fc8970 to d5a6351 Compare January 15, 2025 16:38

tomwhite added 2 commits January 16, 2025 10:33

Move icechunk CI tests to own workflow

d91812a

Fix mypy

28f6683

tomwhite merged commit c487014 into main Jan 17, 2025
16 checks passed

tomwhite deleted the icechunk-store branch January 17, 2025 09:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Icechunk store #633

Icechunk store #633

Uh oh!

tomwhite commented Dec 4, 2024

Uh oh!

dcherian Dec 4, 2024

Uh oh!

dcherian Dec 4, 2024

Uh oh!

dcherian Dec 4, 2024

Uh oh!

paraseba Dec 4, 2024

Uh oh!

dcherian Dec 4, 2024

Uh oh!

tomwhite Dec 4, 2024

Uh oh!

dcherian Dec 4, 2024 •

edited

Loading

Uh oh!

tomwhite Dec 4, 2024

Uh oh!

dcherian Dec 4, 2024

Uh oh!

tomwhite commented Dec 13, 2024

Uh oh!

rabernat commented Dec 13, 2024

Uh oh!

tomwhite commented Dec 13, 2024

Uh oh!

tomwhite commented Jan 16, 2025

Uh oh!

TomNicholas commented Jan 16, 2025

Uh oh!

Uh oh!

Uh oh!

Icechunk store #633

Icechunk store #633

Uh oh!

Conversation

tomwhite commented Dec 4, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dcherian Dec 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tomwhite commented Dec 13, 2024

Uh oh!

rabernat commented Dec 13, 2024

Uh oh!

tomwhite commented Dec 13, 2024

Uh oh!

tomwhite commented Jan 16, 2025

Uh oh!

TomNicholas commented Jan 16, 2025

Uh oh!

Uh oh!

Uh oh!

dcherian Dec 4, 2024 •

edited

Loading