Skip to content

Icechunk store #633

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jan 17, 2025
Merged

Icechunk store #633

merged 8 commits into from
Jan 17, 2025

Conversation

tomwhite
Copy link
Member

@tomwhite tomwhite commented Dec 4, 2024

Fixes #628

This is a proof of concept/draft for a store_icechunk function that uses Cubed callbacks to do the Icechunk store merging. The tests pass locally, but not in CI yet. I'm not sure where this code should live.

@tomwhite tomwhite mentioned this pull request Dec 4, 2024
*,
sources: Union["Array", Sequence["Array"]],
targets: List[zarr.Array],
executor=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could definitely address this later but regions is quite an important kwarg.

if self.store is None:
self.store = store
else:
self.store.merge(store.change_set_bytes())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tomwhite It looks like callbacks are "accumulated" on every worker. Is that right?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dcherian My only concern with it is the implementation of merge_stores is probably very slow, using no parallelism, but I don't see any issues with making it public

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right we can fix that, but note that dask and presumably cubed are accumulating on remote workers already, so there is already some parallelism in how it is used.

Moreover it'd be nice not to have store.change_set_bytes be the public API :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like callbacks are "accumulated" on every worker. Is that right?

No, they are being accumulated on the client so there is no parallelism. I don't know what Icechunk is doing here, but would it be possible to merge a batch in one go rather than one at a time? Could that be more efficient?

Copy link
Contributor

@dcherian dcherian Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but would it be possible to merge a batch in one go rather than one at a time

we'll have to build some rust API, we'll get to this eventually.

No, they are being accumulated on the client so there is no parallelism.

ah ok. In that case, how about calling reduction on each array that was written? That way you parallelize the merge across blocks for each array, and then the only serial bit is the merging across arrays, which will be a lot smaller. I considered this approach for dask, but then just wrote out a tree reduction across all chunks.

EDIT: or is the reduction approach not viable because you need to serialize to Zarr at some point?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or is the reduction approach not viable because you need to serialize to Zarr at some point?

Yes, that's basically it - Cubed also separates the data paths for array manipulations (contents of the blocks) from the metadata operations (block IDs and - for Icechunk - the changesets). So I think merging in batches would be more feasible.

we'll have to build some rust API, we'll get to this eventually.

+1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's basically it

:( I was afraid so.

@tomwhite
Copy link
Member Author

I'm not sure where this code should live.

After chatting to @rabernat, @jhamman, @TomNicholas we think Icechunk integration can live in this repo - as an optional dependency. There may be further changes once Xarray supports pluggable writers, but this PR can go in once tests are passing.

@rabernat
Copy link

This is amazing!

Note that the Python API is getting a major refactor in earth-mover/icechunk#470 which will almost certainly break this.

@tomwhite
Copy link
Member Author

Note that the Python API is getting a major refactor in earth-mover/icechunk#470 which will almost certainly break this.

@rabernat thanks for the heads up. I'll keep an eye on that PR.

I think the best place to run the Icechunk tests here is in the Zarr v3 tests workflow in CI.

@tomwhite tomwhite force-pushed the icechunk-store branch 2 times, most recently from 719f83c to 7cc3b8e Compare January 11, 2025 11:44
@tomwhite tomwhite marked this pull request as ready for review January 11, 2025 12:26
@tomwhite
Copy link
Member Author

This is working well now with the latest zarr and icechunk releases. I plan to merge it soon.

@TomNicholas
Copy link
Member

Awesome @tomwhite !

@tomwhite tomwhite merged commit c487014 into main Jan 17, 2025
16 checks passed
@tomwhite tomwhite deleted the icechunk-store branch January 17, 2025 09:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Icechunk integration
5 participants