Skip to content

Possible feature request on saving array mechanism #1140

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
AhmetCanSolak opened this issue Sep 21, 2022 · 1 comment
Open

Possible feature request on saving array mechanism #1140

AhmetCanSolak opened this issue Sep 21, 2022 · 1 comment
Labels
enhancement New features or improvements

Comments

@AhmetCanSolak
Copy link

Hello all,

As far as I understand, current zarr-python implementation is overwriting all the chunks by default when you have an array that is already saved, had some changes and will be saved again. (correct me if I am wrong but here what I am looking it:

_create_array(arr, store=_store, overwrite=True, zarr_version=zarr_version, path=path,

for example). In some cases(like image acquisition softwares that are trying to save more chunks as data arrives and continues to write chunks over hours/days) it can be wasteful overwrite all chunks especially if the only the new chunks are different chunks.

Less wordy explanation of the concern can be:

  • imagine you have an array on disk with 1000 chunks.
  • you want to append let's say 1000 more chunks of data to the array.
  • you want zarr api to realized first 1000 chunks will be identical anyway and not spend time overwrite it and directly only add new chunks.

Here at opensci2022 meeting, I have been discussing this with @jakirkham and he suggested one can resize the array first and fill only the new chunks with newly available values/frames. I think it is a valid way to address the concern. I like to discuss if we can possibly implement this internally and do it by default if possible. It may or may not change the existing public API(happy to discuss here). A few implementation ideas:

  • there is a require_dataset endpoint:
    def require_dataset(self, name, shape, dtype=None, exact=False, **kwargs):
    , maybe we can implement a similar function that is require_chunks and does the check internally and we can call such function in the save_array endpoint?
  • there is already an append API here:
    def append(self, data, axis=0):
    but I am not sure if this would work as I explain above at all the times? I understood it works per axis at a time.

Any ideas/comments/discussions welcome!

@joshmoore
Copy link
Member

Thanks, @AhmetCanSolak. Cross-linking here as during the community meeting: #1017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New features or improvements
Projects
None yet
Development

No branches or pull requests

3 participants