You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As far as I understand, current zarr-python implementation is overwriting all the chunks by default when you have an array that is already saved, had some changes and will be saved again. (correct me if I am wrong but here what I am looking it:
for example). In some cases(like image acquisition softwares that are trying to save more chunks as data arrives and continues to write chunks over hours/days) it can be wasteful overwrite all chunks especially if the only the new chunks are different chunks.
Less wordy explanation of the concern can be:
imagine you have an array on disk with 1000 chunks.
you want to append let's say 1000 more chunks of data to the array.
you want zarr api to realized first 1000 chunks will be identical anyway and not spend time overwrite it and directly only add new chunks.
Here at opensci2022 meeting, I have been discussing this with @jakirkham and he suggested one can resize the array first and fill only the new chunks with newly available values/frames. I think it is a valid way to address the concern. I like to discuss if we can possibly implement this internally and do it by default if possible. It may or may not change the existing public API(happy to discuss here). A few implementation ideas:
, maybe we can implement a similar function that is require_chunks and does the check internally and we can call such function in the save_array endpoint?
Hello all,
As far as I understand, current zarr-python implementation is overwriting all the chunks by default when you have an array that is already saved, had some changes and will be saved again. (correct me if I am wrong but here what I am looking it:
zarr-python/zarr/convenience.py
Line 170 in 505810c
for example). In some cases(like image acquisition softwares that are trying to save more chunks as data arrives and continues to write chunks over hours/days) it can be wasteful overwrite all chunks especially if the only the new chunks are different chunks.
Less wordy explanation of the concern can be:
Here at opensci2022 meeting, I have been discussing this with @jakirkham and he suggested one can resize the array first and fill only the new chunks with newly available values/frames. I think it is a valid way to address the concern. I like to discuss if we can possibly implement this internally and do it by default if possible. It may or may not change the existing public API(happy to discuss here). A few implementation ideas:
require_dataset
endpoint:zarr-python/zarr/hierarchy.py
Line 997 in ce129a5
require_chunks
and does the check internally and we can call such function in thesave_array
endpoint?zarr-python/zarr/core.py
Line 2507 in 43266ee
Any ideas/comments/discussions welcome!
The text was updated successfully, but these errors were encountered: