-
-
Notifications
You must be signed in to change notification settings - Fork 329
Appending performance improvement #1014
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Appending performance improvement #1014
Conversation
Hello @hailiangzhang! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2022-05-03 20:31:55 UTC |
Thanks, @hailiangzhang! I've launched the tests. |
Codecov Report
@@ Coverage Diff @@
## master #1014 +/- ##
=======================================
Coverage 99.94% 99.94%
=======================================
Files 34 34
Lines 13710 13719 +9
=======================================
+ Hits 13703 13712 +9
Misses 7 7
|
Relaunched after update, thanks, @hailiangzhang. Reading the comment and the implementation this makes sense & all tests are passing. I do wonder, though, if you have any ideas for edge cases which may need extra tests. Also: is it possible to share the benchmark you are evaluating with? |
I think there is another potential opportunity for optimization here Lines 2449 to 2451 in 539755f
We are deleting chunks synchronously, in a loop. It would be better to first populate the list of keys and then call Only FSStore implements delitems. Line 1372 in 27cf315
|
Thanks for your comment, @joshmoore ! The benchmark that I used is based on the script that Christine has posted in her issue report, where I found performance degradation when the number of chunks grew to 113,000, and observed 50% improvement after applying changes from this PR. However, I did look into the related tests, where it seems not covering the case when resizing the array by increasing one dimension while decreasing the other dimension. zarr-python/zarr/tests/test_core.py Lines 631 to 668 in 27cf315
Maybe I can add a test case of |
@hailiangzhang, there's a little time before the next release. If you have a chance to add the extra test(s), that'd be wonderful! |
Thanks for your comment @rabernat ! |
@joshmoore , actually I found something interesting when trying to add a new test as I suggested above (add a test case of z.resize((1, 55)) at the end). Basically I found that by using Sorry if I didn't explain it well, but I think this is a separate issue, and I have just filed an issue report here. Again, since this issue is not related to this PR, I think we can probably add this test after this issue is addressed. I will be willing to look into this a later time and see whether it's easy to get it fixed. |
Hi @joshmoore , as mentioned in my issue report (actually a feature request:), I have added an edge case test for resize method as we originally planned. This test can be adapted if my feature request is implemented in the future:) |
Thanks, @hailiangzhang. Merging to get this into a pre-release of 2.12 |
* Release notes for 2.12.0a1 * Minor fixes to release notes * Extend explanation of #1014
This PR is trying to address the issue reported in #938:
The issue appears to be coming from a method in zarr/core.py when trying to remove any chunks not within range.
Basically the existing implementation iterates through all the
old
chunks and removes those that don't exist in thenew
chunks. This seems time-consuming and unnecessary when appending new chunks to the end of an existing array with a large number of chunks.Regarding this, this PR will iterate through each dimension, and only find and remove the chunk slices that exist in
old
but notnew
data. It also introduced a mutable list to dynamically adjust the number of chunks along the already-processed dimensions in order to avoid duplicate chunk removal. These details have been documented in the comments from my changes.This should noticeably improve the performance when appending data to a huge zarr array; for our benchmark, it reduced the processing time by half. Moreover, the time of appending should stay the same as the array size grows.
TODO: