Appending performance improvement #1014

hailiangzhang · 2022-04-25T20:33:02Z

This PR is trying to address the issue reported in #938:

When I repeatedly append data to a zarr array in s3, appending takes longer and longer and longer

The issue appears to be coming from a method in zarr/core.py when trying to remove any chunks not within range.

Basically the existing implementation iterates through all the old chunks and removes those that don't exist in the new chunks. This seems time-consuming and unnecessary when appending new chunks to the end of an existing array with a large number of chunks.

Regarding this, this PR will iterate through each dimension, and only find and remove the chunk slices that exist in old but not new data. It also introduced a mutable list to dynamically adjust the number of chunks along the already-processed dimensions in order to avoid duplicate chunk removal. These details have been documented in the comments from my changes.

This should noticeably improve the performance when appending data to a huge zarr array; for our benchmark, it reduced the processing time by half. Moreover, the time of appending should stay the same as the array size grows.

TODO:

Add unit tests and/or doctests in docstrings
Add docstrings and API docs for any new/modified user-facing classes and functions
New/modified features documented in docs/tutorial.rst
Changes documented in docs/release.rst
GitHub Actions have all passed
Test coverage is 100% (Codecov passes)

pep8speaks · 2022-04-25T20:33:05Z

Hello @hailiangzhang! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2022-05-03 20:31:55 UTC

joshmoore · 2022-04-26T19:35:11Z

Thanks, @hailiangzhang! I've launched the tests.

codecov · 2022-04-26T19:46:04Z

Codecov Report

Merging #1014 (9078b57) into master (27cf315) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master    #1014   +/-   ##
=======================================
  Coverage   99.94%   99.94%           
=======================================
  Files          34       34           
  Lines       13710    13719    +9     
=======================================
+ Hits        13703    13712    +9     
  Misses          7        7

Impacted Files	Coverage Δ
zarr/core.py	`100.00% <100.00%> (ø)`
zarr/tests/test_core.py	`100.00% <100.00%> (ø)`

joshmoore · 2022-05-02T07:22:35Z

Relaunched after update, thanks, @hailiangzhang.

Reading the comment and the implementation this makes sense & all tests are passing. I do wonder, though, if you have any ideas for edge cases which may need extra tests.

Also: is it possible to share the benchmark you are evaluating with?

rabernat · 2022-05-02T13:05:28Z

I think there is another potential opportunity for optimization here

zarr-python/zarr/core.py

Lines 2449 to 2451 in 539755f

    
           key = self._chunk_key(cidx) 
        
           try: 
        
               del chunk_store[key]

We are deleting chunks synchronously, in a loop. It would be better to first populate the list of keys and then call delitems on them. This would potentially dispatch asynchronously leading to much faster delete speed (especially with object storage).

Only FSStore implements delitems.

zarr-python/zarr/storage.py

Line 1372 in 27cf315

def delitems(self, keys):

hailiangzhang · 2022-05-02T18:36:15Z

Relaunched after update, thanks, @hailiangzhang.

Reading the comment and the implementation this makes sense & all tests are passing. I do wonder, though, if you have any ideas for edge cases which may need extra tests.

Also: is it possible to share the benchmark you are evaluating with?

Thanks for your comment, @joshmoore !

The benchmark that I used is based on the script that Christine has posted in her issue report, where I found performance degradation when the number of chunks grew to 113,000, and observed 50% improvement after applying changes from this PR.
Regarding edge cases which may need extra tests, actually I think, in theory, my implementation should be identical the original code, so I am not sure whether new tests will be necessary to cover the changes or not.

However, I did look into the related tests, where it seems not covering the case when resizing the array by increasing one dimension while decreasing the other dimension.

zarr-python/zarr/tests/test_core.py

Lines 631 to 668 in 27cf315

    
           def test_resize_2d(self): 
        
               z = self.create_array(shape=(105, 105), chunks=(10, 10), dtype='i4', 
        
                                     fill_value=0) 
        
               a = np.arange(105*105, dtype='i4').reshape((105, 105)) 
        
               z[:] = a 
        
               assert (105, 105) == z.shape 
        
               assert (105, 105) == z[:].shape 
        
               assert np.dtype('i4') == z.dtype 
        
               assert np.dtype('i4') == z[:].dtype 
        
               assert (10, 10) == z.chunks 
        
               assert_array_equal(a, z[:]) 
        
               z.resize((205, 205)) 
        
               assert (205, 205) == z.shape 
        
               assert (205, 205) == z[:].shape 
        
               assert np.dtype('i4') == z.dtype 
        
               assert np.dtype('i4') == z[:].dtype 
        
               assert (10, 10) == z.chunks 
        
               assert_array_equal(a, z[:105, :105]) 
        
               assert_array_equal(np.zeros((100, 205), dtype='i4'), z[105:, :]) 
        
               assert_array_equal(np.zeros((205, 100), dtype='i4'), z[:, 105:]) 
        
               z.resize((55, 55)) 
        
               assert (55, 55) == z.shape 
        
               assert (55, 55) == z[:].shape 
        
               assert np.dtype('i4') == z.dtype 
        
               assert np.dtype('i4') == z[:].dtype 
        
               assert (10, 10) == z.chunks 
        
               assert_array_equal(a[:55, :55], z[:]) 
        
               z.resize((55, 1)) 
        
               assert (55, 1) == z.shape 
        
               assert (55, 1) == z[:].shape 
        
               assert np.dtype('i4') == z.dtype 
        
               assert np.dtype('i4') == z[:].dtype 
        
               assert (10, 10) == z.chunks 
        
               assert_array_equal(a[:55, :1], z[:])

Maybe I can add a test case of z.resize((1, 55)) at the end? Again, I don't think it is related to this PR, but please let me know if you think it's necessary and I will be happy to do that:)

joshmoore · 2022-05-02T18:56:02Z

@hailiangzhang, there's a little time before the next release. If you have a chance to add the extra test(s), that'd be wonderful!

hailiangzhang · 2022-05-02T19:34:29Z

I think there is another potential opportunity for optimization here

zarr-python/zarr/core.py

Lines 2449 to 2451 in 539755f

key = self._chunk_key(cidx)

try:

del chunk_store[key]

We are deleting chunks synchronously, in a loop. It would be better to first populate the list of keys and then call delitems on them. This would potentially dispatch asynchronously leading to much faster delete speed (especially with object storage).

Only FSStore implements delitems.

zarr-python/zarr/storage.py

Line 1372 in 27cf315

def delitems(self, keys):

Thanks for your comment @rabernat !
Yes, I can first populate the list of keys and then call delitems on them at the end, but this still seems a synchronous operation, and I am not sure how this would potentially dispatch asynchronously. Sorry that I may have missed something here, but could you explain more on this?

hailiangzhang · 2022-05-03T04:25:08Z

@hailiangzhang, there's a little time before the next release. If you have a chance to add the extra test(s), that'd be wonderful!

@joshmoore , actually I found something interesting when trying to add a new test as I suggested above (add a test case of z.resize((1, 55)) at the end).

Basically I found that by using resize, if I reduce the shape of a zarr array to be smaller than the chunk size, and then increase the array shape afterward, the values that were supposed to be removed by the first resize was brought back by the second resize.

Sorry if I didn't explain it well, but I think this is a separate issue, and I have just filed an issue report here.

Again, since this issue is not related to this PR, I think we can probably add this test after this issue is addressed. I will be willing to look into this a later time and see whether it's easy to get it fixed.

hailiangzhang · 2022-05-03T20:36:30Z

@hailiangzhang, there's a little time before the next release. If you have a chance to add the extra test(s), that'd be wonderful!

@joshmoore , actually I found something interesting when trying to add a new test as I suggested above (add a test case of z.resize((1, 55)) at the end).

Basically I found that by using resize, if I reduce the shape of a zarr array to be smaller than the chunk size, and then increase the array shape afterward, the values that were supposed to be removed by the first resize was brought back by the second resize.

Sorry if I didn't explain it well, but I think this is a separate issue, and I have just filed an issue report here.

Again, since this issue is not related to this PR, I think we can probably add this test after this issue is addressed. I will be willing to look into this a later time and see whether it's easy to get it fixed.

Hi @joshmoore , as mentioned in my issue report (actually a feature request:), I have added an edge case test for resize method as we originally planned. This test can be adapted if my feature request is implemented in the future:)

joshmoore · 2022-05-04T10:05:29Z

Thanks, @hailiangzhang. Merging to get this into a pre-release of 2.12

* Release notes for 2.12.0a1 * Minor fixes to release notes * Extend explanation of #1014

HailiangZhangNASA added 4 commits April 24, 2022 17:08

feat: replace the sub-optimal itertools loop

d1d4e1b

feat: remove old scripts which was commented out

4c6d83b

docs: explanation added in appending script

2115e82

docs: comment polished

a41ec0c

hailiangzhang and others added 2 commits April 25, 2022 16:33

Merge branch 'master' into appending_performance_improvement

18cd6cd

fix: remove trailing whitespace found by lint

ef851a0

hailiangzhang mentioned this pull request Apr 26, 2022

When I repeatedly append data to a zarr array in s3, appending takes longer and longer and longer #938

Closed

Merge branch 'master' into appending_performance_improvement

539755f

hailiangzhang mentioned this pull request May 3, 2022

resize(): Improve docs & control of what is modified #1017

Open

hailiangzhang and others added 2 commits May 3, 2022 16:27

Merge branch 'master' into appending_performance_improvement

33b31d3

test: add an edge case test for resize method

9078b57

joshmoore merged commit d1f590d into zarr-developers:master May 4, 2022

joshmoore added a commit to joshmoore/zarr-python that referenced this pull request May 10, 2022

Extend explanation of zarr-developers#1014

e0f117a

joshmoore added a commit that referenced this pull request May 10, 2022

Release notes for 2.12.0a1 (#1023)

972c912

* Release notes for 2.12.0a1 * Minor fixes to release notes * Extend explanation of #1014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Appending performance improvement #1014

Appending performance improvement #1014

hailiangzhang commented Apr 25, 2022 •

edited

Loading

pep8speaks commented Apr 25, 2022 •

edited

Loading

joshmoore commented Apr 26, 2022

codecov bot commented Apr 26, 2022 •

edited

Loading

joshmoore commented May 2, 2022

rabernat commented May 2, 2022

hailiangzhang commented May 2, 2022 •

edited

Loading

joshmoore commented May 2, 2022

hailiangzhang commented May 2, 2022

hailiangzhang commented May 3, 2022

hailiangzhang commented May 3, 2022

joshmoore commented May 4, 2022

Appending performance improvement #1014

Appending performance improvement #1014

Conversation

hailiangzhang commented Apr 25, 2022 • edited Loading

pep8speaks commented Apr 25, 2022 • edited Loading

Comment last updated at 2022-05-03 20:31:55 UTC

joshmoore commented Apr 26, 2022

codecov bot commented Apr 26, 2022 • edited Loading

Codecov Report

joshmoore commented May 2, 2022

rabernat commented May 2, 2022

hailiangzhang commented May 2, 2022 • edited Loading

joshmoore commented May 2, 2022

hailiangzhang commented May 2, 2022

hailiangzhang commented May 3, 2022

hailiangzhang commented May 3, 2022

joshmoore commented May 4, 2022

hailiangzhang commented Apr 25, 2022 •

edited

Loading

pep8speaks commented Apr 25, 2022 •

edited

Loading

codecov bot commented Apr 26, 2022 •

edited

Loading

hailiangzhang commented May 2, 2022 •

edited

Loading