Skip to content

Appending performance improvement #1014

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

hailiangzhang
Copy link
Contributor

@hailiangzhang hailiangzhang commented Apr 25, 2022

This PR is trying to address the issue reported in #938:

When I repeatedly append data to a zarr array in s3, appending takes longer and longer and longer

The issue appears to be coming from a method in zarr/core.py when trying to remove any chunks not within range.

Basically the existing implementation iterates through all the old chunks and removes those that don't exist in the new chunks. This seems time-consuming and unnecessary when appending new chunks to the end of an existing array with a large number of chunks.

Regarding this, this PR will iterate through each dimension, and only find and remove the chunk slices that exist in old but not new data. It also introduced a mutable list to dynamically adjust the number of chunks along the already-processed dimensions in order to avoid duplicate chunk removal. These details have been documented in the comments from my changes.

This should noticeably improve the performance when appending data to a huge zarr array; for our benchmark, it reduced the processing time by half. Moreover, the time of appending should stay the same as the array size grows.

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/tutorial.rst
  • Changes documented in docs/release.rst
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

@pep8speaks
Copy link

pep8speaks commented Apr 25, 2022

Hello @hailiangzhang! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2022-05-03 20:31:55 UTC

@joshmoore
Copy link
Member

Thanks, @hailiangzhang! I've launched the tests.

@codecov
Copy link

codecov bot commented Apr 26, 2022

Codecov Report

Merging #1014 (9078b57) into master (27cf315) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master    #1014   +/-   ##
=======================================
  Coverage   99.94%   99.94%           
=======================================
  Files          34       34           
  Lines       13710    13719    +9     
=======================================
+ Hits        13703    13712    +9     
  Misses          7        7           
Impacted Files Coverage Δ
zarr/core.py 100.00% <100.00%> (ø)
zarr/tests/test_core.py 100.00% <100.00%> (ø)

@joshmoore
Copy link
Member

Relaunched after update, thanks, @hailiangzhang.

Reading the comment and the implementation this makes sense & all tests are passing. I do wonder, though, if you have any ideas for edge cases which may need extra tests.

Also: is it possible to share the benchmark you are evaluating with?

@rabernat
Copy link
Contributor

rabernat commented May 2, 2022

I think there is another potential opportunity for optimization here

zarr-python/zarr/core.py

Lines 2449 to 2451 in 539755f

key = self._chunk_key(cidx)
try:
del chunk_store[key]

We are deleting chunks synchronously, in a loop. It would be better to first populate the list of keys and then call delitems on them. This would potentially dispatch asynchronously leading to much faster delete speed (especially with object storage).

Only FSStore implements delitems.

def delitems(self, keys):

@hailiangzhang
Copy link
Contributor Author

hailiangzhang commented May 2, 2022

Relaunched after update, thanks, @hailiangzhang.

Reading the comment and the implementation this makes sense & all tests are passing. I do wonder, though, if you have any ideas for edge cases which may need extra tests.

Also: is it possible to share the benchmark you are evaluating with?

Thanks for your comment, @joshmoore !

The benchmark that I used is based on the script that Christine has posted in her issue report, where I found performance degradation when the number of chunks grew to 113,000, and observed 50% improvement after applying changes from this PR.
Regarding edge cases which may need extra tests, actually I think, in theory, my implementation should be identical the original code, so I am not sure whether new tests will be necessary to cover the changes or not.

However, I did look into the related tests, where it seems not covering the case when resizing the array by increasing one dimension while decreasing the other dimension.

def test_resize_2d(self):
z = self.create_array(shape=(105, 105), chunks=(10, 10), dtype='i4',
fill_value=0)
a = np.arange(105*105, dtype='i4').reshape((105, 105))
z[:] = a
assert (105, 105) == z.shape
assert (105, 105) == z[:].shape
assert np.dtype('i4') == z.dtype
assert np.dtype('i4') == z[:].dtype
assert (10, 10) == z.chunks
assert_array_equal(a, z[:])
z.resize((205, 205))
assert (205, 205) == z.shape
assert (205, 205) == z[:].shape
assert np.dtype('i4') == z.dtype
assert np.dtype('i4') == z[:].dtype
assert (10, 10) == z.chunks
assert_array_equal(a, z[:105, :105])
assert_array_equal(np.zeros((100, 205), dtype='i4'), z[105:, :])
assert_array_equal(np.zeros((205, 100), dtype='i4'), z[:, 105:])
z.resize((55, 55))
assert (55, 55) == z.shape
assert (55, 55) == z[:].shape
assert np.dtype('i4') == z.dtype
assert np.dtype('i4') == z[:].dtype
assert (10, 10) == z.chunks
assert_array_equal(a[:55, :55], z[:])
z.resize((55, 1))
assert (55, 1) == z.shape
assert (55, 1) == z[:].shape
assert np.dtype('i4') == z.dtype
assert np.dtype('i4') == z[:].dtype
assert (10, 10) == z.chunks
assert_array_equal(a[:55, :1], z[:])

Maybe I can add a test case of z.resize((1, 55)) at the end? Again, I don't think it is related to this PR, but please let me know if you think it's necessary and I will be happy to do that:)

@joshmoore
Copy link
Member

@hailiangzhang, there's a little time before the next release. If you have a chance to add the extra test(s), that'd be wonderful!

@hailiangzhang
Copy link
Contributor Author

I think there is another potential opportunity for optimization here

zarr-python/zarr/core.py

Lines 2449 to 2451 in 539755f

key = self._chunk_key(cidx)
try:
del chunk_store[key]

We are deleting chunks synchronously, in a loop. It would be better to first populate the list of keys and then call delitems on them. This would potentially dispatch asynchronously leading to much faster delete speed (especially with object storage).

Only FSStore implements delitems.

def delitems(self, keys):

Thanks for your comment @rabernat !
Yes, I can first populate the list of keys and then call delitems on them at the end, but this still seems a synchronous operation, and I am not sure how this would potentially dispatch asynchronously. Sorry that I may have missed something here, but could you explain more on this?

@hailiangzhang
Copy link
Contributor Author

@hailiangzhang, there's a little time before the next release. If you have a chance to add the extra test(s), that'd be wonderful!

@joshmoore , actually I found something interesting when trying to add a new test as I suggested above (add a test case of z.resize((1, 55)) at the end).

Basically I found that by using resize, if I reduce the shape of a zarr array to be smaller than the chunk size, and then increase the array shape afterward, the values that were supposed to be removed by the first resize was brought back by the second resize.

Sorry if I didn't explain it well, but I think this is a separate issue, and I have just filed an issue report here.

Again, since this issue is not related to this PR, I think we can probably add this test after this issue is addressed. I will be willing to look into this a later time and see whether it's easy to get it fixed.

@hailiangzhang
Copy link
Contributor Author

@hailiangzhang, there's a little time before the next release. If you have a chance to add the extra test(s), that'd be wonderful!

@joshmoore , actually I found something interesting when trying to add a new test as I suggested above (add a test case of z.resize((1, 55)) at the end).

Basically I found that by using resize, if I reduce the shape of a zarr array to be smaller than the chunk size, and then increase the array shape afterward, the values that were supposed to be removed by the first resize was brought back by the second resize.

Sorry if I didn't explain it well, but I think this is a separate issue, and I have just filed an issue report here.

Again, since this issue is not related to this PR, I think we can probably add this test after this issue is addressed. I will be willing to look into this a later time and see whether it's easy to get it fixed.

Hi @joshmoore , as mentioned in my issue report (actually a feature request:), I have added an edge case test for resize method as we originally planned. This test can be adapted if my feature request is implemented in the future:)

@joshmoore
Copy link
Member

Thanks, @hailiangzhang. Merging to get this into a pre-release of 2.12

@joshmoore joshmoore merged commit d1f590d into zarr-developers:master May 4, 2022
joshmoore added a commit to joshmoore/zarr-python that referenced this pull request May 10, 2022
joshmoore added a commit that referenced this pull request May 10, 2022
* Release notes for 2.12.0a1

* Minor fixes to release notes

* Extend explanation of #1014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants