-
Notifications
You must be signed in to change notification settings - Fork 98
Switch Buffer
s to memoryview
s & remove extra copies/allocations
#656
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
923df20
to
e7afaa2
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #656 +/- ##
=======================================
Coverage 99.96% 99.96%
=======================================
Files 63 63
Lines 2771 2771
=======================================
Hits 2770 2770
Misses 1 1 🚀 New features to boost your workflow:
|
7255767
to
2a942ad
Compare
When this was written in the code, Python's Buffer Protocol support was inconsistent across Python versions (specifically on Python 2.7). Since Python 2.7 reached EOL and it was dropped from Numcodecs, the Python Buffer Protocol support has become more consistent. At this stage the `memoryview` object, which Cython also supports, does all the same things that `Buffer` would do for us. Plus it is builtin to the Python standard library. It behaves similarly in a lot of ways. Given this, switch the code over to `memoryview`s internally and drop `Buffer`.
Planning to merge end of week if no comments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a great improvement. Thanks so much @jakirkham!
Unfortunately my Cython isn't really good enough to provide a useful review. But the tests pass! 😆 And we are reducing LoC and simplifying many functions. So seems like a great direction.
Someone who actually groks Cython should review this for real.
@jakirkham any interest in pushing this forward? |
Have resolved the conflicts Also to avoid making the diff as messy have added in We can clean these Lastly added a new entry Please let me know if anything else is needed |
During encoding preallocate a `bytes` object for the final result and write everything directly into it. This avoids unnecessary staging and copying of intermediate results. Make use of Cython typed-`memoryview`s throughout encode and decode for efficient access of the underlying data. Further leverage the `store_le32` and `load_le32` functions to quickly pack and unpack little-endian 32-bit unsigned integers from buffers when encoding and decoding.
numcodecs/zstd.pyx
Outdated
# resize after compression | ||
dest = dest[:compressed_size] | ||
dest_objptr = <PyObject*>dest | ||
_PyBytes_Resize(&dest_objptr, compressed_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per issue ( #717 ), think this should fix the issue in Zstd (have pushed similar changes for the other codecs in this PR)
In particular this code had created a copy of the bytes
object (trimmed to compressed_size
)
dest = dest[:compressed_size]
To fix this we now call _PyBytes_Resize
, which will resize the existing bytes
object. This uses PyObject_Realloc
under-the-hood. A good realloc
implementation can detect when the memory requested is less than the size of the original allocation (as is the case here). When that happens it will shrink the allocation in-place (meaning no new memory allocation and no-copy), this is a fast operation
Buffer
s to memoryview
sBuffer
s to memoryview
s & remove extra copies/allocations
The `_PyBytes_Resize` function is helpful for resizing a `bytes` object after it is allocated. When the underlying `bytes` object only has one reference to it, the function can potentially use realloc to shrink or grow the allocation in-place. While the function signature of `_PyBytes_Resize` makes sense, it is a little unwieldly when used directly in Cython. To smooth this out a bit, use a macro to wrap calls to `_PyBytes_Resize`. This allows us to work with `PyObject*`s, which Cython handles well, instead of `PyObject**`s, which Cython handles awkwardly. The end result is a function from Cython's perspective, which is easy to use, and one under-the-hood that simply massages our input arguments into something `_PyBytes_Resize` expects.
Include a function to ensure an object is converted into a contiguous `memoryview` object.
Provide this macro in one place and `cimport` it everywhere else.
* Global `cimport`'s first * Start with core modules `cython`, `libc`, etc. * Add extensions `cpython` & `numpy` * Internal extensions * Then `import`s similarly grouped
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cdef extern from *: | ||
""" | ||
#define PyBytes_RESIZE(b, n) _PyBytes_Resize(&b, n) | ||
""" | ||
int PyBytes_RESIZE(object b, Py_ssize_t n) except -1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have added this function (macro) to help with resizing bytes
objects in Cython
It is a thin wrapper around _PyBytes_Resize
, which makes it easier to work with in Cython
This allows us to in-place truncate bytes
allocations that are larger than we end up needing without needing to copy to a new bytes
object
|
||
# check compression was successful | ||
if compressed_size <= 0: | ||
raise RuntimeError('LZ4 compression error: %s' % compressed_size) | ||
|
||
# resize after compression | ||
compressed_size += sizeof(uint32_t) | ||
dest = dest[:compressed_size] | ||
PyBytes_RESIZE(dest, compressed_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thus when we need to truncate a bytes
object, we can avoid [:<n>]
, which copies the bytes
object
Instead we just call PyBytes_RESIZE
, which can do an in-place size reduction of the bytes
object
Any excess memory goes back to the memory pool, which Python can use as it sees fit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have just tested this using the Mac 14 wheels from https://github.com/zarr-developers/numcodecs/actions/runs/14105743382 using the memray code in https://github.com/tomwhite/memray-array - and can confirm that it works. I can see the memory saving (peak memory goes form 300MB to 200MB). This was using zstd with Zarr v3.
@jakirkham thanks so much for this, it looks like a great improvement. I am not qualified to review this because I don't know cython, is there anyone you can recommend? |
i'm also happy merging after self-review |
Thanks Davis! 🙏 @dstansby would you have time to look within the next week? 🙂 |
Sorry, I don't think I have enough C/Cython experience to review this. |
No worries. In that case will go ahead and merge Ryan had reviewed it earlier. Also the fact that Davis took a look and Tom tested it successfully provides some additional confidence. Cython knowledge isn't as common around here. So that is likely as good as we can do Am AFK until the following week. Though happy to follow up on anything at that point. Please do feel free to ping me (otherwise it will likely get buried in the avalanche of notifications in the interim 😅) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noticed that two of the pointer definitions below should have been marked const
to match the associated typed-memoryviews, from which they are taken. However they were not
As a result, the compiler generates warnings about these ( #725 ). That said, we don't actually change the content the pointer is related to. So this is only a compile time warning, there is no actual issue at runtime
To fix the compiler warnings, added these const
s (as shown below) in PR ( #728 ), which fixes this issue
cdef const uint8_t[::1] b_mv = buf | ||
cdef uint8_t* b_ptr = &b_mv[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cdef const uint8_t[::1] b_mv = buf | |
cdef uint8_t* b_ptr = &b_mv[0] | |
cdef const uint8_t[::1] b_mv = buf | |
cdef const uint8_t* b_ptr = &b_mv[0] |
cdef const uint8_t[::1] b_mv = b | ||
cdef uint8_t* b_ptr = &b_mv[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cdef const uint8_t[::1] b_mv = b | |
cdef uint8_t* b_ptr = &b_mv[0] | |
cdef const uint8_t[::1] b_mv = b | |
cdef const uint8_t* b_ptr = &b_mv[0] |
cdef extern from *: | ||
""" | ||
#define PyBytes_RESIZE(b, n) _PyBytes_Resize(&b, n) | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should have an #ifndef
guard to protect against redefinition error
So far it seems Cython doesn't encounter this error as it defines this once per module. Still it is a good idea to add the guard
Handling in PR: #732
When this was written in the code, Python's Buffer Protocol support was inconsistent across Python versions (specifically on Python 2.7). Since Python 2.7 reached EOL and it was dropped from Numcodecs, the Python Buffer Protocol support has become more consistent.
At this stage the
memoryview
object, which Cython also supports, does all the same things thatBuffer
would do for us. Plus it is builtin to the Python standard library. It behaves similarly in a lot of ways.Given this, switch the code over to
memoryview
s internally and dropBuffer
.Additionally have pushed changes to this PR to improve overall memory usage. This eliminates some unneeded copies that occurred at the ended of some codecs. Also have eliminated some temporary allocations used in some codec pipelines by allocating output buffers earlier and changing operations to act in-place. This should eliminate some spiky behavior seen recently with codecs.
TODO: