-
Notifications
You must be signed in to change notification settings - Fork 16
Support for lz4 compression #163 #168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Hi, I added a few lines of code to support Please let me know if I should do any change. |
@@ -808,7 +853,7 @@ def xopen( # noqa: C901 | |||
compresslevel is the compression level for writing to gzip, xz and zst files. | |||
This parameter is ignored for the other compression formats. | |||
If set to None, a default depending on the format is used: | |||
gzip: 6, xz: 6, zstd: 3. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self: didn't we change the gzip level to 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we did.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks very good. My only question is to benchmark whether a bufferedwriter really adds value when returning an lz4 writable file.
My other comment is that it is probably best to make lz4 non-optional, but that needs @marcelm 's blessing is well.
src/xopen/__init__.py
Outdated
f = lz4.frame.LZ4FrameFile(filename, mode, compression_level=compresslevel) | ||
if "r" in mode: | ||
return f | ||
# Buffer writes on lz4.open to mitigate overhead of small writes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Gzip this overhead is present becase gzip is written in Python. Did you benchmark this to check if it made a differences for small writes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, i just follow what other compression formats where doing.
for small writes we could use dictionaries for zstd and lz4, which should boost performance for small writes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gzip and Bzip2 write
calls are implemented in python. They have massive overhead. If the object you are writing to is implemented in C, that usually is not the case. I recommend benchmarking whether the BufferedWriter helps.
or small writes we could use dictionaries for zstd and lz4, which should boost performance for small writes
Xopen does not support that. Gzip can also use dictionaries, but xopen does not provide the handles for that. That is more suited for low level libraries.
I forgot to mention this earlier.
I have set default 1, as works for both. But the value entered and the underlying compression backend need to be aligned, which is not obvious. |
On debian 11 lz4 is not that picky. |
Benchmark descriptionBenchmark for files from 1Kb to 100Mb, 100 runs per test. Using random data (compression factor is bad for this scenario), using default compression level. Benchmarks are not using xopen except for bufferedWriter, i'm comparing raw python LZ4 vs OS lz4 CLI For python i'm measuring read using Compression LZ4 python vs LZ4 CLITesting sizes: ['1.0KB', '10.0KB', '100.0KB', '1.0MB', '10.0MB', '100.0MB'] Results Compression (times in seconds):
Note: Lower times are better. Results show mean ± standard deviation. Python is faster up to 10MB (included), then CLI is better for big files at 100MB. Decompression LZ4 python vs LZ4 CLITesting sizes: ['1.0KB', '10.0KB', '100.0KB', '1.0MB', '10.0MB', '100.0MB'] Results Decompression (times in seconds):
Note: Lower times are better. Results show mean ± standard deviation. Mixed results here, CLI is faster for small files (1Kb), then python lz4 is faster. surprizingly 1Mb and 100Mb are really fast with python lz4, we might be hitting some block size sweet spot there. Overall for decompression python is better that CLI xopen benchmark with and without
|
Size | xopen (avg ± std) | xopn no buffer (avg ± std) |
---|---|---|
1.0KB | 0.000191 ± 0.000079 | 0.000105 ± 0.000020 |
10.0KB | 0.000164 ± 0.000049 | 0.000168 ± 0.000130 |
100.0KB | 0.000216 ± 0.000092 | 0.000157 ± 0.000015 |
1.0MB | 0.000719 ± 0.000501 | 0.000596 ± 0.000468 |
10.0MB | 0.006572 ± 0.002621 | 0.005854 ± 0.001389 |
100.0MB | 0.074405 ± 0.018944 | 0.074006 ± 0.020324 |
Note: Lower times are better. Results show mean ± standard deviation.
There are benefits on not using io.BufferedWriter
across different file sizes.
Conclusion
- i'm taking out
io.BufferedWriter
for lz4 as it's better to take it out. let me know i can do the same for zstd. - Overall is better to use python lz4, but there might still be a case for using OS lz4 CLI. I'm leaving the code as it is.
- benchmark is in a ipynb file, which i can't upload here, https://www.dropbox.com/scl/fi/ftdm12snz4x5rawsdqvix/lz4_bench.ipynb?rlkey=0ayph5s4qiovgal9nlsc4c5lb&st=si0o7cic&dl=0
let me know your comments
The CLI solution consists of the python process pushing data into the pipe and the LZ4 CLI processing that, effectively using 2 cores. This is using 1 additional thread.
The way you are benchmarking with 100 iterations is really sensitive to temporary moments when the CPU is busy doing other things, I think that explains the lacking 10 MB result. Except for 1kb, I see that for decompression Python LZ4 is always better. Given the large stddev on the 1KB result, I think that is from an outlier too. My conclusion is that python lz4 is always faster. The decompression speed for lz4 is so enormously high (4GB/s and higher if the README is to be believed) that all we are doing is measuring the overhead of getting the data into a file. Using a pipe to another process is much more inefficient than doing it directly.
Great! Thanks for benchmarking that. It looks like always going for python lz4 is the best option except for maybe compression. Can you compare the 2threads result with the piped result (using 1 thread on lz4) to have a more apples to apples comparison? |
The way i setup the benchmark is to use default settings for LZ4 CLI, which means So, I don't think this test is possible. At least, I don't know how to do it. But i might be missing something. I agree with your conclusion, python LZ4 is the best for most of the scenarios. And when it's not the best, it's within the statistical error. |
I just realize i can share a jupyter notebook usign gist https://gist.github.com/gnzsnz/048f8c2749f0c73716c69138c157adea |
Ah, but if python-lz4 does not support a threaded mode, then it is better to use the pipedcompressionprogram class instead for scenarios where threading is requested. Python LZ4 drops the GIL, but if the threading library is not used to launch another thread, all the computation still happens on the same thread. So there is no speedup there. Xopen bypasses this with CLI programs. Python-isal is the exception here, because that contains quite some work to be able to escape the GIL and use true threading in Python. So I think the current PR is almost ready. Simply make sure python-lz4 is not optional but required and all the tests pass, and then I think it can be merged. |
Regarding tests. All tests are passing on my pc, pytest and tox work fine. tox is failing with pypy but what fails is I have no clue why CI is failing. I would need your support there. I have done minimal changes on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just checked the code if I could spot anything. But nothing yet. I will have to run this on my own PC to check why the tests are failing when I have the time.
Hi everyone, I came across this PR by chance and was interested in the discussion. If I understand it correctly, it would always be faster to use python-lz4 for decompression despite it being single-threaded. However, the current work in this PR seems to always call the piped compression program for decompression when threading is requested.
The above comment suggests that in theory, launching a subprocess should be faster because it can use a different process.
Your interpretation seems to be that python is faster despite python using a single thread while the CLI used threads=auto. If we now use pipedcompressionprogram when threading is requested, won't we slow down the decompression? PS: I believe that sns.stripplots might be useful to analyze benchmarks with potential outliers. Here is a little notebook with an example: https://gist.github.com/cedricdonie/6785314197cee9a539b1936e22b2b982. |
There are two definitions of "faster"
By using a subprocess, all the decompression is in one process. The python process only has to read input data from the pipe. It can then use the rest of the time to actually run the program you are interested in. This will decrease wall clocktime.
LZ4 is a bit of a special case because the decompression is extremely fast. It could be that simply decompressing a block of data in LZ4 is much faster than incurring the overhead cost of a pipe while letting another program decompress it. So yes, that is a valid concern. |
@rhpvorderman did you manage to find why CI is failing? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thanks! I finally got around to reviewing this PR. I just noted a couple of minor things that I think should be changed, but I will let @rhpvorderman make the final decision.
Now pypy is failing tests/test_xopen.py::test_roundtrip[.lz4-None-t] PASSED [ 34%]
Fatal Python error: Aborted
Stack (most recent call first, approximate line numbers):
File "/home/runner/work/xopen/xopen/.tox/py/lib/pypy3.9/site-packages/coverage/pytracer.py", line 146 in _trace
tests/test_xopen.py::test_roundtrip[.lz4-0-b] py: exit -6 (17.65 seconds) /home/runner/work/xopen/xopen> coverage run --branch --source=xopen,tests -m pytest -v --doctest-modules tests pid=2861
py: FAIL code -6 (45.58=setup[27.93]+cmd[17.65] seconds)
evaluation failed :( (45.98 seconds)
Error: Process completed with exit code 250. but the error is coming from coverage, not really from the lz4 changes. i would appreciate your inputs, i assume this is not new. |
python-lz4 is not tested and build for PyPy. So this is a problem with the upstream library or PyPy. I am wondering what the best way is to work around this problem. Make lz4 optional again? Just drop the low quality bindings and only use the external program? |
Please let me know how to move forward, I see the following options:
Personally, I think that lz4 as optional dependency is the most flexible approach. but is your call, please let me know. |
Yes, but we could use an environment marker to require
The code is a bit more complicated because we need to handle the case that On PyPy, we then need to fall back to the A couple of observations regarding the CLI:
So assuming that we do not want to try to detect which version of Hm, what about this behavior:
And also fall back to lz4 CLI if So this is essentially What do you think? |
Also if threads was explicitly set to 0? I am in dubio about that one. Crashing because a single-threaded option can not be provided leads to the user not being able to open the file. Defaulting to the CLI will use more threads than expected, which can cause issues on cluster environments. |
Hm, you’re right. In general, I would say that as an xopen user, my main interest is in being able to open files that may be compressed, not caring so much about the details. Sure I can specify compression level and threads, but this is more on a “best effort” level. For example, if we open a bz2 file with threads > 0 but pbzip2 is not available, we open it using Applied to lz4, I would say we should not fail even when the bindings aren’t available and threads == 0. How about this instead of version detection (which I’d like to avoid): We try to run the Pseudocode
|
Fully agree on all your points. That seems to be the best option. |
Status
let me know your comments |
__init__.py
README.md
ci.yml
lz4
to dependenciesio.BufferedWriter
-> removeio.BufferedWriter
for lz4