Skip to content

Support for lz4 compression #163 #168

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
Open

Conversation

gnzsnz
Copy link

@gnzsnz gnzsnz commented Mar 7, 2025

  • changes on __init__.py
  • update README.md
  • update tests
  • update ci.yml
  • add lz4 to dependencies
  • benchmark with and without io.BufferedWriter -> remove io.BufferedWriter for lz4
  • benchmark python lz4 vs lz4 CLI
  • set lz4 compression levels aligned with python lz4 [0-16], as CLi accepts those values without problem.

gnzsnz added 3 commits March 7, 2025 21:26
- tox.ini to include lz4 tests
- pyproject.toml to include dev dependencies
- .github/workflows/ci.yml to include lz4 tests
- tests to include lz4 tests
@gnzsnz
Copy link
Author

gnzsnz commented Mar 8, 2025

Hi,

I added a few lines of code to support lz4, I did my best to follow the existing structure. Test are passing, except for pypy. But i don't think that pypy issues are related to lz4.

Please let me know if I should do any change.

@@ -808,7 +853,7 @@ def xopen( # noqa: C901
compresslevel is the compression level for writing to gzip, xz and zst files.
This parameter is ignored for the other compression formats.
If set to None, a default depending on the format is used:
gzip: 6, xz: 6, zstd: 3.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: didn't we change the gzip level to 1?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we did.

Copy link
Collaborator

@rhpvorderman rhpvorderman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very good. My only question is to benchmark whether a bufferedwriter really adds value when returning an lz4 writable file.

My other comment is that it is probably best to make lz4 non-optional, but that needs @marcelm 's blessing is well.

f = lz4.frame.LZ4FrameFile(filename, mode, compression_level=compresslevel)
if "r" in mode:
return f
# Buffer writes on lz4.open to mitigate overhead of small writes
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Gzip this overhead is present becase gzip is written in Python. Did you benchmark this to check if it made a differences for small writes?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, i just follow what other compression formats where doing.

for small writes we could use dictionaries for zstd and lz4, which should boost performance for small writes

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gzip and Bzip2 write calls are implemented in python. They have massive overhead. If the object you are writing to is implemented in C, that usually is not the case. I recommend benchmarking whether the BufferedWriter helps.

or small writes we could use dictionaries for zstd and lz4, which should boost performance for small writes

Xopen does not support that. Gzip can also use dictionaries, but xopen does not provide the handles for that. That is more suited for low level libraries.

@gnzsnz
Copy link
Author

gnzsnz commented Mar 10, 2025

I forgot to mention this earlier.

python-lz4 has compression level 0-16, while CLI lz4 has compression levels 1-12. If you have faced this on other algorithm we can re-use an existing solution, if not we can think a way to scale up/down compression levels.

I have set default 1, as works for both. But the value entered and the underlying compression backend need to be aligned, which is not obvious.

@rhpvorderman
Copy link
Collaborator

On debian 11 lz4 is not that picky. -0 works and even -1231 works (I assume anything in between works too). --1 does not work. So technically you could use range(0, sys.maxsize) as an allowed range specifier. (Please do not use the tuple call in that case, python can check if something is in a range without the whole array being created).

@gnzsnz
Copy link
Author

gnzsnz commented Mar 10, 2025

Benchmark description

Benchmark for files from 1Kb to 100Mb, 100 runs per test. Using random data (compression factor is bad for this scenario), using default compression level.

Benchmarks are not using xopen except for bufferedWriter, i'm comparing raw python LZ4 vs OS lz4 CLI

For python i'm measuring read using with open from python standar lib plus LZ4 open to write(and viceversa). This is to compare with CLI lz4 that does the same operation.

Compression LZ4 python vs LZ4 CLI

Testing sizes: ['1.0KB', '10.0KB', '100.0KB', '1.0MB', '10.0MB', '100.0MB']
Runs per test: 100

Results Compression (times in seconds):

Size Python LZ4 (avg ± std) OS LZ4 (avg ± std)
1.0KB 0.000222 ± 0.000135 0.002831 ± 0.000895
10.0KB 0.000153 ± 0.000044 0.002343 ± 0.000167
100.0KB 0.000140 ± 0.000072 0.002443 ± 0.000308
1.0MB 0.000696 ± 0.000833 0.003184 ± 0.000342
10.0MB 0.006929 ± 0.002354 0.008113 ± 0.000971
100.0MB 0.084589 ± 0.057809 0.056928 ± 0.062384

Note: Lower times are better. Results show mean ± standard deviation.

Python is faster up to 10MB (included), then CLI is better for big files at 100MB.

Decompression LZ4 python vs LZ4 CLI

Testing sizes: ['1.0KB', '10.0KB', '100.0KB', '1.0MB', '10.0MB', '100.0MB']
Runs per test: 100

Results Decompression (times in seconds):

Size Python LZ4 (avg ± std) OS LZ4 (avg ± std)
1.0KB 0.103372 ± 0.002786 0.051380 ± 0.045575
10.0KB 0.000817 ± 0.000073 0.005209 ± 0.016507
100.0KB 0.000167 ± 0.000027 0.002811 ± 0.000130
1.0MB 0.000096 ± 0.000019 0.002655 ± 0.000044
10.0MB 0.010892 ± 0.000622 0.008884 ± 0.003092
100.0MB 0.000113 ± 0.000028 0.002750 ± 0.000164

Note: Lower times are better. Results show mean ± standard deviation.

Mixed results here, CLI is faster for small files (1Kb), then python lz4 is faster. surprizingly 1Mb and 100Mb are really fast with python lz4, we might be hitting some block size sweet spot there. Overall for decompression python is better that CLI

xopen benchmark with and without io.BufferedWriter

Testing sizes: ['1.0KB', '10.0KB', '100.0KB', '1.0MB', '10.0MB', '100.0MB']
Runs per test: 100

Results Compression (times in seconds):

Size xopen (avg ± std) xopn no buffer (avg ± std)
1.0KB 0.000191 ± 0.000079 0.000105 ± 0.000020
10.0KB 0.000164 ± 0.000049 0.000168 ± 0.000130
100.0KB 0.000216 ± 0.000092 0.000157 ± 0.000015
1.0MB 0.000719 ± 0.000501 0.000596 ± 0.000468
10.0MB 0.006572 ± 0.002621 0.005854 ± 0.001389
100.0MB 0.074405 ± 0.018944 0.074006 ± 0.020324

Note: Lower times are better. Results show mean ± standard deviation.

There are benefits on not using io.BufferedWriter across different file sizes.

Conclusion

let me know your comments

@rhpvorderman
Copy link
Collaborator

Compression LZ4 python vs LZ4 CLI
Python is faster up to 10MB (included), then CLI is better for big files at 100MB.

The CLI solution consists of the python process pushing data into the pipe and the LZ4 CLI processing that, effectively using 2 cores. This is using 1 additional thread.
The python lz4 is not given this benefit. What happens when python lz4 is opened with an additional thread (threads=2 or whatever syntax they use)? That is the fair comparison.

Decompression LZ4 python vs LZ4 CLI
Mixed results here, CLI is faster for small files (1Kb), then python lz4 is faster. surprizingly 1Mb and 100Mb are really fast with python lz4, we might be hitting some block size sweet spot there. Overall for decompression python is better that CLI

The way you are benchmarking with 100 iterations is really sensitive to temporary moments when the CPU is busy doing other things, I think that explains the lacking 10 MB result. Except for 1kb, I see that for decompression Python LZ4 is always better. Given the large stddev on the 1KB result, I think that is from an outlier too.

My conclusion is that python lz4 is always faster. The decompression speed for lz4 is so enormously high (4GB/s and higher if the README is to be believed) that all we are doing is measuring the overhead of getting the data into a file. Using a pipe to another process is much more inefficient than doing it directly.

There are benefits on not using io.BufferedWriter across different file sizes.

Great! Thanks for benchmarking that.

It looks like always going for python lz4 is the best option except for maybe compression. Can you compare the 2threads result with the piped result (using 1 thread on lz4) to have a more apples to apples comparison?

@gnzsnz
Copy link
Author

gnzsnz commented Mar 11, 2025

The CLI solution consists of the python process pushing data into the pipe and the LZ4 CLI processing that, effectively using 2 cores. This is using 1 additional thread.
The python lz4 is not given this benefit. What happens when python lz4 is opened with an additional thread (threads=2 or whatever syntax they use)? That is the fair comparison.

The way i setup the benchmark is to use default settings for LZ4 CLI, which means number of threads = auto. There is no way to set the number of threads in python LZ4, it's multithreading, but there is no parameter to set it (https://python-lz4.readthedocs.io/en/stable/lz4.frame.html). 🤷

So, I don't think this test is possible. At least, I don't know how to do it. But i might be missing something.

I agree with your conclusion, python LZ4 is the best for most of the scenarios. And when it's not the best, it's within the statistical error.

@gnzsnz
Copy link
Author

gnzsnz commented Mar 11, 2025

I just realize i can share a jupyter notebook usign gist

https://gist.github.com/gnzsnz/048f8c2749f0c73716c69138c157adea

@rhpvorderman
Copy link
Collaborator

So, I don't think this test is possible. At least, I don't know how to do it. But i might be missing something.

I agree with your conclusion, python LZ4 is the best for most of the scenarios. And when it's not the best, it's within the statistical error.

Ah, but if python-lz4 does not support a threaded mode, then it is better to use the pipedcompressionprogram class instead for scenarios where threading is requested.

Python LZ4 drops the GIL, but if the threading library is not used to launch another thread, all the computation still happens on the same thread. So there is no speedup there. Xopen bypasses this with CLI programs. Python-isal is the exception here, because that contains quite some work to be able to escape the GIL and use true threading in Python.

So I think the current PR is almost ready. Simply make sure python-lz4 is not optional but required and all the tests pass, and then I think it can be merged.

@gnzsnz
Copy link
Author

gnzsnz commented Mar 13, 2025

Regarding tests. All tests are passing on my pc, pytest and tox work fine. tox is failing with pypy but what fails is coverage not the test themselves. mypy is failing on tox, but these are old outstanding issues. Nothing new introduced by this PR.

I have no clue why CI is failing. I would need your support there. I have done minimal changes on ci.yml to install LZ4 cli on OSX and ubuntu runners.

Copy link
Collaborator

@rhpvorderman rhpvorderman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just checked the code if I could spot anything. But nothing yet. I will have to run this on my own PC to check why the tests are failing when I have the time.

@cedricdonie
Copy link

Hi everyone, I came across this PR by chance and was interested in the discussion. If I understand it correctly, it would always be faster to use python-lz4 for decompression despite it being single-threaded. However, the current work in this PR seems to always call the piped compression program for decompression when threading is requested.

Python LZ4 drops the GIL, but if the threading library is not used to launch another thread, all the computation still happens on the same thread. So there is no speedup there. Xopen bypasses this with CLI programs. Python-isal is the exception here, because that contains quite some work to be able to escape the GIL and use true threading in Python.

The above comment suggests that in theory, launching a subprocess should be faster because it can use a different process.

So, I don't think this test is possible. At least, I don't know how to do it. But i might be missing something.

I agree with your conclusion, python LZ4 is the best for most of the scenarios. And when it's not the best, it's within the statistical error.

Ah, but if python-lz4 does not support a threaded mode, then it is better to use the pipedcompressionprogram class instead for scenarios where threading is requested.

Your interpretation seems to be that python is faster despite python using a single thread while the CLI used threads=auto. If we now use pipedcompressionprogram when threading is requested, won't we slow down the decompression?

PS: I believe that sns.stripplots might be useful to analyze benchmarks with potential outliers. Here is a little notebook with an example: https://gist.github.com/cedricdonie/6785314197cee9a539b1936e22b2b982.
PPS: Thanks for the awesome library and for adding LZ4 🙂!

@rhpvorderman
Copy link
Collaborator

The above comment suggests that in theory, launching a subprocess should be faster because it can use a different process.

There are two definitions of "faster"

  • Uses less wallclock time. This metric favours heavy threading.
  • Uses less cpu time overall. This metric favours the least possible amount of overhead.

By using a subprocess, all the decompression is in one process. The python process only has to read input data from the pipe. It can then use the rest of the time to actually run the program you are interested in. This will decrease wall clocktime.

Your interpretation seems to be that python is faster despite python using a single thread while the CLI used threads=auto. If we now use pipedcompressionprogram when threading is requested, won't we slow down the decompression?

LZ4 is a bit of a special case because the decompression is extremely fast. It could be that simply decompressing a block of data in LZ4 is much faster than incurring the overhead cost of a pipe while letting another program decompress it. So yes, that is a valid concern.

@gnzsnz
Copy link
Author

gnzsnz commented Mar 21, 2025

@rhpvorderman did you manage to find why CI is failing?

Copy link
Collaborator

@marcelm marcelm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks! I finally got around to reviewing this PR. I just noted a couple of minor things that I think should be changed, but I will let @rhpvorderman make the final decision.

@gnzsnz
Copy link
Author

gnzsnz commented Mar 27, 2025

Now pypy is failing

tests/test_xopen.py::test_roundtrip[.lz4-None-t] PASSED                  [ 34%]
Fatal Python error: Aborted

Stack (most recent call first, approximate line numbers):
  File "/home/runner/work/xopen/xopen/.tox/py/lib/pypy3.9/site-packages/coverage/pytracer.py", line 146 in _trace
tests/test_xopen.py::test_roundtrip[.lz4-0-b] py: exit -6 (17.65 seconds) /home/runner/work/xopen/xopen> coverage run --branch --source=xopen,tests -m pytest -v --doctest-modules tests pid=2861
  py: FAIL code -6 (45.58=setup[27.93]+cmd[17.65] seconds)
  evaluation failed :( (45.98 seconds)
Error: Process completed with exit code 250.

but the error is coming from coverage, not really from the lz4 changes. i would appreciate your inputs, i assume this is not new.

@rhpvorderman
Copy link
Collaborator

python-lz4 is not tested and build for PyPy. So this is a problem with the upstream library or PyPy.

I am wondering what the best way is to work around this problem. Make lz4 optional again? Just drop the low quality bindings and only use the external program?

@gnzsnz
Copy link
Author

gnzsnz commented Mar 31, 2025

python-lz4 is not tested and build for PyPy. So this is a problem with the upstream library or PyPy.

I am wondering what the best way is to work around this problem. Make lz4 optional again? Just drop the low quality bindings and only use the external program?

Please let me know how to move forward, I see the following options:

  • leave code as it is. it will not work with pypy, unless lz4 CLI is installed and explicitly called, xopen("file.txt.xz", mode="wb", threads=1). Not sure if this is acceptable, if yes I would suggest making it clear in the README.md
  • make lz4 as optional dependence, with the same caveats as on the previous point.
  • if I understand correctly your comment, you suggest dropping lz4 module and use lz4 CLI.

Personally, I think that lz4 as optional dependency is the most flexible approach. but is your call, please let me know.

@marcelm
Copy link
Collaborator

marcelm commented Apr 1, 2025

Personally, I think that lz4 as optional dependency is the most flexible approach. but is your call, please let me know.

Yes, but we could use an environment marker to require lz4 only if we’re not on PyPy. It would look like this in pyproject.toml:

  'lz4>4.3.1'; platform_python_implementation != "PyPy"',

The code is a bit more complicated because we need to handle the case that import lz4 fails, but that is already implemented AFAICS.

On PyPy, we then need to fall back to the lz4 CLI.

A couple of observations regarding the CLI:

  • Multithreading in lz4 is only supported for compression, not for decompression.
  • Older versions of lz4 (such as v1.94 that comes with Ubuntu 24.04) do not support multithreading. They fail if one tries to pass the -T option.
  • The newer versions use a default of -T0, which chooses the number of threads automatically (presumably the number of available cores).

So assuming that we do not want to try to detect which version of lz4 is installed, we cannot use -T. We can therefore not actually control the number of threads used for compression; it will just depend on which version of lz4 is installed.

Hm, what about this behavior:

Mode Threads What to use Comment
Compress 0 python-lz4
Compress !=0 lz4 CLI threads parameter is ignored, but if it is a newer binary, it will use all cores
Decompress 0 python-lz4
Decompress != 0 python-lz4 threads parameter is ignored

And also fall back to lz4 CLI if python-lz4 is not available.

So this is essentially if lz4 is not None and (mode == 'rb' or (mode in ('ab', 'wb') and threads == 0)): use Python bindings; else: use lz4 CLI

What do you think?

@rhpvorderman
Copy link
Collaborator

And also fall back to lz4 CLI if python-lz4 is not available.

Also if threads was explicitly set to 0? I am in dubio about that one. Crashing because a single-threaded option can not be provided leads to the user not being able to open the file. Defaulting to the CLI will use more threads than expected, which can cause issues on cluster environments.

@marcelm
Copy link
Collaborator

marcelm commented Apr 2, 2025

Also if threads was explicitly set to 0? I am in dubio about that one. Crashing because a single-threaded option can not be provided leads to the user not being able to open the file. Defaulting to the CLI will use more threads than expected, which can cause issues on cluster environments.

Hm, you’re right.

In general, I would say that as an xopen user, my main interest is in being able to open files that may be compressed, not caring so much about the details. Sure I can specify compression level and threads, but this is more on a “best effort” level. For example, if we open a bz2 file with threads > 0 but pbzip2 is not available, we open it using bz2.open anyway (singlethreaded).

Applied to lz4, I would say we should not fail even when the bindings aren’t available and threads == 0.

How about this instead of version detection (which I’d like to avoid): We try to run the lz4 CLI with the appropriate -T option first, and if that fails, we just re-try without the option.

Pseudocode

if lz4 is not None and (mode == 'rb' or (mode in ('ab', 'wb') and threads == 0)):
    # use Python bindings
else:
    try:
        run lz4 with option -T max(threads, 1)
    except ...:
        run lz4 without option -T

@rhpvorderman
Copy link
Collaborator

Fully agree on all your points. That seems to be the best option.

@gnzsnz
Copy link
Author

gnzsnz commented Apr 7, 2025

Status

  • lz4 dependency, but not for pypy on pyproject. 'lz4>4.3.1'; platform_python_implementation != "PyPy"'
  • lz4 dependency management. The previous point makes tox fail for pypy. we need to consider a logic where lz4 is still treated as an optional dependency. this was part of the initial commit. i need search the commit history and apply it.
  • logic to maximize changes of lz4 success Support for lz4 compression #163 #168 (comment)
  • there is a new error on test_xopen.py::test_pass_bytesio_for_reading_and_writing for params [lz4-1]. error io.UnsupportedOperation: fileno. After my initial analysis is clear that BytesIO will not work for process pipes, as there is no functional fileno method. My understanding is that this is testing threads on a python module. for lz4 it works with threads == 0 but when threads == 1 it tries to use LZ4 CLI which fails. i can add a check on the code or maybe an xfail marker. Need to look into it.

let me know your comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants