-
-
Notifications
You must be signed in to change notification settings - Fork 31.7k
Gzip header corruption not properly checked. #89672
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The following headers are currently allowed while being wrong:
|
Bump. This is a bug that allows corrupted gzip files to be processed without error. Therefore I bump this issue in the hopes someone will review the PR. |
Ping |
I think it is good idea, although we may get reports about regressions in 3.11 when Python will start to reject GZIP files which was successfully read before. But before merging this I want to know:
|
I tested it for the two most common use case. WITH_FNAME = """
from gzip import GzipFile, decompress
import io
fileobj = io.BytesIO()
g = GzipFile(fileobj=fileobj, mode='wb', filename='compressable_file')
g.write(b'')
g.close()
data=fileobj.getvalue()
"""
WITH_NO_FLAGS = """
from gzip import decompress
import zlib
data = zlib.compress(b'', wbits=31)
"""
def benchmark(name, setup, loops=10000, runs=10):
print(f"{name}")
results = [timeit.timeit("decompress(data)", setup, number=loops) for _ in range(runs)]
# Calculate microseconds
results = [(result / loops) * 1_000_000 for result in results]
print(f"average: {round(statistics.mean(results), 2)}, "
f"range: {round(min(results), 2)}-{round(max(results),2)} "
f"stdev: {round(statistics.stdev(results),2)}")
if __name__ == "__main__":
benchmark("with_fname", WITH_FNAME)
benchmark("with_noflags", WITH_FNAME) BEFORE: with_fname AFTER: That is a dramatic increase in overhead. (Okay the decompressed data is empty, but still)
|
I increased the performance of the patch. I added the file used for benchmarking. I also test the FHCRC changes now. The benchmark tests headers with different flags concatenated to a DEFLATE block with no data and a gzip trailer. The data is fed to gzip.decompress. Please note that this is the *worst-case* performance overhead. When there is actual data to decompress the overhead will get less. When GzipFile is used the overhead will get less as well. BEFORE (Current main branch): After (bpo-45509 PR): An increase of .1 microsecond in the most common use cases. Roughly 3%. But now the FNAME field is correctly checked for truncation. With the FHCRC the overhead is increased by 33%. But this is worth it, because the header is now actually checked. As it should. |
I have found that using the timeit module provides more precise measurements: For a simple gzip header. (As returned by gzip.compress or zlib.compress with wbits=31) For a gzip header with FNAME. (Returned by gzip itself and by Python's GzipFile) For a gzip header with all flags set: Since performance is most critical for in-memory compression and decompression, I now optimized for no flags. For the most common case of only FNAME set: For the case where FCHRC is set: So this PR is now a clear win for decompressing anything that has been compressed with gzip.compress. It is neutral for normal file decompression. There is a performance cost associated with correctly checking the header, but that is expected. It is better than the alternative of not checking it. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: