-
-
Notifications
You must be signed in to change notification settings - Fork 31.7k
bpo-45509: Check gzip headers for corrupted fields #29028
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This PR is stale because it has been open for 30 days with no activity. |
Lib/gzip.py
Outdated
@@ -430,29 +430,42 @@ def _read_gzip_header(fp): | |||
|
|||
if magic != b'\037\213': | |||
raise BadGzipFile('Not a gzipped file (%r)' % magic) | |||
|
|||
(method, flag, last_mtime) = struct.unpack("<BBIxx", _read_exact(fp, 8)) | |||
header_buffer = io.BytesIO() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try to use a bytearray
instead of io.BytesIO
. Would it be faster? In what cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Performance improvement was necessary (see bug report). So I made it better the following way:
- When the data is read from
fp
a bytes object is created. Instead of passing it anonymously to the functions that performs the check it is saved as a local variable. - The FHCRC flag is checked and the result is saved as
fhcrc
, so the check runs only once. - When
fhcrc
is truthy, aheader_parts
list is created. - Every time
fp
is read andfhcrc
is truthy the already created bytes objects are appended to theheader_parts
list. List appending has a very low overhead. It simply expands an array of PyObject pointers. Pointers that we already have as the named bytes objects. - When the header should be checked, a
b"".join(header_parts)
is called to recreate the header.
This means for the most common use cases (no flags or only FNAME set) that only a few extra truthy checks are performed. This comes at a very low cost of 3% (see the bug report for results).
This is worth it since it enables correct truncation checking for FNAME and FCOMMENT as well as enabling checking of the header CRC when FHCRC is set.
1f308a4
to
453ba11
Compare
@serhiy-storchaka Thank you for taking a look and thank you for your insights. I have updated the PR accordingly. It is much better now in terms of performance. |
453ba11
to
617e064
Compare
Call the bool method and cache the result for faster truth checking. Do not test for empty bytes but use "not magic" instead for faster truth checking.
I made a small tweek that shaves off another 2 microseconds
|
Those are: + Only FNAME set. (Created by gzip and python's GzipFile) + No flags set. (Created by gzip.compress and zlib.compress with wbits=31)
With further performance tweaks, reading the gzip header is now either faster or neutral compared to the most common flags (no flags, or only FNAME set). See bpo-45509 for more performance details. The performance tweaks are:
|
@serhiy-storchaka The PR is ready for re-review when you have the time. The performance issue has been addressed. |
Check whether the COMMENT, NAME and HCRC fields are correct.
EDIT: Additionally, slightly increase performance for header checking for headers produced by gzip.compress or zlib.compress.
https://bugs.python.org/issue45509