Skip to content

bpo-45509: Check gzip headers for corrupted fields #29028

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

rhpvorderman
Copy link
Contributor

@rhpvorderman rhpvorderman commented Oct 18, 2021

Check whether the COMMENT, NAME and HCRC fields are correct.

EDIT: Additionally, slightly increase performance for header checking for headers produced by gzip.compress or zlib.compress.

https://bugs.python.org/issue45509

@rhpvorderman rhpvorderman changed the title bpo: 45509: Check gzip headers for corrupted fields bpo-45509: Check gzip headers for corrupted fields Oct 18, 2021
@github-actions
Copy link

This PR is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the stale Stale PR or inactive for long period of time. label Nov 18, 2021
Lib/gzip.py Outdated
@@ -430,29 +430,42 @@ def _read_gzip_header(fp):

if magic != b'\037\213':
raise BadGzipFile('Not a gzipped file (%r)' % magic)

(method, flag, last_mtime) = struct.unpack("<BBIxx", _read_exact(fp, 8))
header_buffer = io.BytesIO()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try to use a bytearray instead of io.BytesIO. Would it be faster? In what cases?

Copy link
Contributor Author

@rhpvorderman rhpvorderman Nov 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance improvement was necessary (see bug report). So I made it better the following way:

  • When the data is read from fp a bytes object is created. Instead of passing it anonymously to the functions that performs the check it is saved as a local variable.
  • The FHCRC flag is checked and the result is saved as fhcrc, so the check runs only once.
  • When fhcrc is truthy, a header_parts list is created.
  • Every time fp is read and fhcrc is truthy the already created bytes objects are appended to the header_parts list. List appending has a very low overhead. It simply expands an array of PyObject pointers. Pointers that we already have as the named bytes objects.
  • When the header should be checked, a b"".join(header_parts) is called to recreate the header.

This means for the most common use cases (no flags or only FNAME set) that only a few extra truthy checks are performed. This comes at a very low cost of 3% (see the bug report for results).

This is worth it since it enables correct truncation checking for FNAME and FCOMMENT as well as enabling checking of the header CRC when FHCRC is set.

@rhpvorderman rhpvorderman force-pushed the bpo-45509 branch 2 times, most recently from 1f308a4 to 453ba11 Compare November 23, 2021 06:15
@rhpvorderman
Copy link
Contributor Author

@serhiy-storchaka Thank you for taking a look and thank you for your insights. I have updated the PR accordingly. It is much better now in terms of performance.

@github-actions github-actions bot removed the stale Stale PR or inactive for long period of time. label Nov 24, 2021
Call the bool method and cache the result for faster truth checking.
Do not test for empty bytes but use "not magic" instead for faster truth checking.
@rhpvorderman
Copy link
Contributor Author

I made a small tweek that shaves off another 2 microseconds

  • Use if not magic instead of if magic == b''
  • Use fhcrc = bool(flag & FHCRC) instead of fhcrc = flag & FHCRC

Those are:
+ Only FNAME set. (Created by gzip and python's GzipFile)
+ No flags set. (Created by gzip.compress and zlib.compress with wbits=31)
@rhpvorderman
Copy link
Contributor Author

rhpvorderman commented Nov 24, 2021

With further performance tweaks, reading the gzip header is now either faster or neutral compared to the most common flags (no flags, or only FNAME set). See bpo-45509 for more performance details.

The performance tweaks are:

  • Check if flags == 0. If so return last_mtime immediately and skip any further processing.
  • Check if flags == FNAME. If so read until the NULL byte terminating the name and return last_mtime. Skip any further processing.
  • In all other cases. Use a slower code path were all parts of the header are saved for header crc checking. (Even if FHCRC is not set!). A single code path for all other cases is used here to keep the code simple(r). Since these cases will be rare, it makes no sense complicating the code for an optimal code path where parts of the header are only saved if FHCRC is set.

@rhpvorderman
Copy link
Contributor Author

@serhiy-storchaka The PR is ready for re-review when you have the time. The performance issue has been addressed.

@serhiy-storchaka serhiy-storchaka self-requested a review January 30, 2023 15:24
@serhiy-storchaka serhiy-storchaka added the type-feature A feature request or enhancement label Jan 30, 2023
@python-cla-bot
Copy link

python-cla-bot bot commented Apr 18, 2025

All commit authors signed the Contributor License Agreement.

CLA signed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting review type-feature A feature request or enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants