bpo-45509: Check gzip headers for corrupted fields #29028

rhpvorderman · 2021-10-18T11:48:22Z

Check whether the COMMENT, NAME and HCRC fields are correct.

EDIT: Additionally, slightly increase performance for header checking for headers produced by gzip.compress or zlib.compress.

https://bugs.python.org/issue45509

github-actions · 2021-11-18T00:05:46Z

This PR is stale because it has been open for 30 days with no activity.

Lib/test/test_gzip.py

Lib/gzip.py

serhiy-storchaka · 2021-11-22T09:35:27Z

Lib/gzip.py

@@ -430,29 +430,42 @@ def _read_gzip_header(fp):

    if magic != b'\037\213':
        raise BadGzipFile('Not a gzipped file (%r)' % magic)
-
-    (method, flag, last_mtime) = struct.unpack("<BBIxx", _read_exact(fp, 8))
+    header_buffer = io.BytesIO()


Try to use a bytearray instead of io.BytesIO. Would it be faster? In what cases?

Performance improvement was necessary (see bug report). So I made it better the following way:

When the data is read from fp a bytes object is created. Instead of passing it anonymously to the functions that performs the check it is saved as a local variable.

The FHCRC flag is checked and the result is saved as fhcrc, so the check runs only once.

When fhcrc is truthy, a header_parts list is created.

Every time fp is read and fhcrc is truthy the already created bytes objects are appended to the header_parts list. List appending has a very low overhead. It simply expands an array of PyObject pointers. Pointers that we already have as the named bytes objects.

When the header should be checked, a b"".join(header_parts) is called to recreate the header.

This means for the most common use cases (no flags or only FNAME set) that only a few extra truthy checks are performed. This comes at a very low cost of 3% (see the bug report for results).

This is worth it since it enables correct truncation checking for FNAME and FCOMMENT as well as enabling checking of the header CRC when FHCRC is set.

rhpvorderman · 2021-11-23T06:24:08Z

@serhiy-storchaka Thank you for taking a look and thank you for your insights. I have updated the PR accordingly. It is much better now in terms of performance.

Call the bool method and cache the result for faster truth checking. Do not test for empty bytes but use "not magic" instead for faster truth checking.

rhpvorderman · 2021-11-24T09:58:57Z

I made a small tweek that shaves off another 2 microseconds

Use if not magic instead of if magic == b''
Use fhcrc = bool(flag & FHCRC) instead of fhcrc = flag & FHCRC

Those are: + Only FNAME set. (Created by gzip and python's GzipFile) + No flags set. (Created by gzip.compress and zlib.compress with wbits=31)

rhpvorderman · 2021-11-24T11:14:09Z

With further performance tweaks, reading the gzip header is now either faster or neutral compared to the most common flags (no flags, or only FNAME set). See bpo-45509 for more performance details.

The performance tweaks are:

Check if flags == 0. If so return last_mtime immediately and skip any further processing.
Check if flags == FNAME. If so read until the NULL byte terminating the name and return last_mtime. Skip any further processing.
In all other cases. Use a slower code path were all parts of the header are saved for header crc checking. (Even if FHCRC is not set!). A single code path for all other cases is used here to keep the code simple(r). Since these cases will be rare, it makes no sense complicating the code for an optimal code path where parts of the header are only saved if FHCRC is set.

rhpvorderman · 2021-11-29T15:30:57Z

@serhiy-storchaka The PR is ready for re-review when you have the time. The performance issue has been addressed.

python-cla-bot · 2025-04-18T09:47:41Z

All commit authors signed the Contributor License Agreement.

bedevere-bot added the awaiting review label Oct 18, 2021

the-knights-who-say-ni added the CLA signed label Oct 18, 2021

rhpvorderman changed the title ~~bpo: 45509: Check gzip headers for corrupted fields~~ bpo-45509: Check gzip headers for corrupted fields Oct 18, 2021

github-actions bot added the stale Stale PR or inactive for long period of time. label Nov 18, 2021

serhiy-storchaka reviewed Nov 22, 2021

View reviewed changes

rhpvorderman force-pushed the bpo-45509 branch 2 times, most recently from 1f308a4 to 453ba11 Compare November 23, 2021 06:15

Check gzip headers for corrupted fields

617e064

rhpvorderman force-pushed the bpo-45509 branch from 453ba11 to 617e064 Compare November 23, 2021 06:28

github-actions bot removed the stale Stale PR or inactive for long period of time. label Nov 24, 2021

Minor performance tweaks to _read_gzip_header

3d5bb47

Call the bool method and cache the result for faster truth checking. Do not test for empty bytes but use "not magic" instead for faster truth checking.

Optimize _read_gzip_header for the most common code paths

e68e76e

Those are: + Only FNAME set. (Created by gzip and python's GzipFile) + No flags set. (Created by gzip.compress and zlib.compress with wbits=31)

rhpvorderman mannequin mentioned this pull request Apr 10, 2022

Gzip header corruption not properly checked. #89672

Open

ezio-melotti removed the CLA signed label Jul 13, 2022

rhpvorderman mentioned this pull request Oct 17, 2022

Give python-isal a mention in the zlib/gzip documentation #98347

Open

Merge branch 'main' into bpo-45509

2dfc96c

serhiy-storchaka self-requested a review January 30, 2023 15:24

serhiy-storchaka added the type-feature A feature request or enhancement label Jan 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bpo-45509: Check gzip headers for corrupted fields #29028

bpo-45509: Check gzip headers for corrupted fields #29028

rhpvorderman commented Oct 18, 2021 •

edited

Loading

github-actions bot commented Nov 18, 2021

serhiy-storchaka Nov 22, 2021

rhpvorderman Nov 23, 2021 •

edited

Loading

rhpvorderman commented Nov 23, 2021

rhpvorderman commented Nov 24, 2021

rhpvorderman commented Nov 24, 2021 •

edited

Loading

rhpvorderman commented Nov 29, 2021

python-cla-bot bot commented Apr 18, 2025 •

edited

Loading

bpo-45509: Check gzip headers for corrupted fields #29028

Are you sure you want to change the base?

bpo-45509: Check gzip headers for corrupted fields #29028

Conversation

rhpvorderman commented Oct 18, 2021 • edited Loading

github-actions bot commented Nov 18, 2021

serhiy-storchaka Nov 22, 2021

Choose a reason for hiding this comment

rhpvorderman Nov 23, 2021 • edited Loading

Choose a reason for hiding this comment

rhpvorderman commented Nov 23, 2021

rhpvorderman commented Nov 24, 2021

rhpvorderman commented Nov 24, 2021 • edited Loading

rhpvorderman commented Nov 29, 2021

python-cla-bot bot commented Apr 18, 2025 • edited Loading

rhpvorderman commented Oct 18, 2021 •

edited

Loading

rhpvorderman Nov 23, 2021 •

edited

Loading

rhpvorderman commented Nov 24, 2021 •

edited

Loading

python-cla-bot bot commented Apr 18, 2025 •

edited

Loading