Skip to content

Gzip header corruption not properly checked. #89672

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
rhpvorderman mannequin opened this issue Oct 18, 2021 · 7 comments
Open

Gzip header corruption not properly checked. #89672

rhpvorderman mannequin opened this issue Oct 18, 2021 · 7 comments
Labels
3.11 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@rhpvorderman
Copy link
Mannequin

rhpvorderman mannequin commented Oct 18, 2021

BPO 45509
Nosy @serhiy-storchaka, @rhpvorderman
PRs
  • bpo-45509: Check gzip headers for corrupted fields #29028
  • Files
  • benchmark_gzip_read_header.py
  • benchmark_gzip_read_header.py
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2021-10-18.10:41:07.747>
    labels = ['type-bug', 'library', '3.11']
    title = 'Gzip header corruption not properly checked.'
    updated_at = <Date 2021-11-24.11:10:50.781>
    user = 'https://github.com/rhpvorderman'

    bugs.python.org fields:

    activity = <Date 2021-11-24.11:10:50.781>
    actor = 'rhpvorderman'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = ['Library (Lib)']
    creation = <Date 2021-10-18.10:41:07.747>
    creator = 'rhpvorderman'
    dependencies = []
    files = ['50458', '50459']
    hgrepos = []
    issue_num = 45509
    keywords = ['patch']
    message_count = 7.0
    messages = ['404174', '405783', '406747', '406753', '406758', '406815', '406918']
    nosy_count = 2.0
    nosy_names = ['serhiy.storchaka', 'rhpvorderman']
    pr_nums = ['29028']
    priority = 'normal'
    resolution = None
    stage = 'patch review'
    status = 'open'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue45509'
    versions = ['Python 3.11']

    @rhpvorderman
    Copy link
    Mannequin Author

    rhpvorderman mannequin commented Oct 18, 2021

    The following headers are currently allowed while being wrong:

    • Headers with FCOMMENT flag set, but with incomplete or missing COMMENT bytes.
    • Headers with FNAME flag set, but with incomplete or missing NAME bytes
    • Headers with FHCRC set, the crc is read, but not checked.

    @rhpvorderman rhpvorderman mannequin added 3.11 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Oct 18, 2021
    @rhpvorderman
    Copy link
    Mannequin Author

    rhpvorderman mannequin commented Nov 5, 2021

    Bump. This is a bug that allows corrupted gzip files to be processed without error. Therefore I bump this issue in the hopes someone will review the PR.

    @rhpvorderman
    Copy link
    Mannequin Author

    rhpvorderman mannequin commented Nov 22, 2021

    Ping

    @serhiy-storchaka
    Copy link
    Member

    I think it is good idea, although we may get reports about regressions in 3.11 when Python will start to reject GZIP files which was successfully read before.

    But before merging this I want to know:

    1. How much overhead does it add for reading files with the FHCRC flag clear?
    2. How much overhead does it add for reading files with the FHCRC flag set?
    3. Are GZIP files created by other tools has the FHCRC flag by default?

    @rhpvorderman
    Copy link
    Mannequin Author

    rhpvorderman mannequin commented Nov 22, 2021

    1. Quite a lot

    I tested it for the two most common use case.
    import timeit
    import statistics

    WITH_FNAME = """
    from gzip import GzipFile, decompress
    import io
    fileobj = io.BytesIO()
    g = GzipFile(fileobj=fileobj, mode='wb', filename='compressable_file')
    g.write(b'')
    g.close()
    data=fileobj.getvalue()
    """
    WITH_NO_FLAGS = """
    from gzip import decompress
    import zlib
    data = zlib.compress(b'', wbits=31)
    """
    
    def benchmark(name, setup, loops=10000, runs=10):
        print(f"{name}")
        results = [timeit.timeit("decompress(data)", setup, number=loops) for _ in range(runs)]
        # Calculate microseconds
        results = [(result / loops) * 1_000_000 for result in results]
        print(f"average: {round(statistics.mean(results), 2)}, "
              f"range: {round(min(results), 2)}-{round(max(results),2)} "
              f"stdev: {round(statistics.stdev(results),2)}")
    
    
    if __name__ == "__main__":
        benchmark("with_fname", WITH_FNAME)
        benchmark("with_noflags", WITH_FNAME)

    BEFORE:

    with_fname
    average: 3.27, range: 3.21-3.36 stdev: 0.05
    with_noflags
    average: 3.24, range: 3.14-3.37 stdev: 0.07

    AFTER:
    with_fname
    average: 4.98, range: 4.85-5.14 stdev: 0.1
    with_noflags
    average: 4.87, range: 4.69-5.05 stdev: 0.1

    That is a dramatic increase in overhead. (Okay the decompressed data is empty, but still)

    1. Haven't tested this yet. But the regression is quite unacceptable already.

    2. Not that I know of. But if it is set, it is safe to assume they care. Nevertheless this is a bit of an edge-case.

    @rhpvorderman
    Copy link
    Mannequin Author

    rhpvorderman mannequin commented Nov 23, 2021

    I increased the performance of the patch. I added the file used for benchmarking. I also test the FHCRC changes now.

    The benchmark tests headers with different flags concatenated to a DEFLATE block with no data and a gzip trailer. The data is fed to gzip.decompress. Please note that this is the *worst-case* performance overhead. When there is actual data to decompress the overhead will get less. When GzipFile is used the overhead will get less as well.

    BEFORE (Current main branch):
    $ ./python benchmark_gzip_read_header.py
    with_fname
    average: 3.01, range: 2.9-4.79 stdev: 0.19
    with_noflags
    average: 2.99, range: 2.93-3.04 stdev: 0.02
    All flags (incl FHCRC)
    average: 3.13, range: 3.05-3.16 stdev: 0.02

    After (bpo-45509 PR):
    with_fname
    average: 3.09, range: 3.01-4.63 stdev: 0.16
    with_noflags
    average: 3.1, range: 3.03-3.38 stdev: 0.04
    All flags (incl FHCRC)
    average: 4.09, range: 4.05-4.49 stdev: 0.04

    An increase of .1 microsecond in the most common use cases. Roughly 3%. But now the FNAME field is correctly checked for truncation.

    With the FHCRC the overhead is increased by 33%. But this is worth it, because the header is now actually checked. As it should.

    @rhpvorderman
    Copy link
    Mannequin Author

    rhpvorderman mannequin commented Nov 24, 2021

    I have found that using the timeit module provides more precise measurements:

    For a simple gzip header. (As returned by gzip.compress or zlib.compress with wbits=31)
    ./python -m timeit -s "import io; data = b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\x03\x00\x00\x00\x00\x00\x00\x00\x00\x00'; from gzip import _read_gzip_header" '_read_gzip_header(io.BytesIO(data))'

    For a gzip header with FNAME. (Returned by gzip itself and by Python's GzipFile)
    ./python -m timeit -s "import io; data = b'\x1f\x8b\x08\x08j\x1a\x9ea\x02\xffcompressable_file\x00\x03\x00\x00\x00\x00\x00\x00\x00\x00\x00'; from gzip import _read_gzip_header" '_read_gzip_header(io.BytesIO(data))'

    For a gzip header with all flags set:
    ./python -m timeit -s 'import gzip, io; data = b"\x1f\x8b\x08\x1f\x00\x00\x00\x00\x00\xff\x05\x00extraname\x00comment\x00\xe9T"; from gzip import _read_gzip_header' '_read_gzip_header(io.BytesIO(data))'

    Since performance is most critical for in-memory compression and decompression, I now optimized for no flags.
    Before (current main): 500000 loops, best of 5: 469 nsec per loop
    after (PR): 1000000 loops, best of 5: 390 nsec per loop

    For the most common case of only FNAME set:
    before: 200000 loops, best of 5: 1.48 usec per loop
    after: 200000 loops, best of 5: 1.45 usec per loop

    For the case where FCHRC is set:
    before: 200000 loops, best of 5: 1.62 usec per loop
    after: 100000 loops, best of 5: 2.43 usec per loop

    So this PR is now a clear win for decompressing anything that has been compressed with gzip.compress. It is neutral for normal file decompression. There is a performance cost associated with correctly checking the header, but that is expected. It is better than the alternative of not checking it.

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.11 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    Status: No status
    Development

    No branches or pull requests

    1 participant