BUG: Snappy checksum check #2252

JacobSMoller · 2025-07-25T08:54:45Z

Rationale for this change

The SnappyCodec.decompress() method has a bug where the CRC32 checksum is extracted from the compressed data after the data has already been truncated to remove the checksum. This results in reading the wrong 4 bytes for checksum validation, causing the CRC32 check to fail incorrectly.

Root Cause:
In the current implementation:

data = data[0:-4] removes the last 4 bytes (checksum) from the data
checksum = data[-4:] then tries to get the checksum from the already-truncated data
This means checksum contains the wrong bytes (last 4 bytes of compressed data instead of the actual checksum)

Solution:
Extract the checksum before truncating the data:

checksum = data[-4:]  # store checksum before truncating data
data = data[0:-4]     # remove checksum from the data

This ensures data integrity checks work correctly for snappy-compressed Avro data.

Are these changes tested?

The fix resolves the logical error in the checksum extraction order. Existing tests should pass, and any snappy-compressed data with valid checksums will now decompress successfully instead of failing with "Checksum failure" errors.

The change is minimal and only reorders two existing lines of code, making it low-risk.

Are there any user-facing changes?

Yes - This is a bug fix that improves functionality:

Before: Snappy-compressed Avro data would fail to decompress with "Checksum failure" errors even when the data and checksum were valid
After: Snappy-compressed Avro data with valid checksums will decompress correctly

This fix resolves data integrity validation issues for users working with snappy-compressed Avro files. No API changes are introduced.

JacobSMoller · 2025-07-25T09:04:29Z

Small script for simulating the issue

#!/usr/bin/env python3
"""
Simple reproduction of PyIceberg Snappy checksum bug
"""

import struct
import binascii
import snappy
def create_pyiceberg_snappy_block():
    """
    Mimic PyIceberg's Snappy compression format with added checksum
    """
    print("Creating PyIceberg-format Snappy block...")
    
    # Original data that represents an Iceberg manifest entry
    original_data = b'{"manifest_path": "gs://bucket/manifest.avro", "manifest_length": 1234, "added_files_count": 5}'
    
    # Compress with Snappy
    compressed = snappy.compress(original_data)
    
    # Calculate CRC32 checksum of the ORIGINAL data (this is what PyIceberg does)
    crc32 = binascii.crc32(original_data) & 0xFFFFFFFF
    crc32_bytes = struct.pack(">I", crc32)

    print(f"Orginal data checksum: {crc32:08x}")
    
    # Create the block: compressed data + checksum
    snappy_block = compressed + crc32_bytes
    
    return snappy_block

def demonstrate_pyiceberg_bug(data):
    """
    Issue with extracting the checksum after truncating the data
    """
    print("Demonstrating the fix for the PyIceberg Snappy checksum bug")
    print("=" * 70)
    # Step 1: Truncate BEFORE getting checksum (BUG!)
    data = data[0:-4]
    
    # Step 2: Decompress
    uncompressed = snappy.decompress(data)
    print(f"Uncompressed: {uncompressed}")
    
    # Step 3: Get checksum from truncated data (BUG!)
    checksum = data[-4:]
    print(f"Checksum: {checksum.hex()} ({checksum})")
    
    # Step 4: CRC32 validation fails
    computed_crc = binascii.crc32(uncompressed) & 0xFFFFFFFF
    buggy_stored_crc = struct.unpack(">I", checksum)[0]
    print(f"Computed crc32: {binascii.crc32(uncompressed) & 0xFFFFFFFF:08x}")
    print(f"Stored crc32: {struct.unpack(">I", checksum)[0]:08x}")
    print(f"Match: {computed_crc == buggy_stored_crc}")
    
    
    
def demonstrate_pyiceberg_fix(data):
    """
    Demonstrate the fix for the PyIceberg Snappy checksum bug
    """
    print()
    print("Demonstrating the fix for the PyIceberg Snappy checksum bug")
    print("=" * 70)
    checksum = data[-4:]
    data = data[0:-4]
    uncompressed = snappy.decompress(data)
    computed_crc = binascii.crc32(uncompressed) & 0xFFFFFFFF
    stored_crc = struct.unpack(">I", checksum)[0]
    print(f"Checksum: {checksum.hex()} ({checksum})")
    print(f"Uncompressed: {uncompressed}")
    print(f"Computed crc32: {computed_crc:08x}")
    print(f"Stored crc32: {stored_crc:08x}")
    print(f"Match: {computed_crc == stored_crc}")

def main():
    # Create a realistic Snappy block
    snappy_block = create_pyiceberg_snappy_block()
    
    # Demonstrate the bug and the fix
    demonstrate_pyiceberg_bug(snappy_block)
    demonstrate_pyiceberg_fix(snappy_block)
if __name__ == "__main__":
    main()

kevinjqliu

LGTM! Thanks for the repro as well

# Rationale for this change The `SnappyCodec.decompress()` method has a bug where the CRC32 checksum is extracted from the compressed data **after** the data has already been truncated to remove the checksum. This results in reading the wrong 4 bytes for checksum validation, causing the CRC32 check to fail incorrectly. **Root Cause:** In the current implementation: 1. `data = data[0:-4]` removes the last 4 bytes (checksum) from the data 2. `checksum = data[-4:]` then tries to get the checksum from the already-truncated data 3. This means `checksum` contains the wrong bytes (last 4 bytes of compressed data instead of the actual checksum) **Solution:** Extract the checksum **before** truncating the data: ```python checksum = data[-4:] # store checksum before truncating data data = data[0:-4] # remove checksum from the data ``` This ensures data integrity checks work correctly for snappy-compressed Avro data. # Are these changes tested? The fix resolves the logical error in the checksum extraction order. Existing tests should pass, and any snappy-compressed data with valid checksums will now decompress successfully instead of failing with "Checksum failure" errors. The change is minimal and only reorders two existing lines of code, making it low-risk. # Are there any user-facing changes? **Yes** - This is a bug fix that improves functionality: - **Before:** Snappy-compressed Avro data would fail to decompress with "Checksum failure" errors even when the data and checksum were valid - **After:** Snappy-compressed Avro data with valid checksums will decompress correctly This fix resolves data integrity validation issues for users working with snappy-compressed Avro files. No API changes are introduced.

BUG: Snappy checksum check

7771636

kevinjqliu approved these changes Jul 29, 2025

View reviewed changes

kevinjqliu merged commit 661d75a into apache:main Jul 29, 2025
10 checks passed

muttcg mentioned this pull request Jul 30, 2025

[core] Add required Iceberg Field IDs to support ID-based column pruning apache/paimon#5981

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

BUG: Snappy checksum check #2252

BUG: Snappy checksum check #2252

Uh oh!

JacobSMoller commented Jul 25, 2025

Uh oh!

JacobSMoller commented Jul 25, 2025

Uh oh!

kevinjqliu left a comment

Uh oh!

Uh oh!

Uh oh!

BUG: Snappy checksum check #2252

BUG: Snappy checksum check #2252

Uh oh!

Conversation

JacobSMoller commented Jul 25, 2025

Rationale for this change

Are these changes tested?

Are there any user-facing changes?

Uh oh!

JacobSMoller commented Jul 25, 2025

Uh oh!

kevinjqliu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!