Skip to content

Conversation

JacobSMoller
Copy link
Contributor

Rationale for this change

The SnappyCodec.decompress() method has a bug where the CRC32 checksum is extracted from the compressed data after the data has already been truncated to remove the checksum. This results in reading the wrong 4 bytes for checksum validation, causing the CRC32 check to fail incorrectly.

Root Cause:
In the current implementation:

  1. data = data[0:-4] removes the last 4 bytes (checksum) from the data
  2. checksum = data[-4:] then tries to get the checksum from the already-truncated data
  3. This means checksum contains the wrong bytes (last 4 bytes of compressed data instead of the actual checksum)

Solution:
Extract the checksum before truncating the data:

checksum = data[-4:]  # store checksum before truncating data
data = data[0:-4]     # remove checksum from the data

This ensures data integrity checks work correctly for snappy-compressed Avro data.

Are these changes tested?

The fix resolves the logical error in the checksum extraction order. Existing tests should pass, and any snappy-compressed data with valid checksums will now decompress successfully instead of failing with "Checksum failure" errors.

The change is minimal and only reorders two existing lines of code, making it low-risk.

Are there any user-facing changes?

Yes - This is a bug fix that improves functionality:

  • Before: Snappy-compressed Avro data would fail to decompress with "Checksum failure" errors even when the data and checksum were valid
  • After: Snappy-compressed Avro data with valid checksums will decompress correctly

This fix resolves data integrity validation issues for users working with snappy-compressed Avro files. No API changes are introduced.

@JacobSMoller
Copy link
Contributor Author

Small script for simulating the issue

#!/usr/bin/env python3
"""
Simple reproduction of PyIceberg Snappy checksum bug
"""

import struct
import binascii
import snappy
def create_pyiceberg_snappy_block():
    """
    Mimic PyIceberg's Snappy compression format with added checksum
    """
    print("Creating PyIceberg-format Snappy block...")
    
    # Original data that represents an Iceberg manifest entry
    original_data = b'{"manifest_path": "gs://bucket/manifest.avro", "manifest_length": 1234, "added_files_count": 5}'
    
    # Compress with Snappy
    compressed = snappy.compress(original_data)
    
    # Calculate CRC32 checksum of the ORIGINAL data (this is what PyIceberg does)
    crc32 = binascii.crc32(original_data) & 0xFFFFFFFF
    crc32_bytes = struct.pack(">I", crc32)

    print(f"Orginal data checksum: {crc32:08x}")
    
    # Create the block: compressed data + checksum
    snappy_block = compressed + crc32_bytes
    
    return snappy_block

def demonstrate_pyiceberg_bug(data):
    """
    Issue with extracting the checksum after truncating the data
    """
    print("Demonstrating the fix for the PyIceberg Snappy checksum bug")
    print("=" * 70)
    # Step 1: Truncate BEFORE getting checksum (BUG!)
    data = data[0:-4]
    
    # Step 2: Decompress
    uncompressed = snappy.decompress(data)
    print(f"Uncompressed: {uncompressed}")
    
    # Step 3: Get checksum from truncated data (BUG!)
    checksum = data[-4:]
    print(f"Checksum: {checksum.hex()} ({checksum})")
    
    # Step 4: CRC32 validation fails
    computed_crc = binascii.crc32(uncompressed) & 0xFFFFFFFF
    buggy_stored_crc = struct.unpack(">I", checksum)[0]
    print(f"Computed crc32: {binascii.crc32(uncompressed) & 0xFFFFFFFF:08x}")
    print(f"Stored crc32: {struct.unpack(">I", checksum)[0]:08x}")
    print(f"Match: {computed_crc == buggy_stored_crc}")
    
    
    
def demonstrate_pyiceberg_fix(data):
    """
    Demonstrate the fix for the PyIceberg Snappy checksum bug
    """
    print()
    print("Demonstrating the fix for the PyIceberg Snappy checksum bug")
    print("=" * 70)
    checksum = data[-4:]
    data = data[0:-4]
    uncompressed = snappy.decompress(data)
    computed_crc = binascii.crc32(uncompressed) & 0xFFFFFFFF
    stored_crc = struct.unpack(">I", checksum)[0]
    print(f"Checksum: {checksum.hex()} ({checksum})")
    print(f"Uncompressed: {uncompressed}")
    print(f"Computed crc32: {computed_crc:08x}")
    print(f"Stored crc32: {stored_crc:08x}")
    print(f"Match: {computed_crc == stored_crc}")

def main():
    # Create a realistic Snappy block
    snappy_block = create_pyiceberg_snappy_block()
    
    # Demonstrate the bug and the fix
    demonstrate_pyiceberg_bug(snappy_block)
    demonstrate_pyiceberg_fix(snappy_block)
if __name__ == "__main__":
    main() 

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the repro as well

@kevinjqliu kevinjqliu merged commit 661d75a into apache:main Jul 29, 2025
10 checks passed
gabeiglio pushed a commit to Netflix/iceberg-python that referenced this pull request Aug 13, 2025
<!--
Thanks for opening a pull request!
-->

# Rationale for this change

The `SnappyCodec.decompress()` method has a bug where the CRC32 checksum
is extracted from the compressed data **after** the data has already
been truncated to remove the checksum. This results in reading the wrong
4 bytes for checksum validation, causing the CRC32 check to fail
incorrectly.

**Root Cause:**
In the current implementation:
1. `data = data[0:-4]` removes the last 4 bytes (checksum) from the data
2. `checksum = data[-4:]` then tries to get the checksum from the
already-truncated data
3. This means `checksum` contains the wrong bytes (last 4 bytes of
compressed data instead of the actual checksum)

**Solution:**
Extract the checksum **before** truncating the data:
```python
checksum = data[-4:]  # store checksum before truncating data
data = data[0:-4]     # remove checksum from the data
```

This ensures data integrity checks work correctly for snappy-compressed
Avro data.

# Are these changes tested?

The fix resolves the logical error in the checksum extraction order.
Existing tests should pass, and any snappy-compressed data with valid
checksums will now decompress successfully instead of failing with
"Checksum failure" errors.

The change is minimal and only reorders two existing lines of code,
making it low-risk.

# Are there any user-facing changes?

**Yes** - This is a bug fix that improves functionality:

- **Before:** Snappy-compressed Avro data would fail to decompress with
"Checksum failure" errors even when the data and checksum were valid
- **After:** Snappy-compressed Avro data with valid checksums will
decompress correctly

This fix resolves data integrity validation issues for users working
with snappy-compressed Avro files. No API changes are introduced.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants