-
Notifications
You must be signed in to change notification settings - Fork 356
BUG: Snappy checksum check #2252
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
kevinjqliu
merged 1 commit into
apache:main
from
JacobSMoller:bug-snappy-codec-checksum
Jul 29, 2025
Merged
BUG: Snappy checksum check #2252
kevinjqliu
merged 1 commit into
apache:main
from
JacobSMoller:bug-snappy-codec-checksum
Jul 29, 2025
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Small script for simulating the issue #!/usr/bin/env python3
"""
Simple reproduction of PyIceberg Snappy checksum bug
"""
import struct
import binascii
import snappy
def create_pyiceberg_snappy_block():
"""
Mimic PyIceberg's Snappy compression format with added checksum
"""
print("Creating PyIceberg-format Snappy block...")
# Original data that represents an Iceberg manifest entry
original_data = b'{"manifest_path": "gs://bucket/manifest.avro", "manifest_length": 1234, "added_files_count": 5}'
# Compress with Snappy
compressed = snappy.compress(original_data)
# Calculate CRC32 checksum of the ORIGINAL data (this is what PyIceberg does)
crc32 = binascii.crc32(original_data) & 0xFFFFFFFF
crc32_bytes = struct.pack(">I", crc32)
print(f"Orginal data checksum: {crc32:08x}")
# Create the block: compressed data + checksum
snappy_block = compressed + crc32_bytes
return snappy_block
def demonstrate_pyiceberg_bug(data):
"""
Issue with extracting the checksum after truncating the data
"""
print("Demonstrating the fix for the PyIceberg Snappy checksum bug")
print("=" * 70)
# Step 1: Truncate BEFORE getting checksum (BUG!)
data = data[0:-4]
# Step 2: Decompress
uncompressed = snappy.decompress(data)
print(f"Uncompressed: {uncompressed}")
# Step 3: Get checksum from truncated data (BUG!)
checksum = data[-4:]
print(f"Checksum: {checksum.hex()} ({checksum})")
# Step 4: CRC32 validation fails
computed_crc = binascii.crc32(uncompressed) & 0xFFFFFFFF
buggy_stored_crc = struct.unpack(">I", checksum)[0]
print(f"Computed crc32: {binascii.crc32(uncompressed) & 0xFFFFFFFF:08x}")
print(f"Stored crc32: {struct.unpack(">I", checksum)[0]:08x}")
print(f"Match: {computed_crc == buggy_stored_crc}")
def demonstrate_pyiceberg_fix(data):
"""
Demonstrate the fix for the PyIceberg Snappy checksum bug
"""
print()
print("Demonstrating the fix for the PyIceberg Snappy checksum bug")
print("=" * 70)
checksum = data[-4:]
data = data[0:-4]
uncompressed = snappy.decompress(data)
computed_crc = binascii.crc32(uncompressed) & 0xFFFFFFFF
stored_crc = struct.unpack(">I", checksum)[0]
print(f"Checksum: {checksum.hex()} ({checksum})")
print(f"Uncompressed: {uncompressed}")
print(f"Computed crc32: {computed_crc:08x}")
print(f"Stored crc32: {stored_crc:08x}")
print(f"Match: {computed_crc == stored_crc}")
def main():
# Create a realistic Snappy block
snappy_block = create_pyiceberg_snappy_block()
# Demonstrate the bug and the fix
demonstrate_pyiceberg_bug(snappy_block)
demonstrate_pyiceberg_fix(snappy_block)
if __name__ == "__main__":
main() |
kevinjqliu
approved these changes
Jul 29, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for the repro as well
gabeiglio
pushed a commit
to Netflix/iceberg-python
that referenced
this pull request
Aug 13, 2025
<!-- Thanks for opening a pull request! --> # Rationale for this change The `SnappyCodec.decompress()` method has a bug where the CRC32 checksum is extracted from the compressed data **after** the data has already been truncated to remove the checksum. This results in reading the wrong 4 bytes for checksum validation, causing the CRC32 check to fail incorrectly. **Root Cause:** In the current implementation: 1. `data = data[0:-4]` removes the last 4 bytes (checksum) from the data 2. `checksum = data[-4:]` then tries to get the checksum from the already-truncated data 3. This means `checksum` contains the wrong bytes (last 4 bytes of compressed data instead of the actual checksum) **Solution:** Extract the checksum **before** truncating the data: ```python checksum = data[-4:] # store checksum before truncating data data = data[0:-4] # remove checksum from the data ``` This ensures data integrity checks work correctly for snappy-compressed Avro data. # Are these changes tested? The fix resolves the logical error in the checksum extraction order. Existing tests should pass, and any snappy-compressed data with valid checksums will now decompress successfully instead of failing with "Checksum failure" errors. The change is minimal and only reorders two existing lines of code, making it low-risk. # Are there any user-facing changes? **Yes** - This is a bug fix that improves functionality: - **Before:** Snappy-compressed Avro data would fail to decompress with "Checksum failure" errors even when the data and checksum were valid - **After:** Snappy-compressed Avro data with valid checksums will decompress correctly This fix resolves data integrity validation issues for users working with snappy-compressed Avro files. No API changes are introduced.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Rationale for this change
The
SnappyCodec.decompress()
method has a bug where the CRC32 checksum is extracted from the compressed data after the data has already been truncated to remove the checksum. This results in reading the wrong 4 bytes for checksum validation, causing the CRC32 check to fail incorrectly.Root Cause:
In the current implementation:
data = data[0:-4]
removes the last 4 bytes (checksum) from the datachecksum = data[-4:]
then tries to get the checksum from the already-truncated datachecksum
contains the wrong bytes (last 4 bytes of compressed data instead of the actual checksum)Solution:
Extract the checksum before truncating the data:
This ensures data integrity checks work correctly for snappy-compressed Avro data.
Are these changes tested?
The fix resolves the logical error in the checksum extraction order. Existing tests should pass, and any snappy-compressed data with valid checksums will now decompress successfully instead of failing with "Checksum failure" errors.
The change is minimal and only reorders two existing lines of code, making it low-risk.
Are there any user-facing changes?
Yes - This is a bug fix that improves functionality:
This fix resolves data integrity validation issues for users working with snappy-compressed Avro files. No API changes are introduced.