Skip to content

Conversation

ForeverAngry
Copy link
Contributor

@ForeverAngry ForeverAngry commented Sep 4, 2025

Closes #2409, and partially closes #2427

Rationale for this change

This PR fixes a critical thread safety issue in the ExpireSnapshots class where concurrent snapshot expiration operations on different tables would share snapshot IDs, causing operations to fail with "snapshot does not exist" errors.

Root Cause:
The ExpireSnapshots class had class-level attributes (_snapshot_ids_to_expire, _updates, _requirements) that were shared across all instances. When multiple threads created different ExpireSnapshots instances, they all shared the same underlying set() object for tracking snapshot IDs.

Impact:

  • Thread 1: table1.expire_snapshots().by_id(1001) adds 1001 to shared set
  • Thread 2: table2.expire_snapshots().by_id(2001) adds 2001 to same shared set
  • Result: Both threads have {1001, 2001} and try to expire snapshot 1001 from table2, causing failure

Solution:
Moved the shared class-level attributes to instance-level attributes in the __init__ method, ensuring each ExpireSnapshots instance has its own isolated state.

Are these changes tested?

📢 🔥 Big shout-out to @QlikFrederic, as the testing methodology was largely derived from the testing and analysis done by the user! 🔥 📢

Yes, comprehensive test coverage has been added:

  • test_thread_safety_fix() - Verifies that different ExpireSnapshots instances have separate snapshot sets
  • test_concurrent_operations() - Tests concurrent operations don't contaminate each other
  • test_concurrent_different_tables_expiration() - Reproduces the exact scenario from GitHub issue commit on expire_snapshot tries to remove snapshot from wrong table. #2409
  • test_concurrent_same_table_different_snapshots() - Tests concurrent operations on the same table
  • test_cross_table_snapshot_id_isolation() - Validates no cross-contamination of snapshot IDs between tables
  • test_batch_expire_snapshots() - Tests batch expiration operations in threaded environments

All existing tests continue to pass, ensuring no regression in functionality.

Are there any user-facing changes?

No breaking changes. The public API remains identical:

  • All existing ExpireSnapshots methods work the same way
  • Method signatures are unchanged
  • Behavior is identical except for the thread safety fix

Behavioral improvement:

  • Concurrent expire_snapshots() operations on different tables now work correctly
  • No more "snapshot does not exist" errors when using ExpireSnapshots in multi-threaded environments

This is a pure bug fix with no user-facing API changes.

@QlikFrederic
Copy link

Tried this change out in code where we are expiring snapshots from 2 iceberg tables in separate threads and all is working fine now. 👍

@ForeverAngry
Copy link
Contributor Author

Tried this change out in code where we are expiring snapshots from 2 iceberg tables in separate threads and all is working fine now. 👍

Thanks for testing it!!! Let me know if you bump into any other issues.

Comment on lines 927 to 932
_snapshot_ids_to_expire: Set[int] = set()
_updates: Tuple[TableUpdate, ...] = ()
_requirements: Tuple[TableRequirement, ...] = ()
def __init__(self, transaction: Transaction) -> None:
super().__init__(transaction)
# Initialize instance-level attributes to avoid sharing state between instances
self._snapshot_ids_to_expire: Set[int] = set()
self._updates: Tuple[TableUpdate, ...] = ()
self._requirements: Tuple[TableRequirement, ...] = ()
Copy link
Contributor

@smaheshwar-pltr smaheshwar-pltr Sep 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

Nit: I'd personally keep class-level annotations here (with assignment in the constructor, so state still shouldn't be shared), so the code would look similar to what we have for Transaction:

class Transaction:
_table: Table
_autocommit: bool
_updates: Tuple[TableUpdate, ...]
_requirements: Tuple[TableRequirement, ...]
def __init__(self, table: Table, autocommit: bool = False):
"""Open a transaction to stage and commit changes to a table.
Args:
table: The table that will be altered.
autocommit: Option to automatically commit the changes when they are staged.
"""
self._table = table
self._autocommit = autocommit
self._updates = ()
self._requirements = ()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smaheshwar-pltr i applied the changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

check for class with mutable state as class attributes commit on expire_snapshot tries to remove snapshot from wrong table.
4 participants