-
Notifications
You must be signed in to change notification settings - Fork 356
Expire snapshot thread safety issue 2409 #2430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Expire snapshot thread safety issue 2409 #2430
Conversation
Tried this change out in code where we are expiring snapshots from 2 iceberg tables in separate threads and all is working fine now. 👍 |
Thanks for testing it!!! Let me know if you bump into any other issues. |
pyiceberg/table/update/snapshot.py
Outdated
_snapshot_ids_to_expire: Set[int] = set() | ||
_updates: Tuple[TableUpdate, ...] = () | ||
_requirements: Tuple[TableRequirement, ...] = () | ||
def __init__(self, transaction: Transaction) -> None: | ||
super().__init__(transaction) | ||
# Initialize instance-level attributes to avoid sharing state between instances | ||
self._snapshot_ids_to_expire: Set[int] = set() | ||
self._updates: Tuple[TableUpdate, ...] = () | ||
self._requirements: Tuple[TableRequirement, ...] = () |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
Nit: I'd personally keep class-level annotations here (with assignment in the constructor, so state still shouldn't be shared), so the code would look similar to what we have for Transaction
:
iceberg-python/pyiceberg/table/__init__.py
Lines 247 to 263 in 52d810e
class Transaction: | |
_table: Table | |
_autocommit: bool | |
_updates: Tuple[TableUpdate, ...] | |
_requirements: Tuple[TableRequirement, ...] | |
def __init__(self, table: Table, autocommit: bool = False): | |
"""Open a transaction to stage and commit changes to a table. | |
Args: | |
table: The table that will be altered. | |
autocommit: Option to automatically commit the changes when they are staged. | |
""" | |
self._table = table | |
self._autocommit = autocommit | |
self._updates = () | |
self._requirements = () |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@smaheshwar-pltr i applied the changes
Closes #2409, and partially closes #2427
Rationale for this change
This PR fixes a critical thread safety issue in the
ExpireSnapshots
class where concurrent snapshot expiration operations on different tables would share snapshot IDs, causing operations to fail with "snapshot does not exist" errors.Root Cause:
The
ExpireSnapshots
class had class-level attributes (_snapshot_ids_to_expire
,_updates
,_requirements
) that were shared across all instances. When multiple threads created differentExpireSnapshots
instances, they all shared the same underlyingset()
object for tracking snapshot IDs.Impact:
table1.expire_snapshots().by_id(1001)
adds1001
to shared settable2.expire_snapshots().by_id(2001)
adds2001
to same shared set{1001, 2001}
and try to expire snapshot1001
fromtable2
, causing failureSolution:
Moved the shared class-level attributes to instance-level attributes in the
__init__
method, ensuring eachExpireSnapshots
instance has its own isolated state.Are these changes tested?
Yes, comprehensive test coverage has been added:
test_thread_safety_fix()
- Verifies that different ExpireSnapshots instances have separate snapshot setstest_concurrent_operations()
- Tests concurrent operations don't contaminate each othertest_concurrent_different_tables_expiration()
- Reproduces the exact scenario from GitHub issue commit on expire_snapshot tries to remove snapshot from wrong table. #2409test_concurrent_same_table_different_snapshots()
- Tests concurrent operations on the same tabletest_cross_table_snapshot_id_isolation()
- Validates no cross-contamination of snapshot IDs between tablestest_batch_expire_snapshots()
- Tests batch expiration operations in threaded environmentsAll existing tests continue to pass, ensuring no regression in functionality.
Are there any user-facing changes?
No breaking changes. The public API remains identical:
ExpireSnapshots
methods work the same wayBehavioral improvement:
expire_snapshots()
operations on different tables now work correctlyThis is a pure bug fix with no user-facing API changes.