refactor: consolidate snapshot expiration into MaintenanceTable #2143

ForeverAngry · 2025-06-23T23:04:17Z

Rationale for this change

Consolidates snapshot expiration functionality from the standalone ExpireSnapshots class into the MaintenanceTable class for a unified maintenance API.
Resolves planned work left over from Added ExpireSnapshots Feature #1880.
Achieves feature and API parity with the Java implementation for snapshot retention and table maintenance.

Features & Enhancements

Duplicate Data File Remediation (#2130)
- Adds deduplicate_data_files to MaintenanceTable.
- Detects and removes duplicate data files, improving table hygiene and storage efficiency.
Advanced Snapshot Retention (#2150)
- Adds new snapshot retention methods for Java API parity:
  - retain_last_n_snapshots(n) — Retain only the latest N snapshots.
  - expire_snapshots_older_than_with_retention(timestamp_ms, retain_last_n=None, min_snapshots_to_keep=None) — Expire snapshots older than a given timestamp, with optional retention constraints.
  - expire_snapshots_with_retention_policy(timestamp_ms=None, retain_last_n=None, min_snapshots_to_keep=None) — Unified retention policy supporting both time-based and count-based constraints.
- All retention logic respects protected snapshots (branches/tags) and includes guardrails to prevent excessive expiration.
- Removes the obsolete expire_snapshots_older_than method.

Bug Fixes & Cleanups

ManageSnapshots Cleanup (#2151)
- Removes an unrelated instance variable from the ManageSnapshots class, aligning with the Java reference implementation.

Testing & Documentation

Testing:
- All snapshot expiration and retention tests consolidated into test_retention_strategies.py, including:
  - Expiration by ID and timestamp
  - Protection of branch/tag snapshots
  - Retention guardrails and combined policies
  - Deduplication of data files
Documentation:
- Added and updated documentation to describe:
  - All new retention strategies
  - Deduplication logic
  - API parity and usage examples

Are these changes tested?

Yes. All changes are tested.~~, with this PR predicated on the final changes from #1200.~~ This work builds on the framework introduced by @jayceslesar in #1200 for the MaintenanceTable.

Are there any user-facing changes?

Breaking Changes:

✅ Move ExpireSnapshots functionality to MaintenanceTable
✅ Replace fluent API with direct execution pattern
✅ Remove ExpireSnapshots class entirely
✅ Update all tests to use new table.maintenance.* API
✅ Maintain all existing validation and protection logic

API Changes

Before:

table.expire_snapshots().expire_snapshot_by_id(snapshot_id).commit()

Now:

table.maintenance.expire_snapshot_by_id(snapshot_id)
# Or use new retention/maintenance methods as documented

Closes:

…p data

…h a new Expired Snapshot class. updated tests.

ValueError: Cannot expire snapshot IDs {3051729675574597004} as they are currently referenced by table refs.

Moved expiration-related methods from `ExpireSnapshots` to `ManageSnapshots` for improved organization and clarity. Updated corresponding pytest tests to reflect these changes.

Re-ran the `poetry run pre-commit run --all-files` command on the project.

Moved: the functions for expiring snapshots to their own class.

…ng it in a separate issue. Fixed: unrelated changes caused by afork/branch sync issues.

Co-authored-by: Fokko Driesprong <[email protected]>

Implemented logic to protect the HEAD branches or Tagged branches from being expired by the `expire_snapshot_by_id` method.

ForeverAngry · 2025-06-23T23:07:52Z

@Fokko @jayceslesar let me know if you guys prefer i stack this pr into the #1200 or if you both would rather i wait until the #1200 is merged into main, and then rebase on the updated upstream/main, and then create the PR against apache/iceberg-python:main!

Fokko · 2025-06-24T08:07:51Z

Great seeing this PR @ForeverAngry, thanks again for working on this! I'm okay with first merging #1200, but we could also merge this first, and adapt the remove orphan files routine to use .maintenance. Let me follow up on the remove orphan files, because there are some open questions there.

ForeverAngry · 2025-07-03T00:32:35Z

@Fokko did you decide if you wanted me to stay stacked on the delete orphans pr, or go ahead and prepare the pr for this, to the main branch?

…ve obsolete test

(1) apache#2130 with addition of the new `deduplicate_data_files` function to the `MaintenanceTable` class. (2) apache#2151 with the removal of the errant member variable from the `ManageSnapshots` class. (3) apache#2150 by adding the additional functions to be at parity with the Java API.

- **Duplicate File Remediation apache#2130** - Added `deduplicate_data_files` to the `MaintenanceTable` class. - Enables detection and removal of duplicate data files, improving table hygiene and storage efficiency. - **Support `retainLast` and `setMinSnapshotsToKeep` Snapshot Retention Policies apache#2150** - Added new snapshot retention methods to `MaintenanceTable` for feature parity with the Java API: - `retain_last_n_snapshots(n)`: Retain only the last N snapshots. - `expire_snapshots_older_than_with_retention(timestamp_ms, retain_last_n=None, min_snapshots_to_keep=None)`: Expire snapshots older than a timestamp, with additional retention constraints. - `expire_snapshots_with_retention_policy(timestamp_ms=None, retain_last_n=None, min_snapshots_to_keep=None)`: Unified retention policy supporting time-based and count-based constraints. - All retention logic respects protected snapshots (branches/tags) and includes guardrails to prevent over-aggressive expiration. ### Bug Fixes & Cleanups - **Remove unrelated instance variable from the `ManageSnapshots` class apache#2151** - Removed an errant member variable from the `ManageSnapshots` class, aligning the implementation with the intended design and the Java reference. ### Testing & Documentation - Consolidated all snapshot expiration and retention tests into a single file (`test_retention_strategies.py`), covering: - Basic expiration by ID and timestamp. - Protection of branch/tag snapshots. - Retention guardrails and combined policies. - Deduplication of data files. - Added and updated documentation to describe all new retention strategies, deduplication, and API parity improvements.

…tion features

…rategies

…intenance operations

…of full paths

jayceslesar · 2025-07-05T18:29:10Z

pyiceberg/table/maintenance.py

+    def _get_protected_snapshot_ids(self, table_metadata: TableMetadata) -> Set[int]:
+        """Get the IDs of protected snapshots.
+
+        These are the HEAD snapshots of all branches and all tagged snapshots.
+        These ids are to be excluded from expiration.
+
+        Args:
+            table_metadata: The table metadata to check for protected snapshots.
+
+        Returns:
+            Set of protected snapshot IDs to exclude from expiration.
+        """
+        from pyiceberg.table.refs import SnapshotRefType
+
+        protected_ids: Set[int] = set()
+        for ref in table_metadata.refs.values():
+            if ref.snapshot_ref_type in [SnapshotRefType.TAG, SnapshotRefType.BRANCH]:
+                protected_ids.add(ref.snapshot_id)
+        return protected_ids


I do not know the answer to this but is this different than just the refs?

I think thats part of it, but there is a bit more validation around what is eligible to be expired. That being said, i dont think you initial intuition is wrong :), i think it all boils down to that.

@jayceslesar I went back and took a closer look at the refs, and wanted to give a slightly better response than my previous one. To me, the refs file seems like an object model and some enums. If I'm missing something, let me know! I really appreciate your responsiveness and input! 🙏 🚀

protected_ids is the same as set(table.inspect.refs()["snapshot_id"].to_pylist()) is what I was trying to say

also the same as {ref.snapshot_id for ref in tbl.metadata.refs.values()} I think

gotcha - im happy to make that change, if you like! Let me know!

Yeah I think if we can rely on existing code that is good!

…Table The deduplicate_data_files() method was not properly removing duplicate data file references from Iceberg tables. After deduplication, multiple references to the same data file remained instead of the expected single reference. Root causes: 1. _get_all_datafiles() was scanning ALL snapshots instead of current only 2. Incorrect transaction API usage that didn't leverage snapshot updates 3. Missing proper overwrite logic to create clean deduplicated snapshots Key fixes: - Modified _get_all_datafiles() to scan only current snapshot manifests - Implemented proper transaction pattern using update_snapshot().overwrite() - Added explicit delete_data_file() calls for duplicates + append_data_file() for unique files - Removed unused helper methods _get_all_datafiles_with_context() and _detect_duplicates() Technical details: - Deduplication now operates on ManifestEntry objects from current snapshot only - Files are grouped by basename and first occurrence is kept as canonical reference - New snapshot created atomically replaces current snapshot with deduplicated file list - Proper Iceberg transaction semantics ensure data consistency Tests: All deduplication tests now pass including the previously failing test_deduplicate_data_files_removes_duplicates_in_current_snapshot Fixes: Table maintenance deduplication functionality

…xpired

…ion context in MaintenanceTable

ForeverAngry · 2025-07-05T23:58:09Z

@Fokko @jayceslesar, i wasn't sure when #1200 was going to be merged, and July tends to be pretty busy for me, so I thought i would use the framework for the MaintenanceTable that @jayceslesar created, so that things go smoothly, and as you requested, @Fokko. In addition, I had some other features that seem to fit right along with the MaintenanceTable class proposed and implemented by @jayceslesar in pr for #1200. I also figured id add a new entry to the api.md file for "Table Maintenance" that talks helps users understand the use-cases and features added. Let me know what everyone thinks!

jayceslesar · 2025-07-06T17:09:53Z

pyiceberg/table/maintenance.py

+    def _get_all_datafiles(self) -> List[DataFile]:
+        """Collect all DataFiles in the current snapshot only."""
+        datafiles: List[DataFile] = []
+
+        current_snapshot = self.tbl.current_snapshot()
+        if not current_snapshot:
+            return datafiles
+
+        def process_manifest(manifest: ManifestFile) -> list[DataFile]:
+            found: list[DataFile] = []
+            for entry in manifest.fetch_manifest_entry(io=self.tbl.io, discard_deleted=True):
+                if hasattr(entry, "data_file"):
+                    found.append(entry.data_file)
+            return found
+
+        # Scan only the current snapshot's manifests
+        manifests = current_snapshot.manifests(io=self.tbl.io)
+        with ThreadPoolExecutor() as executor:
+            results = executor.map(process_manifest, manifests)
+            for res in results:
+                datafiles.extend(res)
+
+        return datafiles


similar to above, why cant you use

iceberg-python/pyiceberg/table/inspect.py

Line 665 in 9c99f32

def data_files(self, snapshot_id: Optional[int] = None) -> "pa.Table":

similar to above, why cant you use

iceberg-python/pyiceberg/table/inspect.py

Line 665 in 9c99f32

def data_files(self, snapshot_id: Optional[int] = None) -> "pa.Table":

HI @jayceslesar , Yeah, i agree its similar. I actually looked at using inspect originally, and tried to use the DataFile.from_args() to go from the json object back to a DataFile, however, I couldn't seem to find a way to get this to work right, after trying a few different approaches. This was the easiest way i could think of. If you have an idea in mind, or know what I was missing, let me know!

…nce deduplication tests

jayceslesar

This gonna be awesome, left a few comments.

One general question I have is I dont us making use of the following table properties anywhere where I think we should favor them in the case that the user doesn't specify:

history.expire.max-snapshot-age-ms
history.expire.min-snapshots-to-keep
history.expire.max-ref-age-ms

jayceslesar · 2025-07-07T21:30:02Z

pyiceberg/table/inspect.py

+        files_table: list[pa.Table] = []
+        for manifest_list in snapshot.manifests(io):
+            files_table.append(self._get_files_from_manifest(manifest_list, data_file_filter))

-        executor = ExecutorFactory.get_or_create()
-        results = list(
-            executor.map(
-                lambda manifest_list: self._get_files_from_manifest(manifest_list, data_file_filter), snapshot.manifests(io)
-            )
-        )
-        return pa.concat_tables(results)
+        return pa.concat_tables(files_table)


this is de-parallelized? Is that intentional?

Nope, it was not :) Good catch!

jayceslesar · 2025-07-07T21:32:07Z

pyiceberg/table/inspect.py

+    def all_manifests(self, snapshots: Optional[Union[list[Snapshot], list[int]]] = None) -> "pa.Table":
        import pyarrow as pa

-        snapshots = self.tbl.snapshots()
+        # coerce into snapshot objects if users passes in snapshot ids
+        if snapshots is not None:
+            if isinstance(snapshots[0], int):
+                snapshots = [
+                    snapshot
+                    for snapshot_id in snapshots
+                    if (snapshot := self.tbl.metadata.snapshot_by_id(snapshot_id)) is not None
+                ]
+        else:
+            snapshots = self.tbl.snapshots()
+


I might have written this and it got cherry picked in but I think its simpler to only allow Snapshot objects until there is a need to allow either

tl;dr
Sounds good ill make the change!

I originally forked your branch so I could stack my PR on your “Delete Orphans” PR. As July began, my schedule looked pretty rough, so I converted my draft into a PR against the main pyiceberg branch—since I wasn’t sure how much time I’d have later in the month—but I forgot to rebase. As a result, I inadvertently removed the code for deleting orphans while keeping your MaintenanceTable implementation in a more… manual way 😟, so there may still be some remnants. They say the best form of flattery is imitation 😉.

jayceslesar · 2025-07-07T21:35:27Z

pyiceberg/table/maintenance.py

+    def expire_snapshots_by_ids(self, snapshot_ids: List[int]) -> None:
+        """Expire multiple snapshots by their IDs.
+


Does this signature exist upstream? https://iceberg.apache.org/javadoc/1.9.1/org/apache/iceberg/ExpireSnapshots.html if not I dont think we should add it here or at least make it private

ill check if it exists upstream, if it does i can ill refactor to expire_snapshots (i will also check if that exists upstream.). otherwise i will make it private!

jayceslesar · 2025-07-07T21:36:35Z

pyiceberg/table/maintenance.py

+
+        try:
+            import pyarrow as pa  # noqa
+        except ModuleNotFoundError as e:
+            raise ModuleNotFoundError("For metadata operations PyArrow needs to be installed") from e


i dont think we need this without the orphan code included, im also thinking of ways we can acheive without pyarrow so I think safe to remove

jayceslesar · 2025-07-07T21:41:54Z

pyiceberg/table/maintenance.py

+    def _get_protected_snapshot_ids(self, table_metadata: TableMetadata) -> Set[int]:
+        """Get the IDs of protected snapshots.
+
+        These are the HEAD snapshots of all branches and all tagged snapshots.
+        These ids are to be excluded from expiration.
+
+        Args:
+            table_metadata: The table metadata to check for protected snapshots.
+
+        Returns:
+            Set of protected snapshot IDs to exclude from expiration.
+        """
+        from pyiceberg.table.refs import SnapshotRefType
+
+        protected_ids: Set[int] = set()
+        for ref in table_metadata.refs.values():
+            if ref.snapshot_ref_type in [SnapshotRefType.TAG, SnapshotRefType.BRANCH]:
+                protected_ids.add(ref.snapshot_id)
+        return protected_ids


Yeah I think if we can rely on existing code that is good!

jayceslesar · 2025-07-07T21:44:41Z

pyiceberg/table/maintenance.py

+    def deduplicate_data_files(self) -> List[DataFile]:
+        """
+        Remove duplicate data files from an Iceberg table.
+
+        Returns:
+            List of removed DataFile objects.
+        """
+        import os
+        from collections import defaultdict
+


maybe logic for this should be added in a different PR?

Yeah, this was one of those selfish things. I was hoping to get it into the next version, for an issue i was having. If you feel strongly about it being a separate PR, i can do that :), but if you think we can let it slide, it would solve some headaches for me sooner, rather than later.

ForeverAngry and others added 25 commits March 28, 2025 20:23

Added initial units tests and Class for Removing a Snapshot

0a94d96

Added methods needed to expire snapshots by id, and optionally cleanu…

5f0b62b

…p data

Update test_expire_snapshots.py

f995daa

Added the builder method to __init__.py, updated the snapshot api wit…

65365e1

…h a new Expired Snapshot class. updated tests.

Snapshots are not being transacted on, but need to re-assign refs

e28815f

ValueError: Cannot expire snapshot IDs {3051729675574597004} as they are currently referenced by table refs.

Fixed the test case.

4628ede

adding print statements to help with debugging

e80c41c

Draft ready

cb9f0c9

Applied suggestions to Fix CICD

ebcff2b

Merge branch 'main' into main

97399bf

Rebuild the poetry lock file.

95e5af2

Merge branch 'main' into main

5ab5890

Refactor implementation of ExpireSnapshots

5acd690

Moved expiration-related methods from `ExpireSnapshots` to `ManageSnapshots` for improved organization and clarity. Updated corresponding pytest tests to reflect these changes.

Fixed format and linting issues

d30a08c

Re-ran the `poetry run pre-commit run --all-files` command on the project.

Merge branch 'main' into main

e62ab58

Fixed format and linting issues

1af3258

Re-ran the `poetry run pre-commit run --all-files` command on the project.

Merge branch 'main' of https://github.com/ForeverAngry/iceberg-python

352b48f

Merge branch 'main' into main

382e0ea

rebased: from main

549c183

fixed: typo

386cb15

removed errant files

12729fa

Added: public method signature to the init table file.

ce3515c

Moved: the functions for expiring snapshots to their own class.

Removed: expire_snapshots_older_than method, in favor of implementi…

28fce4b

…ng it in a separate issue. Fixed: unrelated changes caused by afork/branch sync issues.

Update tests/table/test_expire_snapshots.py

2c3153e

Co-authored-by: Fokko Driesprong <[email protected]>

Removed: unrelated changes, Added: logic to expire snapshot method.

27c3ece

Implemented logic to protect the HEAD branches or Tagged branches from being expired by the `expire_snapshot_by_id` method.

Fokko added this to the PyIceberg 0.10.0 milestone Jun 24, 2025

ForeverAngry force-pushed the refactor/consolidate-snapshot-expiration branch from a6c3b63 to 9937894 Compare July 5, 2025 01:10

ForeverAngry added 12 commits July 4, 2025 21:22

feat: implement deduplication of data files in Iceberg table and remo…

fe73a34

…ve obsolete test

refactor: remove obsolete expire_snapshots_older_than method

42e55c9

feat: enhance table maintenance with deduplication and snapshot reten…

0e6d45c

…tion features

feat: update maintenance features with deduplication and retention st…

311c442

…rategies

Update .gitignore

fba592d

Update test_writes.py

b837f86

Merge branch 'main' into refactor/consolidate-snapshot-expiration

4605a04

refactor: remove obsolete test file for snapshot expiration

536528e

wip: enhance deduplication logic and improve data file handling in ma…

6036e12

…intenance operations

wip - refactor: update deduplication tests to use file names instead …

9dc9c82

…of full paths

jayceslesar reviewed Jul 5, 2025

View reviewed changes

ForeverAngry added 3 commits July 5, 2025 19:45

fix(tests): ensure commit_table is not called when no snapshots are e…

73658e0

…xpired

refactor: remove unused expire_snapshots method and clean up transact…

a9a01ee

…ion context in MaintenanceTable

ForeverAngry marked this pull request as ready for review July 5, 2025 23:53

ForeverAngry requested a review from jayceslesar July 5, 2025 23:59

ForeverAngry mentioned this pull request Jul 6, 2025

Support retainLast and setMinSnapshotsToKeep Snapshot Retention Policies #2150

Open

jayceslesar reviewed Jul 6, 2025

View reviewed changes

ForeverAngry added 8 commits July 6, 2025 16:29

refactor: streamline data file retrieval in MaintenanceTable and enha…

8c906d2

…nce deduplication tests

Reverted changes back to prior commit version for _get_all_datafiles

0e72ccc

refactor: simplify snapshot expiration logic and clean up unused imports

cfb4061

Merge branch 'main' into refactor/consolidate-snapshot-expiration

9371bca

fix: add missing newline in API documentation for clarity

881fab9

refactor: update license header in test_retention_strategies.py

acb70da

feat: add license header to test_overwrite_files.py

54c1f7f

Update test_literals.py

4c6f86c

jayceslesar reviewed Jul 7, 2025

View reviewed changes

		def expire_snapshots_by_ids(self, snapshot_ids: List[int]) -> None:
		"""Expire multiple snapshots by their IDs.

refactor: consolidate snapshot expiration into MaintenanceTable #2143

Are you sure you want to change the base?

refactor: consolidate snapshot expiration into MaintenanceTable #2143

Uh oh!

Conversation

ForeverAngry commented Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

Features & Enhancements

Bug Fixes & Cleanups

Testing & Documentation

Are these changes tested?

Are there any user-facing changes?

API Changes

Uh oh!

ForeverAngry commented Jun 23, 2025

Uh oh!

Fokko commented Jun 24, 2025

Uh oh!

ForeverAngry commented Jul 3, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ForeverAngry commented Jul 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jayceslesar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ForeverAngry commented Jun 23, 2025 •

edited

Loading

ForeverAngry commented Jul 5, 2025 •

edited

Loading