-
Notifications
You must be signed in to change notification settings - Fork 377
fix(iceberg): Correct test setup to ensure delete files are created #5864
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Greptile SummaryFixed the The fix uses
Confidence Score: 5/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Test as Integration Test
participant Spark as Spark SQL
participant Iceberg as Iceberg Table
participant Daft as Daft Reader
Note over Test,Spark: Test Setup Phase
Test->>Spark: CREATE TABLE test_overlapping_deletes
Test->>Spark: createDataFrame(data).coalesce(1)
Note over Spark: Force single Parquet file
Spark->>Iceberg: Write 15 rows to single file
Test->>Spark: DELETE WHERE id <= 5
Spark->>Iceberg: Create position delete file
Note over Iceberg: Cannot remove entire file<br/>(has rows 6-15)
Test->>Spark: DELETE WHERE id <= 3 (overlapping)
Spark->>Iceberg: Create another delete file
Test->>Spark: DELETE WHERE id >= 4 AND id <= 8
Spark->>Iceberg: Create another delete file
Note over Test,Daft: Test Execution Phase
Test->>Daft: read_table().count()
Daft->>Iceberg: Check _has_delete_files()
Iceberg-->>Daft: True (delete files found)
Daft->>Daft: Disable count pushdown
Note over Daft: Regular scan with delete<br/>file processing
Daft-->>Test: Correct count returned
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (3)
-
daft/io/iceberg/iceberg_scan.py, line 311-314 (link)logic: incorrect snapshot summary field used -
deleted-data-filesrefers to data files that were REMOVED from the table (e.g., during compaction), not delete files used for merge-on-read operations. should checktotal-delete-filesinstead to detect positional/equality delete files -
daft/io/iceberg/iceberg_scan.py, line 312-313 (link)syntax:
current_snapshot.summaryvalues are strings, not integers. need to convert to int before comparison -
daft/io/iceberg/iceberg_scan.py, line 309-310 (link)style: snapshot summary check should be done first before scanning tasks since it's more efficient - accessing metadata is cheaper than calling
plan_files()
1 file reviewed, 3 comments
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #5864 +/- ##
==========================================
- Coverage 72.37% 72.35% -0.02%
==========================================
Files 965 965
Lines 125727 125718 -9
==========================================
- Hits 90992 90969 -23
- Misses 34735 34749 +14
🚀 New features to boost your workflow:
|
Greptile's behavior is changing!From now on, if a review finishes with no comments, we will not post an additional "statistics" comment to confirm that our review found nothing to comment on. However, you can confirm that we reviewed your changes in the status check section. This feature can be toggled off in your Code Review Settings by deselecting "Create a status check for each PR". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the original provision.py script, the initial data for the test_overlapping_deletes table is written using Spark INSERT statements. Since the data volume is very small (15 rows), Spark writes this data into multiple tiny Parquet data files (Data Files). When a subsequent delete operation, such as DELETE FROM default.test_overlapping_deletes WHERE id <= 5, is executed, Iceberg/Spark identifies that the rows matching the delete condition are entirely contained within certain data files. As a result, the optimizer chooses a more efficient approach: it directly updates the status of the entire affected data files (Data Files) from "added" to "deleted" in the manifest list, instead of generating deletion files (Position/Equality Delete Files) that record the specific locations of the deletions.
Since this process does not generate any .parquet format deletion files, Daft's _has_delete_files() check naturally cannot find them, leading to the incorrect assumption that count pushdown is safe. Although, in this specific case, the total row count calculated through metadata happens to be correct, this contradicts the original intent of the test case, which is to test the handling logic when deletion files are present.
Changes Made
The integration test TestIcebergCountPushdown.test_count_pushdown_with_delete_files was failing for the test_overlapping_deletes table because it incorrectly enabled count pushdown.
The root cause was that the initial Spark write created multiple small data files. Subsequent DELETE operations were optimized by Iceberg to mark entire data files as removed instead of generating position/equality delete files. As a result, Daft's _has_delete_files() check did not find any delete files and incorrectly allowed the count pushdown optimization.
This PR fixes the test by adding coalesce(1) to the Spark DataFrame before writing the initial data for the test_overlapping_deletes table. This ensures the data is written to a single Parquet file, forcing subsequent DELETE operations to generate actual delete files. This aligns the test's behavior with its intent, correctly disabling count pushdown when delete files are present.
Related Issues
#5863 5863