fix(iceberg): Correct test setup to ensure delete files are created #5864

huleilei · 2025-12-21T09:00:18Z

Changes Made

The integration test TestIcebergCountPushdown.test_count_pushdown_with_delete_files was failing for the test_overlapping_deletes table because it incorrectly enabled count pushdown.
The root cause was that the initial Spark write created multiple small data files. Subsequent DELETE operations were optimized by Iceberg to mark entire data files as removed instead of generating position/equality delete files. As a result, Daft's _has_delete_files() check did not find any delete files and incorrectly allowed the count pushdown optimization.
This PR fixes the test by adding coalesce(1) to the Spark DataFrame before writing the initial data for the test_overlapping_deletes table. This ensures the data is written to a single Parquet file, forcing subsequent DELETE operations to generate actual delete files. This aligns the test's behavior with its intent, correctly disabling count pushdown when delete files are present.

Related Issues

#5863 5863

greptile-apps · 2025-12-21T09:04:44Z

Greptile Summary

Fixed the test_overlapping_deletes integration test by ensuring delete files are actually created during test setup. The original SQL INSERT approach created multiple small Parquet files, which allowed Iceberg to optimize DELETE operations by marking entire files as removed instead of generating position/equality delete files. This caused the test to fail because _has_delete_files() returned false, incorrectly enabling count pushdown optimization.

The fix uses spark.createDataFrame().coalesce(1) to write all 15 rows into a single Parquet file, forcing subsequent DELETE operations to generate actual delete files since they can only delete partial file contents. This aligns the test behavior with its intent.

Changed data insertion from SQL INSERT to DataFrame API with coalesce(1)
Simplified boolean expression in supports_count_pushdown() from True and not self._has_delete_files() to not self._has_delete_files()

Confidence Score: 5/5

This PR is safe to merge with minimal risk
The changes are well-targeted and address a legitimate test setup issue. The test was failing not because of a bug in Daft's logic, but because the test setup wasn't creating the conditions it was designed to test. Using coalesce(1) is a standard Spark technique to control file output, and the simplification of the boolean expression is a straightforward refactoring with no behavioral change.
No files require special attention

Important Files Changed

Filename	Overview
daft/io/iceberg/iceberg_scan.py	Simplified boolean logic in `supports_count_pushdown()` by removing redundant `True and`
tests/integration/iceberg/docker-compose/provision.py	Changed data insertion to use DataFrame with `coalesce(1)` to ensure single Parquet file, forcing delete file creation

Sequence Diagram

sequenceDiagram
    participant Test as Integration Test
    participant Spark as Spark SQL
    participant Iceberg as Iceberg Table
    participant Daft as Daft Reader
    
    Note over Test,Spark: Test Setup Phase
    Test->>Spark: CREATE TABLE test_overlapping_deletes
    Test->>Spark: createDataFrame(data).coalesce(1)
    Note over Spark: Force single Parquet file
    Spark->>Iceberg: Write 15 rows to single file
    
    Test->>Spark: DELETE WHERE id <= 5
    Spark->>Iceberg: Create position delete file
    Note over Iceberg: Cannot remove entire file<br/>(has rows 6-15)
    
    Test->>Spark: DELETE WHERE id <= 3 (overlapping)
    Spark->>Iceberg: Create another delete file
    
    Test->>Spark: DELETE WHERE id >= 4 AND id <= 8
    Spark->>Iceberg: Create another delete file
    
    Note over Test,Daft: Test Execution Phase
    Test->>Daft: read_table().count()
    Daft->>Iceberg: Check _has_delete_files()
    Iceberg-->>Daft: True (delete files found)
    Daft->>Daft: Disable count pushdown
    Note over Daft: Regular scan with delete<br/>file processing
    Daft-->>Test: Correct count returned

greptile-apps

Additional Comments (3)

daft/io/iceberg/iceberg_scan.py, line 311-314 (link)

logic: incorrect snapshot summary field used - deleted-data-files refers to data files that were REMOVED from the table (e.g., during compaction), not delete files used for merge-on-read operations. should check total-delete-files instead to detect positional/equality delete files
daft/io/iceberg/iceberg_scan.py, line 312-313 (link)

syntax: current_snapshot.summary values are strings, not integers. need to convert to int before comparison
daft/io/iceberg/iceberg_scan.py, line 309-310 (link)

style: snapshot summary check should be done first before scanning tasks since it's more efficient - accessing metadata is cheaper than calling plan_files()

_{1 file reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

codecov · 2025-12-21T09:46:24Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 72.35%. Comparing base (38cc5d5) to head (059ddb7).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #5864      +/-   ##
==========================================
- Coverage   72.37%   72.35%   -0.02%     
==========================================
  Files         965      965              
  Lines      125727   125718       -9     
==========================================
- Hits        90992    90969      -23     
- Misses      34735    34749      +14

Files with missing lines	Coverage Δ
daft/io/iceberg/iceberg_scan.py	`76.02% <100.00%> (ø)`

... and 7 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

greptile-apps · 2025-12-21T12:02:31Z

Greptile's behavior is changing!

From now on, if a review finishes with no comments, we will not post an additional "statistics" comment to confirm that our review found nothing to comment on. However, you can confirm that we reviewed your changes in the status check section.

_{This feature can be toggled off in your Code Review Settings by deselecting "Create a status check for each PR".}

huleilei

In the original provision.py script, the initial data for the test_overlapping_deletes table is written using Spark INSERT statements. Since the data volume is very small (15 rows), Spark writes this data into multiple tiny Parquet data files (Data Files). When a subsequent delete operation, such as DELETE FROM default.test_overlapping_deletes WHERE id <= 5, is executed, Iceberg/Spark identifies that the rows matching the delete condition are entirely contained within certain data files. As a result, the optimizer chooses a more efficient approach: it directly updates the status of the entire affected data files (Data Files) from "added" to "deleted" in the manifest list, instead of generating deletion files (Position/Equality Delete Files) that record the specific locations of the deletions.

Since this process does not generate any .parquet format deletion files, Daft's _has_delete_files() check naturally cannot find them, leading to the incorrect assumption that count pushdown is safe. Although, in this specific case, the total row count calculated through metadata happens to be correct, this contradicts the original intent of the test case, which is to test the handling logic when deletion files are present.

fix(iceberg): fix Iceberg Delete File Detection Logic

b8dcdfd

github-actions bot added the fix label Dec 21, 2025

huleilei changed the title ~~fix(iceberg): fix Iceberg Delete File Detection Logic~~ fix(iceberg): disable count pushdown when delete files are present Dec 21, 2025

greptile-apps bot reviewed Dec 21, 2025

View reviewed changes

update test case

059ddb7

huleilei changed the title ~~fix(iceberg): disable count pushdown when delete files are present~~ fix(iceberg): Correct test setup to ensure delete files are created Dec 21, 2025

huleilei marked this pull request as draft December 21, 2025 11:59

huleilei marked this pull request as ready for review December 21, 2025 11:59

huleilei commented Dec 21, 2025

View reviewed changes

shaofengshi mentioned this pull request Dec 22, 2025

feat: Add Apache Gravitino virtual file system (gvfs://) read support in io module #5766

Open

colin-ho approved these changes Dec 22, 2025

View reviewed changes

colin-ho merged commit 050d6e9 into Eventual-Inc:main Dec 22, 2025
39 checks passed

This was referenced Dec 22, 2025

The unit test case for Iceberg failed #5863

Closed

Iceberg integration test failure #5776

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(iceberg): Correct test setup to ensure delete files are created #5864

fix(iceberg): Correct test setup to ensure delete files are created #5864

huleilei commented Dec 21, 2025 •

edited

Loading

Uh oh!

greptile-apps bot commented Dec 21, 2025 •

edited

Loading

Uh oh!

greptile-apps bot left a comment •

edited

Loading

Uh oh!

codecov bot commented Dec 21, 2025 •

edited

Loading

Uh oh!

greptile-apps bot commented Dec 21, 2025

Uh oh!

huleilei left a comment •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix(iceberg): Correct test setup to ensure delete files are created #5864

fix(iceberg): Correct test setup to ensure delete files are created #5864

Conversation

huleilei commented Dec 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes Made

Related Issues

Uh oh!

greptile-apps bot commented Dec 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Additional Comments (3)

Uh oh!

codecov bot commented Dec 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

greptile-apps bot commented Dec 21, 2025

Greptile's behavior is changing!

Uh oh!

huleilei left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

huleilei commented Dec 21, 2025 •

edited

Loading

greptile-apps bot commented Dec 21, 2025 •

edited

Loading

greptile-apps bot left a comment •

edited

Loading

codecov bot commented Dec 21, 2025 •

edited

Loading

huleilei left a comment •

edited

Loading