Skip to content

Conversation

@huleilei
Copy link
Contributor

@huleilei huleilei commented Dec 21, 2025

Changes Made

The integration test TestIcebergCountPushdown.test_count_pushdown_with_delete_files was failing for the test_overlapping_deletes table because it incorrectly enabled count pushdown.
The root cause was that the initial Spark write created multiple small data files. Subsequent DELETE operations were optimized by Iceberg to mark entire data files as removed instead of generating position/equality delete files. As a result, Daft's _has_delete_files() check did not find any delete files and incorrectly allowed the count pushdown optimization.
This PR fixes the test by adding coalesce(1) to the Spark DataFrame before writing the initial data for the test_overlapping_deletes table. This ensures the data is written to a single Parquet file, forcing subsequent DELETE operations to generate actual delete files. This aligns the test's behavior with its intent, correctly disabling count pushdown when delete files are present.

Related Issues

#5863 5863

@github-actions github-actions bot added the fix label Dec 21, 2025
@huleilei huleilei changed the title fix(iceberg): fix Iceberg Delete File Detection Logic fix(iceberg): disable count pushdown when delete files are present Dec 21, 2025
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Dec 21, 2025

Greptile Summary

Fixed the test_overlapping_deletes integration test by ensuring delete files are actually created during test setup. The original SQL INSERT approach created multiple small Parquet files, which allowed Iceberg to optimize DELETE operations by marking entire files as removed instead of generating position/equality delete files. This caused the test to fail because _has_delete_files() returned false, incorrectly enabling count pushdown optimization.

The fix uses spark.createDataFrame().coalesce(1) to write all 15 rows into a single Parquet file, forcing subsequent DELETE operations to generate actual delete files since they can only delete partial file contents. This aligns the test behavior with its intent.

  • Changed data insertion from SQL INSERT to DataFrame API with coalesce(1)
  • Simplified boolean expression in supports_count_pushdown() from True and not self._has_delete_files() to not self._has_delete_files()

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk
  • The changes are well-targeted and address a legitimate test setup issue. The test was failing not because of a bug in Daft's logic, but because the test setup wasn't creating the conditions it was designed to test. Using coalesce(1) is a standard Spark technique to control file output, and the simplification of the boolean expression is a straightforward refactoring with no behavioral change.
  • No files require special attention

Important Files Changed

Filename Overview
daft/io/iceberg/iceberg_scan.py Simplified boolean logic in supports_count_pushdown() by removing redundant True and
tests/integration/iceberg/docker-compose/provision.py Changed data insertion to use DataFrame with coalesce(1) to ensure single Parquet file, forcing delete file creation

Sequence Diagram

sequenceDiagram
    participant Test as Integration Test
    participant Spark as Spark SQL
    participant Iceberg as Iceberg Table
    participant Daft as Daft Reader
    
    Note over Test,Spark: Test Setup Phase
    Test->>Spark: CREATE TABLE test_overlapping_deletes
    Test->>Spark: createDataFrame(data).coalesce(1)
    Note over Spark: Force single Parquet file
    Spark->>Iceberg: Write 15 rows to single file
    
    Test->>Spark: DELETE WHERE id <= 5
    Spark->>Iceberg: Create position delete file
    Note over Iceberg: Cannot remove entire file<br/>(has rows 6-15)
    
    Test->>Spark: DELETE WHERE id <= 3 (overlapping)
    Spark->>Iceberg: Create another delete file
    
    Test->>Spark: DELETE WHERE id >= 4 AND id <= 8
    Spark->>Iceberg: Create another delete file
    
    Note over Test,Daft: Test Execution Phase
    Test->>Daft: read_table().count()
    Daft->>Iceberg: Check _has_delete_files()
    Iceberg-->>Daft: True (delete files found)
    Daft->>Daft: Disable count pushdown
    Note over Daft: Regular scan with delete<br/>file processing
    Daft-->>Test: Correct count returned
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (3)

  1. daft/io/iceberg/iceberg_scan.py, line 311-314 (link)

    logic: incorrect snapshot summary field used - deleted-data-files refers to data files that were REMOVED from the table (e.g., during compaction), not delete files used for merge-on-read operations. should check total-delete-files instead to detect positional/equality delete files

  2. daft/io/iceberg/iceberg_scan.py, line 312-313 (link)

    syntax: current_snapshot.summary values are strings, not integers. need to convert to int before comparison

  3. daft/io/iceberg/iceberg_scan.py, line 309-310 (link)

    style: snapshot summary check should be done first before scanning tasks since it's more efficient - accessing metadata is cheaper than calling plan_files()

1 file reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

@codecov
Copy link

codecov bot commented Dec 21, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 72.35%. Comparing base (38cc5d5) to head (059ddb7).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #5864      +/-   ##
==========================================
- Coverage   72.37%   72.35%   -0.02%     
==========================================
  Files         965      965              
  Lines      125727   125718       -9     
==========================================
- Hits        90992    90969      -23     
- Misses      34735    34749      +14     
Files with missing lines Coverage Δ
daft/io/iceberg/iceberg_scan.py 76.02% <100.00%> (ø)

... and 7 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@huleilei huleilei changed the title fix(iceberg): disable count pushdown when delete files are present fix(iceberg): Correct test setup to ensure delete files are created Dec 21, 2025
@huleilei huleilei marked this pull request as draft December 21, 2025 11:59
@huleilei huleilei marked this pull request as ready for review December 21, 2025 11:59
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Dec 21, 2025

Greptile's behavior is changing!

From now on, if a review finishes with no comments, we will not post an additional "statistics" comment to confirm that our review found nothing to comment on. However, you can confirm that we reviewed your changes in the status check section.

This feature can be toggled off in your Code Review Settings by deselecting "Create a status check for each PR".

Copy link
Contributor Author

@huleilei huleilei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the original provision.py script, the initial data for the test_overlapping_deletes table is written using Spark INSERT statements. Since the data volume is very small (15 rows), Spark writes this data into multiple tiny Parquet data files (Data Files). When a subsequent delete operation, such as DELETE FROM default.test_overlapping_deletes WHERE id <= 5, is executed, Iceberg/Spark identifies that the rows matching the delete condition are entirely contained within certain data files. As a result, the optimizer chooses a more efficient approach: it directly updates the status of the entire affected data files (Data Files) from "added" to "deleted" in the manifest list, instead of generating deletion files (Position/Equality Delete Files) that record the specific locations of the deletions.

Since this process does not generate any .parquet format deletion files, Daft's _has_delete_files() check naturally cannot find them, leading to the incorrect assumption that count pushdown is safe. Although, in this specific case, the total row count calculated through metadata happens to be correct, this contradicts the original intent of the test case, which is to test the handling logic when deletion files are present.

@colin-ho colin-ho merged commit 050d6e9 into Eventual-Inc:main Dec 22, 2025
39 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants