Skip to content

Commit 050d6e9

Browse files
huleileiroothuleilei
authored
fix(iceberg): Correct test setup to ensure delete files are created (#5864)
## Changes Made The integration test TestIcebergCountPushdown.test_count_pushdown_with_delete_files was failing for the test_overlapping_deletes table because it incorrectly enabled count pushdown. The root cause was that the initial Spark write created multiple small data files. Subsequent DELETE operations were optimized by Iceberg to mark entire data files as removed instead of generating position/equality delete files. As a result, Daft's _has_delete_files() check did not find any delete files and incorrectly allowed the count pushdown optimization. This PR fixes the test by adding coalesce(1) to the Spark DataFrame before writing the initial data for the test_overlapping_deletes table. This ensures the data is written to a single Parquet file, forcing subsequent DELETE operations to generate actual delete files. This aligns the test's behavior with its intent, correctly disabling count pushdown when delete files are present. ## Related Issues #5863 5863 <!-- Link to related GitHub issues, e.g., "Closes #123" --> --------- Co-authored-by: root <root@bytedance> Co-authored-by: huleilei <huleilei@bytedance>
1 parent b200c59 commit 050d6e9

File tree

2 files changed

+22
-22
lines changed

2 files changed

+22
-22
lines changed

daft/io/iceberg/iceberg_scan.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -321,7 +321,7 @@ def can_absorb_select(self) -> bool:
321321
return True
322322

323323
def supports_count_pushdown(self) -> bool:
324-
return True and not self._has_delete_files()
324+
return not self._has_delete_files()
325325

326326
def supported_count_modes(self) -> list[CountMode]:
327327
return [CountMode.All]

tests/integration/iceberg/docker-compose/provision.py

Lines changed: 21 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -452,27 +452,27 @@
452452
"""
453453
)
454454

455-
spark.sql(
456-
"""
457-
INSERT INTO default.test_overlapping_deletes
458-
VALUES
459-
(1, 'Alice', 100.0, 'A'),
460-
(2, 'Bob', 200.0, 'B'),
461-
(3, 'Charlie', 300.0, 'A'),
462-
(4, 'David', 400.0, 'B'),
463-
(5, 'Eve', 500.0, 'A'),
464-
(6, 'Frank', 600.0, 'B'),
465-
(7, 'Grace', 700.0, 'A'),
466-
(8, 'Henry', 800.0, 'B'),
467-
(9, 'Ivy', 900.0, 'A'),
468-
(10, 'Jack', 1000.0, 'B'),
469-
(11, 'Kate', 1100.0, 'A'),
470-
(12, 'Leo', 1200.0, 'B'),
471-
(13, 'Mary', 1300.0, 'A'),
472-
(14, 'Nick', 1400.0, 'B'),
473-
(15, 'Olivia', 1500.0, 'A');
474-
"""
475-
)
455+
data = [
456+
(1, "Alice", 100.0, "A"),
457+
(2, "Bob", 200.0, "B"),
458+
(3, "Charlie", 300.0, "A"),
459+
(4, "David", 400.0, "B"),
460+
(5, "Eve", 500.0, "A"),
461+
(6, "Frank", 600.0, "B"),
462+
(7, "Grace", 700.0, "A"),
463+
(8, "Henry", 800.0, "B"),
464+
(9, "Ivy", 900.0, "A"),
465+
(10, "Jack", 1000.0, "B"),
466+
(11, "Kate", 1100.0, "A"),
467+
(12, "Leo", 1200.0, "B"),
468+
(13, "Mary", 1300.0, "A"),
469+
(14, "Nick", 1400.0, "B"),
470+
(15, "Olivia", 1500.0, "A"),
471+
]
472+
columns = ["id", "name", "value", "category"]
473+
df = spark.createDataFrame(data, columns)
474+
df = df.coalesce(1)
475+
df.writeTo("default.test_overlapping_deletes").append()
476476

477477
spark.sql(
478478
"""

0 commit comments

Comments
 (0)