Skip to content

Filter function simplification when there are multiple Iceberg equality delete files #8

@yingsu00

Description

@yingsu00

Description

Velox can do logical expression flattening, but still can't automatically simplify the logical expression. For example, the expression a AND (b AND (c AND d)) would be flattened as AND(a,b,c,d), but a AND (a OR b) cannot be automatically simplified to a, therefore to evaluate a AND (a OR b), a and b will both be evaluated, and one AND and one OR operation need to be performed. While we hope to improve logical expression simplification in the future, we can still do some simple improvements for Iceberg now.

An Iceberg split can come with multiple equality delete files and their schemas could have overlaps. For example
Equality delete file 1

equality_ids=[1, 2, 3]
1: id | 2: category | 3: name
-------|-------------|---------
 1      |   mouse   | Micky
 2      |   mouse   | Minnie
 3      |     bear     | Winnie
 4      |     bear     | Betty

Equality delete file 2

equality_ids=[2]
2: category 
---------------
   mouse

Equality delete file 3

equality_ids=[2, 3]
2: category  | 3: name
----------------|-------------
   bear           | Winnie

We see that equality delete file 2 is on the category column and would remove all tuples with value mouse. This means that the first two rows in equality delete file 1 are already contained and doesn’t need to be read or compiled. Similarly, the single row in file 3 contains row 3 in file 1, therefore row 3 in file 1 doesn’t need to be read or compiled. The simplified delete files are like the follows:

equality_ids=[1, 2, 3]
1: id | 2: category | 3: name
-------|-------------|---------
 4      |     bear     | Betty

and

equality_ids=[2]
2: category 
---------------
   mouse

and

equality_ids=[2, 3]
2: category  | 3: name
----------------|-------------
   bear           | Winnie

With this simplification, the resulted expression would be simpler and the evaluation cost will be reduced.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions