-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Description
Velox can do logical expression flattening, but still can't automatically simplify the logical expression. For example, the expression a AND (b AND (c AND d)) would be flattened as AND(a,b,c,d), but a AND (a OR b) cannot be automatically simplified to a, therefore to evaluate a AND (a OR b), a and b will both be evaluated, and one AND and one OR operation need to be performed. While we hope to improve logical expression simplification in the future, we can still do some simple improvements for Iceberg now.
An Iceberg split can come with multiple equality delete files and their schemas could have overlaps. For example
Equality delete file 1
equality_ids=[1, 2, 3]
1: id | 2: category | 3: name
-------|-------------|---------
1 | mouse | Micky
2 | mouse | Minnie
3 | bear | Winnie
4 | bear | Betty
Equality delete file 2
equality_ids=[2]
2: category
---------------
mouse
Equality delete file 3
equality_ids=[2, 3]
2: category | 3: name
----------------|-------------
bear | Winnie
We see that equality delete file 2 is on the category column and would remove all tuples with value mouse. This means that the first two rows in equality delete file 1 are already contained and doesn’t need to be read or compiled. Similarly, the single row in file 3 contains row 3 in file 1, therefore row 3 in file 1 doesn’t need to be read or compiled. The simplified delete files are like the follows:
equality_ids=[1, 2, 3]
1: id | 2: category | 3: name
-------|-------------|---------
4 | bear | Betty
and
equality_ids=[2]
2: category
---------------
mouse
and
equality_ids=[2, 3]
2: category | 3: name
----------------|-------------
bear | Winnie
With this simplification, the resulted expression would be simpler and the evaluation cost will be reduced.