Skip to content

Conversation

gianm
Copy link
Contributor

@gianm gianm commented Jun 5, 2025

Introduces new multi-value string functions at the native layer. These are now used for the SQL MV_OVERLAP and MV_CONTAINS functions. The main purpose of the new functions is to have slightly different null and empty-array handling behavior vs array_overlap and array_contains. Behavior is designed to align with the native inType filter, which is especially important for MV_OVERLAP, because the SQL layer can convert MV_OVERLAP into an inType filter.

The SQL layer no longer generates calls to the mv_harmonize_nulls function, because that logic has been moved inside mv_overlap and mv_contains. The function still remains at the native layer for backwards compatibility.

Introduces new multi-value string functions at the native layer. These
are now used for the SQL `MV_OVERLAP` and `MV_CONTAINS` functions. The main
purpose of the new functions is to have slightly different null and empty-array
handling behavior vs `array_overlap` and `array_contains`. Behavior is designed
to align with the native `inType` filter, which is especially important for
`MV_OVERLAP`, because the SQL layer can convert `MV_OVERLAP` into an `inType`
filter.

The SQL layer no longer generates calls to the `mv_harmonize_nulls` function,
because that logic has been moved inside `mv_overlap` and `mv_contains`. The
function still remains at the native layer for backwards compatibility.
@gianm gianm force-pushed the sql-mv-harmonize-more-nulls branch from d1cef64 to e7f9456 Compare June 5, 2025 21:38
),
ImmutableList.of(
new Object[]{"[\"a\",\"b\"]"},
new Object[]{"[\"b\",\"c\"]"},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i suppose this test should have also selected dim2 to make it easier to verify heh, though I'm still kind of confused, what does the array constructor end up doing to an empty mvd row? I started to try to find out but then ran into a wierd problem with the grammar, you can't actually write array(array(...)) expression because it gets confused with some other junk in there about array literal parsing.. will look into both of these things later and just assume this is correct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually was a bad change. The root cause was:

  • the array(dim2) function, when passed a dim2 that is a single null, would return an array<long> typed array with a single null in it.
  • the mv_overlap function would then cast the first arg to array<long>, which would yield an array of nulls.
  • the mv_overlap function would then return true because an array of nulls (first arg) does contain a null (second arg).

To fix it we'd need to cast for comparison. But I think that is sort of pointless since MV_CONTAINS and MV_OVERLAP really only make sense with strings. So instead, I updated them to cast the second arg to string array, and updated the tests accordingly.


@Override
protected String getFilterExpression(List<DruidExpression> druidExpressions)
public String getDruidFunctionName()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this still extend ArrayContainsOperatorConversion? I guess i see the benefit of overlapping filter conversion, but given that it uses some other native filters, such as ArrayContainsElementFilter, is that ok?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main reason for extending ArrayContainsOperatorConversion is to get the filter conversion behavior. This patch doesn't change whether ArrayContainsElementFilter could fire for MV_CONTAINS, but just to be safe, I added a check for this. (If the operator is not ARRAY_CONTAINS then it skips the array-element-filter block.)

newArgs
);
}
druidExpressions ->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks like the default implementation, so can just delete i think, unless i'm missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right, will delete it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, sorry for confusion, i meant of the entire toDruidExpression method override can be dropped i think, since this is effectively the same as DirectOperatorConversion implementation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I had deleted the other one, not this one you commented on. Just deleted this one too. Actually, I did a bigger change:

  • updated MV Contains and Overlap to call a superclass constructor with their operator and function names, rather than overriding calciteOperator() and getDruidFunctionName()
  • removed toDruidExpression and toDruidExpressionWithPostAggOperands from MV Contains and Overlap
  • marked calciteOperator() and getDruidFunctionName() as final in DirectOperatorConversion, so all the other methods on DirectOperatorConversion can be trusted to use the correct operator and function name

}

@Test
public void testMvContains()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add some tests for what happens if other types of columns make it in here, like scalar numbers and number arrays and stuff? i think should be fine, just might be nice to have.. see there are tests for scalar strings which is 👍 just would like to see a bit more (and also maybe as bindings since different paths for singleThreaded for constant and dynamic args)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some tests. I didn't use bindings, but I don't think they're needed, because assertExpr tests both with and without single-threaded mode. Using constants should let both modes be exercised.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't use bindings, but I don't think they're needed, because assertExpr tests both with and without single-threaded mode.

oops yea, i forgot it did that 👍

Copy link
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, though I didn't like validate the quidem test changes.. just going to trust that the changes there are correct

@gianm
Copy link
Contributor Author

gianm commented Jun 6, 2025

lgtm, though I didn't like validate the quidem test changes.. just going to trust that the changes there are correct

I think the quidem tests are OK- looking through them, they appear to mainly be differences in how null/empty cases are handled.

@gianm gianm merged commit a33cac7 into apache:master Jun 6, 2025
74 checks passed
@gianm gianm deleted the sql-mv-harmonize-more-nulls branch June 6, 2025 20:48
@capistrant capistrant added this to the 34.0.0 milestone Jul 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants