-
Notifications
You must be signed in to change notification settings - Fork 25.3k
Enable synthetic source on normalized keyword mappings #126623
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Enable synthetic source on normalized keyword mappings #126623
Conversation
Hi @not-napoleon, I've created a changelog YAML for you. |
…ings-with-normalizers' into generate-mappings-with-normalizers
…ings-with-normalizers' into generate-mappings-with-normalizers
I'd like to see this passing tests with the normalizer always enabled, then I'll push a commit to randomize adding it and mark this ready for review. |
Hi @not-napoleon, I've updated the changelog YAML for you. Note that since this PR is labelled |
Pinging @elastic/es-storage-engine (Team:StorageEngine) |
Hi @not-napoleon, I've updated the changelog YAML for you. Note that since this PR is labelled |
@@ -239,11 +252,15 @@ setup: | |||
- match: { columns.5.name: "non_indexed.raw" } | |||
- match: { columns.5.type: "keyword" } | |||
|
|||
- length: { values: 1 } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@luigidellaquila I'd appreciate it if you could take a look at the changes in this test file. This PR changes the "exact subfield" behavior, and I want to make sure the ESQL team is aware of that change. Happy to discuss if you have any questions or concerns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks Mark!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added some questions and comments.
// to store the original value whose doc values would be altered by the normalizer | ||
return SyntheticSourceSupport.FALLBACK; | ||
} | ||
/* NOTE: we allow enabling synthetic source on Keyword fields with a Normalizer, even though the returned synthetic value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we mention why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
/* | ||
* If this is a sub-text field try and return the parent's loader. Text | ||
* fields will always be slow to load and if the parent is exact then we | ||
* should use that instead. | ||
*/ | ||
// TODO: should this be removed? I think SyntheticSourceHelper already does this: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SyntheticSourceHelper
is concerned with multi fields of the text
field. This logic is for when text
field is the multii field of a keyword
field.
{
"t": {
"type": "text",
"fields": {
"k": {
"type": "keyword"
}
}
}
vs
{
"k": {
"type": "keyword",
"fields": {
"t": {
"type": "text"
}
}
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could move this logic into SyntheticSourceHelper
just to be close to similar things.
@@ -1008,17 +1008,21 @@ protected String delegatingTo() { | |||
} | |||
}; | |||
} | |||
if (isStored()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did this move? There is a comment below that says that it is more efficient to use the parent keyword field. I don't really understand how is that the case if the keyword field is stored though but there must be some reasoning here. Maybe let's check with someone from ESQL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, this addresses the case where both the text field is stored. Without this change, we were falling back to the keyword field, which is "less exact" because of normalization, when we had a non-normalized stored text available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be moved all the way to the top then? Synthetic source delegate is not exact now either due to the change in getKeywordFieldMapperForSyntheticSource
below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stepping back a little bit - are we sure we want to make changes to text
at all? As you point out the idea is indeed to apply these optimizations when "parent is exact". So the check for normalizer here still serves its purpose because it protects from non-exactness.
Map<String, Object> actualMapping, | ||
Map<String, Object> expectedMapping | ||
) { | ||
this.normalizer = (String) FieldSpecificMatcher.getMappingParameter("normalizer", actualMapping, expectedMapping); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should re-implement match
using GenericMappingAwareMatcher
as a base (it's pretty short) instead of adding state here. Matchers are shared for all fields of the specific type and it's hard to reason about this state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I've addressed this. I removed the state at least. Let me know if this is not what you had in mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this aligns with my thinking. I would apply normalizer to both actual and expected though for consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about doing that, but I think if we expect normalization and it doesn't happen, we should fail the test.
Summary of the current situation with this change: We appear to be at a conflict in requirements for this feature, specifically regards how text fields with associated keyword fields should behave. My understanding is that ESQL strongly assumes that text fields will always be "exact", which is to say they will always be read as exactly what the user sent in the source. However, if we fall back to a normalized keyword field for a text field, that will not be exact, it will be normalized. Adopting proposal 1 leaves us exposed to the same storage issue this PR was intended to address, although on a somewhat more complicated mapping configuration. On the other hand, adopting proposal 2 breaks assumptions in ESQL, and at the very least requires a lot more testing to prove that nothing unexpected happens as result. We know there are some behavior changes from the test failures in 81_text_exact_subfields.yml already. After discussing with the team on 2025-06-11, I'm assigning this PR to @martijnvg to continue, due to upcoming vacation schedules and relative priority of other tasks. |
Hi @not-napoleon, I've updated the changelog YAML for you. Note that since this PR is labelled |
I think this test seed is useful here: This creates a normalized keyword with a text sub-field that is marked as |
This changes the default behavior for Synthetic Source on keyword fields using normalizers. Prior to this change, normalized keywords were always stored to allow returning the non-normalized values. Under this change, such field will NOT be stored (i.e they will be synthesized from the index when returning source, like all other synthetic source fields). This should result in considerable space improvement for this use case.
Users can opt out of this behavior on a per-field basis by setting
synthetic_source_keep
toall
on the field.This has some implications for ES|QL, in that keyword sub-fields of text fields may no longer be "exact". I've updated tests as I felt was correct for this changed assumption, but there may be additional code changes required in ES|QL.
Resolves #121358
Resolves #124369