Skip to content

Enable synthetic source on normalized keyword mappings #126623

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 41 commits into
base: main
Choose a base branch
from

Conversation

not-napoleon
Copy link
Member

@not-napoleon not-napoleon commented Apr 10, 2025

This changes the default behavior for Synthetic Source on keyword fields using normalizers. Prior to this change, normalized keywords were always stored to allow returning the non-normalized values. Under this change, such field will NOT be stored (i.e they will be synthesized from the index when returning source, like all other synthetic source fields). This should result in considerable space improvement for this use case.

Users can opt out of this behavior on a per-field basis by setting synthetic_source_keep to all on the field.

This has some implications for ES|QL, in that keyword sub-fields of text fields may no longer be "exact". I've updated tests as I felt was correct for this changed assumption, but there may be additional code changes required in ES|QL.

Resolves #121358
Resolves #124369

@not-napoleon not-napoleon added >enhancement :StorageEngine/Mapping The storage related side of mappings v8.19.0 v9.1.0 auto-backport Automatically create backport pull requests when merged labels Apr 10, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @not-napoleon, I've created a changelog YAML for you.

@not-napoleon
Copy link
Member Author

I'd like to see this passing tests with the normalizer always enabled, then I'll push a commit to randomize adding it and mark this ready for review.

@elasticsearchmachine
Copy link
Collaborator

Hi @not-napoleon, I've updated the changelog YAML for you. Note that since this PR is labelled >breaking, you need to update the changelog YAML to fill out the extended information sections.

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-storage-engine (Team:StorageEngine)

@elasticsearchmachine
Copy link
Collaborator

Hi @not-napoleon, I've updated the changelog YAML for you. Note that since this PR is labelled >breaking, you need to update the changelog YAML to fill out the extended information sections.

@@ -239,11 +252,15 @@ setup:
- match: { columns.5.name: "non_indexed.raw" }
- match: { columns.5.type: "keyword" }

- length: { values: 1 }
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@luigidellaquila I'd appreciate it if you could take a look at the changes in this test file. This PR changes the "exact subfield" behavior, and I want to make sure the ESQL team is aware of that change. Happy to discuss if you have any questions or concerns.

Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks Mark!

Copy link
Contributor

@lkts lkts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added some questions and comments.

// to store the original value whose doc values would be altered by the normalizer
return SyntheticSourceSupport.FALLBACK;
}
/* NOTE: we allow enabling synthetic source on Keyword fields with a Normalizer, even though the returned synthetic value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we mention why?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

/*
* If this is a sub-text field try and return the parent's loader. Text
* fields will always be slow to load and if the parent is exact then we
* should use that instead.
*/
// TODO: should this be removed? I think SyntheticSourceHelper already does this:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SyntheticSourceHelper is concerned with multi fields of the text field. This logic is for when text field is the multii field of a keyword field.

{
  "t": {
    "type": "text",
    "fields": {
      "k": {
        "type": "keyword"
      }
  }
}

vs

{
  "k": {
    "type": "keyword",
    "fields": {
      "t": {
        "type": "text"
      }
    }
  }
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could move this logic into SyntheticSourceHelper just to be close to similar things.

@@ -1008,17 +1008,21 @@ protected String delegatingTo() {
}
};
}
if (isStored()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did this move? There is a comment below that says that it is more efficient to use the parent keyword field. I don't really understand how is that the case if the keyword field is stored though but there must be some reasoning here. Maybe let's check with someone from ESQL.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, this addresses the case where both the text field is stored. Without this change, we were falling back to the keyword field, which is "less exact" because of normalization, when we had a non-normalized stored text available.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be moved all the way to the top then? Synthetic source delegate is not exact now either due to the change in getKeywordFieldMapperForSyntheticSource below.

Copy link
Contributor

@lkts lkts Jun 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stepping back a little bit - are we sure we want to make changes to text at all? As you point out the idea is indeed to apply these optimizations when "parent is exact". So the check for normalizer here still serves its purpose because it protects from non-exactness.

Map<String, Object> actualMapping,
Map<String, Object> expectedMapping
) {
this.normalizer = (String) FieldSpecificMatcher.getMappingParameter("normalizer", actualMapping, expectedMapping);
Copy link
Contributor

@lkts lkts Jun 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should re-implement match using GenericMappingAwareMatcher as a base (it's pretty short) instead of adding state here. Matchers are shared for all fields of the specific type and it's hard to reason about this state.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I've addressed this. I removed the state at least. Let me know if this is not what you had in mind.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this aligns with my thinking. I would apply normalizer to both actual and expected though for consistency.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about doing that, but I think if we expect normalization and it doesn't happen, we should fail the test.

@not-napoleon
Copy link
Member Author

Summary of the current situation with this change:

We appear to be at a conflict in requirements for this feature, specifically regards how text fields with associated keyword fields should behave. My understanding is that ESQL strongly assumes that text fields will always be "exact", which is to say they will always be read as exactly what the user sent in the source. However, if we fall back to a normalized keyword field for a text field, that will not be exact, it will be normalized.
There are two proposals:
1 - If there is no exact keyword field, the text field should always be stored. If the user explicitly marks it as store: false, that should be a mapping error. (The comments in TextFieldMapper suggest this is how it works now, although my testing indicates there may be a bug there)
2 - If there is a stored text field, we should use that, and otherwise we should use the keyword field regardless of if it's normalized or not.

Adopting proposal 1 leaves us exposed to the same storage issue this PR was intended to address, although on a somewhat more complicated mapping configuration. On the other hand, adopting proposal 2 breaks assumptions in ESQL, and at the very least requires a lot more testing to prove that nothing unexpected happens as result. We know there are some behavior changes from the test failures in 81_text_exact_subfields.yml already.

After discussing with the team on 2025-06-11, I'm assigning this PR to @martijnvg to continue, due to upcoming vacation schedules and relative priority of other tasks.

@elasticsearchmachine
Copy link
Collaborator

Hi @not-napoleon, I've updated the changelog YAML for you. Note that since this PR is labelled >breaking, you need to update the changelog YAML to fill out the extended information sections.

@not-napoleon
Copy link
Member Author

I think this test seed is useful here: ./gradlew ":server:test" --tests "org.elasticsearch.index.mapper.blockloader.TextFieldWithParentBlockLoaderTests.testBlockLoaderOfParentField {preference=Params[syntheticSource=true, preference=NONE]}" -Dtests.seed=D52BF7A78018C8A1

This creates a normalized keyword with a text sub-field that is marked as store: false, however we still return the normalized value in the test. That's more or less exactly the problematic situation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Automatically create backport pull requests when merged >breaking >enhancement :StorageEngine/Mapping The storage related side of mappings Team:StorageEngine v8.19.0 v9.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enable native synthetic source for keyword fields with normalizers Support normalizer mapping parameter for keyword fields in test data generation
4 participants