Enable synthetic source on normalized keyword mappings #126623

not-napoleon · 2025-04-10T16:11:23Z

This changes the default behavior for Synthetic Source on keyword fields using normalizers. Prior to this change, normalized keywords were always stored to allow returning the non-normalized values. Under this change, such field will NOT be stored (i.e they will be synthesized from the index when returning source, like all other synthetic source fields). This should result in considerable space improvement for this use case.

Users can opt out of this behavior on a per-field basis by setting synthetic_source_keep to all on the field.

This has some implications for ES|QL, in that keyword sub-fields of text fields may no longer be "exact". I've updated tests as I felt was correct for this changed assumption, but there may be additional code changes required in ES|QL.

Resolves #121358
Resolves #124369

elasticsearchmachine · 2025-04-10T16:12:12Z

Hi @not-napoleon, I've created a changelog YAML for you.

…ings-with-normalizers' into generate-mappings-with-normalizers

not-napoleon · 2025-04-11T15:21:03Z

I'd like to see this passing tests with the normalizer always enabled, then I'll push a commit to randomize adding it and mark this ready for review.

elasticsearchmachine · 2025-05-27T14:13:16Z

Hi @not-napoleon, I've updated the changelog YAML for you. Note that since this PR is labelled >breaking, you need to update the changelog YAML to fill out the extended information sections.

elasticsearchmachine · 2025-06-06T13:32:41Z

Pinging @elastic/es-storage-engine (Team:StorageEngine)

elasticsearchmachine · 2025-06-06T13:32:41Z

Hi @not-napoleon, I've updated the changelog YAML for you. Note that since this PR is labelled >breaking, you need to update the changelog YAML to fill out the extended information sections.

not-napoleon · 2025-06-06T13:34:23Z

x-pack/plugin/src/yamlRestTest/resources/rest-api-spec/test/esql/81_text_exact_subfields.yml

@@ -239,11 +252,15 @@ setup:
  - match: { columns.5.name: "non_indexed.raw" }
  - match: { columns.5.type: "keyword" }

-  - length: { values: 1 }


@luigidellaquila I'd appreciate it if you could take a look at the changes in this test file. This PR changes the "exact subfield" behavior, and I want to make sure the ESQL team is aware of that change. Happy to discuss if you have any questions or concerns.

martijnvg

LGTM, thanks Mark!

lkts

I have added some questions and comments.

lkts · 2025-06-06T16:38:29Z

server/src/main/java/org/elasticsearch/index/mapper/KeywordFieldMapper.java

-            // to store the original value whose doc values would be altered by the normalizer
-            return SyntheticSourceSupport.FALLBACK;
-        }
+        /* NOTE: we allow enabling synthetic source on Keyword fields with a Normalizer, even though the returned synthetic value


Can we mention why?

lkts · 2025-06-06T16:43:18Z

server/src/main/java/org/elasticsearch/index/mapper/TextFieldMapper.java

            /*
             * If this is a sub-text field try and return the parent's loader. Text
             * fields will always be slow to load and if the parent is exact then we
             * should use that instead.
             */
+            // TODO: should this be removed? I think SyntheticSourceHelper already does this:


SyntheticSourceHelper is concerned with multi fields of the text field. This logic is for when text field is the multii field of a keyword field.

{ "t": { "type": "text", "fields": { "k": { "type": "keyword" } } }

vs

{ "k": { "type": "keyword", "fields": { "t": { "type": "text" } } } }

You could move this logic into SyntheticSourceHelper just to be close to similar things.

lkts · 2025-06-06T16:45:12Z

server/src/main/java/org/elasticsearch/index/mapper/TextFieldMapper.java

@@ -1008,17 +1008,21 @@ protected String delegatingTo() {
                    }
                };
            }
+            if (isStored()) {


Why did this move? There is a comment below that says that it is more efficient to use the parent keyword field. I don't really understand how is that the case if the keyword field is stored though but there must be some reasoning here. Maybe let's check with someone from ESQL.

IIRC, this addresses the case where both the text field is stored. Without this change, we were falling back to the keyword field, which is "less exact" because of normalization, when we had a non-normalized stored text available.

Shouldn't this be moved all the way to the top then? Synthetic source delegate is not exact now either due to the change in getKeywordFieldMapperForSyntheticSource below.

Stepping back a little bit - are we sure we want to make changes to text at all? As you point out the idea is indeed to apply these optimizations when "parent is exact". So the check for normalizer here still serves its purpose because it protects from non-exactness.

lkts · 2025-06-06T16:53:05Z

...ork/src/main/java/org/elasticsearch/datageneration/matchers/source/FieldSpecificMatcher.java

+            Map<String, Object> actualMapping,
+            Map<String, Object> expectedMapping
+        ) {
+            this.normalizer = (String) FieldSpecificMatcher.getMappingParameter("normalizer", actualMapping, expectedMapping);


I think you should re-implement match using GenericMappingAwareMatcher as a base (it's pretty short) instead of adding state here. Matchers are shared for all fields of the specific type and it's hard to reason about this state.

I think I've addressed this. I removed the state at least. Let me know if this is not what you had in mind.

Thanks, this aligns with my thinking. I would apply normalizer to both actual and expected though for consistency.

I thought about doing that, but I think if we expect normalization and it doesn't happen, we should fail the test.

…lds" This reverts commit aed9477.

This reverts commit 270f066.

not-napoleon · 2025-06-11T17:04:43Z

Summary of the current situation with this change:

We appear to be at a conflict in requirements for this feature, specifically regards how text fields with associated keyword fields should behave. My understanding is that ESQL strongly assumes that text fields will always be "exact", which is to say they will always be read as exactly what the user sent in the source. However, if we fall back to a normalized keyword field for a text field, that will not be exact, it will be normalized.
There are two proposals:
1 - If there is no exact keyword field, the text field should always be stored. If the user explicitly marks it as store: false, that should be a mapping error. (The comments in TextFieldMapper suggest this is how it works now, although my testing indicates there may be a bug there)
2 - If there is a stored text field, we should use that, and otherwise we should use the keyword field regardless of if it's normalized or not.

Adopting proposal 1 leaves us exposed to the same storage issue this PR was intended to address, although on a somewhat more complicated mapping configuration. On the other hand, adopting proposal 2 breaks assumptions in ESQL, and at the very least requires a lot more testing to prove that nothing unexpected happens as result. We know there are some behavior changes from the test failures in 81_text_exact_subfields.yml already.

After discussing with the team on 2025-06-11, I'm assigning this PR to @martijnvg to continue, due to upcoming vacation schedules and relative priority of other tasks.

elasticsearchmachine · 2025-06-11T17:07:26Z

Hi @not-napoleon, I've updated the changelog YAML for you. Note that since this PR is labelled >breaking, you need to update the changelog YAML to fill out the extended information sections.

not-napoleon · 2025-06-11T17:23:22Z

I think this test seed is useful here: ./gradlew ":server:test" --tests "org.elasticsearch.index.mapper.blockloader.TextFieldWithParentBlockLoaderTests.testBlockLoaderOfParentField {preference=Params[syntheticSource=true, preference=NONE]}" -Dtests.seed=D52BF7A78018C8A1

This creates a normalized keyword with a text sub-field that is marked as store: false, however we still return the normalized value in the test. That's more or less exactly the problematic situation.

not-napoleon added 2 commits April 7, 2025 14:59

add tests with normalizer mappings

2a878cf

fix KeywordFieldBlockLoaderTests

0720fe7

not-napoleon added >enhancement :StorageEngine/Mapping The storage related side of mappings v8.19.0 v9.1.0 auto-backport Automatically create backport pull requests when merged labels Apr 10, 2025

Update docs/changelog/126623.yaml

625938f

github-actions bot deployed to docs-preview April 10, 2025 16:12 View deployment

not-napoleon added 3 commits April 10, 2025 16:27

proof of concept

e26e543

better, but not done

c8a0025

Merge remote-tracking branch 'refs/remotes/not-napoleon/generate-mapp…

f8194ae

…ings-with-normalizers' into generate-mappings-with-normalizers

github-actions bot deployed to docs-preview April 10, 2025 21:40 View deployment

[CI] Auto commit changes from spotless

6b20735

github-actions bot deployed to docs-preview April 10, 2025 21:50 View deployment

not-napoleon added 3 commits April 11, 2025 10:38

account for normalizing null values

e829823

remove code I'd only commened out

d2efb0b

Merge remote-tracking branch 'refs/remotes/not-napoleon/generate-mapp…

c2a11e3

…ings-with-normalizers' into generate-mappings-with-normalizers

github-actions bot deployed to docs-preview April 11, 2025 15:09 View deployment

Merge branch 'main' into generate-mappings-with-normalizers

417b084

github-actions bot deployed to docs-preview April 11, 2025 15:11 View deployment

fix yaml tests

949f42f

github-actions bot deployed to docs-preview April 11, 2025 17:18 View deployment

add test for opting out via keep all

0ed6a2b

github-actions bot deployed to docs-preview April 15, 2025 15:51 View deployment

not-napoleon added the >breaking label May 27, 2025

Update docs/changelog/126623.yaml

8d29695

Merge branch 'main' into generate-mappings-with-normalizers

f158c00

not-napoleon requested review from lkts, martijnvg and luigidellaquila and removed request for lkts June 6, 2025 13:32

elasticsearchmachine added the Team:StorageEngine label Jun 6, 2025

Update docs/changelog/126623.yaml

fbe4adb

github-actions bot deployed to docs-preview June 6, 2025 13:33 View deployment

not-napoleon commented Jun 6, 2025

View reviewed changes

fix changelog, again

0ca54e9

github-actions bot deployed to docs-preview June 6, 2025 13:44 View deployment

martijnvg approved these changes Jun 6, 2025

View reviewed changes

lkts reviewed Jun 6, 2025

View reviewed changes

response to PR feedback

86e5b22

github-actions bot deployed to docs-preview June 6, 2025 20:45 View deployment

not-napoleon added 3 commits June 10, 2025 17:02

Revert "Feedback from MVG on fields with both keyword and text subfie…

63bf7d1

…lds" This reverts commit aed9477.

Revert "fix exact subfield tests"

43b1075

This reverts commit 270f066.

leaving a note for future investigation

2b9d384

github-actions bot deployed to docs-preview June 11, 2025 16:55 View deployment

not-napoleon assigned martijnvg Jun 11, 2025

Update docs/changelog/126623.yaml

3b33b15

github-actions bot deployed to docs-preview June 11, 2025 17:08 View deployment

elasticsearchmachine added v9.2.0 and removed v9.1.0 labels Jun 26, 2025

Enable synthetic source on normalized keyword mappings #126623

Are you sure you want to change the base?

Enable synthetic source on normalized keyword mappings #126623

Uh oh!

Conversation

not-napoleon commented Apr 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Apr 10, 2025

Uh oh!

not-napoleon commented Apr 11, 2025

Uh oh!

elasticsearchmachine commented May 27, 2025

Uh oh!

elasticsearchmachine commented Jun 6, 2025

Uh oh!

elasticsearchmachine commented Jun 6, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

martijnvg left a comment

Choose a reason for hiding this comment

Uh oh!

lkts left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lkts Jun 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lkts Jun 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

not-napoleon commented Jun 11, 2025

Uh oh!

elasticsearchmachine commented Jun 11, 2025

Uh oh!

not-napoleon commented Jun 11, 2025

Uh oh!

Uh oh!

not-napoleon commented Apr 10, 2025 •

edited

Loading

lkts Jun 6, 2025 •

edited

Loading

lkts Jun 6, 2025 •

edited

Loading