Skip to content

ESQL: TO_LOWER process all values #124676

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Mar 13, 2025
Merged

ESQL: TO_LOWER process all values #124676

merged 6 commits into from
Mar 13, 2025

Conversation

nik9000
Copy link
Member

@nik9000 nik9000 commented Mar 12, 2025

Make TO_LOWER and TO_UPPER process all values it receives.

This is quite large because it borrows a lot of code from the regular evaluator generator to generate conversions so we can use the Locale. That change propagates to the order of some parameters and to the toString and a few more places.

Closes #124002

nik9000 added 2 commits March 12, 2025 16:06
Make `TO_LOWER` and `TO_UPPER` process all values it received.

This is quite large because it borrows a lot of code from the regular
evaluator generator to generate conversions so we can use the Locale.
That change propagates to the order of some parameters and to the
`toString` and a few more places.

Closes elastic#124002
@nik9000 nik9000 requested a review from bpintea March 12, 2025 20:13
@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Mar 12, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@elasticsearchmachine
Copy link
Collaborator

Hi @nik9000, I've created a changelog YAML for you.

@nik9000 nik9000 changed the title ESQL: TO_LOWER processes all values ESQL: TO_LOWER process all values Mar 12, 2025
@nik9000
Copy link
Member Author

nik9000 commented Mar 13, 2025

I believe this one doesn't require anything else around push down - we push down TO_LOWER0-ed comparisons to lucene. All of those comparisons need single valued fields though. I'm going to have a look at how that's done to much sure there's nothing to do.

@nik9000
Copy link
Member Author

nik9000 commented Mar 13, 2025

Looks like the PR that enabled push down is #118870.

Copy link
Contributor

@bpintea bpintea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nik9000
Copy link
Member Author

nik9000 commented Mar 13, 2025

This doesn't really change the benchmarks:

(operation)  Mode  Cnt   Score   Error -->  Score   Error Units
   to_lower  avgt    7  29.455 ± 2.330 --> 31.079 ± 1.574 ns/op
   to_upper  avgt    7  32.693 ± 0.344 --> 33.438 ± 0.336 ns/op

Well, it does, but it's pretty within the error.

@nik9000 nik9000 merged commit 1e25a54 into elastic:main Mar 13, 2025
17 checks passed
@nik9000
Copy link
Member Author

nik9000 commented Mar 13, 2025

I double checked push down and it all looks good. The equality still neesd single values.

jfreden pushed a commit to jfreden/elasticsearch that referenced this pull request Mar 13, 2025
Make `TO_LOWER` and `TO_UPPER` process all values it received.

This is quite large because it borrows a lot of code from the regular
evaluator generator to generate conversions so we can use the Locale.
That change propagates to the order of some parameters and to the
`toString` and a few more places.

Closes elastic#124002
nik9000 added a commit to nik9000/elasticsearch that referenced this pull request Apr 7, 2025
Speed up the TO_IP method by converting directly from utf-8 encoded
strings to the ip encoding. Previously we did:
```
utf-8 -> String -> INetAddress -> ip encoding
```

In a step towards solving elastic#125460 this creates three IP parsing
functions, one the rejects leading zeros, one that interprets leading
zeros as decimal numbers, and one the interprets leading zeros as octal
numbers. IPs have historically been parsed in all three of those ways.

This plugs the "rejects leading zeros" parser into `TO_IP` because
that's the behavior it had before.

Here is the performance:
```
Benchmark               Score    Error  Units
leadingZerosAreDecimal  14.007 ± 0.093  ns/op
leadingZerosAreOctal    15.020 ± 0.373  ns/op
leadingZerosRejected    14.176 ± 3.861  ns/op
original                32.950 ± 1.062  ns/op
```

So this is roughly 45% faster than what we had.

This includes a big chunk of elastic#124676 - but not the behavior change -
just the code that allowed it.
nik9000 added a commit to nik9000/elasticsearch that referenced this pull request Apr 7, 2025
Speed up the TO_IP method by converting directly from utf-8 encoded
strings to the ip encoding. Previously we did:
```
utf-8 -> String -> INetAddress -> ip encoding
```

In a step towards solving elastic#125460 this creates three IP parsing
functions, one the rejects leading zeros, one that interprets leading
zeros as decimal numbers, and one the interprets leading zeros as octal
numbers. IPs have historically been parsed in all three of those ways.

This plugs the "rejects leading zeros" parser into `TO_IP` because
that's the behavior it had before.

Here is the performance:
```
Benchmark               Score    Error  Units
leadingZerosAreDecimal  14.007 ± 0.093  ns/op
leadingZerosAreOctal    15.020 ± 0.373  ns/op
leadingZerosRejected    14.176 ± 3.861  ns/op
original                32.950 ± 1.062  ns/op
```

So this is roughly 45% faster than what we had.

This includes a big chunk of elastic#124676 - but not the behavior change -
just the code that allowed it.
elasticsearchmachine pushed a commit that referenced this pull request Apr 7, 2025
Speed up the TO_IP method by converting directly from utf-8 encoded
strings to the ip encoding. Previously we did:
```
utf-8 -> String -> INetAddress -> ip encoding
```

In a step towards solving #125460 this creates three IP parsing
functions, one the rejects leading zeros, one that interprets leading
zeros as decimal numbers, and one the interprets leading zeros as octal
numbers. IPs have historically been parsed in all three of those ways.

This plugs the "rejects leading zeros" parser into `TO_IP` because
that's the behavior it had before.

Here is the performance:
```
Benchmark               Score    Error  Units
leadingZerosAreDecimal  14.007 ± 0.093  ns/op
leadingZerosAreOctal    15.020 ± 0.373  ns/op
leadingZerosRejected    14.176 ± 3.861  ns/op
original                32.950 ± 1.062  ns/op
```

So this is roughly 45% faster than what we had.

This includes a big chunk of #124676 - but not the behavior change -
just the code that allowed it.
elasticsearchmachine pushed a commit that referenced this pull request Apr 7, 2025
Speed up the TO_IP method by converting directly from utf-8 encoded
strings to the ip encoding. Previously we did:
```
utf-8 -> String -> INetAddress -> ip encoding
```

In a step towards solving #125460 this creates three IP parsing
functions, one the rejects leading zeros, one that interprets leading
zeros as decimal numbers, and one the interprets leading zeros as octal
numbers. IPs have historically been parsed in all three of those ways.

This plugs the "rejects leading zeros" parser into `TO_IP` because
that's the behavior it had before.

Here is the performance:
```
Benchmark               Score    Error  Units
leadingZerosAreDecimal  14.007 ± 0.093  ns/op
leadingZerosAreOctal    15.020 ± 0.373  ns/op
leadingZerosRejected    14.176 ± 3.861  ns/op
original                32.950 ± 1.062  ns/op
```

So this is roughly 45% faster than what we had.

This includes a big chunk of #124676 - but not the behavior change -
just the code that allowed it.
nik9000 added a commit to nik9000/elasticsearch that referenced this pull request May 6, 2025
Make `TO_LOWER` and `TO_UPPER` process all values it received.

This is quite large because it borrows a lot of code from the regular
evaluator generator to generate conversions so we can use the Locale.
That change propagates to the order of some parameters and to the
`toString` and a few more places.

Closes elastic#124002
@nik9000 nik9000 added the v8.19.0 label May 6, 2025
elasticsearchmachine pushed a commit that referenced this pull request May 8, 2025
Make `TO_LOWER` and `TO_UPPER` process all values it received.

This is quite large because it borrows a lot of code from the regular
evaluator generator to generate conversions so we can use the Locale.
That change propagates to the order of some parameters and to the
`toString` and a few more places.

Closes #124002
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/ES|QL AKA ESQL >bug Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.19.0 v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ESQL: TO_LOWER and TO_UPPER should keep multivalued fields
3 participants