Skip to content

Handle unavailable MD5 in ES|QL #130158

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jul 3, 2025
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions docs/changelog/130158.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
pr: 130158
summary: Handle unavailable MD5 in ES|QL
area: ES|QL
type: bug
issues: []
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@

import org.elasticsearch.common.io.stream.NamedWriteableRegistry;
import org.elasticsearch.common.io.stream.StreamInput;
import org.elasticsearch.xpack.esql.VerificationException;
import org.elasticsearch.xpack.esql.core.expression.Expression;
import org.elasticsearch.xpack.esql.core.tree.NodeInfo;
import org.elasticsearch.xpack.esql.core.tree.Source;
Expand All @@ -18,13 +19,16 @@
import org.elasticsearch.xpack.esql.expression.function.scalar.string.Hash.HashFunction;

import java.io.IOException;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
import java.util.List;
import java.util.concurrent.atomic.AtomicReference;

public class Md5 extends AbstractHashFunction {

public static final NamedWriteableRegistry.Entry ENTRY = new NamedWriteableRegistry.Entry(Expression.class, "MD5", Md5::new);

private static final HashFunction MD5 = HashFunction.create("MD5");
private final AtomicReference<HashFunction> MD5 = new AtomicReference<>();

@FunctionInfo(
returnType = "keyword",
Expand All @@ -39,9 +43,23 @@ private Md5(StreamInput in) throws IOException {
super(in);
}

/**
* As of Java 14, it is permissible for a JRE to ship without the {@code MD5} {@link MessageDigest}.
* We want the "md5" function in ES|QL to fail at runtime on such platforms (rather than at startup)
* so we build the {@link HashFunction} lazily.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the same applicable for SHA functions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is probably worth mentioning that this might not be available in certain JVMs at

Copy link
Contributor Author

@tvernum tvernum Jun 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, as at Java 24, both SHA-1 and SHA-256 are still required

I could change those functions to work similarly, but I opted for the smallest change that solves the known problem. I actually started with a change in HashFunction that would be lenient for any algorithm, but switched to something MD5 specific.

*/
@Override
protected HashFunction getHashFunction() {
return MD5;
HashFunction function = MD5.get();
if (function == null) {
try {
function = new HashFunction("MD5", MessageDigest.getInstance("MD5"));
MD5.compareAndSet(null, function);
} catch (NoSuchAlgorithmException e) {
throw new VerificationException("function 'md5' is not available on this platform: {}", e.getMessage());
}
}
return function;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer if we do not try to initialize it for every query even if it is not available. May be something like:

    private static final HashFunction MD5 = HashFunction.tryCreate("MD5");

and in Hash.java:

        public static HashFunction tryCreate(String algorithm) {
            try {
                return new HashFunction(algorithm, MessageDigest.getInstance(algorithm));
            } catch (NoSuchAlgorithmException e) {
                logger.warn("[{}] hash fuinction is not available", algorithm);
                return null;
            }
        }

This would also mean we need to modify org.elasticsearch.xpack.esql.expression.function.scalar.string.AbstractHashFunction#toEvaluator to use different (warning/error only) hash evaluator when getHashFunction returns null.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer if we do not try to initialize it for every query even if it is not available.

So would I.
We're lacking an appropriate Result (Maybe) type, that has either a value or an error.
That is what I really wanted to use here, but opted for a lazy reference instead.

I can handle it as you describe, the issue is that we lose the original exception (its only in the logs) and have to treat null as a marker for "not available".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly it worth introducing wrapper that captures exception and re-throws it when executed in new or existing constant evaluator.

Copy link
Contributor

@astefan astefan Jun 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is JVM specific, I am wondering if there are scenarios (like from test | eval hash=hash("md5", message) | stats count(hash) by foo where the stats is a pipeline breaker which sends the plan fragment to each data node) the hash is computed at data node level and if data nodes have different JVMs than that of the coordinator, a lazy approach is actually the correct approach (like what @tvernum is suggesting).

Sorry if I am mentioning something obvious, but thought on voicing it loud rather than just assuming this is already considered.

Copy link
Contributor

@idegtiarenko idegtiarenko Jun 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simple from index | eval hash = md5(field) would spread md5 evaluation to each node too.
Since md5 is not going to become available after then node has started there is no need to try to initialize it more than once after the start. In case of nodes running distinct JVMs in this approach we might have nulls and actual hashes from different nodes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simple from index | eval hash = md5(field) would spread md5 evaluation to each node too

It doesn't. The hash is computed on the coordinator after each node replies back with 1000 (the default) rows. The coordinator will pick the final 1000 rows and then compute the hash on them.

This is how a physical plan would look like on the coordinator:


EvalExec[[HASH(md5[KEYWORD],field{f}#119) AS hash#115]]
 \_LimitExec[1000[INTEGER],null]
   \_ExchangeExec[[field{f}#119],false]
     \_FragmentExec[fragment=[<>
													Project[[field{f}#119]]
													\_Limit[1000[INTEGER],false]
                                                       \_EsRelation[index][field{f}#119]<>]]

Copy link
Contributor Author

@tvernum tvernum Jun 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I introduced a Result so that we only try to obtain the message digest once (although that's not strictly true, because message digests aren't thread safe, so we actually obtain a new one each time the HashFunction is copied.)

}

@Override
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
/*
* Copyright Elasticsearch B.V. and/or licensed to Elasticsearch B.V. under one
* or more contributor license agreements. Licensed under the Elastic License
* 2.0; you may not use this file except in compliance with the Elastic License
* 2.0.
*/

package org.elasticsearch.xpack.esql.expression.function.scalar.string;

import org.elasticsearch.test.ESTestCase;
import org.elasticsearch.xpack.esql.VerificationException;
import org.elasticsearch.xpack.esql.core.expression.Literal;
import org.elasticsearch.xpack.esql.core.tree.Source;

import java.security.Provider;
import java.security.Security;

import static org.hamcrest.Matchers.notNullValue;

public class Md5UnavailableTests extends ESTestCase {

public void testMd5Unavailable() {
assumeFalse("We run with different security providers in FIPS, and changing them at runtime is more complicated", inFipsJvm());
final Provider sunProvider = Security.getProvider("SUN");
final Md5 md5;
try {
Security.removeProvider("SUN");
md5 = new Md5(Source.EMPTY, Literal.NULL);
expectThrows(VerificationException.class, md5::getHashFunction);
} finally {
Security.addProvider(sunProvider);
}

assertThat(md5.getHashFunction(), notNullValue());
}

}
Loading