File-based settings health indicator #117081

prdoyle · 2024-11-19T21:00:41Z

We have been trying to alert on file-based settings failures by inferring badness from logs. We've made progress there, but ultimately we're having trouble with the alert recovery conditions.

This PR adds a file-based settings Health Indicator, and we can alert directly on that instead of the logs.

server/src/main/java/org/elasticsearch/reservedstate/service/FileSettingsService.java

prdoyle · 2024-11-20T14:57:34Z

Ok I've taken a different approach.

All failures are now YELLOW
The details include a field called most_recent_failure which is the Exception.toString of the exception
Our alerting can differentiate based on most_recent_failure

elasticsearchmachine · 2024-11-20T22:20:41Z

Pinging @elastic/es-core-infra (Team:Core/Infra)

dakrone

LGTM, I left one comment. I wanted to do some manual testing, but it's not blocking.

dakrone · 2024-11-20T21:15:39Z

server/src/main/java/org/elasticsearch/reservedstate/service/FileSettingsService.java

+            completion.onResponse(null);
+            healthIndicatorService.successOccurred();


In the failure case you call .failureOccurred prior to the completion invocation, but here the order is switched. Should the .successOccurred() call be before the completion.onResponse(null)?

I didn't want to report a success if onResponse throws. Conversely, I did want to report a failure even if onFailure throws.

rjernst

What happens for nodes that don't ever read file settings? They are only used by ECK/serverless, and even then only the master node actually reads them. So on other nodes, the indicator would always be in a yellow state?

prdoyle · 2024-11-21T13:44:40Z

@rjernst that should be covered by the NO_CHANGES_SYMPTOM case. At least, that's the intent.

prdoyle · 2024-11-21T14:09:10Z

I ran it locally, and the curl -u elastic-admin:elastic-password http://localhost:9200/_health_report response includes this:

"file_settings" : {
   "status" : "green",
   "symptom" : "No file-based setting changes have occurred"
},

* Add FileSettingsService health indicator * spotless * YELLOW for any failure, plus most_recent_failure

prdoyle added 2 commits November 19, 2024 15:49

Add FileSettingsService health indicator

d830584

spotless

4b64255

prdoyle added >non-issue :Core/Infra/Settings Settings infrastructure and APIs labels Nov 19, 2024

prdoyle self-assigned this Nov 19, 2024

elasticsearchmachine added the v9.0.0 label Nov 19, 2024

prdoyle commented Nov 19, 2024

View reviewed changes

server/src/main/java/org/elasticsearch/reservedstate/service/FileSettingsService.java Outdated Show resolved Hide resolved

dakrone reviewed Nov 19, 2024

View reviewed changes

server/src/main/java/org/elasticsearch/reservedstate/service/FileSettingsService.java Show resolved Hide resolved

server/src/main/java/org/elasticsearch/reservedstate/service/FileSettingsService.java Show resolved Hide resolved

YELLOW for any failure, plus most_recent_failure

f1b1d78

prdoyle added 2 commits November 20, 2024 12:18

Merge branch 'main' into failure-streak-health

8cef373

Merge branch 'main' into failure-streak-health

6537515

prdoyle marked this pull request as ready for review November 20, 2024 22:20

prdoyle requested a review from a team as a code owner November 20, 2024 22:20

elasticsearchmachine added the Team:Core/Infra Meta label for core/infra team label Nov 20, 2024

dakrone approved these changes Nov 20, 2024

View reviewed changes

rjernst reviewed Nov 20, 2024

View reviewed changes

prdoyle merged commit 1a4b3d3 into elastic:main Nov 21, 2024
16 checks passed

prdoyle deleted the failure-streak-health branch November 21, 2024 14:10

This was referenced Nov 22, 2024

Add FileSettingsIndicator elastic/elasticsearch-specification#3172

Closed

File settings health indicator elastic/elasticsearch-specification#3173

Merged

alexey-ivanov-es pushed a commit to alexey-ivanov-es/elasticsearch that referenced this pull request Nov 28, 2024

File-based settings health indicator (elastic#117081)

1b8afc5

* Add FileSettingsService health indicator * spotless * YELLOW for any failure, plus most_recent_failure

ldematte mentioned this pull request Dec 6, 2024

[CI] FileSettingsServiceTests testStopWorksInMiddleOfProcessing failing #117591

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

File-based settings health indicator #117081

File-based settings health indicator #117081

Uh oh!

prdoyle commented Nov 19, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

prdoyle commented Nov 20, 2024

Uh oh!

elasticsearchmachine commented Nov 20, 2024

Uh oh!

dakrone left a comment

Uh oh!

dakrone Nov 20, 2024

Uh oh!

prdoyle Nov 21, 2024

Uh oh!

rjernst left a comment

Uh oh!

prdoyle commented Nov 21, 2024

Uh oh!

prdoyle commented Nov 21, 2024

Uh oh!

Uh oh!

Uh oh!

		completion.onResponse(null);
		healthIndicatorService.successOccurred();

File-based settings health indicator #117081

File-based settings health indicator #117081

Uh oh!

Conversation

prdoyle commented Nov 19, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

prdoyle commented Nov 20, 2024

Uh oh!

elasticsearchmachine commented Nov 20, 2024

Uh oh!

dakrone left a comment

Choose a reason for hiding this comment

Uh oh!

dakrone Nov 20, 2024

Choose a reason for hiding this comment

Uh oh!

prdoyle Nov 21, 2024

Choose a reason for hiding this comment

Uh oh!

rjernst left a comment

Choose a reason for hiding this comment

Uh oh!

prdoyle commented Nov 21, 2024

Uh oh!

prdoyle commented Nov 21, 2024

Uh oh!

Uh oh!

Uh oh!