Skip to content

File-based settings health indicator #117081

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Nov 21, 2024
Merged

Conversation

prdoyle
Copy link
Contributor

@prdoyle prdoyle commented Nov 19, 2024

We have been trying to alert on file-based settings failures by inferring badness from logs. We've made progress there, but ultimately we're having trouble with the alert recovery conditions.

This PR adds a file-based settings Health Indicator, and we can alert directly on that instead of the logs.

@prdoyle prdoyle added >non-issue :Core/Infra/Settings Settings infrastructure and APIs labels Nov 19, 2024
@prdoyle prdoyle self-assigned this Nov 19, 2024
@prdoyle
Copy link
Contributor Author

prdoyle commented Nov 20, 2024

Ok I've taken a different approach.

  • All failures are now YELLOW
  • The details include a field called most_recent_failure which is the Exception.toString of the exception
  • Our alerting can differentiate based on most_recent_failure

@prdoyle prdoyle marked this pull request as ready for review November 20, 2024 22:20
@prdoyle prdoyle requested a review from a team as a code owner November 20, 2024 22:20
@elasticsearchmachine elasticsearchmachine added the Team:Core/Infra Meta label for core/infra team label Nov 20, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)

Copy link
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I left one comment. I wanted to do some manual testing, but it's not blocking.

Comment on lines +181 to +182
completion.onResponse(null);
healthIndicatorService.successOccurred();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the failure case you call .failureOccurred prior to the completion invocation, but here the order is switched. Should the .successOccurred() call be before the completion.onResponse(null)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't want to report a success if onResponse throws. Conversely, I did want to report a failure even if onFailure throws.

Copy link
Member

@rjernst rjernst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens for nodes that don't ever read file settings? They are only used by ECK/serverless, and even then only the master node actually reads them. So on other nodes, the indicator would always be in a yellow state?

@prdoyle
Copy link
Contributor Author

prdoyle commented Nov 21, 2024

@rjernst that should be covered by the NO_CHANGES_SYMPTOM case. At least, that's the intent.

@prdoyle
Copy link
Contributor Author

prdoyle commented Nov 21, 2024

I ran it locally, and the curl -u elastic-admin:elastic-password http://localhost:9200/_health_report response includes this:

"file_settings" : {
   "status" : "green",
   "symptom" : "No file-based setting changes have occurred"
},

@prdoyle prdoyle merged commit 1a4b3d3 into elastic:main Nov 21, 2024
16 checks passed
@prdoyle prdoyle deleted the failure-streak-health branch November 21, 2024 14:10
alexey-ivanov-es pushed a commit to alexey-ivanov-es/elasticsearch that referenced this pull request Nov 28, 2024
* Add FileSettingsService health indicator

* spotless

* YELLOW for any failure, plus most_recent_failure
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Core/Infra/Settings Settings infrastructure and APIs >non-issue Team:Core/Infra Meta label for core/infra team v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants