Skip to content

Add time-based muting to routing tree#2393

Merged
beorn7 merged 47 commits into
prometheus:masterfrom
benridley:dev_time_interval
Mar 1, 2021
Merged

Add time-based muting to routing tree#2393
beorn7 merged 47 commits into
prometheus:masterfrom
benridley:dev_time_interval

Conversation

@benridley

@benridley benridley commented Oct 13, 2020

Copy link
Copy Markdown
Contributor

Addresses #876

This adds the ability to define time intervals as per this design document.

Time intervals are defined in their own configuration section like so:

mute_time_intervals:
  - name: business_hours
    time_intervals:
      - weekdays: ['monday:friday']
        times: 
        - start_time: "09:00"
          end_time: "17:00"

Then they can be referenced in the routing tree like so:

  # The root route on which each incoming alert enters.
route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 1s
  group_interval: 30s

  # A default receiver
  receiver: team-X-mail
  routes:
    - match:
        alertname: NodeIsDown
      mute_times:
        - business_hours
      receiver: team-X-mail

Time zones are not supported at the moment because it would break the ability to run Alertmanager on Windows in a reliable way. Once this issue is addressed we can add support fairly easily.

@benridley benridley force-pushed the dev_time_interval branch 3 times, most recently from 7a16adb to 1323d17 Compare October 13, 2020 01:31
@benridley benridley marked this pull request as draft October 13, 2020 02:37
@benridley benridley marked this pull request as ready for review October 13, 2020 02:41
@benridley

Copy link
Copy Markdown
Contributor Author

Just added some additional tests for the config

@benridley

Copy link
Copy Markdown
Contributor Author

@simonpasquier Can you please let me know if there's anything else I can do for this one? Thank you!

@fredrik-j-lindberg

Copy link
Copy Markdown

This would help our team greatly. Hope you get time to look at this sooner rather than later! @simonpasquier

@juliusv

juliusv commented Nov 6, 2020

Copy link
Copy Markdown
Member

@simonpasquier There seems to be good interest in this one... in case you are held up by other stuff, maybe someone else wants to take a look? @roidelapluie @beorn7

I am unfortunately totally swamped at the moment.

Comment thread docs/configuration.md Outdated
Comment thread timeinterval/timeinterval.go Outdated
Comment thread timeinterval/timeinterval.go Outdated
Comment thread notify/notify.go Outdated
Comment thread notify/notify.go Outdated
Comment thread notify/notify.go Outdated
Comment thread notify/notify.go Outdated
Comment thread timeinterval/timeinterval.go Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we check somewhere for empty time intervals? WIth an empty time interval, this would always return true

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this makes sense due to the semantics of the time interval (where not specifying a restriction means all possible times are included), I think an empty time interval is probably a symptom of an error. I'll add a check for this, thanks.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've thought more about this, I think we should leave this as is. This makes sense with the semantics of a time interval and modifying it would complicate the function by possibly returning only a single error case.

Comment thread timeinterval/timeinterval.go Outdated
Comment thread timeinterval/timeinterval.go Outdated
Comment thread timeinterval/timeinterval.go Outdated
Comment thread timeinterval/timeinterval.go Outdated
Comment thread timeinterval/timeinterval.go Outdated
Comment thread timeinterval/timeinterval.go Outdated
Comment thread timeinterval/timeinterval.go Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we set them individualy somewhere? Can we do with one setter function ? Do we need those extra setters, can't we set EndDate directly?

Comment thread timeinterval/timeinterval_test.go Outdated
Comment thread cmd/alertmanager/main.go Outdated
Comment thread cmd/alertmanager/main.go Outdated
Comment thread cmd/alertmanager/main.go Outdated
Signed-off-by: Ben Ridley <benridley29@gmail.com>
Signed-off-by: Ben Ridley <benridley29@gmail.com>
Signed-off-by: Ben Ridley <benridley29@gmail.com>
Signed-off-by: Ben Ridley <benridley29@gmail.com>
…ther error message wording

Signed-off-by: Ben Ridley <benridley29@gmail.com>
Signed-off-by: Ben Ridley <benridley29@gmail.com>
Signed-off-by: Ben Ridley <benridley29@gmail.com>
Signed-off-by: Ben Ridley <benridley29@gmail.com>
…r feedback

Signed-off-by: Ben Ridley <benridley29@gmail.com>
@benridley

Copy link
Copy Markdown
Contributor Author

Thanks for validating that @beorn7. I've rebased and the tests are looking much happier.

Do we want to add some additional words around group_interval in the docs to indicate this behavior? The current documentation is:

# How long to wait before sending a notification about new alerts that
# are added to a group of alerts for which an initial notification has
# already been sent. (Usually ~5m or more.)

which doesn't give much indication of the issue we've discovered. Adding some words here might stop people from being surprised by delays in notification delivery.

@beorn7

beorn7 commented Mar 1, 2021

Copy link
Copy Markdown
Contributor

@benridley I totally agree that the current documentation is incomplete. However, let's first get clarity if the current behavior is actually the intended behavior or a bug (in which case we would document it very differently). I'll file an issue about this topic.

In the meantime, we can merge this PR.

@beorn7 beorn7 merged commit e66c803 into prometheus:master Mar 1, 2021
Comment thread docs/configuration.md
of the form <start_day>:<end_day> and are inclusive on both ends. For example:
`[‘monday:wednesday','saturday', 'sunday']`

`days_of_month_ramge`: A list of numerical days in the month. Days begin at 1.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

days_of_month_range?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want me to fix it?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here #2497

@lopesit0

Copy link
Copy Markdown

Can someone explain to me how to integrate this functionality?

@benridley

benridley commented Apr 19, 2021

Copy link
Copy Markdown
Contributor Author

Hey @lopesit0, no problem!
This has been merged into master now, but isn't part of an official release yet, so if you download Alertmanager using traditional methods (e.g. downloading them from https://prometheus.io/download/), this functionality won't be there. If you want to try it out, you can build the master branch yourself which is very easy thanks to the Golang toolchain. Simply download Go and then clone this repository from GitHub. Once you've done that, on the command line simply run

go build -o alertmanager cmd/alertmanager/main.go

which will give you an alertmanager binary with this included.

You can find the raw documentation also in the master branch which explains how to configure your time intervals.

Just be warned this is new and potentially unstable. If you do decide to run it, please feel free to report back here or open issues against the repository so we can fix then ASAP!

EDIT: Alternatively, I've realized there's a Docker image built from master. So the easiest way to try this out might be using the official docker image built from master.
docker pull prom/alertmanager:master

@lopesit0

Copy link
Copy Markdown

Hey @lopesit0, no problem!
This has been merged into master now, but isn't part of an official release yet, so if you download Alertmanager using traditional methods (e.g. downloading them from https://prometheus.io/download/), this functionality won't be there. If you want to try it out, you can build the master branch yourself which is very easy thanks to the Golang toolchain. Simply download Go and then clone this repository from GitHub. Once you've done that, on the command line simply run

go build -o alertmanager cmd/alertmanager/main.go

which will give you an alertmanager binary with this included.

You can find the raw documentation also in the master branch which explains how to configure your time intervals.

Just be warned this is new and potentially unstable. If you do decide to run it, please feel free to report back here or open issues against the repository so we can fix then ASAP!

EDIT: Alternatively, I've realized there's a Docker image built from master. So the easiest way to try this out might be using the official docker image built from master.
docker pull prom/alertmanager:master

Thanks for your answer :)

@ijungmann

Copy link
Copy Markdown

Is there any way to make use of this in the helm chart at the moment? It seems like the operator does not handle this yet.

@beorn7

beorn7 commented Apr 28, 2021

Copy link
Copy Markdown
Contributor

We need a release of Alertmanager. I think @roidelapluie was working on it, but there were flaky tests blocking it. Perhaps those are resolved now?

@roidelapluie

Copy link
Copy Markdown
Member

See #2557

@bimmerkiev

bimmerkiev commented May 14, 2021

Copy link
Copy Markdown

Hi all.
I'm trying to use that feature with release version 0.22.0-rc.1, but I have no luck with it.
My alertmanager.yml:

mute_time_intervals:
  - name: business_hours
    time_intervals:
      - weekdays: ['monday:friday']
        times:
        - start_time: "09:00"
          end_time: "11:00"

templates:
- '/etc/alertmanager/templates/*.tmpl'
receivers:
- name: pagerduty
  pagerduty_configs:
  - client_url: http://x.x.x.x/
    description: '{{ if .CommonAnnotations.summary }}{{ .CommonAnnotations.summary
      }}{{ end }}'
    routing_key: xxx
    severity: '{{ if .CommonLabels.severity }}{{ .CommonLabels.severity | toLower
      }}{{ end }}'

route:
  group_by:
  - alertname
  - cluster
  - service
  - env
  group_interval: 5m
  group_wait: 30s
  receiver: pagerduty
  repeat_interval: 4h
  routes:
  - group_wait: 10s
    match:
      severity: low
      mute_time_intervals:
        - business_hours
      receiver: pagerduty

But with that config alertmanager don't want to start.
Part of log:

level=error ts=2021-05-14T13:38:06.667Z caller=coordinator.go:118 component=configuration msg="Loading configuration file failed" file=/etc/alertmanager/alertmanager.yml err="yaml: unmarshal errors:\n  line 44: cannot unmarshal !!seq into string"

Can anyone help me?
Thanks

@bimmerkiev

Copy link
Copy Markdown

the error about line:
" - business_hours"

@brunocriado

Copy link
Copy Markdown

Hi 👋
Before sending an issue I wanted to confirm if I'm doing it right. I have a test configuration where I want to mute the alerts for for 1 minute on Mondays for a specific receiver as you can see in below.

...
  - group_by:
    - alertname
    - severity
    - env
    match:
      type: aws-alarm
      mute_times: test
    receiver: aws-alarm
    continue: true
...
mute_time_intervals:
- name: test
  time_intervals:
  - weekdays: ['monday']
    times:
    - start_time: '09:00'
      end_time: '09:01'

I'm testing with amtool and when I send the alert I didn't receive it. If I remove that time interval it go off as expected. BTW today is not Monday and I'm testing for more than 1 minute.

@brunocriado

brunocriado commented May 20, 2021

Copy link
Copy Markdown

Just found the issue. I should use mute_time_intervals and not inside match

  - group_by:
    - alertname
    - severity
    - env
    match:
      type: aws-alarm
    mute_time_intervals:
      - test
    receiver: aws-alarm

@rschirin

rschirin commented Jun 4, 2021

Copy link
Copy Markdown

sorry team, is there any chance to see this feature within the docker image?

@beorn7

beorn7 commented Jun 7, 2021

Copy link
Copy Markdown
Contributor

It has been released in v0.22. Any Docker image with that or a later release should support it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.