Skip to content

Update alertmanager to upstream v0.15.1 with memberlist #929

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Aug 16, 2019

Conversation

bboreham
Copy link
Contributor

@bboreham bboreham commented Aug 9, 2018

Fixes #793
Fixes #1205
Fixes #343 because that code is removed
Fixes #899 because that code is removed
Fixes #900 because the message is now at debug level upstream

Options like -alertmanager.mesh.peer.service from the previous implementation are removed.

In a Kubernetes deployment, it can be run as a StatefulSet: suppose the members of the set are named a1, a2 and a3, then all can be run as alertmanager -peer a1 -peer a2 -peer a3.

I have tested as individual Docker containers:

docker network create cortex
docker run --name=configs --net=cortex -d quay.io/weaveworks/cortex-configs --database.uri=memory://
docker run --name=a1 --net=cortex -d quay.io/weaveworks/cortex-alertmanager --alertmanager.configs.url=configs -cluster.peer=a2:9094 --alertmanager.web.external-url=/api/prom/alertmanager
docker run --name=a2 --net=cortex -d quay.io/weaveworks/cortex-alertmanager --alertmanager.configs.url=http://configs -cluster.peer=a1:9094 --alertmanager.web.external-url=/api/prom/alertmanager

Then tried various curl commands against the API.

@bboreham bboreham force-pushed the update-alertmanager branch 2 times, most recently from 249336a to d805c18 Compare August 9, 2018 16:11
@bboreham bboreham force-pushed the update-alertmanager branch 2 times, most recently from b811561 to a38a9c2 Compare August 23, 2018 12:08
Copy link
Contributor

@csmarchbanks csmarchbanks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems fine code wise.

Have you tried deploying it and having it try to discover other members based on the alertmanager service url? I am hoping not to do a stateful set if possible.

@bboreham
Copy link
Contributor Author

No I have not tried via discovery. My vague memory is that it won't work as-is.

Also I have not done the config for statefulset, but in my mind statefulset seems better in every way for a component like this. Interested to hear what makes you avoid them.

@csmarchbanks
Copy link
Contributor

The only issue I have with stateful sets is I think you have to make an entirely new set (or delete the old set) to change values that aren't image or resource constraints. Not a big deal, but a bit of a hassle.

@khaines
Copy link
Contributor

khaines commented Sep 16, 2018

The only issue I have with stateful sets is I think you have to make an entirely new set (or delete the old set) to change values that aren't image or resource constraints. Not a big deal, but a bit of a hassle.

This is really the only pain point we have with stateful sets in our environment. In order to do any changes that are not the few fields that k8s allows you to change, one must remove the entire stateful set first, thus incurring some level of downtime to the component.

@khaines
Copy link
Contributor

khaines commented Sep 16, 2018

if the peer list could have been rendered into a config file and loaded via config map, this might avoid the issue with updating fields in a Statefulset spec.

@bboreham bboreham changed the title Update alertmanager to upstream v0.15.1 with memberlist WIP: Update alertmanager to upstream v0.15.1 with memberlist Jan 16, 2019
@rndstr
Copy link
Contributor

rndstr commented Jan 26, 2019

this would also close #1205

@bboreham bboreham force-pushed the update-alertmanager branch 2 times, most recently from 2810427 to 26d9030 Compare August 5, 2019 16:42
@bboreham
Copy link
Contributor Author

bboreham commented Aug 5, 2019

Rebased to latest master.

I now think a statefulset is not required, because we just need to have each peer find any existing peer and that can be done via regular service discovery.

More notes of commands used in testing:

docker run --name=configs --net=cortex -d quay.io/cortexproject/cortex -target configs --database.uri=memory://
docker run --name=a1 --net=cortex -d quay.io/cortexproject/cortex -target=alertmanager --alertmanager.configs.url=http://configs --alertmanager.web.external-url=/api/prom/alertmanager
docker run --name=a2 --net=cortex -d quay.io/cortexproject/cortex -target=alertmanager --alertmanager.configs.url=http://configs -cluster.peer=a1:9094 --alertmanager.web.external-url=/api/prom/alertmanager

curl http://172.26.0.3/status

curl -X POST -H X-Scope-OrgID:3 http://172.26.0.2/api/prom/configs/alertmanager -d '{"alertmanager_config": "route:\n receiver: foo\nreceivers:\n- name: foo"}'

curl -X POST -H X-Scope-OrgID:3 http://172.26.0.3/api/prom/alertmanager/api/v1/silences -d '{"startsAt": "2019-08-05T16:18:03Z", "endsAt": "2019-08-05T19:18:03Z", "updatedAt": "2019-08-05T16:18:03Z", "comment": "bar", "createdBy": "bryan", "matchers": [{"name": "m1", "value": "v1"}]}'

curl -H X-Scope-OrgID:3 http://172.26.0.3/api/prom/alertmanager/api/v1/silences
curl -H X-Scope-OrgID:3 http://172.26.0.4/api/prom/alertmanager/api/v1/silences


@bboreham bboreham force-pushed the update-alertmanager branch from 26d9030 to d0c0c9b Compare August 5, 2019 16:48
@bboreham bboreham changed the title WIP: Update alertmanager to upstream v0.15.1 with memberlist Update alertmanager to upstream v0.15.1 with memberlist Aug 5, 2019
@bboreham
Copy link
Contributor Author

bboreham commented Aug 5, 2019

I just tested this as a Kubernetes Deployment with -cluster.peer=alertmanager.cortex.svc.cluster.local.:9094 and it worked fine.

May need to check what exactly Memberlist does when you give it a Kubernetes Service address, i.e. one that moves around from call to call.

UPDATE: Prometheus alertmanager takes the name you give it and does a DNS lookup, so a headless service is perfect. It does this once at startup, so each new alertmanager would connect to all existing alertmanagers. Dead ones are removed from the list.

@bboreham
Copy link
Contributor Author

bboreham commented Aug 6, 2019

I'm now running this in a staging cluster. All seems fine, although there is a bit of log noise at startup:

level=warn ts=2019-08-06T13:12:49.059667018Z caller=delegate.go:197 component=cluster received="unknown state key" len=2414 key=sil:1136
level=warn ts=2019-08-06T13:12:49.059676114Z caller=delegate.go:197 component=cluster received="unknown state key" len=2414 key=sil:853
level=warn ts=2019-08-06T13:12:49.059685719Z caller=delegate.go:197 component=cluster received="unknown state key" len=2414 key=sil:1122

I think this is because gossip starts before the configs are all loaded, and we receive updates from already-running alert managers about instances we don't know about yet.

Still, the log noise is down about 100x from the current version.

@csmarchbanks csmarchbanks self-requested a review August 6, 2019 16:25
flag.StringVar(&cfg.clusterAdvertiseAddr, "cluster.advertise-address", "", "Explicit address to advertise in cluster.")
flag.Var(&cfg.peers, "cluster.peer", "Initial peers (may be repeated).")
flag.DurationVar(&cfg.peerTimeout, "cluster.peer-timeout", time.Second*15, "Time to wait between peers to send notifications.")
flag.DurationVar(&cfg.gossipInterval, "cluster.gossip-interval", cluster.DefaultGossipInterval, "Interval between sending gossip messages. By lowering this value (more frequent) gossip messages are propagated across the cluster more quickly at the expense of increased bandwidth.")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we really need to expose all of these "cluster" parameters.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could probably get away with defaults for most of these. Do you plan to actually change any of them?

@bboreham bboreham force-pushed the update-alertmanager branch from 3332bf8 to 9b72c2e Compare August 8, 2019 11:01
@bboreham
Copy link
Contributor Author

bboreham commented Aug 8, 2019

Have rebased against master, and now this PR undoes some of the vendor hacks introduced by #1510 - putting Alertmanager back on mainline.

Copy link
Contributor

@csmarchbanks csmarchbanks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems sane to me, but I don't actually run alertmanager. Perhaps @khaines could take a look as well since I believe he heavily uses alertmanager?

flag.StringVar(&cfg.clusterAdvertiseAddr, "cluster.advertise-address", "", "Explicit address to advertise in cluster.")
flag.Var(&cfg.peers, "cluster.peer", "Initial peers (may be repeated).")
flag.DurationVar(&cfg.peerTimeout, "cluster.peer-timeout", time.Second*15, "Time to wait between peers to send notifications.")
flag.DurationVar(&cfg.gossipInterval, "cluster.gossip-interval", cluster.DefaultGossipInterval, "Interval between sending gossip messages. By lowering this value (more frequent) gossip messages are propagated across the cluster more quickly at the expense of increased bandwidth.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could probably get away with defaults for most of these. Do you plan to actually change any of them?

@bboreham bboreham force-pushed the update-alertmanager branch from d1643c1 to 1e39f02 Compare August 13, 2019 10:54
@bboreham
Copy link
Contributor Author

I removed most of the new configuration parameters (left the commit in, so they can be retrieved if we do need them).

bboreham and others added 4 commits August 16, 2019 15:07
Don't expect any of these will need to be configured.

Signed-off-by: Bryan Boreham <[email protected]>
@bboreham bboreham force-pushed the update-alertmanager branch from 1e39f02 to f547d62 Compare August 16, 2019 15:09
Copy link
Contributor

@khaines khaines left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Glad to see this getting updated!

@bboreham bboreham merged commit 7f3895e into master Aug 16, 2019
@bboreham bboreham deleted the update-alertmanager branch August 16, 2019 15:17
@bboreham bboreham mentioned this pull request Aug 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants