-
Notifications
You must be signed in to change notification settings - Fork 816
Update alertmanager to upstream v0.15.1 with memberlist #929
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
249336a
to
d805c18
Compare
b811561
to
a38a9c2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems fine code wise.
Have you tried deploying it and having it try to discover other members based on the alertmanager service url? I am hoping not to do a stateful set if possible.
No I have not tried via discovery. My vague memory is that it won't work as-is. Also I have not done the config for statefulset, but in my mind statefulset seems better in every way for a component like this. Interested to hear what makes you avoid them. |
The only issue I have with stateful sets is I think you have to make an entirely new set (or delete the old set) to change values that aren't image or resource constraints. Not a big deal, but a bit of a hassle. |
This is really the only pain point we have with stateful sets in our environment. In order to do any changes that are not the few fields that k8s allows you to change, one must remove the entire stateful set first, thus incurring some level of downtime to the component. |
if the peer list could have been rendered into a config file and loaded via config map, this might avoid the issue with updating fields in a Statefulset spec. |
this would also close #1205 |
2810427
to
26d9030
Compare
Rebased to latest master. I now think a statefulset is not required, because we just need to have each peer find any existing peer and that can be done via regular service discovery. More notes of commands used in testing:
|
26d9030
to
d0c0c9b
Compare
I just tested this as a Kubernetes Deployment with May need to check what exactly Memberlist does when you give it a Kubernetes Service address, UPDATE: Prometheus alertmanager takes the name you give it and does a DNS lookup, so a headless service is perfect. It does this once at startup, so each new alertmanager would connect to all existing alertmanagers. Dead ones are removed from the list. |
I'm now running this in a staging cluster. All seems fine, although there is a bit of log noise at startup:
I think this is because gossip starts before the configs are all loaded, and we receive updates from already-running alert managers about instances we don't know about yet. Still, the log noise is down about 100x from the current version. |
pkg/alertmanager/multitenant.go
Outdated
flag.StringVar(&cfg.clusterAdvertiseAddr, "cluster.advertise-address", "", "Explicit address to advertise in cluster.") | ||
flag.Var(&cfg.peers, "cluster.peer", "Initial peers (may be repeated).") | ||
flag.DurationVar(&cfg.peerTimeout, "cluster.peer-timeout", time.Second*15, "Time to wait between peers to send notifications.") | ||
flag.DurationVar(&cfg.gossipInterval, "cluster.gossip-interval", cluster.DefaultGossipInterval, "Interval between sending gossip messages. By lowering this value (more frequent) gossip messages are propagated across the cluster more quickly at the expense of increased bandwidth.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we really need to expose all of these "cluster" parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could probably get away with defaults for most of these. Do you plan to actually change any of them?
3332bf8
to
9b72c2e
Compare
Have rebased against master, and now this PR undoes some of the vendor hacks introduced by #1510 - putting Alertmanager back on mainline. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems sane to me, but I don't actually run alertmanager. Perhaps @khaines could take a look as well since I believe he heavily uses alertmanager?
pkg/alertmanager/multitenant.go
Outdated
flag.StringVar(&cfg.clusterAdvertiseAddr, "cluster.advertise-address", "", "Explicit address to advertise in cluster.") | ||
flag.Var(&cfg.peers, "cluster.peer", "Initial peers (may be repeated).") | ||
flag.DurationVar(&cfg.peerTimeout, "cluster.peer-timeout", time.Second*15, "Time to wait between peers to send notifications.") | ||
flag.DurationVar(&cfg.gossipInterval, "cluster.gossip-interval", cluster.DefaultGossipInterval, "Interval between sending gossip messages. By lowering this value (more frequent) gossip messages are propagated across the cluster more quickly at the expense of increased bandwidth.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could probably get away with defaults for most of these. Do you plan to actually change any of them?
d1643c1
to
1e39f02
Compare
I removed most of the new configuration parameters (left the commit in, so they can be retrieved if we do need them). |
Signed-off-by: Bryan Boreham <[email protected]>
Signed-off-by: Bryan Boreham <[email protected]>
Signed-off-by: Bryan Boreham <[email protected]>
Don't expect any of these will need to be configured. Signed-off-by: Bryan Boreham <[email protected]>
1e39f02
to
f547d62
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Glad to see this getting updated!
Fixes #793
Fixes #1205
Fixes #343 because that code is removed
Fixes #899 because that code is removed
Fixes #900 because the message is now at debug level upstream
Options like
-alertmanager.mesh.peer.service
from the previous implementation are removed.In a Kubernetes deployment, it can be run as a StatefulSet: suppose the members of the set are named a1, a2 and a3, then all can be run as
alertmanager -peer a1 -peer a2 -peer a3
.I have tested as individual Docker containers:
Then tried various
curl
commands against the API.