Skip to content

Conversation

@siavashs
Copy link
Contributor

@siavashs siavashs commented Nov 15, 2025

Add tracing support using otel to the the following components:

  • api: extract trace and span IDs from request context
  • provider: mem put
  • dispatch: split logic and use better naming
  • inhibit: source and target traces, mutes, etc. drop metrics
  • silence: query, expire, mutes
  • notify: add distributed tracing support to stages and all http requests

Note: inhibitor metrics are dropped since we have tracing now and they
are not needed. We have not released any version with these metrics so
we can drop them safely, this is not a breaking change.

This change borrows part of the implementation from #3673
Fixes #3670

Signed-off-by: Dave Henderson [email protected]
Signed-off-by: Siavash Safi [email protected]

TODO list

Demo

Alertmanager receiving alerts from a "sender" and tracing it all the way to an Aggregation Group inside Dispatcher

Note that Sender here is a custom Cloudflare proxy which acts as a limiter, it is included as an example to show distributed tracing in action
image

Trace of Dispatcher flushing alerts and notifications failing to be sent to PagerDuty

image

@siavashs
Copy link
Contributor Author

siavashs commented Nov 15, 2025

I'll implement the Prometheus notifier changes in a new draft PR based on prometheus/prometheus#16355
So if the above PR got merged then it would be much easier to rebase the tracing changes.

Copy link

@thompson-tomo thompson-tomo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some feedback on attribute naming based on Open telemetry exposure in semconv. Might be worthwhile to add an alerting.platform.name attribute to the spans which stores alert manager.

Is there interest in having these signals defined in the open telemetry signal registry (semantic conventions)?

config/config.go Outdated
return json.Marshal(result)
}

// TODO: probably move these into prometheus/common since they're copied from
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably a good idea, what do you think @ArthurSens?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can prepare a PR for common later this week.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not be against it :)

@siavashs siavashs force-pushed the feat/tracing branch 4 times, most recently from 14b79bd to 47747ad Compare November 17, 2025 21:16
Comment on lines +92 to +94
trace.WithAttributes(attribute.String("alerting.notify.integration.name", i.name)),
trace.WithAttributes(attribute.Int("alerting.alerts.count", len(alerts))),
trace.WithSpanKind(trace.SpanKindClient),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an easy way to add additional integration details potential some from https://opentelemetry.io/docs/specs/semconv/http/http-spans/#http-client-span in particular server.*, note I assume this span represents the outbound call to a notification system hence the client span kind.

Copy link
Contributor Author

@siavashs siavashs Nov 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otelhttp generates a bunch of spans for each http request so I think it is redundant to add more info to the parent span, unless if we want to disable those and construct only one custom span:
image

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are going to also have a http span then perhaps this should be internal.

attribute.String("alerting.alert.name", alert.Name()),
attribute.String("alerting.alert.fingerprint", alert.Fingerprint().String()),
),
trace.WithSpanKind(trace.SpanKindConsumer),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check this one

Suggested change
trace.WithSpanKind(trace.SpanKindConsumer),
trace.WithSpanKind(trace.SpanKindInternal),

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed that one, are these outgoing calls as per the test mentioned in #4745 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it is this case:

deferred execution (PRODUCER and CONSUMER spans).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but it still needs to be out going based on other information in the spec. Anyway I have raised open-telemetry/opentelemetry-specification#4758 to get further clarification.

@siavashs siavashs force-pushed the feat/tracing branch 9 times, most recently from c1482c7 to 5882778 Compare November 21, 2025 16:00
@siavashs
Copy link
Contributor Author

CI failures will be fixed after #4761 is merged.

@siavashs siavashs marked this pull request as ready for review November 21, 2025 16:28
@siavashs siavashs requested a review from SuperQ November 21, 2025 16:31
Copy link
Contributor

@OGKevin OGKevin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor ✨comments

@siavashs siavashs force-pushed the feat/tracing branch 2 times, most recently from bea0dba to b65bdbf Compare November 27, 2025 08:37
@siavashs siavashs mentioned this pull request Dec 5, 2025
Copy link
Contributor

@ultrotter ultrotter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have before-and-after benchmark measurements with tracing enabled, and disabled?

// notifyFunc is a function that performs notification for the alert
// with the given fingerprint. It aborts on context cancelation.
// Returns false iff notifying failed.
// Returns false if notifying failed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that was a typo :)

@siavashs siavashs force-pushed the feat/tracing branch 3 times, most recently from f2706f6 to 87706af Compare December 5, 2025 15:30
@siavashs
Copy link
Contributor Author

siavashs commented Dec 5, 2025

Do we have before-and-after benchmark measurements with tracing enabled, and disabled?

No, tracing is disabled by default, I have to check how we can enable tracing in benchmarks and then generate a diff.

@siavashs
Copy link
Contributor Author

siavashs commented Dec 5, 2025

Just did a quick diff, for silences there is a 6% overhead with tracing rate of 1.0:

goos: darwin
goarch: arm64
pkg: github.com/prometheus/alertmanager/silence
cpu: Apple M3 Probench-notrace.txtbench-trace.txt            │
                                                                 │      sec/opsec/op     vs baseMutes/0_total,_0_matching-12                                            383.6n ± ∞ ¹   387.5n ± ∞ ¹       ~ (p=1.000 n=1) ²
Mutes/1_total,_1_matching-12                                            916.6n ± ∞ ¹   924.4n ± ∞ ¹       ~ (p=1.000 n=1) ²
Mutes/100_total,_10_matching-12                                         1.581µ ± ∞ ¹   1.620µ ± ∞ ¹       ~ (p=1.000 n=1) ²
Mutes/1000_total,_1_matching-12                                         1.069µ ± ∞ ¹   1.138µ ± ∞ ¹       ~ (p=1.000 n=1) ²
Mutes/1000_total,_10_matching-12                                        1.926µ ± ∞ ¹   1.939µ ± ∞ ¹       ~ (p=1.000 n=1) ²
Mutes/1000_total,_100_matching-12                                       10.83µ ± ∞ ¹   10.60µ ± ∞ ¹       ~ (p=1.000 n=1) ²
Mutes/10000_total,_0_matching-12                                        1.881µ ± ∞ ¹   1.881µ ± ∞ ¹       ~ (p=1.000 n=1) ³
Mutes/10000_total,_10_matching-12                                       1.846µ ± ∞ ¹   1.898µ ± ∞ ¹       ~ (p=1.000 n=1) ²
Mutes/10000_total,_1000_matching-12                                     105.6µ ± ∞ ¹   105.6µ ± ∞ ¹       ~ (p=1.000 n=1) ²
MutesIncremental/1000_base_silences-12                                  105.2µ ± ∞ ¹   105.2µ ± ∞ ¹       ~ (p=1.000 n=1) ²
MutesIncremental/3000_base_silences-12                                  104.2µ ± ∞ ¹   104.1µ ± ∞ ¹       ~ (p=1.000 n=1) ²
MutesIncremental/7000_base_silences-12                                  104.6µ ± ∞ ¹   109.0µ ± ∞ ¹       ~ (p=1.000 n=1) ²
MutesIncremental/10000_base_silences-12                                 103.1µ ± ∞ ¹   113.0µ ± ∞ ¹       ~ (p=1.000 n=1) ²
Query/100_silences-12                                                   16.43µ ± ∞ ¹   17.36µ ± ∞ ¹       ~ (p=1.000 n=1) ²
Query/1000_silences-12                                                  171.6µ ± ∞ ¹   182.1µ ± ∞ ¹       ~ (p=1.000 n=1) ²
Query/10000_silences-12                                                 2.155m ± ∞ ¹   3.462m ± ∞ ¹       ~ (p=1.000 n=1) ²
QueryParallel/100_silences-12                                           3.749µ ± ∞ ¹   4.416µ ± ∞ ¹       ~ (p=1.000 n=1) ²
QueryParallel/1000_silences-12                                          29.15µ ± ∞ ¹   37.42µ ± ∞ ¹       ~ (p=1.000 n=1) ²
QueryParallel/10000_silences-12                                         465.5µ ± ∞ ¹   625.2µ ± ∞ ¹       ~ (p=1.000 n=1) ²
QueryWithConcurrentAdds/1000_initial_silences,_10%_add_rate-12          109.5µ ± ∞ ¹   100.9µ ± ∞ ¹       ~ (p=1.000 n=1) ²
QueryWithConcurrentAdds/1000_initial_silences,_1%_add_rate-12           60.16µ ± ∞ ¹   64.85µ ± ∞ ¹       ~ (p=1.000 n=1) ²
QueryWithConcurrentAdds/1000_initial_silences,_0.1%_add_rate-12         33.90µ ± ∞ ¹   48.51µ ± ∞ ¹       ~ (p=1.000 n=1) ²
QueryWithConcurrentAdds/10000_initial_silences,_1%_add_rate-12          485.8µ ± ∞ ¹   534.1µ ± ∞ ¹       ~ (p=1.000 n=1) ²
QueryWithConcurrentAdds/10000_initial_silences,_0.1%_add_rate-12        470.9µ ± ∞ ¹   477.7µ ± ∞ ¹       ~ (p=1.000 n=1) ²
MutesParallel/100_silences-12                                           7.420µ ± ∞ ¹   7.221µ ± ∞ ¹       ~ (p=1.000 n=1) ²
MutesParallel/1000_silences-12                                          70.76µ ± ∞ ¹   64.93µ ± ∞ ¹       ~ (p=1.000 n=1) ²
MutesParallel/10000_silences-12                                         436.6µ ± ∞ ¹   446.5µ ± ∞ ¹       ~ (p=1.000 n=1) ²
GC/1000_silences,_0%_expired-12                                         39.93µ ± ∞ ¹   40.24µ ± ∞ ¹       ~ (p=1.000 n=1) ²
GC/1000_silences,_30%_expired-12                                        58.45µ ± ∞ ¹   57.40µ ± ∞ ¹       ~ (p=1.000 n=1) ²
GC/1000_silences,_80%_expired-12                                        81.02µ ± ∞ ¹   81.08µ ± ∞ ¹       ~ (p=1.000 n=1) ²
GC/10000_silences,_0%_expired-12                                        367.2µ ± ∞ ¹   374.6µ ± ∞ ¹       ~ (p=1.000 n=1) ²
GC/10000_silences,_10%_expired-12                                       423.0µ ± ∞ ¹   444.7µ ± ∞ ¹       ~ (p=1.000 n=1) ²
GC/10000_silences,_50%_expired-12                                       625.0µ ± ∞ ¹   661.0µ ± ∞ ¹       ~ (p=1.000 n=1) ²
GC/10000_silences,_80%_expired-12                                       756.8µ ± ∞ ¹   764.5µ ± ∞ ¹       ~ (p=1.000 n=1) ²
geomean                                                                 42.69µ         45.34µ        +6.19%
¹ need >= 6 samples for confidence interval at level 0.95
² need >= 4 samples to detect a difference at alpha level 0.05
³ all samples are equal

I think this is acceptable and expected.

Add tracing support using otel to the the following components:
- api: extract trace and span IDs from request context
- provider: mem put
- dispatch: split logic and use better naming
- inhibit: source and target traces, mutes, etc. drop metrics
- silence: query, expire, mutes
- notify: add distributed tracing support to stages and all http requests

Note: inhibitor metrics are dropped since we have tracing now and they
are not needed. We have not released any version with these metrics so
we can drop them safely, this is not a breaking change.

This change borrows part of the implementation from prometheus#3673
Fixes prometheus#3670

Signed-off-by: Dave Henderson <[email protected]>
Signed-off-by: Siavash Safi <[email protected]>
@SuperQ SuperQ merged commit 18939ce into prometheus:main Dec 5, 2025
7 checks passed
SoloJacobs added a commit to SoloJacobs/alertmanager that referenced this pull request Dec 7, 2025
Adds changes that have been merged after 2025-11-21 into main. The entry
prometheus#4629 is removed from the log, since it is superseded by prometheus#4745.
@siavashs siavashs deleted the feat/tracing branch December 8, 2025 09:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature: Instrument Alertmanager for distributed tracing

8 participants