Skip to content

Conversation

@cxdy
Copy link

@cxdy cxdy commented Dec 26, 2025

Description

We (Akamai/Linode) upgraded our Alertmanager machines to v0.30.0 after some testing on December 18th, 2025 and started seeing the process being OOMKilled on December 23rd, 2025.

We investigated a bit and realized that there was a memory leak relating to the new distributed tracing feature, where each request instantiated a new Transport client.

Before

This is the heap from one of the machines we saw this on, running v0.30.0:

alertmanager-pprof go tool pprof -top -cum heap_1gb.pprof | head -60

File: alertmanager
Build ID: 49423fa9419c76d3b878edb2630e31758ab3026e
Type: inuse_space
Time: 2025-12-26 15:33:15 EST
Showing nodes accounting for 201.21MB, 94.35% of 213.26MB total
Dropped 148 nodes (cum <= 1.07MB)
      flat  flat%   sum%        cum   cum%
         0     0%     0%   114.51MB 53.70%  go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.(*Transport).RoundTrip
       2MB  0.94%  0.94%       63MB 29.54%  net/http/httptrace.WithClientTrace
      31MB 14.54% 15.47%       58MB 27.20%  net/http/httptrace.(*ClientTrace).compose
         0     0% 15.47%    28.55MB 13.39%  github.com/prometheus/alertmanager/dispatch.(*Dispatcher).Run
    1.02MB  0.48% 15.95%    28.55MB 13.39%  github.com/prometheus/alertmanager/dispatch.(*Dispatcher).groupAlert
         0     0% 15.95%    28.55MB 13.39%  github.com/prometheus/alertmanager/dispatch.(*Dispatcher).routeAlert
         0     0% 15.95%    28.55MB 13.39%  github.com/prometheus/alertmanager/dispatch.(*Dispatcher).run
      27MB 12.66% 28.61%       27MB 12.66%  reflect.MakeFunc
       9MB  4.22% 32.83%    23.01MB 10.79%  net/http.(*Request).Clone
    2.50MB  1.17% 34.01%    20.03MB  9.39%  github.com/prometheus/alertmanager/dispatch.newAggrGroup
         0     0% 34.01%       20MB  9.38%  github.com/prometheus/alertmanager/tracing.Transport.func1
      20MB  9.38% 43.39%       20MB  9.38%  go.opentelemetry.io/contrib/instrumentation/net/http/httptrace/otelhttptrace.NewClientTrace
         0     0% 43.39%    17.45MB  8.18%  github.com/prometheus/alertmanager/api.(*API).limitHandler.func1
         0     0% 43.39%    17.45MB  8.18%  github.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerDuration.func1
         0     0% 43.39%    17.45MB  8.18%  go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.(*middleware).serveHTTP
         0     0% 43.39%    17.45MB  8.18%  go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.NewMiddleware.func1.1
         0     0% 43.39%    17.45MB  8.18%  net/http.(*ServeMux).ServeHTTP
         0     0% 43.39%    17.45MB  8.18%  net/http.(*conn).serve
         0     0% 43.39%    17.45MB  8.18%  net/http.HandlerFunc.ServeHTTP
         0     0% 43.39%    17.45MB  8.18%  net/http.serverHandler.ServeHTTP
   17.03MB  7.99% 51.37%    17.03MB  7.99%  runtime.allocm
         0     0% 51.37%    17.03MB  7.99%  runtime.newm
         0     0% 51.37%    17.03MB  7.99%  runtime.resetspinning
         0     0% 51.37%    17.03MB  7.99%  runtime.schedule
         0     0% 51.37%    17.03MB  7.99%  runtime.startm
         0     0% 51.37%    17.03MB  7.99%  runtime.wakep
         0     0% 51.37%       15MB  7.04%  github.com/go-openapi/runtime/middleware.(*Context).RoutesHandler.NewOperationExecutor.func1
         0     0% 51.37%       15MB  7.04%  github.com/go-openapi/runtime/middleware.NewRouter.func1
         0     0% 51.37%       15MB  7.04%  github.com/go-openapi/runtime/middleware.Spec.func1
         0     0% 51.37%       15MB  7.04%  github.com/prometheus/alertmanager/api.(*API).Register.(*API).instrumentHandler.func2
         0     0% 51.37%       15MB  7.04%  github.com/prometheus/alertmanager/api.(*API).Register.StripPrefix.func1
         0     0% 51.37%       15MB  7.04%  github.com/prometheus/alertmanager/api/v2.NewAPI.setResponseHeaders.func2
         0     0% 51.37%       15MB  7.04%  github.com/prometheus/alertmanager/api/v2/restapi/operations/alert.(*PostAlerts).ServeHTTP
         0     0% 51.37%       15MB  7.04%  github.com/rs/cors.(*Cors).Handler-fm.(*Cors).Handler.func1
         0     0% 51.37%       15MB  7.04%  go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.WithRouteTag.func1
      11MB  5.16% 56.53%       11MB  5.16%  net/http.Header.Clone (inline)
         0     0% 56.53%    10.52MB  4.93%  runtime.mstart
         0     0% 56.53%    10.52MB  4.93%  runtime.mstart0
         0     0% 56.53%    10.52MB  4.93%  runtime.mstart1
         0     0% 56.53%    10.11MB  4.74%  runtime.main
         0     0% 56.53%     8.61MB  4.04%  main.main
         0     0% 56.53%     8.61MB  4.04%  main.run
    2.55MB  1.19% 57.73%     8.06MB  3.78%  github.com/prometheus/alertmanager/dispatch.(*aggrGroup).run
         0     0% 57.73%     8.03MB  3.77%  github.com/prometheus/alertmanager/notify.MultiStage.Exec
    1.50MB   0.7% 58.43%        8MB  3.75%  encoding/json.(*decodeState).object
         0     0% 58.43%        8MB  3.75%  encoding/json.(*decodeState).unmarshal
         0     0% 58.43%        8MB  3.75%  encoding/json.(*decodeState).value
         0     0% 58.43%        8MB  3.75%  encoding/json.Unmarshal
         0     0% 58.43%     7.53MB  3.53%  github.com/prometheus/alertmanager/notify.FanoutStage.Exec.func1
         0     0% 58.43%     7.50MB  3.52%  log/slog.(*Logger).With
         0     0% 58.43%     7.50MB  3.52%  log/slog.(*TextHandler).WithAttrs
         0     0% 58.43%     7.50MB  3.52%  log/slog.(*commonHandler).withAttrs
         0     0% 58.43%     7.50MB  3.52%  github.com/prometheus/alertmanager/api/v2.(*API).postAlertsHandleralertmanager-pprof go tool pprof -top -inuse_space heap_1gb.pprof | head -60

File: alertmanager
Build ID: 49423fa9419c76d3b878edb2630e31758ab3026e
Type: inuse_space
Time: 2025-12-26 15:33:15 EST
Showing nodes accounting for 201.21MB, 94.35% of 213.26MB total
Dropped 148 nodes (cum <= 1.07MB)
      flat  flat%   sum%        cum   cum%
      31MB 14.54% 14.54%       58MB 27.20%  net/http/httptrace.(*ClientTrace).compose
      27MB 12.66% 27.20%       27MB 12.66%  reflect.MakeFunc
      20MB  9.38% 36.58%       20MB  9.38%  go.opentelemetry.io/contrib/instrumentation/net/http/httptrace/otelhttptrace.NewClientTrace
   17.03MB  7.99% 44.56%    17.03MB  7.99%  runtime.allocm
      11MB  5.16% 49.72%       11MB  5.16%  net/http.Header.Clone (inline)
       9MB  4.22% 53.94%    23.01MB 10.79%  net/http.(*Request).Clone
       7MB  3.28% 57.23%        7MB  3.28%  runtime.malg
    6.50MB  3.05% 60.28%     6.50MB  3.05%  github.com/prometheus/alertmanager/dispatch.getGroupLabels (inline)
    6.50MB  3.05% 63.33%        7MB  3.28%  go.opentelemetry.io/otel/internal/global.(*tracer).newSpan
       6MB  2.81% 66.14%        6MB  2.81%  encoding/json.(*decodeState).literalStore
    5.50MB  2.58% 68.72%     5.50MB  2.58%  unicode/utf8.AppendRune (inline)
       5MB  2.35% 71.07%        5MB  2.35%  github.com/prometheus/alertmanager/api/v2.APILabelSetToModelLabelSet (inline)
    3.50MB  1.64% 72.71%     3.50MB  1.64%  context.(*cancelCtx).Done
    3.50MB  1.64% 74.35%     3.50MB  1.64%  context.WithValue
       3MB  1.41% 75.76%        3MB  1.41%  github.com/prometheus/alertmanager/nflog/nflogpb.(*Entry).Unmarshal
       3MB  1.41% 77.16%        3MB  1.41%  net/http.cloneURL (inline)
    2.55MB  1.19% 78.36%     8.06MB  3.78%  github.com/prometheus/alertmanager/dispatch.(*aggrGroup).run
    2.53MB  1.18% 79.54%     2.53MB  1.18%  context.(*cancelCtx).propagateCancel
    2.51MB  1.18% 80.72%     2.51MB  1.18%  time.newTimer
    2.50MB  1.17% 81.89%     3.50MB  1.64%  go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.NewTransport
    2.50MB  1.17% 83.06%    20.03MB  9.39%  github.com/prometheus/alertmanager/dispatch.newAggrGroup
       2MB  0.94% 84.00%        2MB  0.94%  log/slog.(*commonHandler).clone (inline)
       2MB  0.94% 84.94%        7MB  3.28%  github.com/prometheus/alertmanager/api/v2.OpenAPIAlertsToAlerts
       2MB  0.94% 85.88%     3.03MB  1.42%  context.withCancel (inline)
       2MB  0.94% 86.82%        2MB  0.94%  github.com/prometheus/alertmanager/store.NewAlerts (inline)
       2MB  0.94% 87.75%       63MB 29.54%  net/http/httptrace.WithClientTrace
    1.51MB  0.71% 88.46%     1.51MB  0.71%  regexp/syntax.(*compiler).inst (inline)
    1.50MB   0.7% 89.16%     4.50MB  2.11%  github.com/prometheus/alertmanager/nflog/nflogpb.(*MeshEntry).Unmarshal
    1.50MB   0.7% 89.87%     1.50MB   0.7%  strings.(*Builder).WriteString (inline)
    1.50MB   0.7% 90.57%     3.50MB  1.64%  time.NewTimer
    1.50MB   0.7% 91.27%     1.50MB   0.7%  go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp/internal/request.NewBodyWrapper (inline)
    1.50MB   0.7% 91.98%        8MB  3.75%  encoding/json.(*decodeState).object
    1.07MB   0.5% 92.48%     1.07MB   0.5%  compress/flate.(*compressor).initDeflate (inline)
    1.02MB  0.48% 92.96%    28.55MB 13.39%  github.com/prometheus/alertmanager/dispatch.(*Dispatcher).groupAlert
       1MB  0.47% 93.42%     1.50MB   0.7%  github.com/prometheus/alertmanager/nflog.(*Log).Log
    0.88MB  0.41% 93.84%     1.95MB  0.91%  compress/flate.NewWriter (inline)
    0.59MB  0.28% 94.11%     1.60MB  0.75%  github.com/prometheus/alertmanager/config.LoadFile
    0.50MB  0.23% 94.35%     2.01MB  0.94%  github.com/prometheus/alertmanager/pkg/labels.NewMatcher
         0     0% 94.35%     1.95MB  0.91%  bufio.(*Writer).Flush
         0     0% 94.35%     1.07MB   0.5%  compress/flate.(*compressor).init
         0     0% 94.35%     1.95MB  0.91%  compress/gzip.(*Writer).Write
         0     0% 94.35%     3.03MB  1.42%  context.WithCancel
         0     0% 94.35%     2.01MB  0.94%  context.WithDeadline (inline)
         0     0% 94.35%     2.01MB  0.94%  context.WithDeadlineCause
         0     0% 94.35%     2.01MB  0.94%  context.WithTimeout
         0     0% 94.35%     7.50MB  3.52%  encoding/json.(*Decoder).Decode
         0     0% 94.35%     7.50MB  3.52%  encoding/json.(*decodeState).array
         0     0% 94.35%        8MB  3.75%  encoding/json.(*decodeState).unmarshal
         0     0% 94.35%        8MB  3.75%  encoding/json.(*decodeState).value
         0     0% 94.35%        8MB  3.75%  encoding/json.Unmarshal
         0     0% 94.35%     7.50MB  3.52%  github.com/go-openapi/runtime.ConsumerFunc.Consume
         0     0% 94.35%     7.50MB  3.52%  github.com/go-openapi/runtime/middleware.(*Context).BindValidRequest
         0     0% 94.35%       15MB  7.04%  github.com/go-openapi/runtime/middleware.(*Context).RoutesHandler.NewOperationExecutor.func1

As you can see, go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.(*Transport).RoundTrip is using ~50% of the heap.

We found feat: add distributed tracing support was included in this release, and then noticed in notify/util.go that in the request method, a new session was created for every request:

func request(ctx context.Context, client *http.Client, method, url, bodyType string, body io.Reader) (*http.Response, error) {
	req, err := http.NewRequest(method, url, body)
	if err != nil {
		return nil, err
	}
	req.Header.Set("User-Agent", UserAgentHeader)
	if bodyType != "" {
		req.Header.Set("Content-Type", bodyType)
	}

	// Inject trancing transport
	client.Transport = tracing.Transport(client.Transport)

	return client.Do(req.WithContext(ctx))
}

To get around this, in this PR we're initializing a single client instead and just re-using it.

After

I'm testing it by spamming tens thousands of alerts, so memory allocation isn't necessarily what it is in a normal production workload, but anyways this is what the heap looks like after my change:

alertmanager-pprof go tool pprof -top -cum heap_new.pprof | head -60
File: alertmanager
Build ID: 49423fa9419c76d3b878edb2630e31758ab3026e
Type: inuse_space
Time: Dec 26, 2025 at 10:16pm (UTC)
Showing nodes accounting for 16937.96kB, 100% of 16937.96kB total
      flat  flat%   sum%        cum   cum%
 8721.01kB 51.49% 51.49%  9233.57kB 54.51%  runtime.allocm
         0     0% 51.49%  9233.57kB 54.51%  runtime.newm
         0     0% 51.49%  9233.57kB 54.51%  runtime.resetspinning
         0     0% 51.49%  9233.57kB 54.51%  runtime.schedule
         0     0% 51.49%  9233.57kB 54.51%  runtime.startm
         0     0% 51.49%  9233.57kB 54.51%  runtime.wakep
         0     0% 51.49%  5129.57kB 30.28%  runtime.mstart
         0     0% 51.49%  5129.57kB 30.28%  runtime.mstart0
         0     0% 51.49%  5129.57kB 30.28%  runtime.mstart1
         0     0% 51.49%  4622.60kB 27.29%  runtime.main
         0     0% 51.49%     4104kB 24.23%  runtime.mcall
         0     0% 51.49%     3591kB 21.20%  runtime.park_m
         0     0% 51.49%  3085.09kB 18.21%  main.main
         0     0% 51.49%  3085.09kB 18.21%  main.run
         0     0% 51.49%  2573.05kB 15.19%  github.com/prometheus/alertmanager/api.New
         0     0% 51.49%  2573.05kB 15.19%  github.com/prometheus/alertmanager/api/v2.NewAPI
         0     0% 51.49%  2573.05kB 15.19%  github.com/prometheus/alertmanager/api/v2.getSwaggerSpec
         0     0% 51.49%  1548.70kB  9.14%  github.com/go-openapi/analysis.New
         0     0% 51.49%  1544.39kB  9.12%  github.com/go-openapi/loads.Analyzed
         0     0% 51.49%  1540.58kB  9.10%  sync.(*Once).Do (inline)
         0     0% 51.49%  1540.58kB  9.10%  sync.(*Once).doSlow
         0     0% 51.49%  1537.52kB  9.08%  runtime.doInit (inline)
         0     0% 51.49%  1537.52kB  9.08%  runtime.doInit1
 1036.68kB  6.12% 57.61%  1036.68kB  6.12%  github.com/go-openapi/analysis.(*Spec).reset
         0     0% 57.61%  1028.57kB  6.07%  crypto/tls.(*Conn).HandshakeContext (inline)
         0     0% 57.61%  1028.57kB  6.07%  crypto/tls.(*Conn).clientHandshake
         0     0% 57.61%  1028.57kB  6.07%  crypto/tls.(*Conn).handshakeContext
         0     0% 57.61%  1028.57kB  6.07%  crypto/tls.(*Conn).verifyServerCertificate
         0     0% 57.61%  1028.57kB  6.07%  crypto/tls.(*clientHandshakeStateTLS13).handshake
         0     0% 57.61%  1028.57kB  6.07%  crypto/tls.(*clientHandshakeStateTLS13).readServerCertificate
         0     0% 57.61%  1028.57kB  6.07%  crypto/x509.(*CertPool).AppendCertsFromPEM
         0     0% 57.61%  1028.57kB  6.07%  crypto/x509.(*Certificate).Verify
         0     0% 57.61%  1028.57kB  6.07%  crypto/x509.initSystemRoots
         0     0% 57.61%  1028.57kB  6.07%  crypto/x509.loadSystemRoots
         0     0% 57.61%  1028.57kB  6.07%  crypto/x509.systemRootsPool
         0     0% 57.61%  1028.57kB  6.07%  net/http.(*persistConn).addTLS.func2
         0     0% 57.61%  1024.64kB  6.05%  regexp.Compile (inline)
         0     0% 57.61%  1024.64kB  6.05%  regexp.MustCompile
  512.08kB  3.02% 60.63%  1024.64kB  6.05%  regexp.compile
 1024.44kB  6.05% 66.68%  1024.44kB  6.05%  runtime.malg
         0     0% 66.68%  1024.44kB  6.05%  runtime.newproc.func1
         0     0% 66.68%  1024.44kB  6.05%  runtime.newproc1
         0     0% 66.68%  1024.44kB  6.05%  runtime.systemstack
         0     0% 66.68%  1024.35kB  6.05%  encoding/json.(*decodeState).object
         0     0% 66.68%  1024.35kB  6.05%  encoding/json.(*decodeState).unmarshal
         0     0% 66.68%  1024.35kB  6.05%  encoding/json.(*decodeState).value
         0     0% 66.68%  1024.35kB  6.05%  encoding/json.Unmarshal
         0     0% 66.68%  1024.35kB  6.05%  github.com/go-openapi/spec.(*Schema).UnmarshalJSON
         0     0% 66.68%  1024.35kB  6.05%  github.com/go-openapi/spec.MustLoadSwagger20Schema (inline)
         0     0% 66.68%  1024.35kB  6.05%  github.com/go-openapi/spec.Swagger20Schema
  516.76kB  3.05% 69.73%   516.76kB  3.05%  runtime.procresize
         0     0% 69.73%   516.76kB  3.05%  runtime.rt0_go
         0     0% 69.73%   516.76kB  3.05%  runtime.schedinit
  516.01kB  3.05% 72.78%   516.01kB  3.05%  crypto/x509.(*CertPool).addCertFunc (inline)alertmanager-pprof go tool pprof -top -inuse_space heap_new.pprof | head -60
File: alertmanager
Build ID: 49423fa9419c76d3b878edb2630e31758ab3026e
Type: inuse_space
Time: Dec 26, 2025 at 10:16pm (UTC)
Showing nodes accounting for 16937.96kB, 100% of 16937.96kB total
      flat  flat%   sum%        cum   cum%
 8721.01kB 51.49% 51.49%  9233.57kB 54.51%  runtime.allocm
 1036.68kB  6.12% 57.61%  1036.68kB  6.12%  github.com/go-openapi/analysis.(*Spec).reset
 1024.44kB  6.05% 63.66%  1024.44kB  6.05%  runtime.malg
  516.76kB  3.05% 66.71%   516.76kB  3.05%  runtime.procresize
  516.01kB  3.05% 69.75%   516.01kB  3.05%  crypto/x509.(*CertPool).addCertFunc (inline)
  512.88kB  3.03% 72.78%   512.88kB  3.03%  google.golang.org/protobuf/internal/filedesc.(*Message).unmarshalFull
  512.56kB  3.03% 75.81%   512.56kB  3.03%  encoding/pem.Decode
  512.56kB  3.03% 78.83%   512.56kB  3.03%  regexp.onePassCopy
  512.56kB  3.03% 81.86%   512.56kB  3.03%  runtime.makeProfStackFP (inline)
  512.28kB  3.02% 84.88%   512.28kB  3.02%  reflect.mapassign0
  512.08kB  3.02% 87.91%  1024.64kB  6.05%  regexp.compile
  512.07kB  3.02% 90.93%   512.07kB  3.02%  net/url.parse
  512.03kB  3.02% 93.95%   512.03kB  3.02%  text/template/parse.(*ListNode).append (inline)
  512.02kB  3.02% 96.98%   512.02kB  3.02%  github.com/go-openapi/analysis.(*Spec).analyzeSchema
  512.01kB  3.02%   100%   512.01kB  3.02%  mime.setExtensionType
         0     0%   100%  1028.57kB  6.07%  crypto/tls.(*Conn).HandshakeContext (inline)
         0     0%   100%  1028.57kB  6.07%  crypto/tls.(*Conn).clientHandshake
         0     0%   100%  1028.57kB  6.07%  crypto/tls.(*Conn).handshakeContext
         0     0%   100%  1028.57kB  6.07%  crypto/tls.(*Conn).verifyServerCertificate
         0     0%   100%  1028.57kB  6.07%  crypto/tls.(*clientHandshakeStateTLS13).handshake
         0     0%   100%  1028.57kB  6.07%  crypto/tls.(*clientHandshakeStateTLS13).readServerCertificate
         0     0%   100%  1028.57kB  6.07%  crypto/x509.(*CertPool).AppendCertsFromPEM
         0     0%   100%  1028.57kB  6.07%  crypto/x509.(*Certificate).Verify
         0     0%   100%  1028.57kB  6.07%  crypto/x509.initSystemRoots
         0     0%   100%  1028.57kB  6.07%  crypto/x509.loadSystemRoots
         0     0%   100%  1028.57kB  6.07%  crypto/x509.systemRootsPool
         0     0%   100%  1024.35kB  6.05%  encoding/json.(*decodeState).object
         0     0%   100%  1024.35kB  6.05%  encoding/json.(*decodeState).unmarshal
         0     0%   100%  1024.35kB  6.05%  encoding/json.(*decodeState).value
         0     0%   100%  1024.35kB  6.05%  encoding/json.Unmarshal
         0     0%   100%   512.56kB  3.03%  github.com/aws/aws-sdk-go-v2/service/sns/internal/endpoints.init
         0     0%   100%   512.02kB  3.02%  github.com/go-openapi/analysis.(*Spec).initialize
         0     0%   100%  1548.70kB  9.14%  github.com/go-openapi/analysis.New
         0     0%   100%   512.07kB  3.02%  github.com/go-openapi/jsonreference.(*Ref).parse
         0     0%   100%   512.07kB  3.02%  github.com/go-openapi/jsonreference.New (inline)
         0     0%   100%  1544.39kB  9.12%  github.com/go-openapi/loads.Analyzed
         0     0%   100%   512.07kB  3.02%  github.com/go-openapi/spec.(*Ref).fromMap
         0     0%   100%  1024.35kB  6.05%  github.com/go-openapi/spec.(*Schema).UnmarshalJSON
         0     0%   100%  1024.35kB  6.05%  github.com/go-openapi/spec.MustLoadSwagger20Schema (inline)
         0     0%   100%  1024.35kB  6.05%  github.com/go-openapi/spec.Swagger20Schema
         0     0%   100%   512.01kB  3.02%  github.com/julienschmidt/httprouter.(*Router).ServeHTTP
         0     0%   100%   512.01kB  3.02%  github.com/prometheus/alertmanager/api.(*API).limitHandler.func1
         0     0%   100%  2573.05kB 15.19%  github.com/prometheus/alertmanager/api.New
         0     0%   100%  2573.05kB 15.19%  github.com/prometheus/alertmanager/api/v2.NewAPI
         0     0%   100%  2573.05kB 15.19%  github.com/prometheus/alertmanager/api/v2.getSwaggerSpec
         0     0%   100%   512.03kB  3.02%  github.com/prometheus/alertmanager/config.(*Coordinator).Reload
         0     0%   100%   512.03kB  3.02%  github.com/prometheus/alertmanager/config.(*Coordinator).notifySubscribers (inline)
         0     0%   100%   512.03kB  3.02%  github.com/prometheus/alertmanager/template.(*Template).Parse
         0     0%   100%   512.03kB  3.02%  github.com/prometheus/alertmanager/template.FromGlobs
         0     0%   100%   512.01kB  3.02%  github.com/prometheus/alertmanager/ui.Register.func1
         0     0%   100%   512.01kB  3.02%  github.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerDuration.func1
         0     0%   100%   512.01kB  3.02%  github.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerResponseSize.func1
         0     0%   100%   512.01kB  3.02%  github.com/prometheus/common/route.(*Router).ServeHTTP
         0     0%   100%   512.01kB  3.02%  github.com/prometheus/common/route.(*Router).handle.func1

As you can see, much less! For what it's worth, we don't currently get traces from Alertmanager (yet), but it looks like my change didn't break anything:
image

Summary

We are now re-using a single Transport client for tracing instead of creating a new one for each request, resulting in a significant decrease in memory usage:

Metric Before Fix After Fix (31,000+ alerts) Improvement
Total Heap 213.35 MB 21.2 MB -90% reduction
otelhttp.Transport.RoundTrip 114.51 MB (53.7%) 2.05 MB (9.7%) -98% reduction
httptrace.WithClientTrace 63 MB (29.5%) 0 MB Eliminated
otelhttptrace.NewClientTrace 20 MB (9.4%) 0.512 MB (2.4%) -97% reduction

For what it's worth, the machine we tested on did not have any OOMKills within the last ~2-3 weeks that we've been testing v0.30.0. We only saw this in production, which are under a considerable amount more load.

I'm not sure what the urgency is here for other folks or if anyone else is seeing similar behavior, but we've rolled back to v0.28.1 for now so not too big of a deal for us (although we would like to get back up to v0.30.x soon!)

Signed-off-by: Cody Kaczynski [email protected]

@cxdy cxdy force-pushed the cxdy/fix-memory-leak branch from 4742b52 to 84f13fa Compare December 26, 2025 22:57

// WrapWithTracing wraps an HTTP client's transport with tracing instrumentation.
// This should be called once when creating the client, not on every request.
func WrapWithTracing(client *http.Client) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use sync.once here maybe... But also I don't fully love the need to call notify.WrapWithTracing from all notifiers. Is there a way to replace the injection with an injection that reuses a client, avoiding the memory over-use? I am OOO until next week so I can't try out options until then, and also @siavashs should be back then and we can look at options. Or we can submit this, and then look into improvements later, given the issue.


// Inject trancing transport
client.Transport = tracing.Transport(client.Transport)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you consider trying wrapping this call with a sync.Once (we can have one at the notify/utils level?) and see if tracing still works, without the leak, and we also avoid having to make the call in each notifier? Or alternatively should we have the wrapping happen in httpclient, err := commoncfg.NewClientFromConfig(*conf.HTTPConfig, "telegram", httpOpts...) or in a function wrapping that one, so we avoid forgetting to call it on a new notifier or something?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants