fix(bug): memory leak in tracing client #4828

cxdy · 2025-12-26T22:45:55Z

Description

We (Akamai/Linode) upgraded our Alertmanager machines to v0.30.0 after some testing on December 18th, 2025 and started seeing the process being OOMKilled on December 23rd, 2025.

We investigated a bit and realized that there was a memory leak relating to the new distributed tracing feature, where each request instantiated a new Transport client.

Before

This is the heap from one of the machines we saw this on, running v0.30.0:

➜  alertmanager-pprof go tool pprof -top -cum heap_1gb.pprof | head -60

File: alertmanager
Build ID: 49423fa9419c76d3b878edb2630e31758ab3026e
Type: inuse_space
Time: 2025-12-26 15:33:15 EST
Showing nodes accounting for 201.21MB, 94.35% of 213.26MB total
Dropped 148 nodes (cum <= 1.07MB)
      flat  flat%   sum%        cum   cum%
         0     0%     0%   114.51MB 53.70%  go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.(*Transport).RoundTrip
       2MB  0.94%  0.94%       63MB 29.54%  net/http/httptrace.WithClientTrace
      31MB 14.54% 15.47%       58MB 27.20%  net/http/httptrace.(*ClientTrace).compose
         0     0% 15.47%    28.55MB 13.39%  github.com/prometheus/alertmanager/dispatch.(*Dispatcher).Run
    1.02MB  0.48% 15.95%    28.55MB 13.39%  github.com/prometheus/alertmanager/dispatch.(*Dispatcher).groupAlert
         0     0% 15.95%    28.55MB 13.39%  github.com/prometheus/alertmanager/dispatch.(*Dispatcher).routeAlert
         0     0% 15.95%    28.55MB 13.39%  github.com/prometheus/alertmanager/dispatch.(*Dispatcher).run
      27MB 12.66% 28.61%       27MB 12.66%  reflect.MakeFunc
       9MB  4.22% 32.83%    23.01MB 10.79%  net/http.(*Request).Clone
    2.50MB  1.17% 34.01%    20.03MB  9.39%  github.com/prometheus/alertmanager/dispatch.newAggrGroup
         0     0% 34.01%       20MB  9.38%  github.com/prometheus/alertmanager/tracing.Transport.func1
      20MB  9.38% 43.39%       20MB  9.38%  go.opentelemetry.io/contrib/instrumentation/net/http/httptrace/otelhttptrace.NewClientTrace
         0     0% 43.39%    17.45MB  8.18%  github.com/prometheus/alertmanager/api.(*API).limitHandler.func1
         0     0% 43.39%    17.45MB  8.18%  github.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerDuration.func1
         0     0% 43.39%    17.45MB  8.18%  go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.(*middleware).serveHTTP
         0     0% 43.39%    17.45MB  8.18%  go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.NewMiddleware.func1.1
         0     0% 43.39%    17.45MB  8.18%  net/http.(*ServeMux).ServeHTTP
         0     0% 43.39%    17.45MB  8.18%  net/http.(*conn).serve
         0     0% 43.39%    17.45MB  8.18%  net/http.HandlerFunc.ServeHTTP
         0     0% 43.39%    17.45MB  8.18%  net/http.serverHandler.ServeHTTP
   17.03MB  7.99% 51.37%    17.03MB  7.99%  runtime.allocm
         0     0% 51.37%    17.03MB  7.99%  runtime.newm
         0     0% 51.37%    17.03MB  7.99%  runtime.resetspinning
         0     0% 51.37%    17.03MB  7.99%  runtime.schedule
         0     0% 51.37%    17.03MB  7.99%  runtime.startm
         0     0% 51.37%    17.03MB  7.99%  runtime.wakep
         0     0% 51.37%       15MB  7.04%  github.com/go-openapi/runtime/middleware.(*Context).RoutesHandler.NewOperationExecutor.func1
         0     0% 51.37%       15MB  7.04%  github.com/go-openapi/runtime/middleware.NewRouter.func1
         0     0% 51.37%       15MB  7.04%  github.com/go-openapi/runtime/middleware.Spec.func1
         0     0% 51.37%       15MB  7.04%  github.com/prometheus/alertmanager/api.(*API).Register.(*API).instrumentHandler.func2
         0     0% 51.37%       15MB  7.04%  github.com/prometheus/alertmanager/api.(*API).Register.StripPrefix.func1
         0     0% 51.37%       15MB  7.04%  github.com/prometheus/alertmanager/api/v2.NewAPI.setResponseHeaders.func2
         0     0% 51.37%       15MB  7.04%  github.com/prometheus/alertmanager/api/v2/restapi/operations/alert.(*PostAlerts).ServeHTTP
         0     0% 51.37%       15MB  7.04%  github.com/rs/cors.(*Cors).Handler-fm.(*Cors).Handler.func1
         0     0% 51.37%       15MB  7.04%  go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.WithRouteTag.func1
      11MB  5.16% 56.53%       11MB  5.16%  net/http.Header.Clone (inline)
         0     0% 56.53%    10.52MB  4.93%  runtime.mstart
         0     0% 56.53%    10.52MB  4.93%  runtime.mstart0
         0     0% 56.53%    10.52MB  4.93%  runtime.mstart1
         0     0% 56.53%    10.11MB  4.74%  runtime.main
         0     0% 56.53%     8.61MB  4.04%  main.main
         0     0% 56.53%     8.61MB  4.04%  main.run
    2.55MB  1.19% 57.73%     8.06MB  3.78%  github.com/prometheus/alertmanager/dispatch.(*aggrGroup).run
         0     0% 57.73%     8.03MB  3.77%  github.com/prometheus/alertmanager/notify.MultiStage.Exec
    1.50MB   0.7% 58.43%        8MB  3.75%  encoding/json.(*decodeState).object
         0     0% 58.43%        8MB  3.75%  encoding/json.(*decodeState).unmarshal
         0     0% 58.43%        8MB  3.75%  encoding/json.(*decodeState).value
         0     0% 58.43%        8MB  3.75%  encoding/json.Unmarshal
         0     0% 58.43%     7.53MB  3.53%  github.com/prometheus/alertmanager/notify.FanoutStage.Exec.func1
         0     0% 58.43%     7.50MB  3.52%  log/slog.(*Logger).With
         0     0% 58.43%     7.50MB  3.52%  log/slog.(*TextHandler).WithAttrs
         0     0% 58.43%     7.50MB  3.52%  log/slog.(*commonHandler).withAttrs
         0     0% 58.43%     7.50MB  3.52%  github.com/prometheus/alertmanager/api/v2.(*API).postAlertsHandler
➜  alertmanager-pprof go tool pprof -top -inuse_space heap_1gb.pprof | head -60

File: alertmanager
Build ID: 49423fa9419c76d3b878edb2630e31758ab3026e
Type: inuse_space
Time: 2025-12-26 15:33:15 EST
Showing nodes accounting for 201.21MB, 94.35% of 213.26MB total
Dropped 148 nodes (cum <= 1.07MB)
      flat  flat%   sum%        cum   cum%
      31MB 14.54% 14.54%       58MB 27.20%  net/http/httptrace.(*ClientTrace).compose
      27MB 12.66% 27.20%       27MB 12.66%  reflect.MakeFunc
      20MB  9.38% 36.58%       20MB  9.38%  go.opentelemetry.io/contrib/instrumentation/net/http/httptrace/otelhttptrace.NewClientTrace
   17.03MB  7.99% 44.56%    17.03MB  7.99%  runtime.allocm
      11MB  5.16% 49.72%       11MB  5.16%  net/http.Header.Clone (inline)
       9MB  4.22% 53.94%    23.01MB 10.79%  net/http.(*Request).Clone
       7MB  3.28% 57.23%        7MB  3.28%  runtime.malg
    6.50MB  3.05% 60.28%     6.50MB  3.05%  github.com/prometheus/alertmanager/dispatch.getGroupLabels (inline)
    6.50MB  3.05% 63.33%        7MB  3.28%  go.opentelemetry.io/otel/internal/global.(*tracer).newSpan
       6MB  2.81% 66.14%        6MB  2.81%  encoding/json.(*decodeState).literalStore
    5.50MB  2.58% 68.72%     5.50MB  2.58%  unicode/utf8.AppendRune (inline)
       5MB  2.35% 71.07%        5MB  2.35%  github.com/prometheus/alertmanager/api/v2.APILabelSetToModelLabelSet (inline)
    3.50MB  1.64% 72.71%     3.50MB  1.64%  context.(*cancelCtx).Done
    3.50MB  1.64% 74.35%     3.50MB  1.64%  context.WithValue
       3MB  1.41% 75.76%        3MB  1.41%  github.com/prometheus/alertmanager/nflog/nflogpb.(*Entry).Unmarshal
       3MB  1.41% 77.16%        3MB  1.41%  net/http.cloneURL (inline)
    2.55MB  1.19% 78.36%     8.06MB  3.78%  github.com/prometheus/alertmanager/dispatch.(*aggrGroup).run
    2.53MB  1.18% 79.54%     2.53MB  1.18%  context.(*cancelCtx).propagateCancel
    2.51MB  1.18% 80.72%     2.51MB  1.18%  time.newTimer
    2.50MB  1.17% 81.89%     3.50MB  1.64%  go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.NewTransport
    2.50MB  1.17% 83.06%    20.03MB  9.39%  github.com/prometheus/alertmanager/dispatch.newAggrGroup
       2MB  0.94% 84.00%        2MB  0.94%  log/slog.(*commonHandler).clone (inline)
       2MB  0.94% 84.94%        7MB  3.28%  github.com/prometheus/alertmanager/api/v2.OpenAPIAlertsToAlerts
       2MB  0.94% 85.88%     3.03MB  1.42%  context.withCancel (inline)
       2MB  0.94% 86.82%        2MB  0.94%  github.com/prometheus/alertmanager/store.NewAlerts (inline)
       2MB  0.94% 87.75%       63MB 29.54%  net/http/httptrace.WithClientTrace
    1.51MB  0.71% 88.46%     1.51MB  0.71%  regexp/syntax.(*compiler).inst (inline)
    1.50MB   0.7% 89.16%     4.50MB  2.11%  github.com/prometheus/alertmanager/nflog/nflogpb.(*MeshEntry).Unmarshal
    1.50MB   0.7% 89.87%     1.50MB   0.7%  strings.(*Builder).WriteString (inline)
    1.50MB   0.7% 90.57%     3.50MB  1.64%  time.NewTimer
    1.50MB   0.7% 91.27%     1.50MB   0.7%  go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp/internal/request.NewBodyWrapper (inline)
    1.50MB   0.7% 91.98%        8MB  3.75%  encoding/json.(*decodeState).object
    1.07MB   0.5% 92.48%     1.07MB   0.5%  compress/flate.(*compressor).initDeflate (inline)
    1.02MB  0.48% 92.96%    28.55MB 13.39%  github.com/prometheus/alertmanager/dispatch.(*Dispatcher).groupAlert
       1MB  0.47% 93.42%     1.50MB   0.7%  github.com/prometheus/alertmanager/nflog.(*Log).Log
    0.88MB  0.41% 93.84%     1.95MB  0.91%  compress/flate.NewWriter (inline)
    0.59MB  0.28% 94.11%     1.60MB  0.75%  github.com/prometheus/alertmanager/config.LoadFile
    0.50MB  0.23% 94.35%     2.01MB  0.94%  github.com/prometheus/alertmanager/pkg/labels.NewMatcher
         0     0% 94.35%     1.95MB  0.91%  bufio.(*Writer).Flush
         0     0% 94.35%     1.07MB   0.5%  compress/flate.(*compressor).init
         0     0% 94.35%     1.95MB  0.91%  compress/gzip.(*Writer).Write
         0     0% 94.35%     3.03MB  1.42%  context.WithCancel
         0     0% 94.35%     2.01MB  0.94%  context.WithDeadline (inline)
         0     0% 94.35%     2.01MB  0.94%  context.WithDeadlineCause
         0     0% 94.35%     2.01MB  0.94%  context.WithTimeout
         0     0% 94.35%     7.50MB  3.52%  encoding/json.(*Decoder).Decode
         0     0% 94.35%     7.50MB  3.52%  encoding/json.(*decodeState).array
         0     0% 94.35%        8MB  3.75%  encoding/json.(*decodeState).unmarshal
         0     0% 94.35%        8MB  3.75%  encoding/json.(*decodeState).value
         0     0% 94.35%        8MB  3.75%  encoding/json.Unmarshal
         0     0% 94.35%     7.50MB  3.52%  github.com/go-openapi/runtime.ConsumerFunc.Consume
         0     0% 94.35%     7.50MB  3.52%  github.com/go-openapi/runtime/middleware.(*Context).BindValidRequest
         0     0% 94.35%       15MB  7.04%  github.com/go-openapi/runtime/middleware.(*Context).RoutesHandler.NewOperationExecutor.func1

As you can see, go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.(*Transport).RoundTrip is using ~50% of the heap.

We found feat: add distributed tracing support was included in this release, and then noticed in notify/util.go that in the request method, a new session was created for every request:

func request(ctx context.Context, client *http.Client, method, url, bodyType string, body io.Reader) (*http.Response, error) {
	req, err := http.NewRequest(method, url, body)
	if err != nil {
		return nil, err
	}
	req.Header.Set("User-Agent", UserAgentHeader)
	if bodyType != "" {
		req.Header.Set("Content-Type", bodyType)
	}

	// Inject trancing transport
	client.Transport = tracing.Transport(client.Transport)

	return client.Do(req.WithContext(ctx))
}

To get around this, in this PR we're initializing a single client instead and just re-using it.

After

I'm testing it by spamming tens thousands of alerts, so memory allocation isn't necessarily what it is in a normal production workload, but anyways this is what the heap looks like after my change:

➜  alertmanager-pprof go tool pprof -top -cum heap_new.pprof | head -60
File: alertmanager
Build ID: 49423fa9419c76d3b878edb2630e31758ab3026e
Type: inuse_space
Time: Dec 26, 2025 at 10:16pm (UTC)
Showing nodes accounting for 16937.96kB, 100% of 16937.96kB total
      flat  flat%   sum%        cum   cum%
 8721.01kB 51.49% 51.49%  9233.57kB 54.51%  runtime.allocm
         0     0% 51.49%  9233.57kB 54.51%  runtime.newm
         0     0% 51.49%  9233.57kB 54.51%  runtime.resetspinning
         0     0% 51.49%  9233.57kB 54.51%  runtime.schedule
         0     0% 51.49%  9233.57kB 54.51%  runtime.startm
         0     0% 51.49%  9233.57kB 54.51%  runtime.wakep
         0     0% 51.49%  5129.57kB 30.28%  runtime.mstart
         0     0% 51.49%  5129.57kB 30.28%  runtime.mstart0
         0     0% 51.49%  5129.57kB 30.28%  runtime.mstart1
         0     0% 51.49%  4622.60kB 27.29%  runtime.main
         0     0% 51.49%     4104kB 24.23%  runtime.mcall
         0     0% 51.49%     3591kB 21.20%  runtime.park_m
         0     0% 51.49%  3085.09kB 18.21%  main.main
         0     0% 51.49%  3085.09kB 18.21%  main.run
         0     0% 51.49%  2573.05kB 15.19%  github.com/prometheus/alertmanager/api.New
         0     0% 51.49%  2573.05kB 15.19%  github.com/prometheus/alertmanager/api/v2.NewAPI
         0     0% 51.49%  2573.05kB 15.19%  github.com/prometheus/alertmanager/api/v2.getSwaggerSpec
         0     0% 51.49%  1548.70kB  9.14%  github.com/go-openapi/analysis.New
         0     0% 51.49%  1544.39kB  9.12%  github.com/go-openapi/loads.Analyzed
         0     0% 51.49%  1540.58kB  9.10%  sync.(*Once).Do (inline)
         0     0% 51.49%  1540.58kB  9.10%  sync.(*Once).doSlow
         0     0% 51.49%  1537.52kB  9.08%  runtime.doInit (inline)
         0     0% 51.49%  1537.52kB  9.08%  runtime.doInit1
 1036.68kB  6.12% 57.61%  1036.68kB  6.12%  github.com/go-openapi/analysis.(*Spec).reset
         0     0% 57.61%  1028.57kB  6.07%  crypto/tls.(*Conn).HandshakeContext (inline)
         0     0% 57.61%  1028.57kB  6.07%  crypto/tls.(*Conn).clientHandshake
         0     0% 57.61%  1028.57kB  6.07%  crypto/tls.(*Conn).handshakeContext
         0     0% 57.61%  1028.57kB  6.07%  crypto/tls.(*Conn).verifyServerCertificate
         0     0% 57.61%  1028.57kB  6.07%  crypto/tls.(*clientHandshakeStateTLS13).handshake
         0     0% 57.61%  1028.57kB  6.07%  crypto/tls.(*clientHandshakeStateTLS13).readServerCertificate
         0     0% 57.61%  1028.57kB  6.07%  crypto/x509.(*CertPool).AppendCertsFromPEM
         0     0% 57.61%  1028.57kB  6.07%  crypto/x509.(*Certificate).Verify
         0     0% 57.61%  1028.57kB  6.07%  crypto/x509.initSystemRoots
         0     0% 57.61%  1028.57kB  6.07%  crypto/x509.loadSystemRoots
         0     0% 57.61%  1028.57kB  6.07%  crypto/x509.systemRootsPool
         0     0% 57.61%  1028.57kB  6.07%  net/http.(*persistConn).addTLS.func2
         0     0% 57.61%  1024.64kB  6.05%  regexp.Compile (inline)
         0     0% 57.61%  1024.64kB  6.05%  regexp.MustCompile
  512.08kB  3.02% 60.63%  1024.64kB  6.05%  regexp.compile
 1024.44kB  6.05% 66.68%  1024.44kB  6.05%  runtime.malg
         0     0% 66.68%  1024.44kB  6.05%  runtime.newproc.func1
         0     0% 66.68%  1024.44kB  6.05%  runtime.newproc1
         0     0% 66.68%  1024.44kB  6.05%  runtime.systemstack
         0     0% 66.68%  1024.35kB  6.05%  encoding/json.(*decodeState).object
         0     0% 66.68%  1024.35kB  6.05%  encoding/json.(*decodeState).unmarshal
         0     0% 66.68%  1024.35kB  6.05%  encoding/json.(*decodeState).value
         0     0% 66.68%  1024.35kB  6.05%  encoding/json.Unmarshal
         0     0% 66.68%  1024.35kB  6.05%  github.com/go-openapi/spec.(*Schema).UnmarshalJSON
         0     0% 66.68%  1024.35kB  6.05%  github.com/go-openapi/spec.MustLoadSwagger20Schema (inline)
         0     0% 66.68%  1024.35kB  6.05%  github.com/go-openapi/spec.Swagger20Schema
  516.76kB  3.05% 69.73%   516.76kB  3.05%  runtime.procresize
         0     0% 69.73%   516.76kB  3.05%  runtime.rt0_go
         0     0% 69.73%   516.76kB  3.05%  runtime.schedinit
  516.01kB  3.05% 72.78%   516.01kB  3.05%  crypto/x509.(*CertPool).addCertFunc (inline)
➜  alertmanager-pprof go tool pprof -top -inuse_space heap_new.pprof | head -60
File: alertmanager
Build ID: 49423fa9419c76d3b878edb2630e31758ab3026e
Type: inuse_space
Time: Dec 26, 2025 at 10:16pm (UTC)
Showing nodes accounting for 16937.96kB, 100% of 16937.96kB total
      flat  flat%   sum%        cum   cum%
 8721.01kB 51.49% 51.49%  9233.57kB 54.51%  runtime.allocm
 1036.68kB  6.12% 57.61%  1036.68kB  6.12%  github.com/go-openapi/analysis.(*Spec).reset
 1024.44kB  6.05% 63.66%  1024.44kB  6.05%  runtime.malg
  516.76kB  3.05% 66.71%   516.76kB  3.05%  runtime.procresize
  516.01kB  3.05% 69.75%   516.01kB  3.05%  crypto/x509.(*CertPool).addCertFunc (inline)
  512.88kB  3.03% 72.78%   512.88kB  3.03%  google.golang.org/protobuf/internal/filedesc.(*Message).unmarshalFull
  512.56kB  3.03% 75.81%   512.56kB  3.03%  encoding/pem.Decode
  512.56kB  3.03% 78.83%   512.56kB  3.03%  regexp.onePassCopy
  512.56kB  3.03% 81.86%   512.56kB  3.03%  runtime.makeProfStackFP (inline)
  512.28kB  3.02% 84.88%   512.28kB  3.02%  reflect.mapassign0
  512.08kB  3.02% 87.91%  1024.64kB  6.05%  regexp.compile
  512.07kB  3.02% 90.93%   512.07kB  3.02%  net/url.parse
  512.03kB  3.02% 93.95%   512.03kB  3.02%  text/template/parse.(*ListNode).append (inline)
  512.02kB  3.02% 96.98%   512.02kB  3.02%  github.com/go-openapi/analysis.(*Spec).analyzeSchema
  512.01kB  3.02%   100%   512.01kB  3.02%  mime.setExtensionType
         0     0%   100%  1028.57kB  6.07%  crypto/tls.(*Conn).HandshakeContext (inline)
         0     0%   100%  1028.57kB  6.07%  crypto/tls.(*Conn).clientHandshake
         0     0%   100%  1028.57kB  6.07%  crypto/tls.(*Conn).handshakeContext
         0     0%   100%  1028.57kB  6.07%  crypto/tls.(*Conn).verifyServerCertificate
         0     0%   100%  1028.57kB  6.07%  crypto/tls.(*clientHandshakeStateTLS13).handshake
         0     0%   100%  1028.57kB  6.07%  crypto/tls.(*clientHandshakeStateTLS13).readServerCertificate
         0     0%   100%  1028.57kB  6.07%  crypto/x509.(*CertPool).AppendCertsFromPEM
         0     0%   100%  1028.57kB  6.07%  crypto/x509.(*Certificate).Verify
         0     0%   100%  1028.57kB  6.07%  crypto/x509.initSystemRoots
         0     0%   100%  1028.57kB  6.07%  crypto/x509.loadSystemRoots
         0     0%   100%  1028.57kB  6.07%  crypto/x509.systemRootsPool
         0     0%   100%  1024.35kB  6.05%  encoding/json.(*decodeState).object
         0     0%   100%  1024.35kB  6.05%  encoding/json.(*decodeState).unmarshal
         0     0%   100%  1024.35kB  6.05%  encoding/json.(*decodeState).value
         0     0%   100%  1024.35kB  6.05%  encoding/json.Unmarshal
         0     0%   100%   512.56kB  3.03%  github.com/aws/aws-sdk-go-v2/service/sns/internal/endpoints.init
         0     0%   100%   512.02kB  3.02%  github.com/go-openapi/analysis.(*Spec).initialize
         0     0%   100%  1548.70kB  9.14%  github.com/go-openapi/analysis.New
         0     0%   100%   512.07kB  3.02%  github.com/go-openapi/jsonreference.(*Ref).parse
         0     0%   100%   512.07kB  3.02%  github.com/go-openapi/jsonreference.New (inline)
         0     0%   100%  1544.39kB  9.12%  github.com/go-openapi/loads.Analyzed
         0     0%   100%   512.07kB  3.02%  github.com/go-openapi/spec.(*Ref).fromMap
         0     0%   100%  1024.35kB  6.05%  github.com/go-openapi/spec.(*Schema).UnmarshalJSON
         0     0%   100%  1024.35kB  6.05%  github.com/go-openapi/spec.MustLoadSwagger20Schema (inline)
         0     0%   100%  1024.35kB  6.05%  github.com/go-openapi/spec.Swagger20Schema
         0     0%   100%   512.01kB  3.02%  github.com/julienschmidt/httprouter.(*Router).ServeHTTP
         0     0%   100%   512.01kB  3.02%  github.com/prometheus/alertmanager/api.(*API).limitHandler.func1
         0     0%   100%  2573.05kB 15.19%  github.com/prometheus/alertmanager/api.New
         0     0%   100%  2573.05kB 15.19%  github.com/prometheus/alertmanager/api/v2.NewAPI
         0     0%   100%  2573.05kB 15.19%  github.com/prometheus/alertmanager/api/v2.getSwaggerSpec
         0     0%   100%   512.03kB  3.02%  github.com/prometheus/alertmanager/config.(*Coordinator).Reload
         0     0%   100%   512.03kB  3.02%  github.com/prometheus/alertmanager/config.(*Coordinator).notifySubscribers (inline)
         0     0%   100%   512.03kB  3.02%  github.com/prometheus/alertmanager/template.(*Template).Parse
         0     0%   100%   512.03kB  3.02%  github.com/prometheus/alertmanager/template.FromGlobs
         0     0%   100%   512.01kB  3.02%  github.com/prometheus/alertmanager/ui.Register.func1
         0     0%   100%   512.01kB  3.02%  github.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerDuration.func1
         0     0%   100%   512.01kB  3.02%  github.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerResponseSize.func1
         0     0%   100%   512.01kB  3.02%  github.com/prometheus/common/route.(*Router).ServeHTTP
         0     0%   100%   512.01kB  3.02%  github.com/prometheus/common/route.(*Router).handle.func1

As you can see, much less! For what it's worth, we don't currently get traces from Alertmanager (yet), but it looks like my change didn't break anything:

Summary

We are now re-using a single Transport client for tracing instead of creating a new one for each request, resulting in a significant decrease in memory usage:

Metric	Before Fix	After Fix (31,000+ alerts)	Improvement
Total Heap	213.35 MB	21.2 MB	-90% reduction
otelhttp.Transport.RoundTrip	114.51 MB (53.7%)	2.05 MB (9.7%)	-98% reduction
httptrace.WithClientTrace	63 MB (29.5%)	0 MB	Eliminated
otelhttptrace.NewClientTrace	20 MB (9.4%)	0.512 MB (2.4%)	-97% reduction

For what it's worth, the machine we tested on did not have any OOMKills within the last ~2-3 weeks that we've been testing v0.30.0. We only saw this in production, which are under a considerable amount more load.

I'm not sure what the urgency is here for other folks or if anyone else is seeing similar behavior, but we've rolled back to v0.28.1 for now so not too big of a deal for us (although we would like to get back up to v0.30.x soon!)

Signed-off-by: Cody Kaczynski [email protected]

Signed-off-by: Cody Kaczynski <[email protected]>

ultrotter · 2026-01-01T08:57:07Z

notify/util.go


+// WrapWithTracing wraps an HTTP client's transport with tracing instrumentation.
+// This should be called once when creating the client, not on every request.
+func WrapWithTracing(client *http.Client) {


We could use sync.once here maybe... But also I don't fully love the need to call notify.WrapWithTracing from all notifiers. Is there a way to replace the injection with an injection that reuses a client, avoiding the memory over-use? I am OOO until next week so I can't try out options until then, and also @siavashs should be back then and we can look at options. Or we can submit this, and then look into improvements later, given the issue.

ultrotter · 2026-01-01T09:07:22Z

notify/util.go


-	// Inject trancing transport
-	client.Transport = tracing.Transport(client.Transport)
-


Would you consider trying wrapping this call with a sync.Once (we can have one at the notify/utils level?) and see if tracing still works, without the leak, and we also avoid having to make the call in each notifier? Or alternatively should we have the wrapping happen in httpclient, err := commoncfg.NewClientFromConfig(*conf.HTTPConfig, "telegram", httpOpts...) or in a function wrapping that one, so we avoid forgetting to call it on a new notifier or something?

fix(bug): memory leak in tracing client

84f13fa

Signed-off-by: Cody Kaczynski <[email protected]>

cxdy force-pushed the cxdy/fix-memory-leak branch from 4742b52 to 84f13fa Compare December 26, 2025 22:57

Merge branch 'main' into cxdy/fix-memory-leak

fa55dc4

SuperQ requested review from TheMeier, siavashs and ultrotter December 31, 2025 14:08

ultrotter reviewed Jan 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(bug): memory leak in tracing client #4828

fix(bug): memory leak in tracing client #4828

cxdy commented Dec 26, 2025 •

edited

Loading

Uh oh!

ultrotter Jan 1, 2026

Uh oh!

ultrotter Jan 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		// Inject trancing transport
		client.Transport = tracing.Transport(client.Transport)

fix(bug): memory leak in tracing client #4828

Are you sure you want to change the base?

fix(bug): memory leak in tracing client #4828

Conversation

cxdy commented Dec 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Before

After

Summary

Uh oh!

ultrotter Jan 1, 2026

Choose a reason for hiding this comment

Uh oh!

ultrotter Jan 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cxdy commented Dec 26, 2025 •

edited

Loading