-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
What is the issue?
I've noticed the linkerd destination and proxy injector control plane components restart every now and then due to an OOMKilled error.
I am running linkerd w/ the recommended production level configs (e.g. 3 instances of each control plane component).
The destination and injector components have been assigned a 250Mi memory limit.
I notice that all three replicas of these components restart at about the same time (give or take a few minutes) - exiting w/ the same OOMKilled error (error 137).
Here are some resource usage charts. The first one is linkerd destination's resource usage over the past month:
And this one shows the proxy injector's resource usage over the past month:
Why do these spikes occur? Perhaps these spikes are associated w/ rollout of a lot pods? But that doesn't explain some of the spikes because I know for sure we didn't do any major rollout.
The linkerd identity component does not show the same behavior.
The cluster that linkerd is running on has several hundred pods running. Could linkerd be running into issues w/ handling that many pods? How many pods can linkerd handle w/ the production level configuration?
Thank you for the help.
How can it be reproduced?
N/A
Logs, error output, etc
This is what the pod state shows for all of the linkerd destination and injector replicas (the times vary by a few minutes):
State: Running
Started: Fri, 15 Apr 2022 05:41:27 -0400
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Mon, 04 Apr 2022 19:17:12 -0400
Finished: Fri, 15 Apr 2022 05:41:26 -0400
output of linkerd check -o short
Linkerd core checks
===================
kubernetes-version
------------------
× is running the minimum kubectl version
exec: "kubectl": executable file not found in $PATH
see https://linkerd.io/2.11/checks/#kubectl-version for hints
linkerd-webhooks-and-apisvc-tls
-------------------------------
‼ proxy-injector cert is valid for at least 60 days
certificate will expire on 2022-04-16T10:19:53Z
see https://linkerd.io/2.11/checks/#l5d-proxy-injector-webhook-cert-not-expiring-soon for hints
‼ sp-validator cert is valid for at least 60 days
certificate will expire on 2022-04-16T10:19:27Z
see https://linkerd.io/2.11/checks/#l5d-sp-validator-webhook-cert-not-expiring-soon for hints
Status check results are ×
Linkerd extensions checks
=========================
linkerd-viz
-----------
‼ tap API server cert is valid for at least 60 days
certificate will expire on 2022-06-08T17:09:25Z
see https://linkerd.io/2.11/checks/#l5d-tap-cert-not-expiring-soon for hints
‼ linkerd-viz pods are injected
could not find proxy container for prometheus-797c7d558b-hrfqc pod
see https://linkerd.io/2.11/checks/#l5d-viz-pods-injection for hints
‼ viz extension proxies and cli versions match
prometheus-797c7d558b-hrfqc running but cli running stable-2.11.1
see https://linkerd.io/2.11/checks/#l5d-viz-proxy-cli-version for hints
Status check results are √
Environment
- Kubernetes Version: 1.20.15-gke.2500
- Cluster Environment: GKE
- Host OS: cos_containerd
- Linkerd version: 2.11.1
Possible solution
N/A
Additional context
N/A
Would you like to work on fixing this bug?
No response