-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Closed
Description
Bug Report
What is the issue?
Linkerd-destination got into OOM and failed all the traffic when 2 pods are assigned the same IP, one of which is till under "Terminating" phase.
This is caused by a AWS CNI behavior.
prod personio--de-web-personio-web-6fbcd49f86-x79ch 6/6 Running 0 4h 10.250.187.3 ip-10-250-180-130.eu-central-1.compute.internal <none> <none>
prod queues-work-sqs-reminders-automatic-5b8cff7756-dzlr2 1/3 Terminating 0 4h40m 10.250.187.3 ip-10-250-180-130.eu-central-1.compute.internal <none> <none>
Logs, error output, etc
Statuses of linkerd-destination
linkerd-destination-795d6d557c-4xk8l 1/2 CrashLoopBackOff 12 21h
linkerd-destination-795d6d557c-97zq9 1/2 CrashLoopBackOff 12 5d8h
linkerd-destination-795d6d557c-9plvm 1/2 CrashLoopBackOff 34 162m
linkerd-destination-795d6d557c-g9wh8 1/2 CrashLoopBackOff 20 12h
linkerd-destination-795d6d557c-jb59j 1/2 CrashLoopBackOff 15 6d15h
linkerd-destination-795d6d557c-wwpqp 1/2 CrashLoopBackOff 23 18h
The state of the linkerd-destination of those
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Tue, 16 Mar 2021 07:36:12 +0100
Finished: Tue, 16 Mar 2021 07:36:30 +0100
Ready: False
Restart Count: 12
And meanwhile huge amount of logs like below is emitted from linkerd-proxy of linkerd-prometheus:
[301.841267s] WARN ThreadId(01) outbound:accept{client.addr=10.250.156.233:58946
target.addr=10.250.187.3:4191}: linkerd_service_profiles::client: Could not fetch
profile error=status: Unknown, message: "http2 error: protocol error: unexpected
internal error encountered", details: [], metadata: MetadataMap { headers: {} }
linkerd check
output
kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API
kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version
linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ controller pod is running
√ can initialize the client
√ can query the control plane API
linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ control plane PodSecurityPolicies exist
linkerd-identity
----------------
√ certificate config is valid
√ trust anchors are using supported crypto algorithm
√ trust anchors are within their validity period
√ trust anchors are valid for at least 60 days
√ issuer cert is using supported crypto algorithm
√ issuer cert is within its validity period
‼ issuer cert is valid for at least 60 days
issuer certificate will expire on 2021-05-07T08:57:23Z
see https://linkerd.io/checks/#l5d-identity-issuer-cert-not-expiring-soon for hints
√ issuer cert is issued by the trust anchor
linkerd-webhooks-and-apisvc-tls
-------------------------------
√ tap API server has valid cert
√ tap API server cert is valid for at least 60 days
√ proxy-injector webhook has valid cert
√ proxy-injector cert is valid for at least 60 days
√ sp-validator webhook has valid cert
√ sp-validator cert is valid for at least 60 days
linkerd-api
-----------
√ control plane pods are ready
√ control plane self-check
√ [kubernetes] control plane can talk to Kubernetes
√ [prometheus] control plane can talk to Prometheus
√ tap api service is running
linkerd-version
---------------
√ can determine the latest version
‼ cli is up-to-date
is running version 2.9.2 but the latest stable version is 2.10.0
see https://linkerd.io/checks/#l5d-version-cli for hints
control-plane-version
---------------------
‼ control plane is up-to-date
is running version 2.9.2 but the latest stable version is 2.10.0
see https://linkerd.io/checks/#l5d-version-control for hints
√ control plane and cli versions match
linkerd-ha-checks
-----------------
√ pod injection disabled on kube-system
linkerd-prometheus
------------------
√ prometheus add-on service account exists
√ prometheus add-on config map exists
√ prometheus pod is running
linkerd-grafana
---------------
√ grafana add-on service account exists
√ grafana add-on config map exists
√ grafana pod is running
Status check results are √
Environment
- Kubernetes Version: v1.18.9
- Cluster Environment: EKS
- Linkerd version: 2.9.2
Possible solution
The issue can be fixed by making the IPWatcher intentionally ignore the pods that are in "Terminating", similar to this #5412
Metadata
Metadata
Assignees
Labels
No labels