Skip to content

Linkerd-destination OOM caused by pods with duplicate IP address #5939

@Wenliang-CHEN

Description

@Wenliang-CHEN

Bug Report

What is the issue?

Linkerd-destination got into OOM and failed all the traffic when 2 pods are assigned the same IP, one of which is till under "Terminating" phase.

This is caused by a AWS CNI behavior.

prod                           personio--de-web-personio-web-6fbcd49f86-x79ch                    6/6     Running            0          4h      10.250.187.3     ip-10-250-180-130.eu-central-1.compute.internal   <none>           <none>
prod                           queues-work-sqs-reminders-automatic-5b8cff7756-dzlr2              1/3     Terminating        0          4h40m   10.250.187.3     ip-10-250-180-130.eu-central-1.compute.internal   <none>           <none>

Logs, error output, etc

Statuses of linkerd-destination

linkerd-destination-795d6d557c-4xk8l      1/2     CrashLoopBackOff   12         21h
linkerd-destination-795d6d557c-97zq9      1/2     CrashLoopBackOff   12         5d8h
linkerd-destination-795d6d557c-9plvm      1/2     CrashLoopBackOff   34         162m
linkerd-destination-795d6d557c-g9wh8      1/2     CrashLoopBackOff   20         12h
linkerd-destination-795d6d557c-jb59j      1/2     CrashLoopBackOff   15         6d15h
linkerd-destination-795d6d557c-wwpqp      1/2     CrashLoopBackOff   23         18h

The state of the linkerd-destination of those

    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Tue, 16 Mar 2021 07:36:12 +0100
      Finished:     Tue, 16 Mar 2021 07:36:30 +0100
    Ready:          False
    Restart Count:  12

And meanwhile huge amount of logs like below is emitted from linkerd-proxy of linkerd-prometheus:

[301.841267s]  WARN ThreadId(01) outbound:accept{client.addr=10.250.156.233:58946
 target.addr=10.250.187.3:4191}: linkerd_service_profiles::client: Could not fetch 
 profile error=status: Unknown, message: "http2 error: protocol error: unexpected 
 internal error encountered", details: [], metadata: MetadataMap { headers: {} }

linkerd check output

kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API
kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version
linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ controller pod is running
√ can initialize the client
√ can query the control plane API
linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ control plane PodSecurityPolicies exist
linkerd-identity
----------------
√ certificate config is valid
√ trust anchors are using supported crypto algorithm
√ trust anchors are within their validity period
√ trust anchors are valid for at least 60 days
√ issuer cert is using supported crypto algorithm
√ issuer cert is within its validity period
‼ issuer cert is valid for at least 60 days
    issuer certificate will expire on 2021-05-07T08:57:23Z
    see https://linkerd.io/checks/#l5d-identity-issuer-cert-not-expiring-soon for hints
√ issuer cert is issued by the trust anchor
linkerd-webhooks-and-apisvc-tls
-------------------------------
√ tap API server has valid cert
√ tap API server cert is valid for at least 60 days
√ proxy-injector webhook has valid cert
√ proxy-injector cert is valid for at least 60 days
√ sp-validator webhook has valid cert
√ sp-validator cert is valid for at least 60 days
linkerd-api
-----------
√ control plane pods are ready
√ control plane self-check
√ [kubernetes] control plane can talk to Kubernetes
√ [prometheus] control plane can talk to Prometheus
√ tap api service is running
linkerd-version
---------------
√ can determine the latest version
‼ cli is up-to-date
    is running version 2.9.2 but the latest stable version is 2.10.0
    see https://linkerd.io/checks/#l5d-version-cli for hints
control-plane-version
---------------------
‼ control plane is up-to-date
    is running version 2.9.2 but the latest stable version is 2.10.0
    see https://linkerd.io/checks/#l5d-version-control for hints
√ control plane and cli versions match
linkerd-ha-checks
-----------------
√ pod injection disabled on kube-system
linkerd-prometheus
------------------
√ prometheus add-on service account exists
√ prometheus add-on config map exists
√ prometheus pod is running
linkerd-grafana
---------------
√ grafana add-on service account exists
√ grafana add-on config map exists
√ grafana pod is running
Status check results are √

Environment

  • Kubernetes Version: v1.18.9
  • Cluster Environment: EKS
  • Linkerd version: 2.9.2

Possible solution

The issue can be fixed by making the IPWatcher intentionally ignore the pods that are in "Terminating", similar to this #5412

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions