Linkerd-destination OOM caused by pods with duplicate IP address

## Bug Report

### What is the issue?

Linkerd-destination got into OOM and failed all the traffic when 2 pods are assigned the same IP, one of which is till under "Terminating" phase.

This is caused by a [AWS CNI behavior](https://github.com/aws/amazon-vpc-cni-k8s/issues/1091).

```
prod                           personio--de-web-personio-web-6fbcd49f86-x79ch                    6/6     Running            0          4h      10.250.187.3     ip-10-250-180-130.eu-central-1.compute.internal   <none>           <none>
prod                           queues-work-sqs-reminders-automatic-5b8cff7756-dzlr2              1/3     Terminating        0          4h40m   10.250.187.3     ip-10-250-180-130.eu-central-1.compute.internal   <none>           <none>
```

### Logs, error output, etc

Statuses of linkerd-destination
```
linkerd-destination-795d6d557c-4xk8l      1/2     CrashLoopBackOff   12         21h
linkerd-destination-795d6d557c-97zq9      1/2     CrashLoopBackOff   12         5d8h
linkerd-destination-795d6d557c-9plvm      1/2     CrashLoopBackOff   34         162m
linkerd-destination-795d6d557c-g9wh8      1/2     CrashLoopBackOff   20         12h
linkerd-destination-795d6d557c-jb59j      1/2     CrashLoopBackOff   15         6d15h
linkerd-destination-795d6d557c-wwpqp      1/2     CrashLoopBackOff   23         18h
```

The state of the linkerd-destination of those
```
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Tue, 16 Mar 2021 07:36:12 +0100
      Finished:     Tue, 16 Mar 2021 07:36:30 +0100
    Ready:          False
    Restart Count:  12
```

And meanwhile huge amount of logs like below is emitted from linkerd-proxy of linkerd-prometheus:
```
[301.841267s]  WARN ThreadId(01) outbound:accept{client.addr=10.250.156.233:58946
 target.addr=10.250.187.3:4191}: linkerd_service_profiles::client: Could not fetch 
 profile error=status: Unknown, message: "http2 error: protocol error: unexpected 
 internal error encountered", details: [], metadata: MetadataMap { headers: {} }
```

#### `linkerd check` output

```text
kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API
kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version
linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ controller pod is running
√ can initialize the client
√ can query the control plane API
linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ control plane PodSecurityPolicies exist
linkerd-identity
----------------
√ certificate config is valid
√ trust anchors are using supported crypto algorithm
√ trust anchors are within their validity period
√ trust anchors are valid for at least 60 days
√ issuer cert is using supported crypto algorithm
√ issuer cert is within its validity period
‼ issuer cert is valid for at least 60 days
    issuer certificate will expire on 2021-05-07T08:57:23Z
    see https://linkerd.io/checks/#l5d-identity-issuer-cert-not-expiring-soon for hints
√ issuer cert is issued by the trust anchor
linkerd-webhooks-and-apisvc-tls
-------------------------------
√ tap API server has valid cert
√ tap API server cert is valid for at least 60 days
√ proxy-injector webhook has valid cert
√ proxy-injector cert is valid for at least 60 days
√ sp-validator webhook has valid cert
√ sp-validator cert is valid for at least 60 days
linkerd-api
-----------
√ control plane pods are ready
√ control plane self-check
√ [kubernetes] control plane can talk to Kubernetes
√ [prometheus] control plane can talk to Prometheus
√ tap api service is running
linkerd-version
---------------
√ can determine the latest version
‼ cli is up-to-date
    is running version 2.9.2 but the latest stable version is 2.10.0
    see https://linkerd.io/checks/#l5d-version-cli for hints
control-plane-version
---------------------
‼ control plane is up-to-date
    is running version 2.9.2 but the latest stable version is 2.10.0
    see https://linkerd.io/checks/#l5d-version-control for hints
√ control plane and cli versions match
linkerd-ha-checks
-----------------
√ pod injection disabled on kube-system
linkerd-prometheus
------------------
√ prometheus add-on service account exists
√ prometheus add-on config map exists
√ prometheus pod is running
linkerd-grafana
---------------
√ grafana add-on service account exists
√ grafana add-on config map exists
√ grafana pod is running
Status check results are √
```

### Environment

- Kubernetes Version: v1.18.9
- Cluster Environment: EKS
- Linkerd version: 2.9.2

### Possible solution

The issue can be fixed by making the IPWatcher intentionally ignore the pods that are in "Terminating", similar to this #5412 



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Linkerd-destination OOM caused by pods with duplicate IP address #5939

Bug Report

What is the issue?

Logs, error output, etc

`linkerd check` output

Environment

Possible solution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Linkerd-destination OOM caused by pods with duplicate IP address #5939

Description

Bug Report

What is the issue?

Logs, error output, etc

linkerd check output

Environment

Possible solution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`linkerd check` output