Skip to content

Receiving ssl handshake error from kubernetes APIserver and opentelemetry webhook  #2956

Open
@hesamhamdarsi

Description

@hesamhamdarsi

Component(s)

No response

Describe the issue you're reporting

Description:

We are observing some ssl handshake errors between otel operator on default port 9443 (webhook server) and internal IPs of kubernetes API server.

Steps to reproduce:

Deploying operator helm chart with only a few changes(for our use case) including:

admissionWebhooks:
  namespaceSelector:
    matchLabels:
      otel-injection: enabled
manager:
  podAnnotations:
    sidecar.istio.io/inject: "false"
  resources:
    limits:
      memory: 256Mi

Expected Result:

The opentelemetry operator and collectors works fine, But we are receiving the following logs from the operator pod saying a TLS handshake error happening time to time between API-server and otel operator webhook server. We couldn't see any issue though on validatiingWebhook and MutatingWebhook and they both seems working fine.
10.40.76.248 is the internal service IP of kubernetes API server
10.40.99.143 is the pod ip of the opentelemetry operator

2024/05/14 09:10:01 http: TLS handshake error from 10.40.76.248:55276: read tcp 10.40.99.143:9443->10.40.76.248:55276: read: connection reset by peer
2024/05/14 09:15:00 http: TLS handshake error from 10.40.76.248:36546: read tcp 10.40.99.143:9443->10.40.76.248:36546: read: connection reset by peer
2024/05/14 09:15:00 http: TLS handshake error from 10.40.76.248:36562: read tcp 10.40.99.143:9443->10.40.76.248:36562: read: connection reset by peer
2024/05/14 09:18:00 http: TLS handshake error from 10.40.76.248:39346: read tcp 10.40.99.143:9443->10.40.76.248:39346: read: connection reset by peer
2024/05/14 09:18:00 http: TLS handshake error from 10.40.76.248:39360: EOF
2024/05/14 09:18:00 http: TLS handshake error from 10.40.76.248:39370: read tcp 10.40.99.143:9443->10.40.76.248:39370: read: connection reset by peer
2024/05/14 09:25:00 http: TLS handshake error from 10.40.76.248:42412: EOF
2024/05/14 09:28:00 http: TLS handshake error from 10.40.76.248:50632: EOF
2024/05/14 09:35:00 http: TLS handshake error from 10.40.76.248:34974: read tcp 10.40.99.143:9443->10.40.76.248:34974: read: connection reset by peer
2024/05/14 09:40:00 http: TLS handshake error from 10.40.76.248:53388: read tcp 10.40.99.143:9443->10.40.76.248:53388: read: connection reset by peer
2024/05/14 09:45:00 http: TLS handshake error from 10.40.76.248:50526: read tcp 10.40.99.143:9443->10.40.76.248:50526: read: connection reset by peer
2024/05/14 09:45:00 http: TLS handshake error from 10.40.76.248:50534: read tcp 10.40.99.143:9443->10.40.76.248:50534: read: connection reset by peer
2024/05/14 09:48:00 http: TLS handshake error from 10.40.76.248:39272: EOF
2024/05/14 09:50:00 http: TLS handshake error from 10.40.76.248:33666: read tcp 10.40.99.143:9443->10.40.76.248:33666: read: connection reset by peer

Troubleshooting steps:

To make sure there is no rate limit happening between API server and otel operator, we've checked the API-server logs as well as priority and fairness for handling requests by API-server and we didn't observe anything subspecies behaviour there:

kubectl get flowschemas                                                                                                                                                                                                
NAME                           PRIORITYLEVEL     MATCHINGPRECEDENCE   DISTINGUISHERMETHOD   AGE     MISSINGPL
exempt                         exempt            1                    <none>                2y87d   False
eks-exempt                     exempt            2                    <none>                262d    False
probes                         exempt            2                    <none>                2y87d   False
system-leader-election         leader-election   100                  ByUser                2y87d   False
endpoint-controller            workload-high     150                  ByUser                200d    False
workload-leader-election       leader-election   200                  ByUser                2y87d   False
system-node-high               node-high         400                  ByUser                455d    False
system-nodes                   system            500                  ByUser                2y87d   False
kube-controller-manager        workload-high     800                  ByNamespace           2y87d   False
kube-scheduler                 workload-high     800                  ByNamespace           2y87d   False
kube-system-service-accounts   workload-high     900                  ByNamespace           2y87d   False
eks-workload-high              workload-high     1000                 ByUser                172d    False
service-accounts               workload-low      9000                 ByUser                2y87d   False
global-default                 global-default    9900                 ByUser                2y87d   False
catch-all                      catch-all         10000                ByUser                2y87d   False

kubectl get prioritylevelconfiguration 
NAME              TYPE      NOMINALCONCURRENCYSHARES   QUEUES   HANDSIZE   QUEUELENGTHLIMIT   AGE
catch-all         Limited   5                          <none>   <none>     <none>             2y87d
exempt            Exempt    <none>                     <none>   <none>     <none>             2y87d
global-default    Limited   20                         128      6          50                 2y87d
leader-election   Limited   10                         16       4          50                 2y87d
node-high         Limited   40                         64       6          50                 455d
system            Limited   30                         64       6          50                 2y87d
workload-high     Limited   40                         128      6          50                 2y87d
workload-low      Limited   100                        128      6          50                 2y87d

kubectl get --raw /metrics | grep 'apiserver_flowcontrol_request_concurrency_in_use.*workload-low' 
apiserver_flowcontrol_request_concurrency_in_use{flow_schema="service-accounts",priority_level="workload-low"} 0            # current

kubectl get --raw /metrics | grep 'apiserver_flowcontrol_current_inqueue_requests.*workload-low' 
apiserver_flowcontrol_current_inqueue_requests{flow_schema="service-accounts",priority_level="workload-low"} 0              # queue

The certificate generated for otel operator also checked and it is valid:

kubectl get certificate -n monitoring                                                                                                      
NAME                                                READY   SECRET                                                                 AGE
otel-operator-opentelemetry-operator-serving-cert   True    otel-operator-opentelemetry-operator-controller-manager-service-cert   26d

kubectl get secret otel-operator-opentelemetry-operator-controller-manager-service-cert -n monitoring -o jsonpath="{.data['tls\.crt']}" | base64 --decode > cert.crt
openssl x509 -in cert.crt -text -noout
Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number:
            9d:c0:73:fe:ab:4f:b1:1f:a8:24:ee:73:49:23:59:91
        Signature Algorithm: sha256WithRSAEncryption
        Issuer: OU=otel-operator-opentelemetry-operator
        Validity
            Not Before: Apr  2 15:25:31 2024 GMT
            Not After : Jul  1 15:25:31 2024 GMT
        Subject: OU=otel-operator-opentelemetry-operator
        Subject Public Key Info:
            Public Key Algorithm: rsaEncryption
                Public-Key: (2048 bit)
                # removed to reduce the message size 

Test environment:

Kubernetes version: v1.27.13-eks-3af4770
Provider: EKS
Operator version: 0.96.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions