Description
Component(s)
No response
Describe the issue you're reporting
Description:
We are observing some ssl handshake errors between otel operator on default port 9443 (webhook server) and internal IPs of kubernetes API server.
Steps to reproduce:
Deploying operator helm chart with only a few changes(for our use case) including:
admissionWebhooks:
namespaceSelector:
matchLabels:
otel-injection: enabled
manager:
podAnnotations:
sidecar.istio.io/inject: "false"
resources:
limits:
memory: 256Mi
Expected Result:
The opentelemetry operator and collectors works fine, But we are receiving the following logs from the operator pod saying a TLS handshake error happening time to time between API-server and otel operator webhook server. We couldn't see any issue though on validatiingWebhook and MutatingWebhook and they both seems working fine.
10.40.76.248 is the internal service IP of kubernetes API server
10.40.99.143 is the pod ip of the opentelemetry operator
2024/05/14 09:10:01 http: TLS handshake error from 10.40.76.248:55276: read tcp 10.40.99.143:9443->10.40.76.248:55276: read: connection reset by peer
2024/05/14 09:15:00 http: TLS handshake error from 10.40.76.248:36546: read tcp 10.40.99.143:9443->10.40.76.248:36546: read: connection reset by peer
2024/05/14 09:15:00 http: TLS handshake error from 10.40.76.248:36562: read tcp 10.40.99.143:9443->10.40.76.248:36562: read: connection reset by peer
2024/05/14 09:18:00 http: TLS handshake error from 10.40.76.248:39346: read tcp 10.40.99.143:9443->10.40.76.248:39346: read: connection reset by peer
2024/05/14 09:18:00 http: TLS handshake error from 10.40.76.248:39360: EOF
2024/05/14 09:18:00 http: TLS handshake error from 10.40.76.248:39370: read tcp 10.40.99.143:9443->10.40.76.248:39370: read: connection reset by peer
2024/05/14 09:25:00 http: TLS handshake error from 10.40.76.248:42412: EOF
2024/05/14 09:28:00 http: TLS handshake error from 10.40.76.248:50632: EOF
2024/05/14 09:35:00 http: TLS handshake error from 10.40.76.248:34974: read tcp 10.40.99.143:9443->10.40.76.248:34974: read: connection reset by peer
2024/05/14 09:40:00 http: TLS handshake error from 10.40.76.248:53388: read tcp 10.40.99.143:9443->10.40.76.248:53388: read: connection reset by peer
2024/05/14 09:45:00 http: TLS handshake error from 10.40.76.248:50526: read tcp 10.40.99.143:9443->10.40.76.248:50526: read: connection reset by peer
2024/05/14 09:45:00 http: TLS handshake error from 10.40.76.248:50534: read tcp 10.40.99.143:9443->10.40.76.248:50534: read: connection reset by peer
2024/05/14 09:48:00 http: TLS handshake error from 10.40.76.248:39272: EOF
2024/05/14 09:50:00 http: TLS handshake error from 10.40.76.248:33666: read tcp 10.40.99.143:9443->10.40.76.248:33666: read: connection reset by peer
Troubleshooting steps:
To make sure there is no rate limit happening between API server and otel operator, we've checked the API-server logs as well as priority and fairness for handling requests by API-server and we didn't observe anything subspecies behaviour there:
kubectl get flowschemas
NAME PRIORITYLEVEL MATCHINGPRECEDENCE DISTINGUISHERMETHOD AGE MISSINGPL
exempt exempt 1 <none> 2y87d False
eks-exempt exempt 2 <none> 262d False
probes exempt 2 <none> 2y87d False
system-leader-election leader-election 100 ByUser 2y87d False
endpoint-controller workload-high 150 ByUser 200d False
workload-leader-election leader-election 200 ByUser 2y87d False
system-node-high node-high 400 ByUser 455d False
system-nodes system 500 ByUser 2y87d False
kube-controller-manager workload-high 800 ByNamespace 2y87d False
kube-scheduler workload-high 800 ByNamespace 2y87d False
kube-system-service-accounts workload-high 900 ByNamespace 2y87d False
eks-workload-high workload-high 1000 ByUser 172d False
service-accounts workload-low 9000 ByUser 2y87d False
global-default global-default 9900 ByUser 2y87d False
catch-all catch-all 10000 ByUser 2y87d False
kubectl get prioritylevelconfiguration
NAME TYPE NOMINALCONCURRENCYSHARES QUEUES HANDSIZE QUEUELENGTHLIMIT AGE
catch-all Limited 5 <none> <none> <none> 2y87d
exempt Exempt <none> <none> <none> <none> 2y87d
global-default Limited 20 128 6 50 2y87d
leader-election Limited 10 16 4 50 2y87d
node-high Limited 40 64 6 50 455d
system Limited 30 64 6 50 2y87d
workload-high Limited 40 128 6 50 2y87d
workload-low Limited 100 128 6 50 2y87d
kubectl get --raw /metrics | grep 'apiserver_flowcontrol_request_concurrency_in_use.*workload-low'
apiserver_flowcontrol_request_concurrency_in_use{flow_schema="service-accounts",priority_level="workload-low"} 0 # current
kubectl get --raw /metrics | grep 'apiserver_flowcontrol_current_inqueue_requests.*workload-low'
apiserver_flowcontrol_current_inqueue_requests{flow_schema="service-accounts",priority_level="workload-low"} 0 # queue
The certificate generated for otel operator also checked and it is valid:
kubectl get certificate -n monitoring
NAME READY SECRET AGE
otel-operator-opentelemetry-operator-serving-cert True otel-operator-opentelemetry-operator-controller-manager-service-cert 26d
kubectl get secret otel-operator-opentelemetry-operator-controller-manager-service-cert -n monitoring -o jsonpath="{.data['tls\.crt']}" | base64 --decode > cert.crt
openssl x509 -in cert.crt -text -noout
Certificate:
Data:
Version: 3 (0x2)
Serial Number:
9d:c0:73:fe:ab:4f:b1:1f:a8:24:ee:73:49:23:59:91
Signature Algorithm: sha256WithRSAEncryption
Issuer: OU=otel-operator-opentelemetry-operator
Validity
Not Before: Apr 2 15:25:31 2024 GMT
Not After : Jul 1 15:25:31 2024 GMT
Subject: OU=otel-operator-opentelemetry-operator
Subject Public Key Info:
Public Key Algorithm: rsaEncryption
Public-Key: (2048 bit)
# removed to reduce the message size
Test environment:
Kubernetes version: v1.27.13-eks-3af4770
Provider: EKS
Operator version: 0.96.0