Closed
Description
Version
3.6.2
What Kubernetes platforms are you running on?
EKS Amazon
Steps to reproduce
k8s EKS version: 1.31
Describe the bug:
Sometimes, the nginx-ingress-controller restarts the process without cleaning the socket files.
At first time we meet this problem during massive node restarting in the k8s cluster.
Then it happens randomly on weekends.
The problem was noticed in version 3.6.2. Before we used app version 3.0.2 and never had this problem
Manual Pod deletion solves the problem, but it can happen again.
Here is deployment yaml
apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
meta.helm.sh/release-name: nginx-inc-ingress-controller
meta.helm.sh/release-namespace: nginx-ingress
labels:
app: nginx-inc-ingress-controller
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/version: 3.6.1
helm.sh/chart: nginx-ingress-1.3.1
name: nginx-inc-ingress-controller
namespace: nginx-ingress
spec:
progressDeadlineSeconds: 600
replicas: 3
revisionHistoryLimit: 10
selector:
matchLabels:
app: nginx-inc-ingress-controller
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
template:
metadata:
annotations:
logs.improvado.io/app: nginx-ingress
logs.improvado.io/format: json
logs.improvado.io/ingress-class: nginx-stable
prometheus.io/port: "9113"
prometheus.io/scheme: http
prometheus.io/scrape: "true"
creationTimestamp: null
labels:
app: nginx-inc-ingress-controller
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: nginx-inc-ingress-controller
topologyKey: kubernetes.io/hostname
automountServiceAccountToken: true
containers:
- args:
- -nginx-plus=false
- -nginx-reload-timeout=60000
- -enable-app-protect=false
- -enable-app-protect-dos=false
- -nginx-configmaps=$(POD_NAMESPACE)/nginx-inc-ingress-controller
- -default-server-tls-secret=$(POD_NAMESPACE)/nginx-inc-ingress-controller-default-server-tls
- -ingress-class=nginx-stable
- -health-status=true
- -health-status-uri=/-/health/lb
- -nginx-debug=false
- -v=1
- -nginx-status=true
- -nginx-status-port=8080
- -nginx-status-allow-cidrs=127.0.0.1
- -report-ingress-status
- -enable-leader-election=true
- -leader-election-lock-name=nginx-inc-ingress-controller-leader
- -enable-prometheus-metrics=true
- -prometheus-metrics-listen-port=9113
- -prometheus-tls-secret=
- -enable-service-insight=false
- -service-insight-listen-port=9114
- -service-insight-tls-secret=
- -enable-custom-resources=true
- -enable-snippets=true
- -include-year=false
- -disable-ipv6=false
- -enable-tls-passthrough=false
- -enable-cert-manager=false
- -enable-oidc=false
- -enable-external-dns=false
- -default-http-listener-port=80
- -default-https-listener-port=443
- -ready-status=true
- -ready-status-port=8081
- -enable-latency-metrics=false
- -ssl-dynamic-reload=true
- -enable-telemetry-reporting=false
- -weight-changes-dynamic-reload=false
env:
- name: POD_NAMESPACE
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
- name: POD_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.name
image: 627003544259.dkr.ecr.us-east-1.amazonaws.com/nginx-inc-ingress:master-3.6.2-2-1
imagePullPolicy: IfNotPresent
name: ingress-controller
ports:
- containerPort: 80
name: http
protocol: TCP
- containerPort: 443
name: https
protocol: TCP
- containerPort: 9113
name: prometheus
protocol: TCP
- containerPort: 8081
name: readiness-port
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /nginx-ready
port: readiness-port
scheme: HTTP
periodSeconds: 1
successThreshold: 1
timeoutSeconds: 1
resources:
limits:
cpu: 1500m
memory: 1500Mi
requests:
cpu: 100m
memory: 1500Mi
securityContext:
allowPrivilegeEscalation: false
capabilities:
add:
- NET_BIND_SERVICE
drop:
- ALL
readOnlyRootFilesystem: true
runAsNonRoot: true
runAsUser: 101
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /etc/nginx
name: nginx-etc
- mountPath: /var/cache/nginx
name: nginx-cache
- mountPath: /var/lib/nginx
name: nginx-lib
- mountPath: /var/log/nginx
name: nginx-log
dnsPolicy: ClusterFirst
initContainers:
- command:
- cp
- -vdR
- /etc/nginx/.
- /mnt/etc
image: 627003544259.dkr.ecr.us-east-1.amazonaws.com/nginx-inc-ingress:master-3.6.2-2-1
imagePullPolicy: IfNotPresent
name: init-ingress-controller
resources:
requests:
cpu: 100m
memory: 128Mi
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
readOnlyRootFilesystem: true
runAsNonRoot: true
runAsUser: 101
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /mnt/etc
name: nginx-etc
nodeSelector:
kubernetes.io/arch: amd64
priorityClassName: cluster-application-critical
restartPolicy: Always
schedulerName: default-scheduler
securityContext:
seccompProfile:
type: RuntimeDefault
serviceAccount: nginx-inc-ingress-controller
serviceAccountName: nginx-inc-ingress-controller
terminationGracePeriodSeconds: 60
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
- effect: NoSchedule
key: node.kubernetes.io/memory-pressure
operator: Exists
topologySpreadConstraints:
- labelSelector:
matchLabels:
app: nginx-inc-ingress-controller
maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
volumes:
- emptyDir: {}
name: nginx-etc
- emptyDir: {}
name: nginx-cache
- emptyDir: {}
name: nginx-lib
- emptyDir: {}
name: nginx-log
Logs with error:
2024-11-02 10:02:58.543 2024/11/02 06:02:56 [emerg] 18#18: bind() to unix:/var/lib/nginx/nginx-status.sock failed (98: Address already in use)
2024-11-02 10:02:58.543 2024/11/02 06:02:56 [emerg] 18#18: bind() to unix:/var/lib/nginx/nginx-config-version.sock failed (98: Address already in use)
2024-11-02 10:02:58.543 2024/11/02 06:02:56 [emerg] 18#18: bind() to unix:/var/lib/nginx/nginx-502-server.sock failed (98: Address already in use)
2024-11-02 10:02:58.543 2024/11/02 06:02:56 [emerg] 18#18: bind() to unix:/var/lib/nginx/nginx-418-server.sock failed (98: Address already in use)
2024-11-02 10:02:58.543 2024/11/02 06:02:56 [notice] 18#18: try again to bind() after 500ms
2024-11-02 10:02:59.043 2024/11/02 06:02:56 [emerg] 18#18: still could not bind()
Here are logs, containing 1 signal reconfiguring and then a crash loop with socket busy error
Explore-logs-2024-11-05 18_40_57.txt
Expected behavior
nginx-ingress controller pod is working.
Sub-issues
Metadata
Metadata
Assignees
Type
Projects
Status
Done 🚀