Skip to content

VSO fails to rotate some certs even when in renewal window #1064

@tks98

Description

@tks98

Describe the bug
The VSO sometimes loses track of vault-pki-secret certs. It will say its in the renewal window, but wont actually rotate the cert unless it is restarted. Once restarted, it rotates the cert successfully.

To Reproduce
Steps to reproduce the behavior:

  1. Deploy VSO and a vault-pki-secret resource
  2. Wait for cert to expire
  3. Notice cert was not rotated
  4. Restart VSO
  5. Cert will be rotated

Application deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: '4'
  creationTimestamp: '2024-10-16T19:19:42Z'
  generation: 5
  labels:
    app.kubernetes.io/component: controller-manager
    app.kubernetes.io/instance: sf-vault-secrets-operator
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: vault-secrets-operator
    app.kubernetes.io/version: 0.8.1
    argocd.argoproj.io/instance: vault-prod_sf-vault-secrets-operator
    control-plane: controller-manager
    helm.sh/chart: vault-secrets-operator-0.8.6
  name: sf-vault-secrets-operator-controller-manager
  namespace: vault-prod
  resourceVersion: '3696531985'
  uid: 30e8be8f-99e7-46e1-b927-f97772239224
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/instance: sf-vault-secrets-operator
      app.kubernetes.io/name: vault-secrets-operator
      control-plane: controller-manager
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      annotations:
        kubectl.kubernetes.io/default-container: manager
      creationTimestamp: null
      labels:
        app.kubernetes.io/instance: sf-vault-secrets-operator
        app.kubernetes.io/name: vault-secrets-operator
        control-plane: controller-manager
    spec:
      containers:
        - args:
            - '--secure-listen-address=0.0.0.0:8443'
            - '--upstream=http://127.0.0.1:8080/'
            - '--logtostderr=true'
            - '--v=0'
          env:
            - name: KUBERNETES_CLUSTER_DOMAIN
              value: cluster.local
          image: my-registry/vault/kube-rbac-proxy:v0.18.1
          imagePullPolicy: IfNotPresent
          name: kube-rbac-proxy
          ports:
            - containerPort: 8443
              name: https
              protocol: TCP
          resources:
            limits:
              memory: 1Gi
            requests:
              cpu: 5m
              memory: 500Mi
          securityContext:
            allowPrivilegeEscalation: false
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
        - args:
            - '--health-probe-bind-address=:8081'
            - '--metrics-bind-address=127.0.0.1:8080'
            - '--leader-elect'
            - '--global-vault-auth-options=allow-default-globals'
            - '--backoff-initial-interval=5s'
            - '--backoff-max-interval=60s'
            - '--backoff-max-elapsed-time=0s'
            - '--backoff-multiplier=1.50'
            - '--backoff-randomization-factor=0.50'
            - '--zap-log-level=info'
            - '--zap-time-encoding=rfc3339'
            - '--zap-stacktrace-level=panic'
          command:
            - /vault-secrets-operator
          env:
            - name: OPERATOR_POD_NAME
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: metadata.name
            - name: OPERATOR_POD_UID
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: metadata.uid
            - name: KUBERNETES_CLUSTER_DOMAIN
              value: cluster.local
          image: my-registry/vault/vault-secrets-operator:0.10.0
          imagePullPolicy: IfNotPresent
          livenessProbe:
            failureThreshold: 3
            httpGet:
              path: /healthz
              port: 8081
              scheme: HTTP
            initialDelaySeconds: 15
            periodSeconds: 20
            successThreshold: 1
            timeoutSeconds: 1
          name: manager
          readinessProbe:
            failureThreshold: 3
            httpGet:
              path: /readyz
              port: 8081
              scheme: HTTP
            initialDelaySeconds: 5
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 1
          resources:
            limits:
              memory: 1Gi
            requests:
              cpu: 10m
              memory: 500Mi
          securityContext:
            allowPrivilegeEscalation: false
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
            - mountPath: /var/run/podinfo
              name: podinfo
      dnsPolicy: ClusterFirst
      nodeSelector:
        node-role.kubernetes.io/infrastructure-new: ''
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext:
        runAsNonRoot: true
      serviceAccount: sf-vault-secrets-operator-controller-manager
      serviceAccountName: sf-vault-secrets-operator-controller-manager
      terminationGracePeriodSeconds: 120
      tolerations:
        - effect: NoSchedule
          key: node-role.kubernetes.io/infrastructure-new
          operator: Exists
      volumes:
        - downwardAPI:
            defaultMode: 420
            items:
              - fieldRef:
                  apiVersion: v1
                  fieldPath: metadata.name
                path: name
              - fieldRef:
                  apiVersion: v1
                  fieldPath: metadata.uid
                path: uid
          name: podinfo
status:
  availableReplicas: 1
  conditions:
    - lastTransitionTime: '2024-10-16T19:19:42Z'
      lastUpdateTime: '2025-03-17T17:01:14Z'
      message: >-
        ReplicaSet "sf-vault-secrets-operator-controller-manager-5599cb9f75" has
        successfully progressed.
      reason: NewReplicaSetAvailable
      status: 'True'
      type: Progressing
    - lastTransitionTime: '2025-05-08T14:55:39Z'
      lastUpdateTime: '2025-05-08T14:55:39Z'
      message: Deployment has minimum availability.
      reason: MinimumReplicasAvailable
      status: 'True'
      type: Available
  observedGeneration: 5
  readyReplicas: 1
  replicas: 1
  updatedReplicas: 1

The following are logs from today just before VSO was restarted. It knows the grafana cert is in the renewal window but it will not rotate it. No error logs after this. Cert just doesnt get rotated.

{"level":"info","ts":"2025-05-08T14:55:48Z","msg":"Must sync","controller":"vaultpkisecret","controllerGroup":"secrets.hashicorp.com","controllerKind":"VaultPKISecret","VaultPKISecret":{"name":"grafana","namespace":"grafana"},"namespace":"grafana","name":"grafana","reconcileID":"22c5590e-c366-4f8f-a4a1-83a7d6138e5d","reason":"InRenewalWindow"}

vault-pki-secret resource

kc get vaultpkisecret --context sf-prod -n grafana -o yaml
apiVersion: v1
items:
- apiVersion: secrets.hashicorp.com/v1beta1
  kind: VaultPKISecret
  metadata:
    annotations:
    creationTimestamp: "2025-02-06T22:41:25Z"
    finalizers:
    - vaultpkisecrets.secrets.hashicorp.com/finalizer
    generation: 1
    labels:
      argocd.argoproj.io/instance: grafana_grafana
    name: grafana
    namespace: grafana
    resourceVersion: "3638517376"
    uid: f5a3bbbb-72ad-46df-bf14-72044d418cd7
  spec:
    commonName: <common_name>
    destination:
      create: true
      name: grafana-pki-secret
      overwrite: false
      transformation: {}
    expiryOffset: 3600s
    format: pem
    mount: pki
    role: postgres
    rolloutRestartTargets:
    - kind: Deployment
      name: grafana
    ttl: 2592000s
    vaultAuthRef: grafana
  status:
    error: ""
    expiration: 1746677241 (Thu May 08 2025 04:07:21 GMT+0000)
    lastGeneration: 1
    lastRotation: 1744085242 (Tue Apr 08 2025 04:07:22 GMT+0000)
    secretMAC: K36WwBka5fODwB41pxiRulmrmxyacg4mSfFwl9zn2tY=
    serialNumber: 35:17:fb:18:8a:68:11:c1:4c:f3:9f:71:d4:94:86:f3:d9:60:fe:84
    valid: true
kind: List
metadata:
  resourceVersion: ""

Expected behavior
When in the renewal window, vso successfully rotates the cert.

Environment

  • Kubernetes version: v1.30.9
    • Distribution or cloud vendor (OpenShift, EKS, GKE, AKS, etc.): Vanilla on-prem
  • vault-secrets-operator version: 0.10.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions