Skip to content

Operator crashes when reconciling ClickHouseKeeperInstallation in terminating namespace #1871

@pkirby-neptune

Description

@pkirby-neptune

Issue: Operator crashes when reconciling ClickHouseKeeperInstallation in terminating namespace

Description:

The ClickHouse operator (versions 0.25.2 and 0.25.5) crashes with a panic when attempting to reconcile a ClickHouseKeeperInstallation resource in a namespace that is currently being terminated. This causes operator restarts and delays reconciliation of other ClickHouse installations in the cluster.

Environment

  • Operator Version: 0.25.5 (also reproducible in 0.25.2)
  • Kubernetes Version: Azure AKS
  • Watch Configuration: All namespaces (watch.namespaces: [.*])
  • Frequency: Occurs every time a namespace containing a ClickHouseKeeperInstallation is deleted

Steps to Reproduce

  1. Deploy a ClickHouseKeeperInstallation in a namespace (e.g., e2e-test-namespace)
  2. Delete the namespace: kubectl delete namespace e2e-test-namespace
  3. Operator receives reconciliation event for the keeper installation
  4. By the time operator processes the event, the namespace/CR is already terminated/deleted
  5. Operator attempts to update status on a nil CR
  6. Operator crashes with exit code 2

Expected Behavior

The operator should gracefully handle the case where:

  • A namespace is being terminated
  • A CR is deleted during reconciliation
  • A CR is nil when attempting status updates

The operator should either:

  1. Skip reconciliation for resources in terminating namespaces, or
  2. Check if the CR is nil before attempting type assertions/status updates, or
  3. Handle the terminating namespace errors without attempting status updates

Actual Behavior

The operator panics and crashes with the following error:

panic: interface conversion: v1.ICustomResource is nil, not v1.ClickHouseKeeperInstallation [recovered, repanicked]
**Full Stack Trace:**
E1124 12:31:53.311112 1 worker-config-map.go:115] createConfigMap():unknown:Create ConfigMap xxx/chk-xxx-common-configd failed with error configmaps "chk-neptune-common-configd" is forbidden: unable to create new content in namespace xxx because it is being terminated
E1124 12:31:53.311130 1 worker-config-map.go:62] reconcileConfigMap():unknown:FAILED to reconcile ConfigMap: chk-xxx-common-configd CHI: xxx
E1124 12:31:53.314616 1 worker-config-map.go:115] createConfigMap():unknown:Create ConfigMap xxx/chk-xxx-deploy-confd-keeper-0-0 failed with error configmaps "chk-xxx-deploy-confd-keeper-0-0" is forbidden: unable to create new content in namespace xxx because it is being terminated
E1124 12:31:53.314634 1 worker-config-map.go:62] reconcileConfigMap():unknown:FAILED to reconcile ConfigMap: chk-xxx-deploy-confd-keeper-0-0 CHI: xxx
W1124 12:31:53.314650 1 worker-reconciler-chk.go:640] reconcileHostMain():Host:0-0[0/0]:xxx/xxx:Reconcile Host Main - unable to reconcile ConfigMap. Host: 0-0 Err: configmaps "chk-xxx-deploy-confd-keeper-0-0" is forbidden: unable to create new content in namespace xxx because it is being terminated
I1124 12:31:53.314936 1 worker-service.go:46] reconcileService():unknown:Service: xxxkeeper-xxx not found. err: Service "keeper-xxx" not found
I1124 12:31:53.314953 1 controller-service.go:52] deleteServiceIfExists():xxx/keeper-xxx:Not Found Service: xxx/keeper-xxx err: Service "keeper-xxx" not found
E1124 12:31:53.318177 1 worker-service.go:214] createService():unknown:FAILED Create Service: xxx/keeper-xxx err: services "keeper-xxx" is forbidden: unable to create new content in namespace xxx because it is being terminated
E1124 12:31:53.318195 1 worker-service.go:66] reconcileService():unknown:FAILED to reconcile Service: xxx/keeper-xxx CHI: xxx
W1124 12:31:53.318208 1 worker-reconciler-chk.go:481] first shard failed, skipping rest of shards due to an error: configmaps "chk-xxx-deploy-confd-keeper-0-0" is forbidden: unable to create new content in namespace xxx because it is being terminated
I1124 12:31:53.318217 1 worker-reconciler-chk.go:482] worker-reconciler-chk.go:465:reconcileShardsAndHosts():end:reconcileShardsAndHosts end
E1124 12:31:53.318232 1 worker-reconciler-chk.go:73] reconcileCR():unknown:FAILED to reconcile CR xxx/xxx, err: configmaps "chk-xxx-deploy-confd-keeper-0-0" is forbidden: unable to create new content in namespace xxx because it is being terminated
I1124 12:31:53.318264 1 panic.go:783] worker-reconciler-chk.go:50:reconcileCR():end:unknown
2025-11-24T12:31:53Z INFO Observed a panic in reconciler: interface conversion: v1.ICustomResource is nil, not v1.ClickHouseKeeperInstallation {"controller": "clickhousekeeperinstallation", "controllerGroup": "clickhouse-keeper.altinity.com", "controllerKind": "ClickHouseKeeperInstallation", "ClickHouseKeeperInstallation": {"name":"xxx","namespace":"xxx"}, "namespace": "xxx", "name": "xxx", "reconcileID": "5db980c5-62d6-470e-af60-b04a168e8140"}
panic: interface conversion: v1.ICustomResource is nil, not v1.ClickHouseKeeperInstallation [recovered, repanicked]
goroutine 300 [running]:
sigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).Reconcile.func1()
/clickhouse-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:115 +0x1d4
panic({0x1c5dac0?, 0xc0016c7710?})
/usr/local/go/src/runtime/panic.go:783 +0x132
github.com/altinity/clickhouse-operator/pkg/controller/chk/kube.(CR).statusUpdateProcess(0xc00296cfc0, {0x2239c70, 0x32906a0}, {0x2258fb0?, 0xc002579980}, {{{0x0, 0x0, 0x0, 0x1, 0x0, ...}, ...}, ...})
/clickhouse-operator/pkg/controller/chk/kube/cr.go:98 +0xe2a
github.com/altinity/clickhouse-operator/pkg/controller/chk/kube.(CR).statusUpdateRetry(0xc00296cfc0, {0x2239c70, 0x32906a0}, {0x2258fb0, 0xc002579980}, {{{0x0, 0x0, 0x0, 0x1, 0x0, ...}, ...}, ...})
/clickhouse-operator/pkg/controller/chk/kube/cr.go:71 +0x10e
github.com/altinity/clickhouse-operator/pkg/controller/chk/kube.(CR).StatusUpdate(0xc00296cfc0?, {0x2239c70?, 0x32906a0?}, {0x2258fb0?, 0xc002579980?}, {{{0x0, 0x0, 0x0, 0x1, 0x0, ...}, ...}, ...})
/clickhouse-operator/pkg/controller/chk/kube/cr.go:62 +0x1c5
github.com/altinity/clickhouse-operator/pkg/controller/chk.(Controller).updateCRObjectStatus(0x0?, {0x2239c70, 0x32906a0}, {0x2258fb0, 0xc002579980}, {{{0x0, 0x0, 0x0, 0x1, 0x0, ...}, ...}, ...})
/clickhouse-operator/pkg/controller/chk/controller-status.go:26 +0xab
github.com/altinity/clickhouse-operator/pkg/controller/chk.(worker).markReconcileCompletedUnsuccessfully(0xc001a63950, {0x2239c70, 0x32906a0}, 0xc002579980, {0x2219a60, 0xc00068a520})
/clickhouse-operator/pkg/controller/chk/worker.go:251 +0x28c
github.com/altinity/clickhouse-operator/pkg/controller/chk.(worker).reconcileCR(0xc001a63950, {0x2239c70, 0x32906a0}, 0x0?, 0xc000791800)
/clickhouse-operator/pkg/controller/chk/worker-reconciler-chk.go:75 +0x1091
github.com/altinity/clickhouse-operator/pkg/controller/chk.(Controller).Reconcile(0xc0005ff480, {0x2239b90, 0xc002f39440}, {{{0xc002032830?, 0x1de4fc0?}, {0xc002032820?, 0xc00296ce50?}}})
/clickhouse-operator/pkg/controller/chk/controller.go:80 +0x3c5
sigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).Reconcile(0x2239b90?, {0x2239b90?, 0xc002f39440?}, {{{0xc002032830?, 0x1b18e60?}, {0xc002032820?, 0x0?}}})
/clickhouse-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:118 +0xa5
sigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).reconcileHandler(0xc000998d20, {0x2239bc8, 0xc000389ae0}, {0x1d4f020, 0xc002e87b20})
/clickhouse-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:314 +0x31c
sigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).processNextWorkItem(0xc000998d20, {0x2239bc8, 0xc000389ae0})
/clickhouse-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:265 +0x197
sigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).Start.func2.2()
/clickhouse-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:226 +0x73
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).Start.func2 in goroutine 305
/clickhouse-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:222 +0x509

Root Cause

The panic occurs at pkg/controller/chk/kube/cr.go:98 when the code attempts to:

  1. Mark reconciliation as completed unsuccessfully
  2. Update the CR status
  3. Type assert an interface to *v1.ClickHouseKeeperInstallation
  4. The interface is nil because the CR has already been deleted

The reconciliation fails with "unable to create new content in namespace because it is being terminated" errors, but the error handling path doesn't check if the CR still exists before attempting the status update.

Impact

Severity: Medium-High

  1. Operator instability: Each namespace deletion causes an operator crash (exit code 2)
  2. Reconciliation delays: After restart, the operator needs ~4-5 minutes to rebuild its cache and resume operations
  3. Cascading effect: In environments with frequent namespace creation/deletion (e.g. testing), this causes frequent operator restarts
  4. Observable metric: In our cluster, we observed 12 restarts in 10 hours while there was a significant amount of namespace churn during testing

When a new ClickHouseInstallation is created shortly after an operator crash, reconciliation is delayed by the cache rebuild time, causing deployments to time out or take significantly longer.

Metadata

Metadata

Assignees

No one assigned

    Labels

    holdThis issue has been put on holdresearch requiredThis issue requires additional research

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions