-
Notifications
You must be signed in to change notification settings - Fork 517
Description
Issue: Operator crashes when reconciling ClickHouseKeeperInstallation in terminating namespace
Description:
The ClickHouse operator (versions 0.25.2 and 0.25.5) crashes with a panic when attempting to reconcile a ClickHouseKeeperInstallation resource in a namespace that is currently being terminated. This causes operator restarts and delays reconciliation of other ClickHouse installations in the cluster.
Environment
- Operator Version: 0.25.5 (also reproducible in 0.25.2)
- Kubernetes Version: Azure AKS
- Watch Configuration: All namespaces (
watch.namespaces: [.*]) - Frequency: Occurs every time a namespace containing a ClickHouseKeeperInstallation is deleted
Steps to Reproduce
- Deploy a ClickHouseKeeperInstallation in a namespace (e.g.,
e2e-test-namespace) - Delete the namespace:
kubectl delete namespace e2e-test-namespace - Operator receives reconciliation event for the keeper installation
- By the time operator processes the event, the namespace/CR is already terminated/deleted
- Operator attempts to update status on a nil CR
- Operator crashes with exit code 2
Expected Behavior
The operator should gracefully handle the case where:
- A namespace is being terminated
- A CR is deleted during reconciliation
- A CR is nil when attempting status updates
The operator should either:
- Skip reconciliation for resources in terminating namespaces, or
- Check if the CR is nil before attempting type assertions/status updates, or
- Handle the terminating namespace errors without attempting status updates
Actual Behavior
The operator panics and crashes with the following error:
panic: interface conversion: v1.ICustomResource is nil, not v1.ClickHouseKeeperInstallation [recovered, repanicked]
**Full Stack Trace:**
E1124 12:31:53.311112 1 worker-config-map.go:115] createConfigMap():unknown:Create ConfigMap xxx/chk-xxx-common-configd failed with error configmaps "chk-neptune-common-configd" is forbidden: unable to create new content in namespace xxx because it is being terminated
E1124 12:31:53.311130 1 worker-config-map.go:62] reconcileConfigMap():unknown:FAILED to reconcile ConfigMap: chk-xxx-common-configd CHI: xxx
E1124 12:31:53.314616 1 worker-config-map.go:115] createConfigMap():unknown:Create ConfigMap xxx/chk-xxx-deploy-confd-keeper-0-0 failed with error configmaps "chk-xxx-deploy-confd-keeper-0-0" is forbidden: unable to create new content in namespace xxx because it is being terminated
E1124 12:31:53.314634 1 worker-config-map.go:62] reconcileConfigMap():unknown:FAILED to reconcile ConfigMap: chk-xxx-deploy-confd-keeper-0-0 CHI: xxx
W1124 12:31:53.314650 1 worker-reconciler-chk.go:640] reconcileHostMain():Host:0-0[0/0]:xxx/xxx:Reconcile Host Main - unable to reconcile ConfigMap. Host: 0-0 Err: configmaps "chk-xxx-deploy-confd-keeper-0-0" is forbidden: unable to create new content in namespace xxx because it is being terminated
I1124 12:31:53.314936 1 worker-service.go:46] reconcileService():unknown:Service: xxxkeeper-xxx not found. err: Service "keeper-xxx" not found
I1124 12:31:53.314953 1 controller-service.go:52] deleteServiceIfExists():xxx/keeper-xxx:Not Found Service: xxx/keeper-xxx err: Service "keeper-xxx" not found
E1124 12:31:53.318177 1 worker-service.go:214] createService():unknown:FAILED Create Service: xxx/keeper-xxx err: services "keeper-xxx" is forbidden: unable to create new content in namespace xxx because it is being terminated
E1124 12:31:53.318195 1 worker-service.go:66] reconcileService():unknown:FAILED to reconcile Service: xxx/keeper-xxx CHI: xxx
W1124 12:31:53.318208 1 worker-reconciler-chk.go:481] first shard failed, skipping rest of shards due to an error: configmaps "chk-xxx-deploy-confd-keeper-0-0" is forbidden: unable to create new content in namespace xxx because it is being terminated
I1124 12:31:53.318217 1 worker-reconciler-chk.go:482] worker-reconciler-chk.go:465:reconcileShardsAndHosts():end:reconcileShardsAndHosts end
E1124 12:31:53.318232 1 worker-reconciler-chk.go:73] reconcileCR():unknown:FAILED to reconcile CR xxx/xxx, err: configmaps "chk-xxx-deploy-confd-keeper-0-0" is forbidden: unable to create new content in namespace xxx because it is being terminated
I1124 12:31:53.318264 1 panic.go:783] worker-reconciler-chk.go:50:reconcileCR():end:unknown
2025-11-24T12:31:53Z INFO Observed a panic in reconciler: interface conversion: v1.ICustomResource is nil, not v1.ClickHouseKeeperInstallation {"controller": "clickhousekeeperinstallation", "controllerGroup": "clickhouse-keeper.altinity.com", "controllerKind": "ClickHouseKeeperInstallation", "ClickHouseKeeperInstallation": {"name":"xxx","namespace":"xxx"}, "namespace": "xxx", "name": "xxx", "reconcileID": "5db980c5-62d6-470e-af60-b04a168e8140"}
panic: interface conversion: v1.ICustomResource is nil, not v1.ClickHouseKeeperInstallation [recovered, repanicked]
goroutine 300 [running]:
sigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).Reconcile.func1()
/clickhouse-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:115 +0x1d4
panic({0x1c5dac0?, 0xc0016c7710?})
/usr/local/go/src/runtime/panic.go:783 +0x132
github.com/altinity/clickhouse-operator/pkg/controller/chk/kube.(CR).statusUpdateProcess(0xc00296cfc0, {0x2239c70, 0x32906a0}, {0x2258fb0?, 0xc002579980}, {{{0x0, 0x0, 0x0, 0x1, 0x0, ...}, ...}, ...})
/clickhouse-operator/pkg/controller/chk/kube/cr.go:98 +0xe2a
github.com/altinity/clickhouse-operator/pkg/controller/chk/kube.(CR).statusUpdateRetry(0xc00296cfc0, {0x2239c70, 0x32906a0}, {0x2258fb0, 0xc002579980}, {{{0x0, 0x0, 0x0, 0x1, 0x0, ...}, ...}, ...})
/clickhouse-operator/pkg/controller/chk/kube/cr.go:71 +0x10e
github.com/altinity/clickhouse-operator/pkg/controller/chk/kube.(CR).StatusUpdate(0xc00296cfc0?, {0x2239c70?, 0x32906a0?}, {0x2258fb0?, 0xc002579980?}, {{{0x0, 0x0, 0x0, 0x1, 0x0, ...}, ...}, ...})
/clickhouse-operator/pkg/controller/chk/kube/cr.go:62 +0x1c5
github.com/altinity/clickhouse-operator/pkg/controller/chk.(Controller).updateCRObjectStatus(0x0?, {0x2239c70, 0x32906a0}, {0x2258fb0, 0xc002579980}, {{{0x0, 0x0, 0x0, 0x1, 0x0, ...}, ...}, ...})
/clickhouse-operator/pkg/controller/chk/controller-status.go:26 +0xab
github.com/altinity/clickhouse-operator/pkg/controller/chk.(worker).markReconcileCompletedUnsuccessfully(0xc001a63950, {0x2239c70, 0x32906a0}, 0xc002579980, {0x2219a60, 0xc00068a520})
/clickhouse-operator/pkg/controller/chk/worker.go:251 +0x28c
github.com/altinity/clickhouse-operator/pkg/controller/chk.(worker).reconcileCR(0xc001a63950, {0x2239c70, 0x32906a0}, 0x0?, 0xc000791800)
/clickhouse-operator/pkg/controller/chk/worker-reconciler-chk.go:75 +0x1091
github.com/altinity/clickhouse-operator/pkg/controller/chk.(Controller).Reconcile(0xc0005ff480, {0x2239b90, 0xc002f39440}, {{{0xc002032830?, 0x1de4fc0?}, {0xc002032820?, 0xc00296ce50?}}})
/clickhouse-operator/pkg/controller/chk/controller.go:80 +0x3c5
sigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).Reconcile(0x2239b90?, {0x2239b90?, 0xc002f39440?}, {{{0xc002032830?, 0x1b18e60?}, {0xc002032820?, 0x0?}}})
/clickhouse-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:118 +0xa5
sigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).reconcileHandler(0xc000998d20, {0x2239bc8, 0xc000389ae0}, {0x1d4f020, 0xc002e87b20})
/clickhouse-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:314 +0x31c
sigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).processNextWorkItem(0xc000998d20, {0x2239bc8, 0xc000389ae0})
/clickhouse-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:265 +0x197
sigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).Start.func2.2()
/clickhouse-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:226 +0x73
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).Start.func2 in goroutine 305
/clickhouse-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:222 +0x509
Root Cause
The panic occurs at pkg/controller/chk/kube/cr.go:98 when the code attempts to:
- Mark reconciliation as completed unsuccessfully
- Update the CR status
- Type assert an interface to
*v1.ClickHouseKeeperInstallation - The interface is
nilbecause the CR has already been deleted
The reconciliation fails with "unable to create new content in namespace because it is being terminated" errors, but the error handling path doesn't check if the CR still exists before attempting the status update.
Impact
Severity: Medium-High
- Operator instability: Each namespace deletion causes an operator crash (exit code 2)
- Reconciliation delays: After restart, the operator needs ~4-5 minutes to rebuild its cache and resume operations
- Cascading effect: In environments with frequent namespace creation/deletion (e.g. testing), this causes frequent operator restarts
- Observable metric: In our cluster, we observed 12 restarts in 10 hours while there was a significant amount of namespace churn during testing
When a new ClickHouseInstallation is created shortly after an operator crash, reconciliation is delayed by the cache rebuild time, causing deployments to time out or take significantly longer.