-
Notifications
You must be signed in to change notification settings - Fork 82
Open
Description
Summary
Following PR #1848, we need to enhance the finalizer removal mechanism to provide better diagnostics and a safer approach when HCP cleanup times out during E2E tests.
Context
PR #1848 introduced NukeHostedCluster()
which blindly removes ALL finalizers when deletion times out. While this unblocks E2E tests, it bypasses critical cleanup logic managed by various HyperShift components.
Current Issues
The current implementation removes all finalizers without understanding:
- Why cleanup is failing or taking too long
- Which specific finalizer is blocking deletion
- What resources might be left behind
Proposed Enhancements
-
Enhanced Diagnostics Before Forceful Removal
- Log which finalizers are still present
- Query and log the status of resources each finalizer is protecting
- Attempt to identify the specific blocker
-
Graduated Finalizer Removal
- Instead of removing all finalizers at once, remove them individually
- Log what each finalizer was protecting before removal
- Allow configuration of which finalizers can be safely force-removed
-
Timeout Configuration
- Make cleanup timeout configurable per finalizer type
- Different finalizers may need different grace periods
-
Post-Removal Report
- Generate a report of potentially orphaned resources
- Include cloud provider resources that may incur costs
Implementation Suggestions
// Example enhancement to NukeHostedCluster
func NukeHostedCluster(h *helper.H, hc *hyperv1.HostedCluster) error {
// First, diagnose why deletion is blocked
diagnostics := diagnoseFinalizerBlockage(h, hc)
h.Logger.Info("Finalizer diagnostics", "report", diagnostics)
// Attempt graceful removal of known safe finalizers first
safeFinalizers := []string{
"openshift.io/destroy-cluster",
// Add other known safe finalizers
}
// Remove finalizers individually with logging
for _, finalizer := range hc.GetFinalizers() {
h.Logger.Warn("Force removing finalizer",
"finalizer", finalizer,
"potentialImpact", getFinalizerImpact(finalizer))
// Remove individual finalizer
}
// Generate orphaned resources report
report := generateOrphanedResourcesReport(h, hc)
h.Logger.Error("Potential orphaned resources after forced cleanup", "report", report)
}
Identified Finalizers and Their Risks
Based on HyperShift codebase analysis, removing these finalizers can cause:
hypershift.openshift.io/finalizer
: Main cleanup orchestration - may leave cloud resourceshypershift.io/aws-oidc-discovery
: AWS OIDC documents remainhypershift.openshift.io/karpenter-finalizer
: Running EC2 instances may be orphanedhypershift.openshift.io/control-plane-operator-finalizer
: AWS PrivateLink endpoints remain
Expected Benefits
- Better understanding of cleanup failures
- Reduced risk of orphaned resources
- Improved debugging capabilities for E2E test failures
- Cost savings by identifying orphaned cloud resources
Related Issues/PRs
- PR E2E Fix: remove the finalizers on HCP to allow force delete after timeout expires #1848: Original implementation of
NukeHostedCluster()
- Related HyperShift finalizer handling
Metadata
Metadata
Assignees
Labels
No labels