Skip to content

Enhancement: Improve finalizer removal diagnostics and provide safer override mechanism for HCP cleanup timeouts #1852

@kaovilai

Description

@kaovilai

Summary

Following PR #1848, we need to enhance the finalizer removal mechanism to provide better diagnostics and a safer approach when HCP cleanup times out during E2E tests.

Context

PR #1848 introduced NukeHostedCluster() which blindly removes ALL finalizers when deletion times out. While this unblocks E2E tests, it bypasses critical cleanup logic managed by various HyperShift components.

Current Issues

The current implementation removes all finalizers without understanding:

  • Why cleanup is failing or taking too long
  • Which specific finalizer is blocking deletion
  • What resources might be left behind

Proposed Enhancements

  1. Enhanced Diagnostics Before Forceful Removal

    • Log which finalizers are still present
    • Query and log the status of resources each finalizer is protecting
    • Attempt to identify the specific blocker
  2. Graduated Finalizer Removal

    • Instead of removing all finalizers at once, remove them individually
    • Log what each finalizer was protecting before removal
    • Allow configuration of which finalizers can be safely force-removed
  3. Timeout Configuration

    • Make cleanup timeout configurable per finalizer type
    • Different finalizers may need different grace periods
  4. Post-Removal Report

    • Generate a report of potentially orphaned resources
    • Include cloud provider resources that may incur costs

Implementation Suggestions

// Example enhancement to NukeHostedCluster
func NukeHostedCluster(h *helper.H, hc *hyperv1.HostedCluster) error {
    // First, diagnose why deletion is blocked
    diagnostics := diagnoseFinalizerBlockage(h, hc)
    h.Logger.Info("Finalizer diagnostics", "report", diagnostics)
    
    // Attempt graceful removal of known safe finalizers first
    safeFinalizers := []string{
        "openshift.io/destroy-cluster",
        // Add other known safe finalizers
    }
    
    // Remove finalizers individually with logging
    for _, finalizer := range hc.GetFinalizers() {
        h.Logger.Warn("Force removing finalizer", 
            "finalizer", finalizer,
            "potentialImpact", getFinalizerImpact(finalizer))
        // Remove individual finalizer
    }
    
    // Generate orphaned resources report
    report := generateOrphanedResourcesReport(h, hc)
    h.Logger.Error("Potential orphaned resources after forced cleanup", "report", report)
}

Identified Finalizers and Their Risks

Based on HyperShift codebase analysis, removing these finalizers can cause:

  • hypershift.openshift.io/finalizer: Main cleanup orchestration - may leave cloud resources
  • hypershift.io/aws-oidc-discovery: AWS OIDC documents remain
  • hypershift.openshift.io/karpenter-finalizer: Running EC2 instances may be orphaned
  • hypershift.openshift.io/control-plane-operator-finalizer: AWS PrivateLink endpoints remain

Expected Benefits

  1. Better understanding of cleanup failures
  2. Reduced risk of orphaned resources
  3. Improved debugging capabilities for E2E test failures
  4. Cost savings by identifying orphaned cloud resources

Related Issues/PRs

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions