Skip to content

Prefect-GCP: Implement Retry Logic for Transient Errors with Cloud Run V2 #16448

@IsaacDayan

Description

@IsaacDayan

Describe the current behavior

When utilizing the Cloud Run V2 worker for Prefect flows, if the job submission encounters a transient error such as a 503 HTTP status code, the entire flow run is marked as crashed. This situation does not trigger any retries, leading to false alerts and unnecessary manual intervention.

Describe the proposed behavior

It would be beneficial to implement automatic retries for the callers of JobV2.create, for example _create_job_and_wait_for_registration within the Prefect-GCP integration. This should include a mechanism to handle transient errors by retrying the submission a configurable number of times with exponential backoff. This approach aligns with best practices for managing resources prone to transient errors.

Example Use

No response

Additional context

We are experiencing a high number of flow crashes attributed to 503 errors. Our current workaround involves an "Automation" to restart these crashed flows, but it does not differentiate between actual job failures and these infrastructure-related interruptions. It also create false alert notification on crashed jobs. Implementing this feature would significantly reduce false alarms and improve the robustness of our automated workflows - and avoid "Negative data engineering"!

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementAn improvement of an existing featuregood first issueThis issue is good for newcomersintegrationsRelated to integrations with other services

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions