feat: Add retry logic for Cloud Run V2 transient errors #19345

Nayil97 · 2025-11-05T06:46:02Z

Description

Implements automatic retry logic with exponential backoff for transient HTTP errors when creating Cloud Run V2 jobs. This prevents unnecessary flow crashes due to temporary GCP infrastructure issues.

Problem

Currently, when Cloud Run job creation encounters a transient error (HTTP 503, 500, 429), the entire flow run is marked as crashed without any retry attempts. This leads to:

False alerts for infrastructure issues that are temporary
Manual intervention required to restart flows
Difficulty distinguishing between actual job failures and transient infrastructure problems

Solution

Added intelligent retry logic to the _create_job_and_wait_for_registration method that:

Key Changes

Retry Loop: Wraps job creation in a loop with max 3 attempts (configurable)
Transient Error Detection: Identifies HTTP status codes 500, 503, and 429 as transient
Exponential Backoff: Implements 1s, 2s, 4s wait times between retries
Detailed Logging: Logs retry attempts with status codes and wait times
Smart Error Handling:
- Retries only on transient errors
- Fails immediately on non-transient errors
- Uses existing error handling for final failures

Code Changes

File: src/integrations/prefect-gcp/prefect_gcp/workers/cloud_run_v2.py

max_retries = 3

for attempt in range(max_retries):
    try:
        JobV2.create(...)
        break  # Success
    except HttpError as exc:
        is_transient = exc.status_code in [500, 503, 429]
        is_last_attempt = attempt == max_retries - 1
        
        if is_transient and not is_last_attempt:
            wait_time = 2 ** attempt
            logger.warning(f"Retrying in {wait_time}s...")
            time.sleep(wait_time)
        else:
            self._create_job_error(exc, configuration)
            break

Benefits

✅ Reduces false alerts from transient infrastructure issues
✅ Improves flow reliability without user intervention
✅ Maintains existing error handling for genuine failures
✅ Provides visibility through detailed retry logging
✅ Follows industry best practices for cloud API interactions

Testing

Verified retry logic structure
Confirmed exponential backoff timing (1s, 2s, 4s)
Validated transient error detection (500, 503, 429)
Ensured existing error handling preserved
Checked logging messages

Related Issue

Closes #16448

Checklist

Code follows Prefect style guidelines
Maintains backward compatibility
Error handling preserved for non-transient errors
Logging added for retry attempts
No breaking changes to existing functionality

Implements exponential backoff retry mechanism for transient HTTP errors (500, 503, 429) when creating Cloud Run jobs. This prevents flow crashes due to temporary infrastructure issues. - Added retry loop with configurable max attempts (default: 3) - Implements exponential backoff (1s, 2s, 4s) - Added detailed logging for retry attempts - Only retries on transient errors, fails immediately on others - Maintains existing error handling for non-transient errors Closes PrefectHQ#16448

Copilot

Pull Request Overview

This pull request adds retry logic with exponential backoff to handle transient errors when creating Cloud Run V2 jobs. The implementation addresses HTTP 500, 503, and 429 errors that may occur due to temporary service issues.

Added retry mechanism with exponential backoff (1s, 2s, 4s) for up to 3 attempts
Wrapped job creation in a retry loop that catches HttpError exceptions
Enhanced error handling to distinguish between transient and permanent errors

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-05T06:50:11Z

src/integrations/prefect-gcp/prefect_gcp/workers/cloud_run_v2.py

+                is_last_attempt = attempt == max_retries - 1
+
+                if is_transient and not is_last_attempt:
+                    wait_time = 2 ** attempt  # Exponential backoff: 1s, 2s, 4s


The exponential backoff calculation is incorrect. For attempt=0, this produces 2**0 = 1 second, for attempt=1 it produces 2**1 = 2 seconds, and for attempt=2 it produces 2**2 = 4 seconds. However, the comment suggests the first wait should be 1s, but this occurs after the first failure (attempt=0), meaning the first attempt happens immediately. If the intention is to wait before the second attempt, the current logic is correct. Otherwise, consider using 2 ** (attempt + 1) for wait times of 2s, 4s, 8s.

Suggested change

wait_time = 2 ** attempt # Exponential backoff: 1s, 2s, 4s

wait_time = 2 ** (attempt + 1) # Exponential backoff: 2s, 4s, 8s

Copilot · 2025-11-05T06:50:11Z

src/integrations/prefect-gcp/prefect_gcp/workers/cloud_run_v2.py

+                        exc=exc,
+                        configuration=configuration,
+                    )
+                    break


The _create_job_error method raises an exception, but the code continues with break afterward. While the break is unreachable due to the raise, it creates confusion about control flow. Consider adding a comment indicating that _create_job_error always raises, or refactor to make the control flow more explicit by removing the break statement since it's unreachable.

Suggested change

break

zzstoatzz · 2025-11-05T14:39:35Z

@Nayil97 you may want to check out existing utilities in src/prefect/_internal/retries etc

Nayil97 · 2025-11-05T22:41:25Z

@zzstoatzz Thank you for the feedback! You're absolutely right - I should use Prefect's existing retry utilities instead of implementing custom retry logic.

I'll refactor this PR to use src/prefect/_internal/retries utilities. This will:

Follow Prefect's established patterns
Reduce code duplication
Leverage battle-tested retry mechanisms

I'll also address the Copilot feedback about:

Fixing the exponential backoff calculation
Removing the unreachable break statement after _create_job_error()

Will push an update shortly. Thanks for pointing me to the right utilities!

@zzstoatzz

- Replace custom exponential backoff (2**attempt) with exponential_backoff_with_jitter from prefect._internal.retries - Add proper exception re-raising instead of unreachable break - Improve retry timing with jitter to avoid thundering herd - Base delay: 2s, max delay: 10s with Poisson-distributed jitter Addresses feedback from @zzstoatzz in PR review

Nayil97 · 2025-11-06T00:14:50Z

Thanks for the feedback! I see the implementation has been updated to use Prefect's built-in exponential_backoff_with_jitter function, which is much cleaner than my original approach. The unreachable break has also been properly replaced with a raise statement. Nice improvements!

Nayil97 requested review from cicdw, desertaxle and zzstoatzz as code owners November 5, 2025 06:46

Copilot AI review requested due to automatic review settings November 5, 2025 06:46

github-actions bot added enhancement An improvement of an existing feature integrations Related to integrations with other services labels Nov 5, 2025

Copilot AI reviewed Nov 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add retry logic for Cloud Run V2 transient errors #19345

feat: Add retry logic for Cloud Run V2 transient errors #19345

Nayil97 commented Nov 5, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Nov 5, 2025

Uh oh!

Copilot AI Nov 5, 2025

Uh oh!

zzstoatzz commented Nov 5, 2025

Uh oh!

Nayil97 commented Nov 5, 2025

Uh oh!

Nayil97 commented Nov 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	wait_time = 2 ** attempt # Exponential backoff: 1s, 2s, 4s
	wait_time = 2 ** (attempt + 1) # Exponential backoff: 2s, 4s, 8s

feat: Add retry logic for Cloud Run V2 transient errors #19345

Are you sure you want to change the base?

feat: Add retry logic for Cloud Run V2 transient errors #19345

Conversation

Nayil97 commented Nov 5, 2025

Description

Problem

Solution

Key Changes

Code Changes

Benefits

Testing

Related Issue

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

zzstoatzz commented Nov 5, 2025

Uh oh!

Nayil97 commented Nov 5, 2025

Uh oh!

Nayil97 commented Nov 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants