-
Notifications
You must be signed in to change notification settings - Fork 2k
feat: Add retry logic for Cloud Run V2 transient errors #19345
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Implements exponential backoff retry mechanism for transient HTTP errors (500, 503, 429) when creating Cloud Run jobs. This prevents flow crashes due to temporary infrastructure issues. - Added retry loop with configurable max attempts (default: 3) - Implements exponential backoff (1s, 2s, 4s) - Added detailed logging for retry attempts - Only retries on transient errors, fails immediately on others - Maintains existing error handling for non-transient errors Closes PrefectHQ#16448
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This pull request adds retry logic with exponential backoff to handle transient errors when creating Cloud Run V2 jobs. The implementation addresses HTTP 500, 503, and 429 errors that may occur due to temporary service issues.
- Added retry mechanism with exponential backoff (1s, 2s, 4s) for up to 3 attempts
- Wrapped job creation in a retry loop that catches
HttpErrorexceptions - Enhanced error handling to distinguish between transient and permanent errors
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| is_last_attempt = attempt == max_retries - 1 | ||
|
|
||
| if is_transient and not is_last_attempt: | ||
| wait_time = 2 ** attempt # Exponential backoff: 1s, 2s, 4s |
Copilot
AI
Nov 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The exponential backoff calculation is incorrect. For attempt=0, this produces 2**0 = 1 second, for attempt=1 it produces 2**1 = 2 seconds, and for attempt=2 it produces 2**2 = 4 seconds. However, the comment suggests the first wait should be 1s, but this occurs after the first failure (attempt=0), meaning the first attempt happens immediately. If the intention is to wait before the second attempt, the current logic is correct. Otherwise, consider using 2 ** (attempt + 1) for wait times of 2s, 4s, 8s.
| wait_time = 2 ** attempt # Exponential backoff: 1s, 2s, 4s | |
| wait_time = 2 ** (attempt + 1) # Exponential backoff: 2s, 4s, 8s |
| exc=exc, | ||
| configuration=configuration, | ||
| ) | ||
| break |
Copilot
AI
Nov 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The _create_job_error method raises an exception, but the code continues with break afterward. While the break is unreachable due to the raise, it creates confusion about control flow. Consider adding a comment indicating that _create_job_error always raises, or refactor to make the control flow more explicit by removing the break statement since it's unreachable.
| break |
|
@Nayil97 you may want to check out existing utilities in src/prefect/_internal/retries etc |
|
@zzstoatzz Thank you for the feedback! You're absolutely right - I should use Prefect's existing retry utilities instead of implementing custom retry logic. I'll refactor this PR to use
I'll also address the Copilot feedback about:
Will push an update shortly. Thanks for pointing me to the right utilities! |
- Replace custom exponential backoff (2**attempt) with exponential_backoff_with_jitter from prefect._internal.retries - Add proper exception re-raising instead of unreachable break - Improve retry timing with jitter to avoid thundering herd - Base delay: 2s, max delay: 10s with Poisson-distributed jitter Addresses feedback from @zzstoatzz in PR review
|
Thanks for the feedback! I see the implementation has been updated to use Prefect's built-in |
Description
Implements automatic retry logic with exponential backoff for transient HTTP errors when creating Cloud Run V2 jobs. This prevents unnecessary flow crashes due to temporary GCP infrastructure issues.
Problem
Currently, when Cloud Run job creation encounters a transient error (HTTP 503, 500, 429), the entire flow run is marked as crashed without any retry attempts. This leads to:
Solution
Added intelligent retry logic to the
_create_job_and_wait_for_registrationmethod that:Key Changes
Code Changes
File:
src/integrations/prefect-gcp/prefect_gcp/workers/cloud_run_v2.pyBenefits
✅ Reduces false alerts from transient infrastructure issues
✅ Improves flow reliability without user intervention
✅ Maintains existing error handling for genuine failures
✅ Provides visibility through detailed retry logging
✅ Follows industry best practices for cloud API interactions
Testing
Related Issue
Closes #16448
Checklist