-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Client Retry Policy: Adds HTTP timeouts with request-level cross-region retry #32450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Client Retry Policy: Adds HTTP timeouts with request-level cross-region retry #32450
Conversation
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/ClientRetryPolicy.java
Outdated
Show resolved
Hide resolved
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/ClientRetryPolicy.java
Outdated
Show resolved
Hide resolved
FabianMeiswinkel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except for two minor suggestions.
FabianMeiswinkel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also add the changelog entries
Co-authored-by: Fabian Meiswinkel <[email protected]>
Co-authored-by: Fabian Meiswinkel <[email protected]>
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
Failed test: conflictCustomLWW (known flaky test) |
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/ClientRetryPolicy.java
Outdated
Show resolved
Hide resolved
| } | ||
|
|
||
| //Data Plane Read & Write | ||
| if(!isMetaDataRequest && !request.isAddressRefresh()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
write operation also retriable? (how we know the request is not reached to server?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, data plane writes (meta data writes are not) operations are retriable and should be handled in this case. I believe that because we are getting an error code the request never reached the server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch Annie - I missed this
@NaluTripician - no, write operations are not in general retriable - only when we can be sure that the request has never been processed. So, when we get certain error codes form the service (410/0 sent form the service)or when we get a timeout trying to establish a connection we know that the request has never been processed and retry is idempotent/safe. But a request timeout after we sent the request on the wire means we simply don't know whether the request was ever processed in the service - so, a retry would not be safe. Usually we capture state of whether request has been flushed to the network yet and also whether a 410 comes form the service or is clinet-side generated to make the call whether writes can be retried. Sounds like above logic might be too aggressive with retries - can you please double-check?
| //Data Plane Read & Write | ||
| if(!isMetaDataRequest && !request.isAddressRefresh()) { | ||
| if(!isMetaDataRequest && !request.isAddressRefresh() | ||
| && (request.isReadOnly() || !BridgeInternal.hasSendingRequestStarted(clientException))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see two options here:
- for http timeout exception, usually it is an indicated that SDK has sent the request, else SDK will translate those exceptions into GATEWAY_ENDPOINT_UNAVAILABLE. so we could remove the write operation completely here I think
- Using the
hasSendingRequestStarted- this flag has only been wired up in direct rntbd layer, but not for gateway requests, so we will need to wire it up for gateway requests as well. -- that is also why the tests you have now still succeeded after the change
...cosmos/azure-cosmos/src/test/java/com/azure/cosmos/implementation/ClientRetryPolicyTest.java
Outdated
Show resolved
Hide resolved
xinlian12
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/ClientRetryPolicy.java
Show resolved
Hide resolved
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/ClientRetryPolicy.java
Outdated
Show resolved
Hide resolved
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/ClientRetryPolicy.java
Outdated
Show resolved
Hide resolved
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/ClientRetryPolicy.java
Show resolved
Hide resolved
…ntation/ClientRetryPolicy.java Co-authored-by: Kushagra Thapar <[email protected]>
…/github.com/Azure/azure-sdk-for-java into users/nalutripician/HttpTimeoutGatewayRetry
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
| return this.throttlingRetry.shouldRetry(e); | ||
| } | ||
|
|
||
| private boolean canGatewayRequestFailoverOnTimeout(RxDocumentServiceRequest request, CosmosException clientException) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the method takes in clientException but doesn't use it anywhere - is this intended?
Description
Based on issue created here.
Java V4 needs to add cross-region retries for the following combination of request types:
Data planeWriteThis is a mirror of this issue and PR on the .NET SDK and resolves Issue #31367.
On the Java side, after retries are expended on the
WebRetryPolicy, It will retry through theClientRetryPolicy. In theshouldRetrymethod upon receiving a Network Failure where the Gateway Endpoint timed out (Sub-Status Code 10002), the SDK will now detect if it is one of the above cases and attempt to see if it can failover to another region.Changes also include behavior for Query Plan operations. Because query plan retries on endpoint timeouts are now handled with data plane reads~+writes~ and metadata reads, this behavior change needed to be reflected in the tests.
Handling of dataplane writes are not included in this PR due to concern with the safety of retrying them. Once more investigation is conducted a new PR will be issued with this changes or this PR will be updated with the results.
All SDK Contribution checklist:
General Guidelines and Best Practices
Testing Guidelines