Skip to content

Conversation

@NaluTripician
Copy link
Contributor

@NaluTripician NaluTripician commented Dec 5, 2022

Description

Based on issue created here.

Java V4 needs to add cross-region retries for the following combination of request types:

Type Read/Write
Data plane Read
Data plane Write
Metadata Read
Query Plan Read (it is always read)

This is a mirror of this issue and PR on the .NET SDK and resolves Issue #31367.

On the Java side, after retries are expended on the WebRetryPolicy, It will retry through the ClientRetryPolicy. In the shouldRetry method upon receiving a Network Failure where the Gateway Endpoint timed out (Sub-Status Code 10002), the SDK will now detect if it is one of the above cases and attempt to see if it can failover to another region.

Changes also include behavior for Query Plan operations. Because query plan retries on endpoint timeouts are now handled with data plane reads~+writes~ and metadata reads, this behavior change needed to be reflected in the tests.

Handling of dataplane writes are not included in this PR due to concern with the safety of retrying them. Once more investigation is conducted a new PR will be issued with this changes or this PR will be updated with the results.

All SDK Contribution checklist:

  • The pull request does not introduce [breaking changes]
  • CHANGELOG is updated for new features, bug fixes or other significant changes.
  • I have read the contribution guidelines.

General Guidelines and Best Practices

  • Title of the pull request is clear and informative.
  • There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, see this page.

Testing Guidelines

  • Pull request includes test coverage for the included changes.

Copy link
Member

@FabianMeiswinkel FabianMeiswinkel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for two minor suggestions.

Copy link
Member

@FabianMeiswinkel FabianMeiswinkel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also add the changelog entries

@xinlian12
Copy link
Member

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@xinlian12
Copy link
Member

Failed test: conflictCustomLWW (known flaky test)

}

//Data Plane Read & Write
if(!isMetaDataRequest && !request.isAddressRefresh()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

write operation also retriable? (how we know the request is not reached to server?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, data plane writes (meta data writes are not) operations are retriable and should be handled in this case. I believe that because we are getting an error code the request never reached the server.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch Annie - I missed this

@NaluTripician - no, write operations are not in general retriable - only when we can be sure that the request has never been processed. So, when we get certain error codes form the service (410/0 sent form the service)or when we get a timeout trying to establish a connection we know that the request has never been processed and retry is idempotent/safe. But a request timeout after we sent the request on the wire means we simply don't know whether the request was ever processed in the service - so, a retry would not be safe. Usually we capture state of whether request has been flushed to the network yet and also whether a 410 comes form the service or is clinet-side generated to make the call whether writes can be retried. Sounds like above logic might be too aggressive with retries - can you please double-check?

//Data Plane Read & Write
if(!isMetaDataRequest && !request.isAddressRefresh()) {
if(!isMetaDataRequest && !request.isAddressRefresh()
&& (request.isReadOnly() || !BridgeInternal.hasSendingRequestStarted(clientException))) {
Copy link
Member

@xinlian12 xinlian12 Dec 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see two options here:

  1. for http timeout exception, usually it is an indicated that SDK has sent the request, else SDK will translate those exceptions into GATEWAY_ENDPOINT_UNAVAILABLE. so we could remove the write operation completely here I think
  2. Using the hasSendingRequestStarted - this flag has only been wired up in direct rntbd layer, but not for gateway requests, so we will need to wire it up for gateway requests as well. -- that is also why the tests you have now still succeeded after the change

Copy link
Member

@xinlian12 xinlian12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks

@NaluTripician
Copy link
Contributor Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@kushagraThapar kushagraThapar merged commit fa0869d into main Dec 14, 2022
@kushagraThapar kushagraThapar deleted the users/nalutripician/HttpTimeoutGatewayRetry branch December 14, 2022 16:50
return this.throttlingRetry.shouldRetry(e);
}

private boolean canGatewayRequestFailoverOnTimeout(RxDocumentServiceRequest request, CosmosException clientException) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the method takes in clientException but doesn't use it anywhere - is this intended?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Client Retry Policy: Handle HTTP timeouts with request-level cross-region retry

6 participants