Use newer AWS API for paginated queries #2452

bboreham · 2020-04-11T17:36:48Z

What this PR does:

Call QueryPagesWithContext() instead of iterating through NextPage().
Use AWS-SDK Handlers for tracing each retry and reporting errors - trace output will be different with just one span for the whole query and span-log entries for each page and retried operation.

This is less code, and more robust when retrying requests. I strongly suspect that before this change it could get into a state on network errors where it would fail every time until the max number of retrys was hit - error message looks like:

level=warn ts=2020-04-10T19:30:30Z msg="DynamoDB error" retry=19 table=prod_chunk_index2_2565 err="RequestError: send request failed\ncaused by: Post https://dynamodb.us-east-1.amazonaws.com/: net/http: HTTP/1.x transport connection broken: http: ContentLength=205 with Body length 0"

We don't need so much from the request object for testing now.

There are still two operations where we use request.Send(), but they do not use pagination so no benefit to rewrite in a similar way.

Which issue(s) this PR fixes:
Fixes #403
Part of #1152

Checklist

Tests updated
NA Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

gouthamve

I'm not super familiar with the aws workings, but this LGTM with nit!

pkg/chunk/aws/dynamodb_storage_client.go

pracucci

Good job @bboreham. I definitely agree the new code is way more cleaner. I left few minor comments and a major one in the retrier (unless I'm missing something, I believe it never retries).

pkg/chunk/aws/dynamodb_storage_client.go

pracucci · 2020-04-17T14:49:50Z

pkg/chunk/aws/dynamodb_storage_client.go

+		}, retryer.withRetrys, withErrorHandler(query.TableName, "DynamoDB.QueryPages"))
+	})
+	if err != nil {
+		return fmt.Errorf("QueryPage error: table=%v, err=%v", query.TableName, err)


Two things:

QueryPage > QueryPages

I would suggest to use errors.Wrap() to wrap the error with the extra info (the parent chain is quite long and we sometimes use unwrap to check, for example, if the root error is a context cancellation or so. I haven't checked if this is the case, but as a rule of thumb Wrap() should be safer)

Suggested change

return fmt.Errorf("QueryPage error: table=%v, err=%v", query.TableName, err)

return errors.Wrapf(err, "QueryPages error: table=%v", query.TableName)

Maybe that would be better, but note I didn't change this error in this PR; that's what it did before.

pkg/chunk/aws/retryer.go

pracucci · 2020-04-17T14:56:22Z

pkg/chunk/aws/retryer.go

+
+// ShouldRetry returns if the failed request is retryable.
+func (r *retryer) ShouldRetry(req *request.Request) bool {
+	var d client.DefaultRetryer


The default max number of retries is 0 so the following call to ShouldRetry() always return false (it's checked inside). I think the this retryer doesn't work as expected.

Thanks - hadn't spotted that. Now I look back at this code, I was trying to re-use the AWS-SDK behaviour, but it's probably clearer just to copy-paste it.

pracucci · 2020-04-17T14:58:44Z

pkg/chunk/aws/retryer.go

+		Backoff:    util.NewBackoff(ctx, cfg),
+		maxRetries: cfg.MaxRetries,


Thinking loudly.

If, for any reason, the Cortex backoff max retries is 0 it leads to the following edge case:

0 for the Cortex backoff means "infinite"

0 for the AWS retries means "do not retry"

I think AWS is better for this use case (I'm quite sure we don't want infinite retry). Looking at the code shouldn't be a problem, but please double check it too.

pracucci · 2020-04-17T15:24:24Z

pkg/chunk/aws/retryer.go

+	}
+}
+
+func (r *retryer) withRetrys(req *request.Request) {


Did you mean withRetries()?

bboreham · 2020-04-21T18:52:51Z

I believe I've fixed all the points made.
Regarding retries=0, it doesn't seem to be documented to do infinite retries; I suspect that came out of a merger between two different bits of retry code.

I have also added a check that the context hasn't been cancelled, via Backoff.Ongoing().

pracucci

Good job @bboreham! I have no other comment (and thanks for addressing my feedback). If you could rebase to fix the conflict, then we can merge it.

bboreham · 2020-04-28T17:06:10Z

Rebased.

Signed-off-by: Bryan Boreham <[email protected]>

This is less code, and more robust when retrying requests. We don't need an indirection on the request object for testing now. Signed-off-by: Bryan Boreham <[email protected]>

Now we are calling QueryPagesWithContext directly we don't need the paging interface and we never re-use request objects. Signed-off-by: Bryan Boreham <[email protected]>

Ingester.flushUserSeries() puts a timeout on the context, so don't retry for longer than that. Signed-off-by: Bryan Boreham <[email protected]>

pull-request-size bot added the size/L label Apr 11, 2020

bboreham force-pushed the new-dynamo-paging branch 2 times, most recently from 0ace13c to aebddc3 Compare April 11, 2020 19:06

gouthamve approved these changes Apr 17, 2020

View reviewed changes

pkg/chunk/aws/dynamodb_storage_client.go Outdated Show resolved Hide resolved

pracucci reviewed Apr 17, 2020

View reviewed changes

bboreham force-pushed the new-dynamo-paging branch from 907fdaa to 9ca7610 Compare April 21, 2020 16:31

pracucci approved these changes Apr 22, 2020

View reviewed changes

bboreham force-pushed the new-dynamo-paging branch from 2d2b232 to ae51b18 Compare April 28, 2020 17:04

bboreham added 4 commits May 7, 2020 14:37

Refactor: export NextDelay() so we can call it from other packages

c706d51

Signed-off-by: Bryan Boreham <[email protected]>

Refactor: use newer AWS API for paginated queries

5f8efcc

This is less code, and more robust when retrying requests. We don't need an indirection on the request object for testing now. Signed-off-by: Bryan Boreham <[email protected]>

Don't need DynamoDB request wrapper to do so much now

b0eadb4

Now we are calling QueryPagesWithContext directly we don't need the paging interface and we never re-use request objects. Signed-off-by: Bryan Boreham <[email protected]>

Stop AWS Retryer if context is cancelled

0443a18

Ingester.flushUserSeries() puts a timeout on the context, so don't retry for longer than that. Signed-off-by: Bryan Boreham <[email protected]>

bboreham force-pushed the new-dynamo-paging branch from ae51b18 to 0443a18 Compare May 7, 2020 14:38

gouthamve merged commit 0ec7b96 into master May 11, 2020

gouthamve deleted the new-dynamo-paging branch May 11, 2020 13:16

pracucci mentioned this pull request May 28, 2020

Use a Retryer instead of retry loops for DynamoDB #1152

Closed

bboreham mentioned this pull request Jul 30, 2020

Querier S3 fetch failure results in 400 error and no retries #1246

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use newer AWS API for paginated queries #2452

Use newer AWS API for paginated queries #2452

bboreham commented Apr 11, 2020 •

edited

Loading

gouthamve left a comment

pracucci left a comment

pracucci Apr 17, 2020

bboreham Apr 18, 2020

pracucci Apr 17, 2020

bboreham Apr 18, 2020

pracucci Apr 17, 2020

pracucci Apr 17, 2020

bboreham commented Apr 21, 2020

pracucci left a comment

bboreham commented Apr 28, 2020

	return fmt.Errorf("QueryPage error: table=%v, err=%v", query.TableName, err)
	return errors.Wrapf(err, "QueryPages error: table=%v", query.TableName)

		Backoff: util.NewBackoff(ctx, cfg),
		maxRetries: cfg.MaxRetries,

Use newer AWS API for paginated queries #2452

Use newer AWS API for paginated queries #2452

Conversation

bboreham commented Apr 11, 2020 • edited Loading

gouthamve left a comment

Choose a reason for hiding this comment

pracucci left a comment

Choose a reason for hiding this comment

pracucci Apr 17, 2020

Choose a reason for hiding this comment

bboreham Apr 18, 2020

Choose a reason for hiding this comment

pracucci Apr 17, 2020

Choose a reason for hiding this comment

bboreham Apr 18, 2020

Choose a reason for hiding this comment

pracucci Apr 17, 2020

Choose a reason for hiding this comment

pracucci Apr 17, 2020

Choose a reason for hiding this comment

bboreham commented Apr 21, 2020

pracucci left a comment

Choose a reason for hiding this comment

bboreham commented Apr 28, 2020

bboreham commented Apr 11, 2020 •

edited

Loading