Skip to content

Massive number of S3 API calls and 4xx errors #3759

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 of 2 tasks
agardiman opened this issue Jan 29, 2021 · 3 comments
Closed
1 of 2 tasks

Massive number of S3 API calls and 4xx errors #3759

agardiman opened this issue Jan 29, 2021 · 3 comments
Labels
stale storage/blocks Blocks storage engine

Comments

@agardiman
Copy link

agardiman commented Jan 29, 2021

Describe the bug
Using block storage on AWS using S3

  1. almost all the S3 API calls made by Cortex are failing with 4xx error codes.
  2. All the calls are made to the same single S3 endpoint (IP) over time, so there seems to be some problem with DNS caching and honoring DNS TTLs

To Reproduce
Steps to reproduce the behavior:

  1. Ingest 380M active time series, 6M points per seconds ingested, replication factor 3
  2. This problem is present even if no reads are hitting Cortex, just the writes. This issue is hit more when queries are happening

Expected behavior
There are no 4xx errors from S3 service.

Environment:

  • Kubernetes on AWS
  • jsonnet
  • block storage, S3
  • 50 store gateways
  • 260 ingesters
  • 12 compactors
  • 30 queriers

Storage Engine

  • Blocks
  • Chunks

Additional Context
As you can see the GET requests per second are aligned with the 4xx errors in AWS console
image

This might be the cause of #3753

For issue number 2, different nslookup calls from the same cortex pod return different ip addresses for the S3 endpoint

/ # nslookup s3.us-west-2.amazonaws.com | grep 'Address: '
Address: 52.218.136.120
/ # nslookup s3.us-west-2.amazonaws.com | grep 'Address: '
Address: 52.218.236.16
/ # nslookup s3.us-west-2.amazonaws.com | grep 'Address: '
Address: 52.218.228.144
/ # nslookup s3.us-west-2.amazonaws.com | grep 'Address: '
Address: 52.218.209.120
/ # nslookup s3.us-west-2.amazonaws.com | grep 'Address: '
Address: 52.218.152.104

so the DNS resolution is working properly in the k8s cluster. If the ip of the S3 endpoint never changes over time in Cortex, it seems to be due to DNS caching issues.

@pracucci
Copy link
Contributor

Using block storage on AWS using S3

Could you share a config snippet to see how S3 is configured (hiding sensitive info), please?

almost all the S3 API calls made by Cortex are failing with 4xx error codes.

Which specific error are you getting? Could you share some logs, please?

@pracucci pracucci added the storage/blocks Blocks storage engine label Jan 29, 2021
@stale
Copy link

stale bot commented Apr 30, 2021

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Apr 30, 2021
@agardiman
Copy link
Author

The number of S3 calls was reduced dramatically by the implementation of the bucket index.
The high number of GET S3 errors was because Cortex, for performance reasons, performed directly a GET on objects without checking their existence first, since it's faster. This errors were reported also as errors in the metrics exposed by Cortex, but this case was then removed from the count, so now the dashboards don't report it as a problem anymore.

About all the calls made to the same single S3 endpoint, the issue was brought up by the S3 team in AWS itself, but we couldn't prove on our side that the IP hit by all the S3 calls was the same.

Considering the above points and after deploying newer versions of Cortex, the issues mentioned in this jira don't seem to be a problem anymore.
Closing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale storage/blocks Blocks storage engine
Projects
None yet
Development

No branches or pull requests

2 participants