Skip to content

If I restart the query-frontend while queriers are running then we can't achieve -querier.max-concurrent #4391

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 of 2 tasks
alvinlin123 opened this issue Jul 31, 2021 · 0 comments · Fixed by #4417
Closed
1 of 2 tasks

Comments

@alvinlin123
Copy link
Contributor

alvinlin123 commented Jul 31, 2021

Describe the bug
If query-frontend and querier are restarted at the same time, or query-frontend is restarted while queriers are running, then -querier.max-concurrent cannot be achieved.

To Reproduce

  1. Restart just queriers by doing a rollout restart, do not restart query-frontend
  2. Make sure your system is in steady steady state and you can achieve -querier.max-concurrent
  3. Restart query-frontend
  4. hammer all your query frontends with expensive queries and observe -querier.max-concurrent is no longer achievable.

Expected behavior
Should still be able to achieve -querier.max-concurrent.

Environment:
We are running on k8s.

Storage Engine

  • Blocks
  • Chunks

Additional Context
My suspicion is because in the worker.go AddressRemoved does not call resetConcurrency()

Imagine the following cases:

  • You have 1 querier and 3 query-frontend (fe1, fe2, and fe3)
  • your -querier.max-concurrent is set to 8
  • So, each query frontend have at least 2 connection to the queriers. Because 8 is not divisible by 3, and 8 modulo 3 is 2, so there will be extra connection between fe1 and fe2 to the querier.
  • So, fe1 has 3 connection to querier, fe2 has 3, and fe3 has 2.

Now, we restart the query-frontend, and the DNS Watch on the querier (worker.go) will get to work and start adding and removing addresses.

  • During deployment we will have 6 query-frontends fe1 to fe6 because we spin up new pods first
  • So you get into a stat where fe1 has 2 connection to querier, fe2 has 2, fe3 has 1, fe4 has1, fe5 has 1, and fe6 has 1
  • Then we will spin down the old pod, fe1 to fe3.
  • Because the AddressRemoved method does not call resetConcurrency() to recalculate the load distribution, we end up having fe4 has 1 connection to querier, fe5 has 1, and fe6 has 1. Which is just 3 instead of 8.

Below is a graph showing achievement of -querier.max-concurrent=8 during different phases.

Grafana

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant