If I restart the query-frontend while queriers are running then we can't achieve `-querier.max-concurrent` #4391

alvinlin123 · 2021-07-31T00:35:18Z

Describe the bug
If query-frontend and querier are restarted at the same time, or query-frontend is restarted while queriers are running, then -querier.max-concurrent cannot be achieved.

To Reproduce

Restart just queriers by doing a rollout restart, do not restart query-frontend
Make sure your system is in steady steady state and you can achieve -querier.max-concurrent
Restart query-frontend
hammer all your query frontends with expensive queries and observe -querier.max-concurrent is no longer achievable.

Expected behavior
Should still be able to achieve -querier.max-concurrent.

Environment:
We are running on k8s.

Storage Engine

Blocks
Chunks

Additional Context
My suspicion is because in the worker.go AddressRemoved does not call resetConcurrency()

Imagine the following cases:

You have 1 querier and 3 query-frontend (fe1, fe2, and fe3)
your -querier.max-concurrent is set to 8
So, each query frontend have at least 2 connection to the queriers. Because 8 is not divisible by 3, and 8 modulo 3 is 2, so there will be extra connection between fe1 and fe2 to the querier.
So, fe1 has 3 connection to querier, fe2 has 3, and fe3 has 2.

Now, we restart the query-frontend, and the DNS Watch on the querier (worker.go) will get to work and start adding and removing addresses.

During deployment we will have 6 query-frontends fe1 to fe6 because we spin up new pods first
So you get into a stat where fe1 has 2 connection to querier, fe2 has 2, fe3 has 1, fe4 has1, fe5 has 1, and fe6 has 1
Then we will spin down the old pod, fe1 to fe3.
Because the AddressRemoved method does not call resetConcurrency() to recalculate the load distribution, we end up having fe4 has 1 connection to querier, fe5 has 1, and fe6 has 1. Which is just 3 instead of 8.

Below is a graph showing achievement of -querier.max-concurrent=8 during different phases.

The text was updated successfully, but these errors were encountered:

alvinlin123 mentioned this issue Aug 10, 2021

Fix bug where querier may not be able to achieve max-concurrent #4417

Merged

3 tasks

pstibrany closed this as completed in #4417 Aug 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

If I restart the query-frontend while queriers are running then we can't achieve `-querier.max-concurrent` #4391

If I restart the query-frontend while queriers are running then we can't achieve `-querier.max-concurrent` #4391

alvinlin123 commented Jul 31, 2021 •

edited

Loading

If I restart the query-frontend while queriers are running then we can't achieve -querier.max-concurrent #4391

If I restart the query-frontend while queriers are running then we can't achieve -querier.max-concurrent #4391

Comments

alvinlin123 commented Jul 31, 2021 • edited Loading

If I restart the query-frontend while queriers are running then we can't achieve `-querier.max-concurrent` #4391

If I restart the query-frontend while queriers are running then we can't achieve `-querier.max-concurrent` #4391

alvinlin123 commented Jul 31, 2021 •

edited

Loading