-
Notifications
You must be signed in to change notification settings - Fork 816
Cortex return 5xx due a single ingester outage #4381
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Can you say a bit more about the 5xx error? The code you changed in #4388 is counting errors per-series, whereas I would expect a 5xx to be more of a whole-process error. |
This code counts the error per series... If i get more errors than expected, it returns the error (globally). So lets say i have replication_set=3 and one series returns error X and Y. As the Y error was the second one (and this one is the one that breached the threshold), it will be the one returned. So basically we return the error that breached the So, if you have one series that for some reason returned a 4xx error and after a 5xx, distributor will return a 5xx (assuming replication factor = 2). This change is only counting the error by "Family", so if one series get a 4xx and another a 5xx, it will wait for the third ingester to decide if it need to return 5xx or 4xx. |
OK but how can you get a 5xx error for a single series? What's an example where this happens? |
That 5xx is saying the whole process is unavailable, not a single series. If you don't need to count 5xx's per-series, that might make the fix simpler. |
So I guess the proposal is to return immediately on the first 4xx error (so we don't need to keep track of 5xx and 4xx separately). The only thing to consider on this case is: we can have samples accepted for 2 ingesters and still return 4xx.. Let's consider this scenario:
As discussed with @bboreham on slack in this case it may be ok to fail it back to the caller because (a) nothing changes - 2 and 3 still captured the data and (b) it's a bad state to be in - you have reduced redundancy - so it's ok to have alarms going off. Besides that, it will behave similar to the case where 2 ingesters return 4xx and 1 return 2xx where we return 4xx to the caller and the data may still be available (but unstable) for querying: #731 What the others think? |
Thinking some more about this, there can be a genuine mix of outcomes across the request. So now I'm thinking it should be a "latching" behaviour: for each series:
Across the whole request:
|
It is sub-optimal that in the "mixed result" case the sender will retry everything. The remote write protocol could be enhanced to return which subset need retrying. Though I suspect this is more work than it deserves. |
Yeah... I'm just a little concerned about the change in behaviour here. Nowadays if 1 ingester (for a given series) return 4xx and the other 2 returns 2xx, cortex will return 2xx. With the proposed change, it will return 4xx, right? |
Yes. For all the cases I can imagine (e.g. limit is 1000 and one ingester has 1001 series while the other two have 999), it makes no practical difference. However I could be failing to imagine something important. |
@pracucci @tomwilkie Any thoughts about this? |
Following this example, I think that if the quorum is reached if the other 2 returns 2xx and so Cortex should return 2xx. |
@pracucci For sure, that's the case on both solution... |
Just as information those are the possible 4xx erros that can happen on the Push (ingesterV2):
|
Thanks Alan. For all of those case ingester A can accept the sample while ingester B rejects, depending on whether A missed some earlier sends. I argue that it is not important whether we send 400 back to the user in these cases, when just one ingester rejects a sample:
Given that the exact state of each ingester depends on which previous sends have arrived, I don't think changing the behaviour to depend on which ingester responds first with 4xx is material. |
Make sense! I just updated my PR with the proposed change. Its important to note that the DoBatch is used by AlertManager distributor as well but as far as I can tell this logic should be fine there as well. |
In my opinion, we shouldn't return 4xx in the first 4xx we get from ingesters. If we get one 4xx and two 2xx then we should return 2xx (why not?). If we get one 4xx and two 5xx what should we return?My take is that we should return 5xx because the two ingesters returning 5xx could have potentially ingested it if they were healthy. The counter argument I've heard is that the case I'm describing above cannot happen in practice. However, I'm thinking about the per-metric series limit. You may get 4xx because you've hit the per-metric series limit on a specific metric, but other ingesters (currently unhealthy because returned 5xx) would have happily ingested the other metrics in the push request if the sender will retry (but if we return 4xx then the sender will not retry). Am I missing anything? |
My argument is if one ingester is at the limit then the other two will be very close to the limit, because they will have received all the same series apart from random glitches. So the benefit from retrying such an operation is very marginal, but it complicates the code a lot. |
My argument is that this doesn't apply to the per-metric series limit. |
I'm not seeing this. Our resiliency story is that each series is captured 3 times; if you lose 1 then 2 will still respond. |
The scenario I'm thinking is:
We receive a push request with N series for different metrics. 1 of these metrics already hit the per-metric limit (in all ingesters). The distributor pushes the request to the 3 ingesters and we get:
What should the distributor return? IMO 5xx. Why? If it returns 4xx, the request will not be retried. It was successfully ingested in 1 of the ingesters but that's not enough for our quorum. The request contains valid samples which would be accepted in the unhealthy ingesters once they will become healthy. The metric over the limit will be rejected in the unhealthy ingesters once they will become healthy, but the other metrics will be accepted. |
This must be in a brief period before the distributor decides it doesn't have enough healthy ingesters to write to and rejects everything with 500. I'm not convinced. I do think we should wait for 2 results of some kind.
For me it turns on how complicated it is to wait for the 3rd result. |
Well, we set |
Nowadays we can return 5xx or 4xx in this case.. It is not deterministic
I don't see how this is making the logic simpler... as on the first case (400 + timeout + timeout - return 500) we will need to wait to the third result anyway. |
My thinking was that we don't have to count timeouts or other 5xx per-series, because they apply to an entire ingester. Per my other argument we also don't have to count 4xx per-series, because a single one is a good enough reason to fail. However I am now unclear whether the combination of these things comes out simpler. |
I looked at 08ef41b again. One problem with my argument above is that in Given that, my current thoughts:
Supposing we have counters
|
So add a documentation in the DoBatch function? Or you wanna change the DoBach callback to return a well defined type? Something like:
We could get another counter for "Unknown" - this is the bucket where timeouts will fall into. But this will make the code even more complex with a third Counter. Im not sure if i see the point as timeouts are translated to 5xx anyway at the end of the day.
Fair.
Make sense, i just kept what was being done before... |
Yes, a comment. |
Hi @bboreham I tried to make the code a little cleaner on 2c3ccce but unfortunately i cannot get rid of the "remaining" field because of concurrency.
In other words, if i get the total (or remaining) from the other counters , i can get an already modified value from another go routine causing inconsistent result. I can make a commit without that and we can see that the tests fail. About the "rpcsFailed" I don't really see why we need that but it was there since 4ever... maybe i'm not seeing something so I would prefer to keep it for now. (The error channel is used here so, seems fine..) |
Hi @bboreham .. Other than the comment, are you ok with the pr? |
This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions. |
Describe the bug
Cortex can return 5xx due a single ingester failure when a tenant is being throttled (4xx). In this case, distributor can return the error from the bad ingester (5xx) even though the other 2 returned 4xx. See this.
Looking at this code seems that if we have replication factor = 2, 1 ingester down and the other 2 returning 4xx we can have for example:
4xx + 5xx + 4xx = 5xx
or
5xx + 4xx + 4xx = 4xx
etc
To Reproduce
Steps to reproduce the behavior:
I could create a unit test that reproduce the behavior:
alanprot@fd36d97
a4bf103
Write
Expected behavior
Cortex should return the error respecting the quorum of the response from ingesters.
So, if 2 ingesters return 4xx and one 5xx, cortex should return 4xx. This means that if distributor receive one 4xx and one 5xx, it needs to wait the response of the third ingester.
Environment:
Kubernetes
Helm
Storage Engine
Additional Context
The text was updated successfully, but these errors were encountered: