Cortex return 5xx due a single ingester outage #4381

alanprot · 2021-07-26T17:40:20Z

Describe the bug
Cortex can return 5xx due a single ingester failure when a tenant is being throttled (4xx). In this case, distributor can return the error from the bad ingester (5xx) even though the other 2 returned 4xx. See this.

Looking at this code seems that if we have replication factor = 2, 1 ingester down and the other 2 returning 4xx we can have for example:

4xx + 5xx + 4xx = 5xx
or
5xx + 4xx + 4xx = 4xx
etc

To Reproduce
Steps to reproduce the behavior:
I could create a unit test that reproduce the behavior:
alanprot@fd36d97

Start Cortex (SHA or version)
a4bf103
Perform Operations(Read/Write/Others)
Write
Expected behavior
Cortex should return the error respecting the quorum of the response from ingesters.
So, if 2 ingesters return 4xx and one 5xx, cortex should return 4xx. This means that if distributor receive one 4xx and one 5xx, it needs to wait the response of the third ingester.

Environment:

Infrastructure: [e.g., Kubernetes, bare-metal, laptop]
Kubernetes
Deployment tool: [e.g., helm, jsonnet]
Helm
Storage Engine
Blocks
Chunks

Additional Context

The text was updated successfully, but these errors were encountered:

bboreham · 2021-08-03T16:03:50Z

Can you say a bit more about the 5xx error? The code you changed in #4388 is counting errors per-series, whereas I would expect a 5xx to be more of a whole-process error.

alanprot · 2021-08-03T16:25:30Z

This code counts the error per series... If i get more errors than expected, it returns the error (globally).

So lets say i have replication_set=3 and one series returns error X and Y. As the Y error was the second one (and this one is the one that breached the threshold), it will be the one returned.

So basically we return the error that breached the maxFailures of the first series that also breach it.

So, if you have one series that for some reason returned a 4xx error and after a 5xx, distributor will return a 5xx (assuming replication factor = 2).

This change is only counting the error by "Family", so if one series get a 4xx and another a 5xx, it will wait for the third ingester to decide if it need to return 5xx or 4xx.

bboreham · 2021-08-04T15:24:57Z

OK but how can you get a 5xx error for a single series? What's an example where this happens?

alanprot · 2021-08-04T16:13:34Z

Push 5XX: https://github.com/cortexproject/cortex/blob/master/pkg/ingester/ingester_v2.go#L751

Possible 4xx: https://github.com/cortexproject/cortex/blob/master/pkg/ingester/ingester_v2.go#L952

bboreham · 2021-08-04T17:01:40Z

That 5xx is saying the whole process is unavailable, not a single series.

If you don't need to count 5xx's per-series, that might make the fix simpler.

alanprot · 2021-08-04T19:42:11Z

So I guess the proposal is to return immediately on the first 4xx error (so we don't need to keep track of 5xx and 4xx separately).

The only thing to consider on this case is: we can have samples accepted for 2 ingesters and still return 4xx..

Let's consider this scenario:

ReplicationFactor = 3
ShardSize = 15
Ingester 7 is running hot for a given tenant (this can happen in a normal operation as the sharding does not guarantee a 100% evenly distribution across ingesters)
Push TimeSeries A that goes to ingester 1, 5 and 7. This request will return 4xx even though it got ingested by 2 ingesters

As discussed with @bboreham on slack in this case it may be ok to fail it back to the caller because (a) nothing changes - 2 and 3 still captured the data and (b) it's a bad state to be in - you have reduced redundancy - so it's ok to have alarms going off.

Besides that, it will behave similar to the case where 2 ingesters return 4xx and 1 return 2xx where we return 4xx to the caller and the data may still be available (but unstable) for querying: #731

What the others think?

bboreham · 2021-08-05T08:28:06Z

Thinking some more about this, there can be a genuine mix of outcomes across the request.
E.g. suppose you receive 100 samples for 100 series; 50 of them are errors that receive 4xx outcome, and 50 go to ingesters that have crashed so receive 5xx outcomes.
We should send a 5xx back to the caller so they can retry.

So now I'm thinking it should be a "latching" behaviour: for each series:

a single 4xx is valid to return for that series
fewer than [quorum] succesful results means a 5xx for that series

Across the whole request:

all series succeeded - return 2xx
one or more 5xx - return 5xx
one or more 4xx and no 5xx - return 4xx

bboreham · 2021-08-05T08:31:31Z

It is sub-optimal that in the "mixed result" case the sender will retry everything. The remote write protocol could be enhanced to return which subset need retrying. Though I suspect this is more work than it deserves.

alanprot · 2021-08-05T16:57:39Z

Yeah... I'm just a little concerned about the change in behaviour here.

Nowadays if 1 ingester (for a given series) return 4xx and the other 2 returns 2xx, cortex will return 2xx.

With the proposed change, it will return 4xx, right?

bboreham · 2021-08-06T12:30:29Z

Nowadays if 1 ingester (for a given series) return 4xx and the other 2 returns 2xx, cortex will return 2xx.
With the proposed change, it will return 4xx, right?

Yes. For all the cases I can imagine (e.g. limit is 1000 and one ingester has 1001 series while the other two have 999), it makes no practical difference.

However I could be failing to imagine something important.

alanprot · 2021-08-10T01:27:17Z

@pracucci @tomwilkie Any thoughts about this?

pracucci · 2021-08-10T06:49:57Z

Nowadays if 1 ingester (for a given series) return 4xx and the other 2 returns 2xx, cortex will return 2xx.

Following this example, I think that if the quorum is reached if the other 2 returns 2xx and so Cortex should return 2xx.

alanprot · 2021-08-10T16:34:09Z

@pracucci For sure, that's the case on both solution...
but the question is, should we return 4xx on the first 4xx from ingesters or should we wait for a quorum of 4xx?

alanprot · 2021-08-19T20:24:49Z

Just as information those are the possible 4xx erros that can happen on the Push (ingesterV2):

storage.ErrOutOfBounds
storage.ErrOutOfOrderSample
storage.ErrDuplicateSampleForTimestamp
errMaxSeriesPerUserLimitExceeded
errMaxSeriesPerMetricLimitExceeded
errExemplarRef

bboreham · 2021-08-25T15:41:00Z

Thanks Alan. For all of those case ingester A can accept the sample while ingester B rejects, depending on whether A missed some earlier sends.

I argue that it is not important whether we send 400 back to the user in these cases, when just one ingester rejects a sample:

Nothing different happens in Cortex - the two ingesters will persist the data and return it for queries
The sender will output a log message. This is fine - the user must be close to some limit or problemmatic scenario.

Given that the exact state of each ingester depends on which previous sends have arrived, I don't think changing the behaviour to depend on which ingester responds first with 4xx is material.

alanprot · 2021-08-26T01:43:31Z

Make sense!

I just updated my PR with the proposed change. Its important to note that the DoBatch is used by AlertManager distributor as well but as far as I can tell this logic should be fine there as well.

pracucci · 2021-08-26T13:06:13Z

but the question is, should we return 4xx on the first 4xx from ingesters or should we wait for a quorum of 4xx?

In my opinion, we shouldn't return 4xx in the first 4xx we get from ingesters. If we get one 4xx and two 2xx then we should return 2xx (why not?).

If we get one 4xx and two 5xx what should we return?My take is that we should return 5xx because the two ingesters returning 5xx could have potentially ingested it if they were healthy.

The counter argument I've heard is that the case I'm describing above cannot happen in practice. However, I'm thinking about the per-metric series limit. You may get 4xx because you've hit the per-metric series limit on a specific metric, but other ingesters (currently unhealthy because returned 5xx) would have happily ingested the other metrics in the push request if the sender will retry (but if we return 4xx then the sender will not retry).

Am I missing anything?

bboreham · 2021-08-26T19:42:11Z

My argument is if one ingester is at the limit then the other two will be very close to the limit, because they will have received all the same series apart from random glitches. So the benefit from retrying such an operation is very marginal, but it complicates the code a lot.

pracucci · 2021-08-27T06:48:10Z

My argument is if one ingester is at the limit then the other two will be very close to the limit, because they will have received all the same series apart from random glitches. So the benefit from retrying such an operation is very marginal, but it complicates the code a lot.

My argument is that this doesn't apply to the per-metric series limit.

bboreham · 2021-08-27T09:16:10Z

I'm not seeing this.

Our resiliency story is that each series is captured 3 times; if you lose 1 then 2 will still respond.
If on a regular basis you reject from 1 ingester because it is over-limit then you don't have resiliency any more.
So alarms should be going off at least; we should not be saying "2 accepted it; all fine".

pracucci · 2021-08-27T09:57:12Z

The scenario I'm thinking is:

3 ingesters in total
2 ingesters are unhealthy

We receive a push request with N series for different metrics. 1 of these metrics already hit the per-metric limit (in all ingesters). The distributor pushes the request to the 3 ingesters and we get:

one 4xx (from the healthy one)
two 5xx (from the unhealthy ones)

What should the distributor return? IMO 5xx. Why? If it returns 4xx, the request will not be retried. It was successfully ingested in 1 of the ingesters but that's not enough for our quorum. The request contains valid samples which would be accepted in the unhealthy ingesters once they will become healthy. The metric over the limit will be rejected in the unhealthy ingesters once they will become healthy, but the other metrics will be accepted.

bboreham · 2021-08-27T10:39:37Z

This must be in a brief period before the distributor decides it doesn't have enough healthy ingesters to write to and rejects everything with 500.

I'm not convinced.

I do think we should wait for 2 results of some kind.

distributor sees 400 + timeout + timeout - return 500
distributor sees 400 + 200 - return 400 (don't care what the 3rd one returns)
distributor sees 400 + 500 - I say return 400, you say wait for the 3rd result and return 500 if it is 500 or timeout

For me it turns on how complicated it is to wait for the 3rd result.

pracucci · 2021-08-27T10:42:21Z

This must be in a brief period before the distributor decides it doesn't have enough healthy ingesters to write to and rejects everything with 500.

Well, we set -ring.heartbeat-timeout=10m. Also it never happens if you enable the new experimental support to disable heartbeats at all.

alanprot · 2021-08-27T13:31:20Z

one 4xx (from the healthy one)
two 5xx (from the unhealthy ones)
What should the distributor return? IMO 5xx. Why? If it returns 4xx, the request will not be retried.

Nowadays we can return 5xx or 4xx in this case.. It is not deterministic

I say return 400, you say wait for the 3rd result and return 500 if it is 500 or timeout

I don't see how this is making the logic simpler... as on the first case (400 + timeout + timeout - return 500) we will need to wait to the third result anyway.

bboreham · 2021-08-27T15:02:39Z

My thinking was that we don't have to count timeouts or other 5xx per-series, because they apply to an entire ingester.

Per my other argument we also don't have to count 4xx per-series, because a single one is a good enough reason to fail.

However I am now unclear whether the combination of these things comes out simpler.

bboreham · 2021-09-21T16:04:11Z

I looked at 08ef41b again.

One problem with my argument above is that in DoBatch we don't have visibility of what's a timeout, what's an ingester failure, what's a 4xx error, etc.
You called status.FromError() which is probably ok given current usage, but would need to go onto the description of DoBatch that error results from the callback must conform to such an interface.

Given that, my current thoughts:

Can we distinguish a solid return from a timeout / failed to talk ?
Two error counters doesn't seem too bad (e.g. failed4xx and failed5xx); can't see why we would need a total if that is the sum of the others.
We don't need rpcsFailed; can do a select / err <- / default so we use the chan size to ensure we push max one error.

Supposing we have counters succeeded, failed4xx and failed5xx, we can do:

4xx + 200 - wait for 3rd response and return 200 if 200, otherwise 4xx
4xx + 5xx - wait for 3rd response and return 5xx if 5xx, otherwise 400
- 4xx + 5xx + 4xx = 4xx
- 5xx + 4xx + 4xx = 4xx

alanprot · 2021-09-24T00:27:24Z

You called status.FromError() which is probably ok given current usage, but would need to go onto the description of DoBatch that error results from the callback must conform to such an interface.

So add a documentation in the DoBatch function? Or you wanna change the DoBach callback to return a well defined type? Something like:

callback func(InstanceDesc, []int) error TO callback func(InstanceDesc, []int) Status

Can we distinguish a solid return from a timeout / failed to talk ?

We could get another counter for "Unknown" - this is the bucket where timeouts will fall into. But this will make the code even more complex with a third Counter. Im not sure if i see the point as timeouts are translated to 5xx anyway at the end of the day.

Two error counters doesn't seem too bad (e.g. failed4xx and failed5xx); can't see why we would need a total if that is the sum of the others.

Fair.

We don't need rpcsFailed; can do a select / err <- / default so we use the chan size to ensure we push max one error.

Make sense, i just kept what was being done before...

bboreham · 2021-09-27T17:12:22Z

So add a documentation in the DoBatch function?

Yes, a comment.

alanprot · 2021-10-04T04:07:46Z

Hi @bboreham

I tried to make the code a little cleaner on 2c3ccce but unfortunately i cannot get rid of the "remaining" field because of concurrency.

can't see why we would need a total if that is the sum of the others.

In other words, if i get the total (or remaining) from the other counters , i can get an already modified value from another go routine causing inconsistent result. I can make a commit without that and we can see that the tests fail.

About the "rpcsFailed" I don't really see why we need that but it was there since 4ever... maybe i'm not seeing something so I would prefer to keep it for now. (The error channel is used here so, seems fine..)

alanprot · 2021-10-21T17:18:24Z

Hi @bboreham .. Other than the comment, are you ok with the pr?
Thanks

stale · 2022-01-22T14:56:39Z

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

alanprot mentioned this issue Jul 29, 2021

distributor: don't return 5xx on a single ingester outage #4388

Merged

3 tasks

stale bot added the stale label Jan 22, 2022

alvinlin123 added keepalive Skipped by stale bot and removed stale labels Jan 22, 2022

alvinlin123 closed this as completed in #4388 Jan 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cortex return 5xx due a single ingester outage #4381

Cortex return 5xx due a single ingester outage #4381

alanprot commented Jul 26, 2021 •

edited

Loading

bboreham commented Aug 3, 2021

alanprot commented Aug 3, 2021

bboreham commented Aug 4, 2021

alanprot commented Aug 4, 2021

bboreham commented Aug 4, 2021

alanprot commented Aug 4, 2021

bboreham commented Aug 5, 2021

bboreham commented Aug 5, 2021

alanprot commented Aug 5, 2021

bboreham commented Aug 6, 2021

alanprot commented Aug 10, 2021

pracucci commented Aug 10, 2021

alanprot commented Aug 10, 2021

alanprot commented Aug 19, 2021

bboreham commented Aug 25, 2021

alanprot commented Aug 26, 2021

pracucci commented Aug 26, 2021 •

edited

Loading

bboreham commented Aug 26, 2021

pracucci commented Aug 27, 2021

bboreham commented Aug 27, 2021

pracucci commented Aug 27, 2021 •

edited

Loading

bboreham commented Aug 27, 2021

pracucci commented Aug 27, 2021

alanprot commented Aug 27, 2021 •

edited

Loading

bboreham commented Aug 27, 2021

bboreham commented Sep 21, 2021

alanprot commented Sep 24, 2021

bboreham commented Sep 27, 2021

alanprot commented Oct 4, 2021

alanprot commented Oct 21, 2021

stale bot commented Jan 22, 2022

Cortex return 5xx due a single ingester outage #4381

Cortex return 5xx due a single ingester outage #4381

Comments

alanprot commented Jul 26, 2021 • edited Loading

bboreham commented Aug 3, 2021

alanprot commented Aug 3, 2021

bboreham commented Aug 4, 2021

alanprot commented Aug 4, 2021

bboreham commented Aug 4, 2021

alanprot commented Aug 4, 2021

bboreham commented Aug 5, 2021

bboreham commented Aug 5, 2021

alanprot commented Aug 5, 2021

bboreham commented Aug 6, 2021

alanprot commented Aug 10, 2021

pracucci commented Aug 10, 2021

alanprot commented Aug 10, 2021

alanprot commented Aug 19, 2021

bboreham commented Aug 25, 2021

alanprot commented Aug 26, 2021

pracucci commented Aug 26, 2021 • edited Loading

bboreham commented Aug 26, 2021

pracucci commented Aug 27, 2021

bboreham commented Aug 27, 2021

pracucci commented Aug 27, 2021 • edited Loading

bboreham commented Aug 27, 2021

pracucci commented Aug 27, 2021

alanprot commented Aug 27, 2021 • edited Loading

bboreham commented Aug 27, 2021

bboreham commented Sep 21, 2021

alanprot commented Sep 24, 2021

bboreham commented Sep 27, 2021

alanprot commented Oct 4, 2021

alanprot commented Oct 21, 2021

stale bot commented Jan 22, 2022

alanprot commented Jul 26, 2021 •

edited

Loading

pracucci commented Aug 26, 2021 •

edited

Loading

pracucci commented Aug 27, 2021 •

edited

Loading

alanprot commented Aug 27, 2021 •

edited

Loading