Skip to content

Cortex return 5xx due a single ingester outage #4381

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 of 2 tasks
alanprot opened this issue Jul 26, 2021 · 31 comments · Fixed by #4388
Closed
1 of 2 tasks

Cortex return 5xx due a single ingester outage #4381

alanprot opened this issue Jul 26, 2021 · 31 comments · Fixed by #4388
Labels
keepalive Skipped by stale bot

Comments

@alanprot
Copy link
Member

alanprot commented Jul 26, 2021

Describe the bug
Cortex can return 5xx due a single ingester failure when a tenant is being throttled (4xx). In this case, distributor can return the error from the bad ingester (5xx) even though the other 2 returned 4xx. See this.

Looking at this code seems that if we have replication factor = 2, 1 ingester down and the other 2 returning 4xx we can have for example:

4xx + 5xx + 4xx = 5xx
or
5xx + 4xx + 4xx = 4xx
etc

To Reproduce
Steps to reproduce the behavior:
I could create a unit test that reproduce the behavior:
alanprot@fd36d97

  1. Start Cortex (SHA or version)
    a4bf103
  2. Perform Operations(Read/Write/Others)
    Write
    Expected behavior
    Cortex should return the error respecting the quorum of the response from ingesters.
    So, if 2 ingesters return 4xx and one 5xx, cortex should return 4xx. This means that if distributor receive one 4xx and one 5xx, it needs to wait the response of the third ingester.

Environment:

  • Infrastructure: [e.g., Kubernetes, bare-metal, laptop]
    Kubernetes
  • Deployment tool: [e.g., helm, jsonnet]
    Helm
    Storage Engine
  • Blocks
  • Chunks

Additional Context

@bboreham
Copy link
Contributor

bboreham commented Aug 3, 2021

Can you say a bit more about the 5xx error? The code you changed in #4388 is counting errors per-series, whereas I would expect a 5xx to be more of a whole-process error.

@alanprot
Copy link
Member Author

alanprot commented Aug 3, 2021

This code counts the error per series... If i get more errors than expected, it returns the error (globally).

So lets say i have replication_set=3 and one series returns error X and Y. As the Y error was the second one (and this one is the one that breached the threshold), it will be the one returned.

So basically we return the error that breached the maxFailures of the first series that also breach it.

So, if you have one series that for some reason returned a 4xx error and after a 5xx, distributor will return a 5xx (assuming replication factor = 2).

This change is only counting the error by "Family", so if one series get a 4xx and another a 5xx, it will wait for the third ingester to decide if it need to return 5xx or 4xx.

@bboreham
Copy link
Contributor

bboreham commented Aug 4, 2021

OK but how can you get a 5xx error for a single series? What's an example where this happens?

@bboreham
Copy link
Contributor

bboreham commented Aug 4, 2021

That 5xx is saying the whole process is unavailable, not a single series.

If you don't need to count 5xx's per-series, that might make the fix simpler.

@alanprot
Copy link
Member Author

alanprot commented Aug 4, 2021

So I guess the proposal is to return immediately on the first 4xx error (so we don't need to keep track of 5xx and 4xx separately).

The only thing to consider on this case is: we can have samples accepted for 2 ingesters and still return 4xx..

Let's consider this scenario:

  • ReplicationFactor = 3
  • ShardSize = 15
  • Ingester 7 is running hot for a given tenant (this can happen in a normal operation as the sharding does not guarantee a 100% evenly distribution across ingesters)
  • Push TimeSeries A that goes to ingester 1, 5 and 7. This request will return 4xx even though it got ingested by 2 ingesters

As discussed with @bboreham on slack in this case it may be ok to fail it back to the caller because (a) nothing changes - 2 and 3 still captured the data and (b) it's a bad state to be in - you have reduced redundancy - so it's ok to have alarms going off.

Besides that, it will behave similar to the case where 2 ingesters return 4xx and 1 return 2xx where we return 4xx to the caller and the data may still be available (but unstable) for querying: #731

What the others think?

@bboreham
Copy link
Contributor

bboreham commented Aug 5, 2021

Thinking some more about this, there can be a genuine mix of outcomes across the request.
E.g. suppose you receive 100 samples for 100 series; 50 of them are errors that receive 4xx outcome, and 50 go to ingesters that have crashed so receive 5xx outcomes.
We should send a 5xx back to the caller so they can retry.

So now I'm thinking it should be a "latching" behaviour: for each series:

  • a single 4xx is valid to return for that series
  • fewer than [quorum] succesful results means a 5xx for that series

Across the whole request:

  • all series succeeded - return 2xx
  • one or more 5xx - return 5xx
  • one or more 4xx and no 5xx - return 4xx

@bboreham
Copy link
Contributor

bboreham commented Aug 5, 2021

It is sub-optimal that in the "mixed result" case the sender will retry everything. The remote write protocol could be enhanced to return which subset need retrying. Though I suspect this is more work than it deserves.

@alanprot
Copy link
Member Author

alanprot commented Aug 5, 2021

Yeah... I'm just a little concerned about the change in behaviour here.

Nowadays if 1 ingester (for a given series) return 4xx and the other 2 returns 2xx, cortex will return 2xx.

With the proposed change, it will return 4xx, right?

@bboreham
Copy link
Contributor

bboreham commented Aug 6, 2021

Nowadays if 1 ingester (for a given series) return 4xx and the other 2 returns 2xx, cortex will return 2xx.
With the proposed change, it will return 4xx, right?

Yes. For all the cases I can imagine (e.g. limit is 1000 and one ingester has 1001 series while the other two have 999), it makes no practical difference.

However I could be failing to imagine something important.

@alanprot
Copy link
Member Author

@pracucci @tomwilkie Any thoughts about this?

@pracucci
Copy link
Contributor

Nowadays if 1 ingester (for a given series) return 4xx and the other 2 returns 2xx, cortex will return 2xx.

Following this example, I think that if the quorum is reached if the other 2 returns 2xx and so Cortex should return 2xx.

@alanprot
Copy link
Member Author

@pracucci For sure, that's the case on both solution...
but the question is, should we return 4xx on the first 4xx from ingesters or should we wait for a quorum of 4xx?

@alanprot
Copy link
Member Author

Just as information those are the possible 4xx erros that can happen on the Push (ingesterV2):

storage.ErrOutOfBounds
storage.ErrOutOfOrderSample
storage.ErrDuplicateSampleForTimestamp
errMaxSeriesPerUserLimitExceeded
errMaxSeriesPerMetricLimitExceeded
errExemplarRef

@bboreham
Copy link
Contributor

Thanks Alan. For all of those case ingester A can accept the sample while ingester B rejects, depending on whether A missed some earlier sends.

I argue that it is not important whether we send 400 back to the user in these cases, when just one ingester rejects a sample:

  • Nothing different happens in Cortex - the two ingesters will persist the data and return it for queries
  • The sender will output a log message. This is fine - the user must be close to some limit or problemmatic scenario.

Given that the exact state of each ingester depends on which previous sends have arrived, I don't think changing the behaviour to depend on which ingester responds first with 4xx is material.

@alanprot
Copy link
Member Author

Make sense!

I just updated my PR with the proposed change. Its important to note that the DoBatch is used by AlertManager distributor as well but as far as I can tell this logic should be fine there as well.

@pracucci
Copy link
Contributor

pracucci commented Aug 26, 2021

but the question is, should we return 4xx on the first 4xx from ingesters or should we wait for a quorum of 4xx?

In my opinion, we shouldn't return 4xx in the first 4xx we get from ingesters. If we get one 4xx and two 2xx then we should return 2xx (why not?).

If we get one 4xx and two 5xx what should we return?My take is that we should return 5xx because the two ingesters returning 5xx could have potentially ingested it if they were healthy.

The counter argument I've heard is that the case I'm describing above cannot happen in practice. However, I'm thinking about the per-metric series limit. You may get 4xx because you've hit the per-metric series limit on a specific metric, but other ingesters (currently unhealthy because returned 5xx) would have happily ingested the other metrics in the push request if the sender will retry (but if we return 4xx then the sender will not retry).

Am I missing anything?

@bboreham
Copy link
Contributor

My argument is if one ingester is at the limit then the other two will be very close to the limit, because they will have received all the same series apart from random glitches. So the benefit from retrying such an operation is very marginal, but it complicates the code a lot.

@pracucci
Copy link
Contributor

My argument is if one ingester is at the limit then the other two will be very close to the limit, because they will have received all the same series apart from random glitches. So the benefit from retrying such an operation is very marginal, but it complicates the code a lot.

My argument is that this doesn't apply to the per-metric series limit.

@bboreham
Copy link
Contributor

I'm not seeing this.

Our resiliency story is that each series is captured 3 times; if you lose 1 then 2 will still respond.
If on a regular basis you reject from 1 ingester because it is over-limit then you don't have resiliency any more.
So alarms should be going off at least; we should not be saying "2 accepted it; all fine".

@pracucci
Copy link
Contributor

pracucci commented Aug 27, 2021

The scenario I'm thinking is:

  • 3 ingesters in total
  • 2 ingesters are unhealthy

We receive a push request with N series for different metrics. 1 of these metrics already hit the per-metric limit (in all ingesters). The distributor pushes the request to the 3 ingesters and we get:

  • one 4xx (from the healthy one)
  • two 5xx (from the unhealthy ones)

What should the distributor return? IMO 5xx. Why? If it returns 4xx, the request will not be retried. It was successfully ingested in 1 of the ingesters but that's not enough for our quorum. The request contains valid samples which would be accepted in the unhealthy ingesters once they will become healthy. The metric over the limit will be rejected in the unhealthy ingesters once they will become healthy, but the other metrics will be accepted.

@bboreham
Copy link
Contributor

This must be in a brief period before the distributor decides it doesn't have enough healthy ingesters to write to and rejects everything with 500.

I'm not convinced.

I do think we should wait for 2 results of some kind.

  • distributor sees 400 + timeout + timeout - return 500
  • distributor sees 400 + 200 - return 400 (don't care what the 3rd one returns)
  • distributor sees 400 + 500 - I say return 400, you say wait for the 3rd result and return 500 if it is 500 or timeout

For me it turns on how complicated it is to wait for the 3rd result.

@pracucci
Copy link
Contributor

This must be in a brief period before the distributor decides it doesn't have enough healthy ingesters to write to and rejects everything with 500.

Well, we set -ring.heartbeat-timeout=10m. Also it never happens if you enable the new experimental support to disable heartbeats at all.

@alanprot
Copy link
Member Author

alanprot commented Aug 27, 2021

one 4xx (from the healthy one)
two 5xx (from the unhealthy ones)
What should the distributor return? IMO 5xx. Why? If it returns 4xx, the request will not be retried.

Nowadays we can return 5xx or 4xx in this case.. It is not deterministic

I say return 400, you say wait for the 3rd result and return 500 if it is 500 or timeout

I don't see how this is making the logic simpler... as on the first case (400 + timeout + timeout - return 500) we will need to wait to the third result anyway.

@bboreham
Copy link
Contributor

My thinking was that we don't have to count timeouts or other 5xx per-series, because they apply to an entire ingester.

Per my other argument we also don't have to count 4xx per-series, because a single one is a good enough reason to fail.

However I am now unclear whether the combination of these things comes out simpler.

@bboreham
Copy link
Contributor

I looked at 08ef41b again.

One problem with my argument above is that in DoBatch we don't have visibility of what's a timeout, what's an ingester failure, what's a 4xx error, etc.
You called status.FromError() which is probably ok given current usage, but would need to go onto the description of DoBatch that error results from the callback must conform to such an interface.

Given that, my current thoughts:

  • Can we distinguish a solid return from a timeout / failed to talk ?
  • Two error counters doesn't seem too bad (e.g. failed4xx and failed5xx); can't see why we would need a total if that is the sum of the others.
  • We don't need rpcsFailed; can do a select / err <- / default so we use the chan size to ensure we push max one error.

Supposing we have counters succeeded, failed4xx and failed5xx, we can do:

  • 4xx + 200 - wait for 3rd response and return 200 if 200, otherwise 4xx
  • 4xx + 5xx - wait for 3rd response and return 5xx if 5xx, otherwise 400
    • 4xx + 5xx + 4xx = 4xx
    • 5xx + 4xx + 4xx = 4xx

@alanprot
Copy link
Member Author

You called status.FromError() which is probably ok given current usage, but would need to go onto the description of DoBatch that error results from the callback must conform to such an interface.

So add a documentation in the DoBatch function? Or you wanna change the DoBach callback to return a well defined type? Something like:

callback func(InstanceDesc, []int) error TO callback func(InstanceDesc, []int) Status

Can we distinguish a solid return from a timeout / failed to talk ?

We could get another counter for "Unknown" - this is the bucket where timeouts will fall into. But this will make the code even more complex with a third Counter. Im not sure if i see the point as timeouts are translated to 5xx anyway at the end of the day.

Two error counters doesn't seem too bad (e.g. failed4xx and failed5xx); can't see why we would need a total if that is the sum of the others.

Fair.

We don't need rpcsFailed; can do a select / err <- / default so we use the chan size to ensure we push max one error.

Make sense, i just kept what was being done before...

@bboreham
Copy link
Contributor

So add a documentation in the DoBatch function?

Yes, a comment.

@alanprot
Copy link
Member Author

alanprot commented Oct 4, 2021

Hi @bboreham

I tried to make the code a little cleaner on 2c3ccce but unfortunately i cannot get rid of the "remaining" field because of concurrency.

can't see why we would need a total if that is the sum of the others.

In other words, if i get the total (or remaining) from the other counters , i can get an already modified value from another go routine causing inconsistent result. I can make a commit without that and we can see that the tests fail.

About the "rpcsFailed" I don't really see why we need that but it was there since 4ever... maybe i'm not seeing something so I would prefer to keep it for now. (The error channel is used here so, seems fine..)

@alanprot
Copy link
Member Author

Hi @bboreham .. Other than the comment, are you ok with the pr?
Thanks

@stale
Copy link

stale bot commented Jan 22, 2022

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jan 22, 2022
@alvinlin123 alvinlin123 added keepalive Skipped by stale bot and removed stale labels Jan 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
keepalive Skipped by stale bot
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants