After N-1 ingester crashes, query results are unstable #731

bboreham · 2018-03-02T13:48:51Z

We replicate to three ingesters but after losing two of them in an Unfortunate Incident, queries would sometimes contain results from just the two restarted instances which didn't have recent history.

It rights itself once those chunks age out of the remaining ingester into the store, but not great in the meantime.

csmarchbanks · 2018-03-07T16:59:43Z

This should be fixed by #732 correct?

bboreham · 2018-03-07T18:46:12Z

Different problem - #732 is about dropping samples on writes, while this is about reads from ingesters that have crashed and restarted.

bboreham · 2018-09-14T09:12:14Z

To improve the user experience, we could get the remaining ingesters to flush those chunks for which they hold the only remaining copy.

So, if I'm ingester X, and I know ingesters A and B have restarted, I iterate over all series in memory, run the distributor hash, and every series that maps to (A, B, X) I should flush.

How do I get to know A and B have restarted? I guess that could be written as metadata into the ring, or we could just have a human type it in via an admin entrypoint.

Slight wrinkle: when using all-labels sharding we will get the wrong hash for series with a blank label value, since we discard that information on entry to the ingester.

bboreham · 2018-11-02T12:29:15Z

How about:

We have ingesters remember the time their data starts, and hand that time over on a transfer.
If an ingester gets a query covering further back than that time, they return http status 206 - partial content.
Change the querier to not cancel remaining requests on 206.

bboreham · 2019-04-05T11:42:00Z

I think we can get into the same bad situation if you add a new ingester and then one old ingester restarts - there are now two places to get an incomplete answer. And my suggestion at #731 (comment) helps there too.

gouthamve · 2019-11-11T16:20:24Z

Fixed by #1103

gouthamve · 2020-05-07T14:59:39Z

If you're having N ingesters, and 2 ingesters go down and come back, we can ignore a (we only wait for 2 replys) different ingester in different queries, hence will have unstable queries.

shard-by-all-labels makes this rarer, but it doesn't completely fix this.

Its not a trivial fix, and can be only fixed rebalancing data once the ingesters are back up.

harry671003 · 2020-06-14T02:42:40Z

Fixed by #1103

@gouthamve Was this issues fixed only for blocks store?

bboreham · 2020-06-14T08:43:37Z

The WAL can be used for chunks, but I think Goutham’s subsequent comment is saying it isn’t fixed.

khaines · 2020-07-08T21:30:49Z

Even the WAL can't prevent this 100% of the time because the WAL can only recover the samples a given ingester has received. While it is offline or otherwise not receiving data, nothing is being written to the WAL, while its replication partners are potentially receiving data.

bboreham added the type/bug label Mar 2, 2018

bboreham mentioned this issue Jun 20, 2018

Don't try and read/write all ingesters in a set; only do it if one fails/times out #100

Closed

bboreham mentioned this issue Aug 8, 2019

A start at a guide to running Cortex in production #1553

Merged

bboreham mentioned this issue Nov 4, 2019

Export Unix timestamp of oldest unflushed chunk in the memory. #1776

Merged

gouthamve added the component/ingester label Nov 11, 2019

bboreham mentioned this issue Oct 7, 2020

Gaps when querying ingesters and -distributor.shard-by-all-labels is disabled #3294

Open

roidelapluie mentioned this issue Oct 12, 2020

Possible gaps when querying ingesters and -distributor.shard-by-all-labels is disabled #3322

Closed

2 tasks

bboreham mentioned this issue Jan 4, 2021

Allow ingesters to stay in the ring during restart #3305

Merged

3 tasks

bboreham mentioned this issue Apr 7, 2021

The query result may miss some data points when ingester is restarting. #4054

Closed

alanprot mentioned this issue Aug 4, 2021

Cortex return 5xx due a single ingester outage #4381

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

After N-1 ingester crashes, query results are unstable #731

After N-1 ingester crashes, query results are unstable #731

bboreham commented Mar 2, 2018

csmarchbanks commented Mar 7, 2018

bboreham commented Mar 7, 2018

bboreham commented Sep 14, 2018 •

edited

Loading

bboreham commented Nov 2, 2018

bboreham commented Apr 5, 2019

gouthamve commented Nov 11, 2019

gouthamve commented May 7, 2020 •

edited

Loading

harry671003 commented Jun 14, 2020

bboreham commented Jun 14, 2020

khaines commented Jul 8, 2020

After N-1 ingester crashes, query results are unstable #731

After N-1 ingester crashes, query results are unstable #731

Comments

bboreham commented Mar 2, 2018

csmarchbanks commented Mar 7, 2018

bboreham commented Mar 7, 2018

bboreham commented Sep 14, 2018 • edited Loading

bboreham commented Nov 2, 2018

bboreham commented Apr 5, 2019

gouthamve commented Nov 11, 2019

gouthamve commented May 7, 2020 • edited Loading

harry671003 commented Jun 14, 2020

bboreham commented Jun 14, 2020

khaines commented Jul 8, 2020

bboreham commented Sep 14, 2018 •

edited

Loading

gouthamve commented May 7, 2020 •

edited

Loading