Read from all ingesters, introduce 2 rings, fix deregistration #43

juliusv · 2016-10-10T13:36:46Z

Probably not related to the consul issues we're seeing, but an ingester
should unregister itself before shutting down, because it rejects new
samples during shutdown.

tomwilkie · 2016-10-10T15:35:32Z

This will break query correctness.

On Mon, Oct 10, 2016 at 6:36 AM, Julius Volz [email protected]
wrote:

Probably not related to the consul issues we're seeing, but an ingester
should unregister itself before shutting down, because it rejects new

samples during shutdown.

You can view, comment on, or merge this pull request online at:

#43
Commit Summary

Unregister ingester before shutting down

File Changes

M cmd/prism/main.go
https://github.com/weaveworks/prism/pull/43/files#diff-0 (6)

Patch Links:

https://github.com/weaveworks/prism/pull/43.patch

https://github.com/weaveworks/prism/pull/43.diff

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#43, or mute the thread
https://github.com/notifications/unsubscribe-auth/AAbGha39q8QKK741oDr09BamfGtKMCLxks5qyj9ugaJpZM4KSkqE
.

juliusv · 2016-10-10T16:13:21Z

@tomwilkie Damn, good point. Options:

deregister the ingester only for ingestion, but keep it registered for queries (basically, two separate rings?)
keep it as it is and just let it return errors while it's shutting down? But we don't have logic in the distributor yet for sending data to multiple ingesters, so the samples will just get lost while an ingester is shutting down, which can take a while...

Any other ideas?

juliusv · 2016-10-11T15:27:55Z

@jml and I talked a long time over the general lifecycle problem today and concluded that there's no feasible way (that we could think of) to make queries work with a consistent hashing approach, as one would need to keep around an arbitrary number of old ring states to find all possible ingesters that may still have data for a given query (and then query all of them). We think the correct long-term solution would be to run a separate in-memory indexing service which would tell us which time series resides in which ingester, but building that (especially in a horizontally scalable way) will be a longer-term effort.

To get results for the near-term, we've decided that querying all ingesters and merging the results for now is the least-bad stop-gap for now. It comes with obvious downsides: if one ingester is broken/unreachable, the entire query fails. Also, queries will be somewhat slower (although ingesters which have no results for a given query will return an empty result really fast). So this approach will get worse and worse the more ingesters we have (both reliability + speed).

Since we're now always querying all ingesters, that does give us some benefits though:

we could allow queries that don't include a metric name
we could shard by the entire labelset, achieving better load distribution over ingesters

I pushed that change into this PR.

rade · 2016-10-11T15:39:55Z

if one ingester is broken/unreachable, the entire query fails.

Why? Why not just ignore it?

jml

I think there might be a bug.

jml · 2016-10-11T15:40:27Z

cmd/prism/main.go

+			// Allow some time for the last appends to this ingester to
+			// complete after unregistering (looking up an ingester and appending
+			// to it is not atomic).
+			time.Sleep(100 * time.Millisecond)


Flag maybe?

jml · 2016-10-11T15:41:56Z

distributor.go

+		fpToSampleStream := map[model.Fingerprint]*model.SampleStream{}
+
+		// Fetch samples from all ingesters and group them by fingerprint (unsorted
+		// and with overlap).


I had assumed you were going to do this in parallel. Any reason why not?

We can, but it's more complexity - it's probably not going to be really noticeable at current ingester scale. Reevaluate later?

Later is fine. (Although I'll note that moving some of this to a separate function would perhaps reduce complexity).

jml · 2016-10-11T15:48:32Z

distributor.go

-			return err
+		matrix := make(model.Matrix, 0, len(fpToSampleStream))
+		for _, ss := range fpToSampleStream {
+			matrix = append(matrix, ss)


I don't follow. How does matrix get returned to the user?

Thanks, this was meant to append to the result defined at the top. Pushing fix.

jml · 2016-10-11T15:55:33Z

if one ingester is broken/unreachable, the entire query fails.

Why? Why not just ignore it?

Because it's better to fail outright than to lie.

juliusv · 2016-10-11T16:02:23Z

@rade If we fail to query one ingester, we don't know if the query is incomplete or not, and silently returning incomplete results seems worse than failing.

rade · 2016-10-11T16:03:00Z

How does the querier recover from such an error?

juliusv · 2016-10-11T16:06:44Z

How does the querier recover from such an error?

It doesn't and it can't, so it has to fail the entire query to the user. That's the problem with the current architecture. We don't know in which ingester the data we want to query lives (at least not after the ring has changed a couple of times), so we either need to query all of them or have an architecturally better way of finding out which ingesters we really need (that separate index service).

juliusv · 2016-10-11T16:15:47Z

@jml As discussed, removed the sleep between deregistering and actually shutting down for now, for simplicity.

rade · 2016-10-11T16:26:45Z

It doesn't and it can't, so it has to fail the entire query to the user.

I don't get it. The querier has to get the list of all ingesters from somewhere. That list can always be out of date by the time the querier performs the query - there may be fewer or more ingester in existence at the time. If query correctness depends on querying all ingesters then either case should result in a query failure. But it seems like only the "fewer" case would cause an error in the proposed design.

juliusv · 2016-10-11T16:34:37Z

@rade With an index service, this could be avoided in the orderly shutdown case: when an ingester shuts down, it:

deregisters itself for appends in the index service
waits a bit for in-flight appends to finish
flushes out its data to AWS
deregisters itself for queries in the index service
waits a bit for in-flight queries to finish
shuts down completely

In the case of a crash, of course, you're screwed for a short while between the crash of the ingester and the time when the index service removes the ingester from its index (due to lack of heartbeats or similar). To solve that correctly without returning errors or incorrect query results to the user, you need to spread ingested data redundantly over multiple ingesters and then try each of the redundant ingesters in turn, without failing if only one of them is down.

rade · 2016-10-11T16:44:22Z

I was not questioning the eventual fix, but the supposed temporary fix here. In particular, I am questioning why "an ingester got removed" should be treated as an error when "an ingester got added" is not.

juliusv · 2016-10-11T16:52:36Z

@rade Ah right, sorry. But a completely new ingester would realistically not have a chance to accumulate any significant data between the first samples being sent to it and the first queries hitting it. The race condition there would only be a couple of milliseconds (and we could even register for queries first and only a second later for appends, if this was a problem), so from the user's point of view it just looks like the samples haven't quite made it to storage yet, which is fine. On the other hand, if an ingester with hours worth of data goes away and we can't query its data anymore, we have a big problem and lots (potentially all) missing data.

rade · 2016-10-11T17:26:52Z

if an ingester with hours worth of data goes away and we can't query its data anymore, we have a big problem and lots (potentially all) missing data.

Aren't we redundantly sending data to multiple ingesters, precisely to counter that?

Also...

{The querier} doesn't and it can't {recover}, so it has to fail the entire query to the user.

So this query will fail forever?

jml · 2016-10-11T17:36:14Z

if an ingester with hours worth of data goes away and we can't query its data anymore, we have a big problem and lots (potentially all) missing data.

Aren't we redundantly sending data to multiple ingesters, precisely to counter that?

Not yet.

{The querier} doesn't and it can't {recover}, so it has to fail the entire query to the user.

So this query will fail forever?

No. The user can retry, and then the result might be successful.

jml

LGTM. (Sorry for delay in pushing the button).

rade · 2016-10-11T17:44:49Z

The user can retry, and then the result might be successful.

It will never succeed after an ingester has failed though, will it?

jml · 2016-10-11T18:03:54Z

Depends on the failure.

requests from distributor to ingester can fail for many reasons (e.g. intermittent network failure, race conditions in code, overload)
when they fail, the distributor has no way of completing the query in the face of that error, because it's not clear what it means or why it happened, or how long it will last for, etc.
therefore the distributor should abort the query

Retrying might result in success depending on the kind of error. We could do this automatically but that often leads to cascading failures.

rade · 2016-10-11T18:15:05Z

Retrying might result in success depending on the kind of error

Yes of course. My question was specifically about an actual ingester failure. i.e. if a query failed because an ingester failed catastrophically (i.e. without removing itself from consul), then that query will never succeed on retry, at least not until some external intervention to clean up consul - true or false? And since with this PR every query is sent to all ingesters, the catastrophic failure of single ingester breaks all queries until external intervention.

We could do this automatically but that often leads to cascading failures.

which is why I didn't suggest that ;)

jml · 2016-10-12T10:48:06Z

We spoke today with @tomwilkie, who mentioned that the original plan was to have a field in the ring for tracking token (ingester) state, that was meant to go in https://github.com/weaveworks/prism/blob/master/ring/model.go#L33. This contrasts to the two ring approach.

The other element is that replication is the solution to our problems. I'm going to start making some notes on that in the ingester lifecycle design doc.

jml · 2016-10-17T10:17:04Z

Superseded by #49?

juliusv · 2016-10-17T15:59:29Z

Yes, thanks!

juliusv force-pushed the unregister-order branch from 43418ed to 349af6a Compare October 11, 2016 15:13

juliusv changed the title ~~Unregister ingester *before* shutting down~~ Read from all ingesters, introduce 2 rings, fix deregistration Oct 11, 2016

juliusv force-pushed the unregister-order branch from 349af6a to 884fa9f Compare October 11, 2016 15:18

jml reviewed Oct 11, 2016

View reviewed changes

juliusv force-pushed the unregister-order branch from 884fa9f to 17e6dd1 Compare October 11, 2016 16:02

Read from all ingesters, introduce 2 rings, fix deregistration

8415383

juliusv force-pushed the unregister-order branch from 17e6dd1 to 8415383 Compare October 11, 2016 16:14

jml approved these changes Oct 11, 2016

View reviewed changes

juliusv closed this Oct 17, 2016

tomwilkie deleted the unregister-order branch November 5, 2016 22:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read from all ingesters, introduce 2 rings, fix deregistration #43

Read from all ingesters, introduce 2 rings, fix deregistration #43

juliusv commented Oct 10, 2016

tomwilkie commented Oct 10, 2016

samples during shutdown.

juliusv commented Oct 10, 2016

juliusv commented Oct 11, 2016

rade commented Oct 11, 2016

jml left a comment

jml Oct 11, 2016

juliusv Oct 11, 2016

jml Oct 11, 2016

juliusv Oct 11, 2016

jml Oct 11, 2016

jml Oct 11, 2016

juliusv Oct 11, 2016

jml commented Oct 11, 2016

juliusv commented Oct 11, 2016

rade commented Oct 11, 2016

juliusv commented Oct 11, 2016

juliusv commented Oct 11, 2016

rade commented Oct 11, 2016

juliusv commented Oct 11, 2016

rade commented Oct 11, 2016

juliusv commented Oct 11, 2016

rade commented Oct 11, 2016

jml commented Oct 11, 2016

jml left a comment

rade commented Oct 11, 2016

jml commented Oct 11, 2016 •

edited

Loading

rade commented Oct 11, 2016

jml commented Oct 12, 2016

jml commented Oct 17, 2016

juliusv commented Oct 17, 2016

Read from all ingesters, introduce 2 rings, fix deregistration #43

Read from all ingesters, introduce 2 rings, fix deregistration #43

Conversation

juliusv commented Oct 10, 2016

tomwilkie commented Oct 10, 2016

samples during shutdown.

juliusv commented Oct 10, 2016

juliusv commented Oct 11, 2016

rade commented Oct 11, 2016

jml left a comment

Choose a reason for hiding this comment

jml Oct 11, 2016

Choose a reason for hiding this comment

juliusv Oct 11, 2016

Choose a reason for hiding this comment

jml Oct 11, 2016

Choose a reason for hiding this comment

juliusv Oct 11, 2016

Choose a reason for hiding this comment

jml Oct 11, 2016

Choose a reason for hiding this comment

jml Oct 11, 2016

Choose a reason for hiding this comment

juliusv Oct 11, 2016

Choose a reason for hiding this comment

jml commented Oct 11, 2016

juliusv commented Oct 11, 2016

rade commented Oct 11, 2016

juliusv commented Oct 11, 2016

juliusv commented Oct 11, 2016

rade commented Oct 11, 2016

juliusv commented Oct 11, 2016

rade commented Oct 11, 2016

juliusv commented Oct 11, 2016

rade commented Oct 11, 2016

jml commented Oct 11, 2016

jml left a comment

Choose a reason for hiding this comment

rade commented Oct 11, 2016

jml commented Oct 11, 2016 • edited Loading

rade commented Oct 11, 2016

jml commented Oct 12, 2016

jml commented Oct 17, 2016

juliusv commented Oct 17, 2016

jml commented Oct 11, 2016 •

edited

Loading