-
Notifications
You must be signed in to change notification settings - Fork 816
Read from all ingesters, introduce 2 rings, fix deregistration #43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This will break query correctness. On Mon, Oct 10, 2016 at 6:36 AM, Julius Volz [email protected]
|
@tomwilkie Damn, good point. Options:
Any other ideas? |
43418ed
to
349af6a
Compare
349af6a
to
884fa9f
Compare
@jml and I talked a long time over the general lifecycle problem today and concluded that there's no feasible way (that we could think of) to make queries work with a consistent hashing approach, as one would need to keep around an arbitrary number of old ring states to find all possible ingesters that may still have data for a given query (and then query all of them). We think the correct long-term solution would be to run a separate in-memory indexing service which would tell us which time series resides in which ingester, but building that (especially in a horizontally scalable way) will be a longer-term effort. To get results for the near-term, we've decided that querying all ingesters and merging the results for now is the least-bad stop-gap for now. It comes with obvious downsides: if one ingester is broken/unreachable, the entire query fails. Also, queries will be somewhat slower (although ingesters which have no results for a given query will return an empty result really fast). So this approach will get worse and worse the more ingesters we have (both reliability + speed). Since we're now always querying all ingesters, that does give us some benefits though:
I pushed that change into this PR. |
Why? Why not just ignore it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there might be a bug.
// Allow some time for the last appends to this ingester to | ||
// complete after unregistering (looking up an ingester and appending | ||
// to it is not atomic). | ||
time.Sleep(100 * time.Millisecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Flag maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
fpToSampleStream := map[model.Fingerprint]*model.SampleStream{} | ||
|
||
// Fetch samples from all ingesters and group them by fingerprint (unsorted | ||
// and with overlap). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had assumed you were going to do this in parallel. Any reason why not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can, but it's more complexity - it's probably not going to be really noticeable at current ingester scale. Reevaluate later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Later is fine. (Although I'll note that moving some of this to a separate function would perhaps reduce complexity).
return err | ||
matrix := make(model.Matrix, 0, len(fpToSampleStream)) | ||
for _, ss := range fpToSampleStream { | ||
matrix = append(matrix, ss) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't follow. How does matrix
get returned to the user?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this was meant to append to the result
defined at the top. Pushing fix.
Because it's better to fail outright than to lie. |
@rade If we fail to query one ingester, we don't know if the query is incomplete or not, and silently returning incomplete results seems worse than failing. |
884fa9f
to
17e6dd1
Compare
How does the querier recover from such an error? |
It doesn't and it can't, so it has to fail the entire query to the user. That's the problem with the current architecture. We don't know in which ingester the data we want to query lives (at least not after the ring has changed a couple of times), so we either need to query all of them or have an architecturally better way of finding out which ingesters we really need (that separate index service). |
17e6dd1
to
8415383
Compare
@jml As discussed, removed the sleep between deregistering and actually shutting down for now, for simplicity. |
I don't get it. The querier has to get the list of all ingesters from somewhere. That list can always be out of date by the time the querier performs the query - there may be fewer or more ingester in existence at the time. If query correctness depends on querying all ingesters then either case should result in a query failure. But it seems like only the "fewer" case would cause an error in the proposed design. |
@rade With an index service, this could be avoided in the orderly shutdown case: when an ingester shuts down, it:
In the case of a crash, of course, you're screwed for a short while between the crash of the ingester and the time when the index service removes the ingester from its index (due to lack of heartbeats or similar). To solve that correctly without returning errors or incorrect query results to the user, you need to spread ingested data redundantly over multiple ingesters and then try each of the redundant ingesters in turn, without failing if only one of them is down. |
I was not questioning the eventual fix, but the supposed temporary fix here. In particular, I am questioning why "an ingester got removed" should be treated as an error when "an ingester got added" is not. |
@rade Ah right, sorry. But a completely new ingester would realistically not have a chance to accumulate any significant data between the first samples being sent to it and the first queries hitting it. The race condition there would only be a couple of milliseconds (and we could even register for queries first and only a second later for appends, if this was a problem), so from the user's point of view it just looks like the samples haven't quite made it to storage yet, which is fine. On the other hand, if an ingester with hours worth of data goes away and we can't query its data anymore, we have a big problem and lots (potentially all) missing data. |
Aren't we redundantly sending data to multiple ingesters, precisely to counter that? Also...
So this query will fail forever? |
Not yet.
No. The user can retry, and then the result might be successful. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. (Sorry for delay in pushing the button).
It will never succeed after an ingester has failed though, will it? |
Depends on the failure.
Retrying might result in success depending on the kind of error. We could do this automatically but that often leads to cascading failures. |
Yes of course. My question was specifically about an actual ingester failure. i.e. if a query failed because an ingester failed catastrophically (i.e. without removing itself from consul), then that query will never succeed on retry, at least not until some external intervention to clean up consul - true or false? And since with this PR every query is sent to all ingesters, the catastrophic failure of single ingester breaks all queries until external intervention.
which is why I didn't suggest that ;) |
We spoke today with @tomwilkie, who mentioned that the original plan was to have a field in the ring for tracking token (ingester) state, that was meant to go in https://github.com/weaveworks/prism/blob/master/ring/model.go#L33. This contrasts to the two ring approach. The other element is that replication is the solution to our problems. I'm going to start making some notes on that in the ingester lifecycle design doc. |
Superseded by #49? |
Yes, thanks! |
Probably not related to the consul issues we're seeing, but an ingester
should unregister itself before shutting down, because it rejects new
samples during shutdown.