Incrementally transfer chunks per token to improve handover #1764

rfratto · 2019-10-28T19:04:11Z

Design document: https://docs.google.com/document/d/1y2TdfEQ9ZKh6CpBVB4o6BYjCr-plNRL9jGD6fJ9bMW0/edit#

This PR introduces two incremental chunk transfer process utilized by the lifecycler to reduce spillover and enable dynamic scaling of ingesters. The incremental transfer process takes precedence over the old handover mechanism.

To migrate a cluster to use incremental transfers, two rollouts must be done:

Rollout ingesters with -ingester.leave-incremental-transfer=true
Rollout ingesters with -ingester.join-incremental-transfer=true

I recognize this is a large PR and I have attempted (to the best of my ability) to split it into smaller, independent commits. It's not perfect, but hopefully the commits I have make it easier to review.

Fixes #1277.

/cc @gouthamve @pstibrany @tomwilkie

pstibrany · 2019-10-30T11:10:10Z

There are multiple test failures and reported race conditions. Would be nice to fix those.

pstibrany

Comments after reviewing first commit (thanks for splitting your work into logical steps!)

pkg/ring/model.go

pkg/ring/ring.go

pkg/ring/replication_strategy.go

pkg/ring/ring.proto

pstibrany

Another round of comments. I still need to better understand and review the real meat of the PR (like entire incremental_transfer.go).

My initial impression is that this is very complex piece of code, that will be hard to find and fix bugs in :-(

docs/arguments.md

pkg/ring/lifecycler.go

pkg/ring/model.go

pkg/util/test/poll.go

pkg/ring/incremental_transfer.go

pkg/ring/token_checker.go

pstibrany

Another pass, this time mostly around TokenChecker.

pkg/ring/token_checker.go

pkg/ingester/ingester.go

pkg/ingester/incremental_transfer.go

pkg/ring/lifecycler.go

pkg/ring/token_checker.go

rfratto · 2019-11-01T18:06:44Z

I've addressed most of the review feedback so far. I want to rebase against latest and fix the merge conflicts before I continue addressing feedback. This may take a little bit of time; both the TSDB blocks and the gossip are going to change bits and pieces about the current implementation.

pstibrany

Yet another round of comments.

Main issue I have is with the requirement that we need to move all the replicated data around, not just the primary data for token. This is further complicated by adjacent tokens belonging to the same ingester, dealing with unhealhty ingesters and token states.

pkg/ingester/client/cortex.proto

pkg/ingester/incremental_transfer.go

pkg/ring/ring_test.go

pkg/ring/model_test.go

pkg/ring/incremental_transfer.go

rfratto · 2019-11-05T12:59:46Z

Main issue I have is with the requirement that we need to move all the replicated data around, not just the primary data for token. This is further complicated by adjacent tokens belonging to the same ingester, dealing with unhealhty ingesters and token states.

Adjacent tokens belonging to the same ingester, I find, is the easier half of the problem. The main complexities with dealing with the ring are when tokens belonging to the same ingester are near, but not next to, each other. For example, when dealing with a ring A1 B A2 C, we have to keep the tokens unmerged to know which subset of ranges A will be handling. Dealing with tokens that are next to each other is handled implicitly through dealing with tokens that are near each other.

Another issue with simplifying how we deal with the ring is the risk of introducing minor differences between what the distributor does and what the ingesters do when moving data around. If this were to happen, then incrementally joining and leaving would stop working properly: spillover may be introduced if the ingester doesn't request data from the proper ingester, and chunks may go untransferred and be flushed before their capacity is reached.

I don't think we can loosen the requirement of not moving all replicated data around. If we didn't do this, every time an ingester leaves, we would lose 1/replicationFactor of our data. If you did a complete rollout, you would lose all replicas outside of the original owner. This would subtly break a lot of things, including querying, which use the quorum to return correct results.

pstibrany · 2019-11-05T13:05:25Z

I don't think we can loosen the requirement of not moving all replicated data around. If we didn't do this, every time an ingester leaves, we would lose 1/replicationFactor of our data. If you did a complete rollout, you would lose all replicas outside of the original owner. This would subtly break a lot of things, including querying, which use the quorum to return correct results.

Can you elaborate on how we would lose data? In the complete rollout scenario, we can use the same mechanism as we do today. Perhaps the solution could be to warn admin about not leaving too many ingesters at once?

rfratto · 2019-11-05T13:09:21Z

Can you elaborate on how we would lose data? In the complete rollout scenario, we can use the same mechanism as we do today. Perhaps the solution could be to warn admin about not leaving too many ingesters at once?

We would lose data because each ingester holds data for replicationFactor total ingesters , including itself. My fallback is to flush anything that didn't get transferred, but we generally have queriers configured to only query the store for stuff that isn't in memory anymore. That means there will be some period where an ingester gets a query for some data that it should be a replica of, but it never received that data during the transfer.

pstibrany · 2019-11-05T18:56:23Z

My understanding is that queriers ask all ingesters, so any ingester with data will reply.

…

On 5 Nov 2019, at 14.09, Robert Fratto ***@***.***> wrote: Can you elaborate on how we would lose data? In the complete rollout scenario, we can use the same mechanism as we do today. Perhaps the solution could be to warn admin about not leaving too many ingesters at once? We would lose data because each ingester holds data for replicationFactor total ingesters , including itself. My fallback is to flush anything that didn't get transferred, but we generally have queriers configured to only query the store for stuff that isn't in memory anymore. That means there will be some period where an ingester gets a query for some data that it should be a replica of, but it never received that data during the transfer. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

rfratto · 2019-11-05T19:01:23Z

My understanding is that queriers ask all ingesters, so any ingester with data will reply.

I may be slightly wrong, but I think queriers ask all ingesters and stop once it receives responses from a quorum number of those ingesters. If none of the quorum had any data, then the query results would show no data. Again, unsure, but I believe this is how it works.

pstibrany · 2019-11-05T19:06:45Z

You may by right, I haven’t checked the code, although why would it need quorum? Either it gets data from somewhere, or not. It cannot possibly get different data from different ingesters (perhaps incomplete?). I need to check the code.

…

On 5 Nov 2019, at 20.01, Robert Fratto ***@***.***> wrote: My understanding is that queriers ask all ingesters, so any ingester with data will reply. I may be slightly wrong, but I think queriers ask all ingesters and stop once it receives responses from a quorum number of those ingesters. If none of the quorum had any data, then the query results would show no data. Again, unsure, but I believe this is how it works. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

pkg/ring/model.go

pkg/ingester/transfer.go

owen-d · 2020-01-10T19:15:43Z

pkg/ingester/incremental_transfer.go

+	}
+
+	// Target ingester may not have had any streams to send.
+	if seriesReceived == 0 {


This if seriesReceived == 0 check seems unnecssary as you're returning nil regardless.

owen-d · 2020-01-10T19:29:05Z

pkg/ingester/incremental_transfer.go

+	// when ranges should be unblocked: we should continue to reject writes for as long as we may
+	// receive them. When the joining token has been completely inserted into the ring, it will
+	// be safe to remove the blocks.
+	i.BlockRanges(req.Ranges)


shouldn't the block occur before copying user states in the unlikely case there's a new series appended during i.userStates.cp()?

This shouldn't happen because the new token isn't in the ring yet, but it's better to be safe here; I'll move the block before the copy.

owen-d · 2020-01-10T19:43:02Z

pkg/ingester/incremental_transfer.go

+}
+
+// UnblockRanges manually removes blocks for the provided ranges.
+func (i *Ingester) UnblockRanges(ctx context.Context, in *client.UnblockRangesRequest) (*client.UnblockRangesResponse, error) {


should probably be _ context.Context since the arg is unused

pkg/ingester/incremental_transfer.go

gouthamve · 2020-01-14T12:09:58Z

pkg/ingester/incremental_transfer.go

+
+// TransferChunksSubset accepts chunks from a client and moves them into the local Ingester.
+func (i *Ingester) TransferChunksSubset(stream client.Ingester_TransferChunksSubsetServer) error {
+	i.userStatesMtx.Lock()


This blocks ingestion https://github.com/cortexproject/cortex/blob/master/pkg/ingester/ingester.go#L329-L335

Can we not take a lock during the entire transfer?

My understanding is that the lock is on the joining ingester so there wouldn't be any data to ingest anyway, right?

pkg/ingester/incremental_transfer.go

gouthamve · 2020-01-14T13:10:28Z

pkg/ingester/incremental_transfer.go

+		return err
+	}
+
+	userStatesCopy := i.userStates.cp()


Yeah, there is the race condition that a State is empty hence removed on userStates.gc() but we use the State here and add data to it. That way the data is lost.

gouthamve · 2020-01-14T16:04:53Z

pkg/ingester/ingester.go

@@ -654,3 +760,22 @@ func (i *Ingester) ReadinessHandler(w http.ResponseWriter, r *http.Request) {
 		http.Error(w, "Not ready: "+err.Error(), http.StatusServiceUnavailable)
 	}
 }
+
+func (i *Ingester) unexpectedStreamsHandler(tokens []uint32) {


If we are taking a function as argument just to log the bad tokens, lets not do it? It makes things harder to read as I need to figure out which unexpectedStreamsHandler function was passed and what it does.

would passing the unexpected series metric to the token checker be acceptable instead? I don't want to remove the logging and metric completely.

gouthamve · 2020-01-14T16:14:47Z

pkg/ingester/ingester.go

 	if err != nil {
 		state = nil // don't want to unlock the fp if there is an error
 		return err
 	}

+	if sstate == seriesCreated && i.cfg.CheckOnCreate {


TODO(gouthamve): Check if this entire check and metric can be skipped.

pkg/ingester/ingester.go

gouthamve · 2020-01-31T17:46:35Z

pkg/ingester/ingester.go

@@ -283,15 +351,15 @@ func (i *Ingester) Push(ctx old_ctx.Context, req *client.WriteRequest) (*client.

 	for _, ts := range req.Timeseries {
 		for _, s := range ts.Samples {
-			err := i.append(ctx, userID, ts.Labels, model.Time(s.TimestampMs), model.SampleValue(s.Value), req.Source)
+			err := i.append(ctx, userID, ts.Token, ts.Labels, model.Time(s.TimestampMs), model.SampleValue(s.Value), req.Source)


I'm curious what will happen if there is a change to the hashing scheme and the tokens change?

My understanding of this is one of two things will happen:

The tokens change, sharding to new ingesters. The old ingesters don't receive appends anymore and their data eventually gets flushed as underutilized chunks.

The tokens change, sharding to the same ingesters (coincidentally). The ingesters update the token in the existing memory series and log a warning.

pkg/ring/lifecycler.go

pkg/ingester/ingester.go

pkg/ingester/incremental_transfer.go

pkg/ingester/transfer.go

pkg/ring/ring.proto

pracucci · 2020-02-01T13:50:29Z

pkg/ring/ring.go

+		// is increased for the key. Dead ingesters will be filtered later by
+		// replication_strategy.go. Filtering later means that we can calculate
+		// a healthiness quorum.
+		if !ingester.IsHealthyState(op) {


This is not the exact logic we had before. For a Read operation, before we were considering a PENDING not valid, while now the IsHealthyState() considers a PENDING as OK. However, don't change the logic in IsHealthyState() too easily, cause I've tried to do the same and I've receive a valuable feedback here.

Right, I'll change this back to its original logic but add in anther check for ingester.Incremental, allowing every non-PENDING state to be valid.

This check was also initially in IsHealthyState but got lost in a rebase. I'll have to add it back in on Monday.

bboreham · 2020-02-03T12:24:08Z

Does this fix #467 and #775 ?

rfratto · 2020-02-03T17:48:48Z

Does this fix #467 and #775?

I'm not sure about #775, but it should fix #467.

This commit introduces several features to comprise a new "incremental chunk transfer" feature: - pkg/ring: incremental transfer management Introduces managing incremental transfers between ingesters when a lifecycler joins a ring and when it leaves the ring. The implementation of the IncrementalTransferer interface will be done in a future commit. The LifecyclerConfig has been updated with JoinIncrementalTransfer and LeaveIncrementalTransfer, available as join_incremental_transfer and leave_incremental_transfer using the YAML config, and join-incremental-transfer and leave-incremental-transfer using command line flags. When JoinIncrementalTransfer is used, the lifecycler will join the ring immediately. Tokens will be inserted into the ring one by one, first into the JOINING state and then the ACTIVE state after requesting chunks in token ranges they should have data for from neighboring ingesters in the ring. When LeaveIncrementalTransfer is used, the lifecycler will incrementally move tokens in LEAVING state after sending ranges to neighboring ingesters that should now have data. Enabling LeaveIncrementalTransfer will disable the handoff process, and flushing non-transferred data always happens at the end. - pkg/distributor: push shard token to ingesters This modifies the ingesters to be aware of the shard token used by the distributors to send traffic to ingesters. This is a requirement for incremental transfers, where the shard token is used to determine which memory series need to be moved. This assumes that all distributors are using the same sharding mechanism and always use the same token for a specific series. If the memory series is appended to with a different token from the one it was created with, a warning will be logged and the new token will be used. - pkg/ingester: implement IncrementalTransferer interface This implements the IncrementalTransferer interface used by lifecyclers to move memory series around the ring as ingesters join and leave. - pkg/ring: add TokenChecker This introduces a TokenChecker component which runs in the background to support reporting metrics on unexpected tokens pushed to ingesters. It supports checking on an interval, checking when a new stream is pushed, checking when an existing stream is appended to, and checking when a stream is transferred. Signed-off-by: Robert Fratto <[email protected]>

Signed-off-by: Robert Fratto <[email protected]>

when rolling out to code using incremental transfers, the warnings messages can be annoying. This commit reduces it to debug-level log lines or logs the warning only if the token checking flag is enabled. Signed-off-by: Robert Fratto <[email protected]>

This was a problem caused by the refactoring, moving the check inside the previous if statement (which only happens once per transfer) is equivalent to what was happening before. Signed-off-by: Robert Fratto <[email protected]>

Signed-off-by: Robert Fratto <[email protected]>

rfratto · 2020-03-05T13:48:11Z

Unfortunately, I think I need to close this PR and take another shot at this. I've rebased it to fix conflicts so many times now that I've lost confidence in its correctness, and I understand that its size has been a pain point for everyone all around.

Rather than one giant PR, I'm going to start work on the first of a smaller set of PRs to eventually build up to this feature. Hopefully starting fresh will allow me to reduce the code complexity introduced here, although I still intend to be copying and pasting at least some of the existing code.

Huge thank you to everyone who took time tackling this beast, and I hope the second attempt is a smoother experience for everyone 🙂

rfratto force-pushed the incremental-chunk-transfers branch from 3db8e47 to 0737ed0 Compare October 28, 2019 19:17

rfratto force-pushed the incremental-chunk-transfers branch 7 times, most recently from f67be16 to b7b9ee6 Compare October 30, 2019 13:58

pstibrany reviewed Oct 30, 2019

View reviewed changes

rfratto force-pushed the incremental-chunk-transfers branch from 5370565 to 03b4dc9 Compare October 31, 2019 13:40

pstibrany reviewed Oct 31, 2019

View reviewed changes

rfratto force-pushed the incremental-chunk-transfers branch from 875afd7 to cbc9007 Compare October 31, 2019 21:46

pstibrany mentioned this pull request Nov 1, 2019

Flush tokens to disk #1750

Merged

rfratto commented Nov 1, 2019

View reviewed changes

pkg/ring/token_checker.go Outdated Show resolved Hide resolved

pstibrany reviewed Nov 1, 2019

View reviewed changes

rfratto force-pushed the incremental-chunk-transfers branch from ce1e47e to a364ceb Compare November 1, 2019 17:39

rfratto force-pushed the incremental-chunk-transfers branch 3 times, most recently from d9009f2 to dd5ac36 Compare November 4, 2019 17:47

pstibrany reviewed Nov 5, 2019

View reviewed changes

rfratto commented Nov 6, 2019

View reviewed changes

pkg/ring/model.go Outdated Show resolved Hide resolved

rfratto commented Nov 6, 2019

View reviewed changes

pkg/ingester/transfer.go Outdated Show resolved Hide resolved

owen-d reviewed Jan 10, 2020

View reviewed changes

pkg/ingester/incremental_transfer.go Show resolved Hide resolved

rfratto mentioned this pull request Jan 14, 2020

Ingesters should ignore data with the same timestamp and contents as previously received line grafana/loki#1517

Closed

pstibrany mentioned this pull request Jan 28, 2020

Remove remaining support for denormalised tokens in the ring. #2034

Merged

3 tasks

rfratto force-pushed the incremental-chunk-transfers branch 3 times, most recently from 3b81d32 to 3522f08 Compare January 31, 2020 18:44

gouthamve reviewed Jan 31, 2020

View reviewed changes

pracucci reviewed Feb 1, 2020

View reviewed changes

rfratto force-pushed the incremental-chunk-transfers branch from d41e927 to 2df440d Compare February 3, 2020 18:30

rfratto force-pushed the incremental-chunk-transfers branch 2 times, most recently from 2711ba9 to 73817fd Compare February 11, 2020 14:08

rfratto added 10 commits February 19, 2020 15:37

address lint errors

24fac66

Signed-off-by: Robert Fratto <[email protected]>

update config file reference

a38e6f2

Signed-off-by: Robert Fratto <[email protected]>

pkg/ring: fix data race

0763ba0

Signed-off-by: Robert Fratto <[email protected]>

address some review feedback

432c1bc

Signed-off-by: Robert Fratto <[email protected]>

goimports -local github.com/cortexproject/cortex

ea7b429

Signed-off-by: Robert Fratto <[email protected]>

make warnings less frequent

5a2382a

when rolling out to code using incremental transfers, the warnings messages can be annoying. This commit reduces it to debug-level log lines or logs the warning only if the token checking flag is enabled. Signed-off-by: Robert Fratto <[email protected]>

increase timeout in incremental transfer tests

4886229

Signed-off-by: Robert Fratto <[email protected]>

don't shadow seriesReceived and fromIngesterID

80249b4

Signed-off-by: Robert Fratto <[email protected]>

rfratto force-pushed the incremental-chunk-transfers branch from 55a8a51 to 80249b4 Compare February 19, 2020 20:37

fix test failures after rebase against master

d343cb1

Signed-off-by: Robert Fratto <[email protected]>

rfratto closed this Mar 5, 2020

Incrementally transfer chunks per token to improve handover #1764

Incrementally transfer chunks per token to improve handover #1764

Uh oh!

Conversation

rfratto commented Oct 28, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pstibrany commented Oct 30, 2019

Uh oh!

pstibrany left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pstibrany left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pstibrany left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rfratto commented Nov 1, 2019

Uh oh!

pstibrany left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rfratto commented Nov 5, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pstibrany commented Nov 5, 2019

Uh oh!

rfratto commented Nov 5, 2019

Uh oh!

pstibrany commented Nov 5, 2019 via email

Uh oh!

rfratto commented Nov 5, 2019

Uh oh!

pstibrany commented Nov 5, 2019 via email

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rfratto commented Oct 28, 2019 •

edited

Loading

pstibrany left a comment •

edited

Loading

rfratto commented Nov 5, 2019 •

edited

Loading

rfratto Jan 10, 2020 •

edited

Loading

rfratto commented Feb 3, 2020 •

edited

Loading