Skip to content

Conversation

@tomwilkie
Copy link
Contributor

Fixes #1184

Signed-off-by: Tom Wilkie [email protected]

@tomwilkie
Copy link
Contributor Author

Doesn't seem to fix it:

level=info ts=2019-01-12T09:47:01.33001897Z caller=transfer.go:197 msg="sending chunks" to_ingester=10.52.27.27:9095
INFO: 2019/01/12 09:47:01 parsed scheme: ""
INFO: 2019/01/12 09:47:01 scheme "" not registered, fallback to default scheme
INFO: 2019/01/12 09:47:01 ccResolverWrapper: sending new addresses to cc: [{10.52.27.27:9095 0  <nil>}]
INFO: 2019/01/12 09:47:01 balancerWrapper: got update addr from Notify: [{10.52.27.27:9095 <nil>}]
level=error ts=2019-01-12T09:47:01.330539903Z caller=lifecycler.go:450 msg="Failed to transfer chunks to another ingester" err="rpc error: code = Unavailable desc = there is no connection available"

@tomwilkie
Copy link
Contributor Author

Scratch that - never actually checking the err. Thats why late night friday coding is not a good idea.

@tomwilkie
Copy link
Contributor Author

Yeah works nicely with this:

level=info ts=2019-01-12T10:47:36.740986791Z caller=gokit.go:36 msg="=== received SIGINT/SIGTERM ===\n*** exiting"
level=info ts=2019-01-12T10:47:36.741103067Z caller=lifecycler.go:441 msg="changing ingester state from" old_state=ACTIVE new_state=LEAVING
INFO: 2019/01/12 10:48:06 parsed scheme: ""
INFO: 2019/01/12 10:48:06 scheme "" not registered, fallback to default scheme
level=info ts=2019-01-12T10:48:06.767099053Z caller=transfer.go:198 msg="sending chunks" to_ingester=10.52.9.218:9095
INFO: 2019/01/12 10:48:06 parsed scheme: ""
INFO: 2019/01/12 10:48:06 scheme "" not registered, fallback to default scheme
INFO: 2019/01/12 10:48:06 ccResolverWrapper: sending new addresses to cc: [{10.52.9.218:9095 0  <nil>}]
INFO: 2019/01/12 10:48:06 ClientConn switching balancer to "pick_first"
INFO: 2019/01/12 10:48:06 pickfirstBalancer: HandleSubConnStateChange: 0xc4b05ba990, CONNECTING
INFO: 2019/01/12 10:48:06 pickfirstBalancer: HandleSubConnStateChange: 0xc4b05ba990, READY
level=info ts=2019-01-12T10:48:16.499741057Z caller=transfer.go:252 msg="successfully sent chunks" to_ingester=10.52.9.218:9095
level=info ts=2019-01-12T10:48:16.523740024Z caller=lifecycler.go:339 msg="ingester removed from consul"
level=info ts=2019-01-12T10:48:16.523836343Z caller=lifecycler.go:263 msg="Ingester.loop() exited gracefully"

@tomwilkie tomwilkie changed the title [WIP] When searching for an ingester to send chunks to, ensure we can actually connect to it. When searching for an ingester to send chunks to, ensure we can actually connect to it. Jan 12, 2019
@tomwilkie tomwilkie requested a review from gouthamve January 12, 2019 12:06
@bboreham
Copy link
Contributor

Seems like this could create a future TOCTTOU issue. Why not look for another ingester if we get a failure to transfer?

@tomwilkie
Copy link
Contributor Author

Seems like this could create a future TOCTTOU issue. Why not look for another ingester if we get a failure to transfer?

That issue already exists, right? This change slightly mitigates it, but I agree it is not perfect. We currently fallback to flushing, and I'd prefer to implement the "proper" migration ideas than drastically change the current approach, which I think it a bit of a dead end.

@bboreham
Copy link
Contributor

bboreham commented Feb 5, 2019

That issue already exists, right?

Not the one I meant. Currently we pick an ingester based on the list in Consul and try to transfer, and if that fails we're done. There is no 'check' so no time-of-check issue.

You think it's way more complicated to put a loop round the pick+transfer set?

@tomwilkie
Copy link
Contributor Author

Ah okay I see, sorry. I'll see what a retry loop looks like.

@tomwilkie
Copy link
Contributor Author

@bboreham PTAL

@gouthamve
Copy link
Contributor

LGTM from me!

Copy link
Contributor

@bboreham bboreham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of thoughts.

if time.Now().Before(deadline) {
time.Sleep(i.cfg.SearchPendingFor / pendingSearchIterations)
select {
case <-ticker.C:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have an outer loop doing retries, do we still want this inner loop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've also deprecated one of the flags that was used here (ingester.search-pending-for), which is replaced by ingester.max-transfer-retries. I made the max backoff 5s, so with 10 retries this should approximate the default 10s search time from the previous flag.

@tomwilkie tomwilkie force-pushed the 1184-transfer-fail branch from a71fbad to 9c97fb0 Compare March 11, 2019 14:19
@tomwilkie tomwilkie merged commit 1f84d68 into cortexproject:master Mar 14, 2019
@tomwilkie tomwilkie deleted the 1184-transfer-fail branch March 14, 2019 10:13
@bboreham
Copy link
Contributor

PLEASE PUT FLAG MEANING CHANGES LOUDLY IN THE TITLE

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants