-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Snapshot restore progress #490
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
dhiaayachi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really cool feature @rboyer!! I had a small comment.
| return nil | ||
| } | ||
|
|
||
| func (r *Raft) tryRestoreSingleSnapshot(snapshot *SnapshotMeta) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't be better to make tryRestoreSingleSnapshot return an error and create the logger outside of it and log the error when it's returned?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe? I wanted to have a derived logger here (Logger.With()) so that all of the progress logs and the errors got the same set of bonus KV data logged.
Given that, I'd end up creating the logger outside of this method to pass in for progress in fsmRestoreAndMeasure, but returning an error that is immediately logged by the caller, which seemed differently strange.
Also the two current logs are slightly different today:
snapLogger.Error("failed to open snapshot", "error", err)
snapLogger.Error("failed to restore snapshot", "error", err)
and are carried over from the existing code. If these were changed to fmt.Errorf return values there'd be a slight change of output as we'd end up having to do snapLogger.Error(err.Error()) instead, and we'd lose the "error" hclog attribute.
I don't mind inverting that logic if you think it is warranted.
mkeeler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just the one comment about what looks to be dead code. Otherwise LGTM
mkeeler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
When restoring a snapshot (on startup, installed from the leader, or during recovery) the logs are extremely terse. There are typically bookend messages indicating that a restore is going to happen, and that it is complete, but there's a big dead space in the middle.
For small snapshots this is probably fine, but for larger multi-GB snapshots this can stretch out and can be unnerving as an operator to know if it's stuck or still making progress.
This PR adjusts the logging to indicate a simple progress log message every 10s about overall completion in bytes-consumed.
Example of it in use while loading a large snapshot:
TODO: need to figure out a test