Skip to content

backend/local: Periodically persist intermediate state snapshots#32680

Merged
apparentlymart merged 1 commit intomainfrom
f-persiststate-periodic
Feb 14, 2023
Merged

backend/local: Periodically persist intermediate state snapshots#32680
apparentlymart merged 1 commit intomainfrom
f-persiststate-periodic

Conversation

@apparentlymart
Copy link
Copy Markdown
Contributor

@apparentlymart apparentlymart commented Feb 14, 2023

Terraform Core emits a hook event every time it writes a change into the in-memory state. Previously the local backend would just copy that into the transient storage of the state manager, but for most state storage implementations that doesn't really do anything useful because it just makes another copy of the state in memory.

We originally added this hook mechanism with the intent of making Terraform persist the state each time, but we backed that out after finding that it was a bit too aggressive and was making the state snapshot history much harder to use in storage systems that can preserve historical snapshots. It was also hitting rate limits in some cases, because state changes can arrive in quick succession when making lots of small changes.

However, sometimes Terraform gets killed mid-apply for whatever reason and in our previous implementation that meant always losing that transient state, forcing the user to edit the state manually (or use "import") to recover a useful state.

In an attempt at finding a sweet spot between these extremes, here we change the rule so that if an apply runs for longer than 20 seconds then we'll try to persist the state to the backend in an update that arrives at least 20 seconds after the first update, and then again for each additional 20 second period as long as Terraform Core keeps announcing new state snapshots.

This also introduces a special interruption mode where if the apply phase gets interrupted by SIGINT (or equivalent on Windows) then the local backend will try to persist the state immediately in anticipation of a possibly-imminent SIGKILL, and will then immediately persist any subsequent state update that arrives until the apply phase is complete. After interruption Terraform will not start any new operations and will instead just let any already-running operations run to completion, and so this will persist the state once per resource instance that is able to complete before being killed. As long as there's sufficient time between the initial SIGINT and the subsequent SIGKILL -- assuming that the session with the state storage backend remains valid -- the data lost should be limited only to operations that are still in progress or very recently completed at the time of SIGKILL.

With this change in place, the correct way to run Terraform in a transient automation environment would be to set both a soft and a hard deadline, where the soft deadline sends SIGINT and then the hard deadline sends SIGKILL some reasonable amount of time later if Terraform hasn't yet been able to gracefully exit.

Long-running applies will now generate intermediate state snapshots where they wouldn't before, but there should still be considerably fewer snapshots than were created when we were persisting for each individual state change. For apply runs that complete in less than 20 seconds there will be no additional state snapshots compared to today's Terraform, and Terraform will also skip creating new snapshots if there's nothing changed in the state. We can adjust the 20 second interval in future commits if we find that this spot isn't as sweet as first assumed.

I have intentionally not made the interval end-user-configurable here because we prefer good defaults over configurable knobs wherever possible and also we want to keep the freedom to change the details of how this works in future releases in a way that might be based on something other than the passage of time, at which point an explicit time interval setting would become redundant.

Closes #32658, closes #24276, and closes #20718.


If we move forward with this then a good place to discuss the soft/hard deadline behavior would be the Running Terraform in Automation guide, but the guides no longer live in this repository so we'll need to update that separately at a later time.

I'm anticipating releasing this for the first time in Terraform v1.5 so that it'll have plenty of time to bake in alpha and beta releases in case we want to tweak it before final, but that does mean that we should probably hold the guide updates until the v1.5.0 final release so that the guide isn't promising something that isn't yet available in any stable release.

@apparentlymart
Copy link
Copy Markdown
Contributor Author

apparentlymart commented Feb 14, 2023

Apparently despite my best efforts to ensure only one execution path in the two concurrent threads of the Terraform Core cancellation test there's still something missing here which the race detector run has found. I've run out of time for today so I'll look into that more soon.

I'd still appreciate review and feedback on the parts of the code other than that failing test, since if we'll significantly change the design before landing this then that test might need to behave quite differently anyway! 😀

Terraform Core emits a hook event every time it writes a change into the
in-memory state. Previously the local backend would just copy that into
the transient storage of the state manager, but for most state storage
implementations that doesn't really do anything useful because it just
makes another copy of the state in memory.

We originally added this hook mechanism with the intent of making
Terraform _persist_ the state each time, but we backed that out after
finding that it was a bit too aggressive and was making the state snapshot
history much harder to use in storage systems that can preserve historical
snapshots.

However, sometimes Terraform gets killed mid-apply for whatever reason and
in our previous implementation that meant always losing that transient
state, forcing the user to edit the state manually (or use "import") to
recover a useful state.

In an attempt at finding a sweet spot between these extremes, here we
change the rule so that if an apply runs for longer than 20 seconds then
we'll try to persist the state to the backend in an update that arrives
at least 20 seconds after the first update, and then again for each
additional 20 second period as long as Terraform keeps announcing new
state snapshots.

This also introduces a special interruption mode where if the apply phase
gets interrupted by SIGINT (or equivalent) then the local backend will
try to persist the state immediately in anticipation of a
possibly-imminent SIGKILL, and will then immediately persist any
subsequent state update that arrives until the apply phase is complete.
After interruption Terraform will not start any new operations and will
instead just let any already-running operations run to completion, and so
this will persist the state once per resource instance that is able to
complete before being killed.

This does mean that now long-running applies will generate intermediate
state snapshots where they wouldn't before, but there should still be
considerably fewer snapshots than were created when we were persisting
for each individual state change. We can adjust the 20 second interval
in future commits if we find that this spot isn't as sweet as first
assumed.
@apparentlymart
Copy link
Copy Markdown
Contributor Author

apparentlymart commented Feb 14, 2023

After reviewing the code a little more I realized that the test couldn't really be made to reliably pass as previously written because it's trying to rely on something that Terraform Core can't normally guarantee: the timing of the "Stopping" hook in relation to the other hooks.

I think I've resolved this both by adding some extra synchronization to the mock provider in the test (making it serialize its ApplyResourceChange and Stop control flow in a way that a normal provider wouldn't) and by slightly changing the order of operations in Context.Stop to an order that is still correct but now makes the extra synchronization in the test effective for forcing the fixed order of delivery of the hook events. As a bonus this also lets the local backend's state hook get a marginally earlier warning about the stopping event, though I don't expect that to be material.

In real code we don't actually need to coordinate the order of events so tightly, but ensuring a consistent order makes this test's implementation simpler.

@apparentlymart apparentlymart merged commit f0de9b6 into main Feb 14, 2023
@apparentlymart apparentlymart deleted the f-persiststate-periodic branch February 14, 2023 23:18
@github-actions
Copy link
Copy Markdown
Contributor

Reminder for the merging maintainer: if this is a user-visible change, please update the changelog on the appropriate release branch.

@LanceXuanLi
Copy link
Copy Markdown

cool, I have been waiting for this feature for a long time.

@stephanpelikan
Copy link
Copy Markdown

Does this mechanism also apply to resources of modules? So will a resource successfully created by a module be part of the state-snapshot?

I faced the problem when using community provided modules which could not be completed for any external reason. I had to read and understand the module to clean up modules partitially deployed. This improvement should also avoid this "module" situation to not force users of external modules to dig into the modules details on failure. Is this covered?

@apparentlymart
Copy link
Copy Markdown
Contributor Author

State snapshots always cover the entire configuration; module boundaries are not really relevant in the state aside from providing a namespace prefix to avoid collisions between similarly-named objects in different modules.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 1, 2023

I'm going to lock this pull request because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 1, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

4 participants