backend/local: Periodically persist intermediate state snapshots#32680
backend/local: Periodically persist intermediate state snapshots#32680apparentlymart merged 1 commit intomainfrom
Conversation
|
Apparently despite my best efforts to ensure only one execution path in the two concurrent threads of the Terraform Core cancellation test there's still something missing here which the race detector run has found. I've run out of time for today so I'll look into that more soon. I'd still appreciate review and feedback on the parts of the code other than that failing test, since if we'll significantly change the design before landing this then that test might need to behave quite differently anyway! 😀 |
Terraform Core emits a hook event every time it writes a change into the in-memory state. Previously the local backend would just copy that into the transient storage of the state manager, but for most state storage implementations that doesn't really do anything useful because it just makes another copy of the state in memory. We originally added this hook mechanism with the intent of making Terraform _persist_ the state each time, but we backed that out after finding that it was a bit too aggressive and was making the state snapshot history much harder to use in storage systems that can preserve historical snapshots. However, sometimes Terraform gets killed mid-apply for whatever reason and in our previous implementation that meant always losing that transient state, forcing the user to edit the state manually (or use "import") to recover a useful state. In an attempt at finding a sweet spot between these extremes, here we change the rule so that if an apply runs for longer than 20 seconds then we'll try to persist the state to the backend in an update that arrives at least 20 seconds after the first update, and then again for each additional 20 second period as long as Terraform keeps announcing new state snapshots. This also introduces a special interruption mode where if the apply phase gets interrupted by SIGINT (or equivalent) then the local backend will try to persist the state immediately in anticipation of a possibly-imminent SIGKILL, and will then immediately persist any subsequent state update that arrives until the apply phase is complete. After interruption Terraform will not start any new operations and will instead just let any already-running operations run to completion, and so this will persist the state once per resource instance that is able to complete before being killed. This does mean that now long-running applies will generate intermediate state snapshots where they wouldn't before, but there should still be considerably fewer snapshots than were created when we were persisting for each individual state change. We can adjust the 20 second interval in future commits if we find that this spot isn't as sweet as first assumed.
c4d1ffc to
6935898
Compare
|
After reviewing the code a little more I realized that the test couldn't really be made to reliably pass as previously written because it's trying to rely on something that Terraform Core can't normally guarantee: the timing of the "Stopping" hook in relation to the other hooks. I think I've resolved this both by adding some extra synchronization to the mock provider in the test (making it serialize its In real code we don't actually need to coordinate the order of events so tightly, but ensuring a consistent order makes this test's implementation simpler. |
|
Reminder for the merging maintainer: if this is a user-visible change, please update the changelog on the appropriate release branch. |
|
cool, I have been waiting for this feature for a long time. |
|
Does this mechanism also apply to resources of modules? So will a resource successfully created by a module be part of the state-snapshot? I faced the problem when using community provided modules which could not be completed for any external reason. I had to read and understand the module to clean up modules partitially deployed. This improvement should also avoid this "module" situation to not force users of external modules to dig into the modules details on failure. Is this covered? |
|
State snapshots always cover the entire configuration; module boundaries are not really relevant in the state aside from providing a namespace prefix to avoid collisions between similarly-named objects in different modules. |
|
I'm going to lock this pull request because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active contributions. |
Terraform Core emits a hook event every time it writes a change into the in-memory state. Previously the local backend would just copy that into the transient storage of the state manager, but for most state storage implementations that doesn't really do anything useful because it just makes another copy of the state in memory.
We originally added this hook mechanism with the intent of making Terraform persist the state each time, but we backed that out after finding that it was a bit too aggressive and was making the state snapshot history much harder to use in storage systems that can preserve historical snapshots. It was also hitting rate limits in some cases, because state changes can arrive in quick succession when making lots of small changes.
However, sometimes Terraform gets killed mid-apply for whatever reason and in our previous implementation that meant always losing that transient state, forcing the user to edit the state manually (or use "import") to recover a useful state.
In an attempt at finding a sweet spot between these extremes, here we change the rule so that if an apply runs for longer than 20 seconds then we'll try to persist the state to the backend in an update that arrives at least 20 seconds after the first update, and then again for each additional 20 second period as long as Terraform Core keeps announcing new state snapshots.
This also introduces a special interruption mode where if the apply phase gets interrupted by SIGINT (or equivalent on Windows) then the local backend will try to persist the state immediately in anticipation of a possibly-imminent SIGKILL, and will then immediately persist any subsequent state update that arrives until the apply phase is complete. After interruption Terraform will not start any new operations and will instead just let any already-running operations run to completion, and so this will persist the state once per resource instance that is able to complete before being killed. As long as there's sufficient time between the initial
SIGINTand the subsequentSIGKILL-- assuming that the session with the state storage backend remains valid -- the data lost should be limited only to operations that are still in progress or very recently completed at the time ofSIGKILL.With this change in place, the correct way to run Terraform in a transient automation environment would be to set both a soft and a hard deadline, where the soft deadline sends
SIGINTand then the hard deadline sendsSIGKILLsome reasonable amount of time later if Terraform hasn't yet been able to gracefully exit.Long-running applies will now generate intermediate state snapshots where they wouldn't before, but there should still be considerably fewer snapshots than were created when we were persisting for each individual state change. For apply runs that complete in less than 20 seconds there will be no additional state snapshots compared to today's Terraform, and Terraform will also skip creating new snapshots if there's nothing changed in the state. We can adjust the 20 second interval in future commits if we find that this spot isn't as sweet as first assumed.
I have intentionally not made the interval end-user-configurable here because we prefer good defaults over configurable knobs wherever possible and also we want to keep the freedom to change the details of how this works in future releases in a way that might be based on something other than the passage of time, at which point an explicit time interval setting would become redundant.
Closes #32658, closes #24276, and closes #20718.
If we move forward with this then a good place to discuss the soft/hard deadline behavior would be the Running Terraform in Automation guide, but the guides no longer live in this repository so we'll need to update that separately at a later time.
I'm anticipating releasing this for the first time in Terraform v1.5 so that it'll have plenty of time to bake in alpha and beta releases in case we want to tweak it before final, but that does mean that we should probably hold the guide updates until the v1.5.0 final release so that the guide isn't promising something that isn't yet available in any stable release.