backend/local: Periodically persist intermediate state snapshots by apparentlymart · Pull Request #32680 · hashicorp/terraform

apparentlymart · 2023-02-14T01:50:46Z

Terraform Core emits a hook event every time it writes a change into the in-memory state. Previously the local backend would just copy that into the transient storage of the state manager, but for most state storage implementations that doesn't really do anything useful because it just makes another copy of the state in memory.

We originally added this hook mechanism with the intent of making Terraform persist the state each time, but we backed that out after finding that it was a bit too aggressive and was making the state snapshot history much harder to use in storage systems that can preserve historical snapshots. It was also hitting rate limits in some cases, because state changes can arrive in quick succession when making lots of small changes.

However, sometimes Terraform gets killed mid-apply for whatever reason and in our previous implementation that meant always losing that transient state, forcing the user to edit the state manually (or use "import") to recover a useful state.

In an attempt at finding a sweet spot between these extremes, here we change the rule so that if an apply runs for longer than 20 seconds then we'll try to persist the state to the backend in an update that arrives at least 20 seconds after the first update, and then again for each additional 20 second period as long as Terraform Core keeps announcing new state snapshots.

This also introduces a special interruption mode where if the apply phase gets interrupted by SIGINT (or equivalent on Windows) then the local backend will try to persist the state immediately in anticipation of a possibly-imminent SIGKILL, and will then immediately persist any subsequent state update that arrives until the apply phase is complete. After interruption Terraform will not start any new operations and will instead just let any already-running operations run to completion, and so this will persist the state once per resource instance that is able to complete before being killed. As long as there's sufficient time between the initial SIGINT and the subsequent SIGKILL -- assuming that the session with the state storage backend remains valid -- the data lost should be limited only to operations that are still in progress or very recently completed at the time of SIGKILL.

With this change in place, the correct way to run Terraform in a transient automation environment would be to set both a soft and a hard deadline, where the soft deadline sends SIGINT and then the hard deadline sends SIGKILL some reasonable amount of time later if Terraform hasn't yet been able to gracefully exit.

Long-running applies will now generate intermediate state snapshots where they wouldn't before, but there should still be considerably fewer snapshots than were created when we were persisting for each individual state change. For apply runs that complete in less than 20 seconds there will be no additional state snapshots compared to today's Terraform, and Terraform will also skip creating new snapshots if there's nothing changed in the state. We can adjust the 20 second interval in future commits if we find that this spot isn't as sweet as first assumed.

I have intentionally not made the interval end-user-configurable here because we prefer good defaults over configurable knobs wherever possible and also we want to keep the freedom to change the details of how this works in future releases in a way that might be based on something other than the passage of time, at which point an explicit time interval setting would become redundant.

Closes #32658, closes #24276, and closes #20718.

If we move forward with this then a good place to discuss the soft/hard deadline behavior would be the Running Terraform in Automation guide, but the guides no longer live in this repository so we'll need to update that separately at a later time.

I'm anticipating releasing this for the first time in Terraform v1.5 so that it'll have plenty of time to bake in alpha and beta releases in case we want to tweak it before final, but that does mean that we should probably hold the guide updates until the v1.5.0 final release so that the guide isn't promising something that isn't yet available in any stable release.

apparentlymart · 2023-02-14T01:59:35Z

Apparently despite my best efforts to ensure only one execution path in the two concurrent threads of the Terraform Core cancellation test there's still something missing here which the race detector run has found. I've run out of time for today so I'll look into that more soon.

I'd still appreciate review and feedback on the parts of the code other than that failing test, since if we'll significantly change the design before landing this then that test might need to behave quite differently anyway! 😀

Terraform Core emits a hook event every time it writes a change into the in-memory state. Previously the local backend would just copy that into the transient storage of the state manager, but for most state storage implementations that doesn't really do anything useful because it just makes another copy of the state in memory. We originally added this hook mechanism with the intent of making Terraform _persist_ the state each time, but we backed that out after finding that it was a bit too aggressive and was making the state snapshot history much harder to use in storage systems that can preserve historical snapshots. However, sometimes Terraform gets killed mid-apply for whatever reason and in our previous implementation that meant always losing that transient state, forcing the user to edit the state manually (or use "import") to recover a useful state. In an attempt at finding a sweet spot between these extremes, here we change the rule so that if an apply runs for longer than 20 seconds then we'll try to persist the state to the backend in an update that arrives at least 20 seconds after the first update, and then again for each additional 20 second period as long as Terraform keeps announcing new state snapshots. This also introduces a special interruption mode where if the apply phase gets interrupted by SIGINT (or equivalent) then the local backend will try to persist the state immediately in anticipation of a possibly-imminent SIGKILL, and will then immediately persist any subsequent state update that arrives until the apply phase is complete. After interruption Terraform will not start any new operations and will instead just let any already-running operations run to completion, and so this will persist the state once per resource instance that is able to complete before being killed. This does mean that now long-running applies will generate intermediate state snapshots where they wouldn't before, but there should still be considerably fewer snapshots than were created when we were persisting for each individual state change. We can adjust the 20 second interval in future commits if we find that this spot isn't as sweet as first assumed.

apparentlymart · 2023-02-14T02:14:23Z

After reviewing the code a little more I realized that the test couldn't really be made to reliably pass as previously written because it's trying to rely on something that Terraform Core can't normally guarantee: the timing of the "Stopping" hook in relation to the other hooks.

I think I've resolved this both by adding some extra synchronization to the mock provider in the test (making it serialize its ApplyResourceChange and Stop control flow in a way that a normal provider wouldn't) and by slightly changing the order of operations in Context.Stop to an order that is still correct but now makes the extra synchronization in the test effective for forcing the fixed order of delivery of the hook events. As a bonus this also lets the local backend's state hook get a marginally earlier warning about the stopping event, though I don't expect that to be material.

In real code we don't actually need to coordinate the order of events so tightly, but ensuring a consistent order makes this test's implementation simpler.

github-actions · 2023-02-14T23:18:12Z

Reminder for the merging maintainer: if this is a user-visible change, please update the changelog on the appropriate release branch.

LanceXuanLi · 2023-02-21T07:24:39Z

cool, I have been waiting for this feature for a long time.

stephanpelikan · 2023-03-01T08:55:51Z

Does this mechanism also apply to resources of modules? So will a resource successfully created by a module be part of the state-snapshot?

I faced the problem when using community provided modules which could not be completed for any external reason. I had to read and understand the module to clean up modules partitially deployed. This improvement should also avoid this "module" situation to not force users of external modules to dig into the modules details on failure. Is this covered?

apparentlymart · 2023-03-01T15:38:00Z

State snapshots always cover the entire configuration; module boundaries are not really relevant in the state aside from providing a namespace prefix to avoid collisions between similarly-named objects in different modules.

github-actions · 2023-04-01T02:08:55Z

I'm going to lock this pull request because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

apparentlymart added enhancement backend/local labels Feb 14, 2023

apparentlymart requested a review from a team February 14, 2023 01:50

apparentlymart self-assigned this Feb 14, 2023

apparentlymart force-pushed the f-persiststate-periodic branch from c4d1ffc to 6935898 Compare February 14, 2023 02:06

vercel bot temporarily deployed to Preview February 14, 2023 02:10 Inactive

jbardin approved these changes Feb 14, 2023

View reviewed changes

apparentlymart merged commit f0de9b6 into main Feb 14, 2023

apparentlymart deleted the f-persiststate-periodic branch February 14, 2023 23:18

github-actions bot locked as resolved and limited conversation to collaborators Apr 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

backend/local: Periodically persist intermediate state snapshots#32680

backend/local: Periodically persist intermediate state snapshots#32680
apparentlymart merged 1 commit intomainfrom
f-persiststate-periodic

apparentlymart commented Feb 14, 2023 •

edited

Loading

Uh oh!

apparentlymart commented Feb 14, 2023 •

edited

Loading

Uh oh!

apparentlymart commented Feb 14, 2023 •

edited

Loading

Uh oh!

github-actions bot commented Feb 14, 2023

Uh oh!

LanceXuanLi commented Feb 21, 2023

Uh oh!

stephanpelikan commented Mar 1, 2023

Uh oh!

apparentlymart commented Mar 1, 2023

Uh oh!

github-actions bot commented Apr 1, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

apparentlymart commented Feb 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

apparentlymart commented Feb 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

apparentlymart commented Feb 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Feb 14, 2023

Uh oh!

LanceXuanLi commented Feb 21, 2023

Uh oh!

stephanpelikan commented Mar 1, 2023

Uh oh!

apparentlymart commented Mar 1, 2023

Uh oh!

github-actions bot commented Apr 1, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

apparentlymart commented Feb 14, 2023 •

edited

Loading

apparentlymart commented Feb 14, 2023 •

edited

Loading

apparentlymart commented Feb 14, 2023 •

edited

Loading