Add design for enabling support for `ExistingResourcePolicy` to restore API #4613

shubham-pampattiwar · 2022-02-01T16:17:05Z

This PR adds an initial design document to add support for ExistingResourcePolicy to Velero's restore API.
Fixes : #4842

sseago · 2022-02-01T16:23:20Z

This relates to #4066

birsanv · 2022-02-01T20:31:51Z

@shubham-pampattiwar I want to include this scenario in your design

We want to consider here delta resources ( resources restored by a previous backup but no longer in the latest backup ).
We need to be able to identify those resources so that they can be cleaned up.

This is the scenario

The scenario :

Cluster A has Policy1 , Policy2, Policy3
Backup policies in backup1
Restore backup1 on a brand new cluster, ClusterB : all policies are created, OK
NOW : on Cluster A delete Policy1 and update Policy2
on Cluster A create a second backup backup2 ( backup2 contains only Policy2, Policy3 )
RESTORE backup2 on Cluster B; with the fix we are discussing now :

Policy1 still on ClusterB, untouched; annotated with velero.io/backup-name=backup1
Policy2 gets updated ; annotated with velero.io/backup-name=backup2
Policy3 is not updated, untouched ; annotated with velero.io/backup-name=backup1

Result:

Cluster B is in an inconsistent state ; Policy1 should no longer be running on this cluster.
There is no differentiation between these policies to tell that :

Policy1 was not part of the latest restore ( is gone, should be removed )
Policy3 was part of the restore but nothing changed ( OK, should be used )

Proposed solution :

When restore runs and finds a resource in backup which is identical with the one one the hub, it should still update the backup annotation to : velero.io/backup-name=backup2
In this way, we can identify restored resources not touched by the new restore because they were not on the backup anymore ; vs the ones not touched by the current restore because they were not changed
I can find what resources should be removed from the cluster (Policy1) by running : get resource with velero.io/backup-name AND velero.io/backup-name != backup2

And if we can identify restored delta, we can remove them after restore is completed. That can be done in a separate enh

sseago · 2022-02-01T21:05:17Z

@birsanv So it sounds like for this use case we don't want the restore to delete things that aren't in the backup -- we just want the restore to update annotations on items that would have been patched but aren't because they haven't changed -- am I understanding that correctly?

As for where it fits in the design, assuming the above is correct, I would think we wouldn't do this in the "none" option (i.e. preserving legacy behavior -- also it doesn't make sense to update the annotation on unchanged resources when we're simply warning on changed resources. Also, we may not want to always do this when updating, as it's extra work which may not matter for some use cases. As such, I'd propose adding a new option to the policy list:

none: info log for unchanged, warn on changed (same as proposed earlier)
update: info log for unchanged, patch on changed, warn if patch failed (same as proposed earlier)
updateAll: info log and update annotation on unchanged, patch on changed, warn if patch failed (this is the new option).

We should also call out that any removal of dangling resources from old backups is out-of-scope for velero -- this would be done (possibly manually) by the user after restore using a delete operation with label selector or something similar.

birsanv · 2022-02-01T21:20:08Z

@sseago your proposal makes sense
As long as we have an option to allow update annotation on unchanged ( updateAll in the above ) then that should cover the delta scenario I mentioned
Agreed that removing of dangling resources can be done outside of velero, as long as we have a way to identify them

sseago

Looks good overall -- minor typo fix identified by Codespell github action, though.

design/existing-resource-policy_design.md

kaovilai · 2022-02-23T01:09:38Z

design/existing-resource-policy_design.md

+skip restoration.
+2. `update` or `merge` or `patch`: In this behaviour Velero will try an attempt to patch the resource with the backed up version, 
+if the patch fails log it as a warning and continue the restore process, user may wish specify the merge type.
+3. `updateAll`: Similar to `update` option but Velero will update labels on unchanged resources too.


Thoughts?

Suggested change

3. `updateAll`: Similar to `update` option but Velero will update labels on unchanged resources too.

3. `updateOverrideLabels`: Similar to `update` option but Velero will update labels on unchanged resources too.

Suggested change

3. `updateAll`: Similar to `update` option but Velero will update labels on unchanged resources too.

3. `updateForceUnchangedLabels`: Similar to `update` option but Velero will update labels on unchanged resources too.

It's not so much "unchanged" labels we're forcing, it's the labels that velero adds on restore -- these labels aren't in the backup. updateForceVeleroLabels is more accurate, but I think it's a bit clumsy as a config option since it's so long. Perhaps just updateVeleroLabels -- that's less clumsy, but it might confuse users into thinking we're only updating velero labels. I'm thinking that updateAll may be best, as long as the documentation is clear on what this means: "Update all resources, including unchanged ones, in order to ensure that the velero labels related to the backup and restore are appropriately updated."

When was the last time verbosity caused problems? It accelerated java adoption. You'd most likely type it once in Json/yaml or velero cli autocompletes or also have an accompanying shorthand flag.

I prefer to cater flags to "I haven't read the doc" folks more than the "I type this out everyday so it has to be short".. since support escalations come from the former.

updateIncludingVeleroLabelsForAll would probably be the most accurate. Or perhaps updateAndForceVeleroLabels. I'd still vote for updateAll, but if we want something longer, one of these would be clearer than the others I think.

Lyndon-Li · 2022-02-25T02:02:17Z

design/existing-resource-policy_design.md

+Reference: https://github.com/vmware-tanzu/velero/pull/4613#issuecomment-1027260446
+
+## High-Level Design
+### Approach 1: Add a new spec field `existingResourcePolicy` to the Restore API


What about call this "ResourceOverwritePolicy"? This name indicates what we want to do with this policy, that is, to decide how to overwrite (existing) resource

Lyndon-Li · 2022-02-25T02:04:56Z

design/existing-resource-policy_design.md

+existingResourcePolicy: update
+```
+
+### Approach 2: Add a new spec field `existingResourcePolicyConfig` to the Restore API


What about using the name "ResourceOverwriteExceptions" instead, same reason, to make the indication more clear (they are the exceptions of ResourceOverwritePolicy)

Lyndon-Li · 2022-02-25T02:39:48Z

design/existing-resource-policy_design.md

+## Non Goals
+- Add support for `ExistingResourcePolicy` to restore API for Non-Kubernetes resources.
+- Add support for `ExistingResourcePolicy` to restore API for `PersistentVolume` data.
+- Change existing restore workflow for `ServiceAccount` objects


Do we have a plan to support the above "Non Goal" objects, if not, it is important to write the scope of the policy into Doc in a clear way

I wonder whether it makes sense to distinguish "non-goals" that are desired bit out-of-scope for now and "non-goals" that are considered not part of this at all (i.e. they're unrelated things that would be dealt with in unrelated proposals_. For non-kube and PV data, I'd put them in the category of "this is completely different functionality". For service accounts, it's similar -- basically service accounts are a separate use case that is hard-coded with type-specific functionality, so we completely ignore them so as not to break that type-specific outcome.

@Lyndon-Li @sseago I will add non-kube and PV data part in a new section titled as Unrelated Proposals

Unrelated Proposals section added to the design document https://github.com/vmware-tanzu/velero/pull/4613/files#diff-472925e9329f9cf29ef5d1c15a734e96587e5e975d67b7b7fec53936a280c28dR25

Lyndon-Li · 2022-02-25T03:54:57Z

design/existing-resource-policy_design.md

+| `none`      | Existing Velero behavior.                                                               | `info` log for unchanged resource, `warn` on changed resource |
+| `update`    | Velero tries to attempt a `patch` on changed resource                                   | `info` log for unchanged, `warn` if patch failed              |
+| `updateAll` | Velero updates labels on unchanged resource  and attempts a `patch` on changed resource | `info` log for unchanged and `warn` if patch failed           |
+


Deleting a resource is very critical to users, so let's consider below case:
A resource is indeed in the current restore, but "update" or "updateAll" failed to update the new backup version; As our suggestion, users still get the to-be-deleted resources by comparing the version label

Suggestion: Actually, during the restore, we are able to detect the the to-be-deleted accurately, so we can print them in the log centrally, still as a reference, it is on users decide to delete these resources.

I don't think we are able to detect the "to-be-deleted" during restore, since we won't touch them at all. We're only processing resources that are in the current backup that's being restored. The previously-restored resources that are not in the current backup become "to-be-deleted" by default. These resources don't actually get considered/processed by the restore operation at all, so we are not in a position to log anything about them.

Is the "patch labels" operation something that is likely to fail? I know for modified resources, failure to patch is a real possibility, if some immutable field is different in the backup vs. the cluster, but if all we're patching are the velero labels, is failure here likely? If it's really unlikely, should "unchanged resource only patching label, failed patch" be considered a restore error rather than a restore warning?

As far as my understanding, we want users to decide which resource is out-of-date thus should be deleted. So how do users tell this? - users compare the version labels we put along with the resource during restore. Eventually, it is our label that causes deletion of a resource. If my above suggestion doesn't work, we should find a way to cope with this problem.

For "patch labels", personally, I think we should consider it case by case, if the result of the patching is critical, we definitely need to fail the restore; otherwise, we can let it go with a warning.

So for unchanged resources, if we're patching labels for the purposes of allowing users to delete things (i.e. the "updateAll" functionality), we may want to make it an error since, as you point out, otherwise the unlabeled resource will look like it's out-of-date and ready to be deleted. For modified resources, it's trickier. If it's not too much trouble in the implementation, I'd propose the following:

For modified resources, if the full patch fails and the policy is "update", just warn and move on.
For modified resources, if the full patch fails and the policy is "updateAll", we have a couple options:

error on path fail just as for unmodified resources

attempt to patch just the label and if this succeeds, add the usual "couldn't modify resource" warning, and if label patching fails, then add the "label update failed" error and the "couldn't update resource" warning.

reasonerjt · 2022-03-01T08:49:33Z

design/existing-resource-policy_design.md

+skip restoration.
+2. `update` or `merge` or `patch`: In this behaviour Velero will try an attempt to patch the resource with the backed up version, 
+if the patch fails log it as a warning and continue the restore process, user may wish specify the merge type.
+3. `updateAll`: Similar to `update` option but Velero will update labels on unchanged resources too.


I think this comment makes sense:
#4613 (comment)
Therefore, why not just make it default behavior of update/patch option? i.e. if user chooses this option the annotation velero.io/backup-name is always updated? Are there benefits for NOT updating it?

The only real benefit of not updating it would be a (probably minor) performance increase. I wouldn't be completely opposed to just making the "update" policy behave like "updateAll" described above, simplifying the configuration options a bit.

reasonerjt · 2022-03-01T09:01:45Z

design/existing-resource-policy_design.md

+```
+
+## Detailed Design
+### Approach 1: Add a new spec field `existingResourcePolicy` to the Restore API


What's your preference in terms of approach 1 vs 2 vs 3?
Approach 1 is probably good enough? Are we good not to make it any more complicated?
Wanna point out it's not flexible for introducing more fine-grained control, but I'm totally fine with that.

My thinking is that approach 1 is probably be good enough for a first release implementation. It leaves the door open to expanding in the future to approach 3. So for now, we just make one policy for the whole restore. For the future, we could add the policy override by resource type.

@reasonerjt @sseago I vote for approach 1 as well. Easy to implement and scale.

design/existing-resource-policy_design.md

reasonerjt · 2022-03-01T09:03:56Z

design/existing-resource-policy_design.md

+- `patch:` 
+    - `includedResources:` [ ]string
+- `recreate:` [ ]string
+    - `includedResources:` [ ]string


What if the values of includeResources of different actions have overlap?

We would need to be explicit in documenting this, but I think "first match" is good enough. Ideally there would be no overlap here, but if there is, take the first one in the list.

Or error out so user fix it in the next run?

reasonerjt · 2022-03-01T09:11:06Z

A more general question, have we discussed whether this option should be exposed via CLI? I think the answer is yes, and this design should add one more section to describe how user should set the policy via CLI and what change in the code should be made?

sseago · 2022-03-01T14:11:42Z

Regarding the CLI, I think "approach 1" should be relatively easy to add to the CLI -- and yes, that should probably be added to the design. I think the more complex configuration of approach 3 (the resource-specific overrides) may be OK to leave out of the cli. If the user needs complex policy like that, they need to create the CR directly, but if a single per-restore policy is sufficient, the CLI should have a flag for that.

reasonerjt · 2022-03-08T05:59:05Z

I would vote for approach 1 and my only concern is the confusing difference between update and updateAll
https://github.com/vmware-tanzu/velero/pull/4613/files#r816560921

shubham-pampattiwar · 2022-03-08T09:53:03Z

@reasonerjt Lets say we dump the updateAll option altogether. We keep only 2 policy options:

none: This option preserves the existing velero restore workflow.
update: This option would provide the following behavior
- Unchanged resources: Velero would update the backup/restore labels on the unchanged resources, if labels patch fails Velero adds a restore error.
- Changed resources: Velero will first try to patch the changed resource, Now if the patch:
  - succeeds: Then the in-cluster resource gets updated with the labels as well as the resource diff
  - fails: Velero adds a restore warning and tries to just update the backup/restore labels on the resource, if the labels patch also fails then we add restore error.

Does the above sound reasonable ?

shubham-pampattiwar · 2022-03-08T09:57:08Z

A more general question, have we discussed whether this option should be exposed via CLI? I think the answer is yes, and this design should add one more section to describe how user should set the policy via CLI and what change in the code should be made?

Added details for CLI changes https://github.com/vmware-tanzu/velero/pull/4613/files#diff-472925e9329f9cf29ef5d1c15a734e96587e5e975d67b7b7fec53936a280c28dR251

sseago · 2022-03-08T13:35:51Z

@shubham-pampattiwar Yes, the proposed change to update/updateAll seems reasonable to me.

codecov-commenter · 2022-03-08T13:43:55Z

Codecov Report

❗ No coverage uploaded for pull request base (main@574baeb). Click here to learn what that means.
The diff coverage is n/a.

@@           Coverage Diff           @@
##             main    #4613   +/-   ##
=======================================
  Coverage        ?   41.44%           
=======================================
  Files           ?      204           
  Lines           ?    18094           
  Branches        ?        0           
=======================================
  Hits            ?     7499           
  Misses          ?    10046           
  Partials        ?      549

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 574baeb...d449ca0. Read the comment docs.

shubham-pampattiwar · 2022-03-09T23:37:48Z

I have updated the design proposal with merging the update and updateAll policy option (dumping the updateAll option altogether). Going ahead with the implementation of approach 1 as discussed in the community call. Will work on feature documentation as well.

add design doc for existing resource policy Signed-off-by: Shubham Pampattiwar <[email protected]> add use-cases and update non-goals Signed-off-by: Shubham Pampattiwar <[email protected]> update approach-1 and add policy-action table Signed-off-by: Shubham Pampattiwar <[email protected]> minor updates Signed-off-by: Shubham Pampattiwar <[email protected]> fix typos Signed-off-by: Shubham Pampattiwar <[email protected]> add CLI details Signed-off-by: Shubham Pampattiwar <[email protected]> dump updateAll option Signed-off-by: Shubham Pampattiwar <[email protected]> add implementation decision Signed-off-by: Shubham Pampattiwar <[email protected]>

shubham-pampattiwar · 2022-04-27T19:25:18Z

@reasonerjt @sseago I have updated the PR with Implementation decision section.

shubham-pampattiwar · 2022-04-27T19:25:54Z

/kind changelog-not-required

Shubham Pampattiwar has made several contributions to Velero, most recently in designing and implementing two v1.9 features, including the following: - [Add design for enabling multiple label support[(vmware-tanzu#4619) - [Add multiple label selector support to Velero Backup and Restore APIs](vmware-tanzu#4650) - [Add design for enabling support for ExistingResourcePolicy to restore API](vmware-tanzu#4613) - [Add existingResourcePolicy to Restore API](vmware-tanzu#4628) Signed-off-by: Scott Seago <[email protected]>

Shubham Pampattiwar has made several contributions to Velero, most recently in designing and implementing two v1.9 features, including the following: - [Add design for enabling multiple label support](vmware-tanzu#4619) - [Add multiple label selector support to Velero Backup and Restore APIs](vmware-tanzu#4650) - [Add design for enabling support for ExistingResourcePolicy to restore API](vmware-tanzu#4613) - [Add existingResourcePolicy to Restore API](vmware-tanzu#4628) Signed-off-by: Scott Seago <[email protected]>

Shubham Pampattiwar has made several contributions to Velero, most recently in designing and implementing two v1.9 features, including the following: - [Add design for enabling multiple label support](vmware-tanzu#4619) - [Add multiple label selector support to Velero Backup and Restore APIs](vmware-tanzu#4650) - [Add design for enabling support for ExistingResourcePolicy to restore API](vmware-tanzu#4613) - [Add existingResourcePolicy to Restore API](vmware-tanzu#4628) Shubham has also been driving forward the data mover requirements and design discussions for velero 1.10: - [Add datamover design](vmware-tanzu#4768) Signed-off-by: Scott Seago <[email protected]>

github-actions bot assigned shubham-pampattiwar Feb 1, 2022

github-actions bot requested a review from a-mccarthy February 1, 2022 16:17

github-actions bot added the Area/Design Design Documents label Feb 1, 2022

shubham-pampattiwar force-pushed the add-design-existing-rs-policy branch from 3c75638 to de5a188 Compare February 2, 2022 23:01

a-mccarthy removed their request for review February 14, 2022 20:17

sseago reviewed Feb 21, 2022

View reviewed changes

design/existing-resource-policy_design.md Outdated Show resolved Hide resolved

sseago previously approved these changes Feb 22, 2022

View reviewed changes

shubham-pampattiwar dismissed sseago’s stale review via e730f5b February 22, 2022 20:29

shubham-pampattiwar force-pushed the add-design-existing-rs-policy branch from 42bca0e to e730f5b Compare February 22, 2022 20:29

kaovilai reviewed Feb 23, 2022

View reviewed changes

reasonerjt requested review from reasonerjt and Lyndon-Li February 23, 2022 01:22

Lyndon-Li reviewed Feb 28, 2022

View reviewed changes

reasonerjt reviewed Mar 1, 2022

View reviewed changes

shubham-pampattiwar force-pushed the add-design-existing-rs-policy branch from e730f5b to 06fee54 Compare March 3, 2022 08:03

shubham-pampattiwar requested review from sseago, reasonerjt and Lyndon-Li March 8, 2022 09:57

shubham-pampattiwar force-pushed the add-design-existing-rs-policy branch from 06fee54 to 3137ad6 Compare March 9, 2022 23:31

shubham-pampattiwar mentioned this pull request Mar 17, 2022

Add existingResourcePolicy to Restore API #4628

Merged

3 tasks

shubham-pampattiwar mentioned this pull request Apr 25, 2022

Capability to overwrite existing Kubernetes objects #4851

Closed

shubham-pampattiwar force-pushed the add-design-existing-rs-policy branch from 3137ad6 to d449ca0 Compare April 27, 2022 19:24

github-actions bot added the kind/changelog-not-required PR does not require a user changelog. Often for docs, website, or build changes label Apr 27, 2022

sseago approved these changes Apr 28, 2022

View reviewed changes

reasonerjt approved these changes Apr 29, 2022

View reviewed changes

reasonerjt merged commit ad7a940 into vmware-tanzu:main May 4, 2022

a-mccarthy mentioned this pull request May 20, 2022

Update docs to include new --existing-resource-policy flag #4929

Closed

shubham-pampattiwar mentioned this pull request Apr 12, 2023

Add support for delete and recreate option to ExistingResourcePolicy feature #6142

Open

	3. `updateAll`: Similar to `update` option but Velero will update labels on unchanged resources too.
	3. `updateOverrideLabels`: Similar to `update` option but Velero will update labels on unchanged resources too.

	3. `updateAll`: Similar to `update` option but Velero will update labels on unchanged resources too.
	3. `updateForceUnchangedLabels`: Similar to `update` option but Velero will update labels on unchanged resources too.

Add design for enabling support for ExistingResourcePolicy to restore API #4613

Add design for enabling support for ExistingResourcePolicy to restore API #4613

Uh oh!

Conversation

shubham-pampattiwar commented Feb 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sseago commented Feb 1, 2022

Uh oh!

birsanv commented Feb 1, 2022

Uh oh!

sseago commented Feb 1, 2022

Uh oh!

birsanv commented Feb 1, 2022

Uh oh!

sseago left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kaovilai Feb 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sseago Mar 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

reasonerjt commented Mar 1, 2022

Uh oh!

sseago commented Mar 1, 2022

Uh oh!

reasonerjt commented Mar 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add design for enabling support for `ExistingResourcePolicy` to restore API #4613

Add design for enabling support for `ExistingResourcePolicy` to restore API #4613

shubham-pampattiwar commented Feb 1, 2022 •

edited

Loading

kaovilai Feb 23, 2022 •

edited

Loading

sseago Mar 1, 2022 •

edited

Loading

reasonerjt commented Mar 8, 2022 •

edited

Loading

codecov-commenter commented Mar 8, 2022 •

edited

Loading