Skip to content

Conversation

shubham-pampattiwar
Copy link
Collaborator

@shubham-pampattiwar shubham-pampattiwar commented Feb 1, 2022

This PR adds an initial design document to add support for ExistingResourcePolicy to Velero's restore API.
Fixes : #4842

@github-actions github-actions bot requested a review from a-mccarthy February 1, 2022 16:17
@github-actions github-actions bot added the Area/Design Design Documents label Feb 1, 2022
@sseago
Copy link
Collaborator

sseago commented Feb 1, 2022

This relates to #4066

@birsanv
Copy link

birsanv commented Feb 1, 2022

@shubham-pampattiwar I want to include this scenario in your design

We want to consider here delta resources ( resources restored by a previous backup but no longer in the latest backup ).
We need to be able to identify those resources so that they can be cleaned up.

This is the scenario

The scenario :

  1. Cluster A has Policy1 , Policy2, Policy3
  2. Backup policies in backup1
  3. Restore backup1 on a brand new cluster, ClusterB : all policies are created, OK
  4. NOW : on Cluster A delete Policy1 and update Policy2
  5. on Cluster A create a second backup backup2 ( backup2 contains only Policy2, Policy3 )
  6. RESTORE backup2 on Cluster B; with the fix we are discussing now :
  • Policy1 still on ClusterB, untouched; annotated with velero.io/backup-name=backup1
  • Policy2 gets updated ; annotated with velero.io/backup-name=backup2
  • Policy3 is not updated, untouched ; annotated with velero.io/backup-name=backup1

Result:

Cluster B is in an inconsistent state ; Policy1 should no longer be running on this cluster.
There is no differentiation between these policies to tell that :

  • Policy1 was not part of the latest restore ( is gone, should be removed )
  • Policy3 was part of the restore but nothing changed ( OK, should be used )

Proposed solution :

When restore runs and finds a resource in backup which is identical with the one one the hub, it should still update the backup annotation to : velero.io/backup-name=backup2
In this way, we can identify restored resources not touched by the new restore because they were not on the backup anymore ; vs the ones not touched by the current restore because they were not changed
I can find what resources should be removed from the cluster (Policy1) by running : get resource with velero.io/backup-name AND velero.io/backup-name != backup2

And if we can identify restored delta, we can remove them after restore is completed. That can be done in a separate enh

@sseago
Copy link
Collaborator

sseago commented Feb 1, 2022

@birsanv So it sounds like for this use case we don't want the restore to delete things that aren't in the backup -- we just want the restore to update annotations on items that would have been patched but aren't because they haven't changed -- am I understanding that correctly?

As for where it fits in the design, assuming the above is correct, I would think we wouldn't do this in the "none" option (i.e. preserving legacy behavior -- also it doesn't make sense to update the annotation on unchanged resources when we're simply warning on changed resources. Also, we may not want to always do this when updating, as it's extra work which may not matter for some use cases. As such, I'd propose adding a new option to the policy list:

  • none: info log for unchanged, warn on changed (same as proposed earlier)
  • update: info log for unchanged, patch on changed, warn if patch failed (same as proposed earlier)
  • updateAll: info log and update annotation on unchanged, patch on changed, warn if patch failed (this is the new option).

We should also call out that any removal of dangling resources from old backups is out-of-scope for velero -- this would be done (possibly manually) by the user after restore using a delete operation with label selector or something similar.

@birsanv
Copy link

birsanv commented Feb 1, 2022

@sseago your proposal makes sense
As long as we have an option to allow update annotation on unchanged ( updateAll in the above ) then that should cover the delta scenario I mentioned
Agreed that removing of dangling resources can be done outside of velero, as long as we have a way to identify them

@shubham-pampattiwar shubham-pampattiwar force-pushed the add-design-existing-rs-policy branch from 3c75638 to de5a188 Compare February 2, 2022 23:01
@a-mccarthy a-mccarthy removed their request for review February 14, 2022 20:17
Copy link
Collaborator

@sseago sseago left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall -- minor typo fix identified by Codespell github action, though.

sseago
sseago previously approved these changes Feb 22, 2022
skip restoration.
2. `update` or `merge` or `patch`: In this behaviour Velero will try an attempt to patch the resource with the backed up version,
if the patch fails log it as a warning and continue the restore process, user may wish specify the merge type.
3. `updateAll`: Similar to `update` option but Velero will update labels on unchanged resources too.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts?

Suggested change
3. `updateAll`: Similar to `update` option but Velero will update labels on unchanged resources too.
3. `updateOverrideLabels`: Similar to `update` option but Velero will update labels on unchanged resources too.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
3. `updateAll`: Similar to `update` option but Velero will update labels on unchanged resources too.
3. `updateForceUnchangedLabels`: Similar to `update` option but Velero will update labels on unchanged resources too.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not so much "unchanged" labels we're forcing, it's the labels that velero adds on restore -- these labels aren't in the backup. updateForceVeleroLabels is more accurate, but I think it's a bit clumsy as a config option since it's so long. Perhaps just updateVeleroLabels -- that's less clumsy, but it might confuse users into thinking we're only updating velero labels. I'm thinking that updateAll may be best, as long as the documentation is clear on what this means: "Update all resources, including unchanged ones, in order to ensure that the velero labels related to the backup and restore are appropriately updated."

Copy link
Collaborator

@kaovilai kaovilai Feb 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When was the last time verbosity caused problems? It accelerated java adoption. You'd most likely type it once in Json/yaml or velero cli autocompletes or also have an accompanying shorthand flag.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to cater flags to "I haven't read the doc" folks more than the "I type this out everyday so it has to be short".. since support escalations come from the former.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updateIncludingVeleroLabelsForAll would probably be the most accurate. Or perhaps updateAndForceVeleroLabels. I'd still vote for updateAll, but if we want something longer, one of these would be clearer than the others I think.

Reference: https://github.com/vmware-tanzu/velero/pull/4613#issuecomment-1027260446

## High-Level Design
### Approach 1: Add a new spec field `existingResourcePolicy` to the Restore API
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about call this "ResourceOverwritePolicy"? This name indicates what we want to do with this policy, that is, to decide how to overwrite (existing) resource

existingResourcePolicy: update
```

### Approach 2: Add a new spec field `existingResourcePolicyConfig` to the Restore API
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about using the name "ResourceOverwriteExceptions" instead, same reason, to make the indication more clear (they are the exceptions of ResourceOverwritePolicy)

## Non Goals
- Add support for `ExistingResourcePolicy` to restore API for Non-Kubernetes resources.
- Add support for `ExistingResourcePolicy` to restore API for `PersistentVolume` data.
- Change existing restore workflow for `ServiceAccount` objects
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a plan to support the above "Non Goal" objects, if not, it is important to write the scope of the policy into Doc in a clear way

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether it makes sense to distinguish "non-goals" that are desired bit out-of-scope for now and "non-goals" that are considered not part of this at all (i.e. they're unrelated things that would be dealt with in unrelated proposals_. For non-kube and PV data, I'd put them in the category of "this is completely different functionality". For service accounts, it's similar -- basically service accounts are a separate use case that is hard-coded with type-specific functionality, so we completely ignore them so as not to break that type-specific outcome.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Lyndon-Li @sseago I will add non-kube and PV data part in a new section titled as Unrelated Proposals

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

| `none` | Existing Velero behavior. | `info` log for unchanged resource, `warn` on changed resource |
| `update` | Velero tries to attempt a `patch` on changed resource | `info` log for unchanged, `warn` if patch failed |
| `updateAll` | Velero updates labels on unchanged resource and attempts a `patch` on changed resource | `info` log for unchanged and `warn` if patch failed |

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleting a resource is very critical to users, so let's consider below case:
A resource is indeed in the current restore, but "update" or "updateAll" failed to update the new backup version; As our suggestion, users still get the to-be-deleted resources by comparing the version label

Suggestion: Actually, during the restore, we are able to detect the the to-be-deleted accurately, so we can print them in the log centrally, still as a reference, it is on users decide to delete these resources.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we are able to detect the "to-be-deleted" during restore, since we won't touch them at all. We're only processing resources that are in the current backup that's being restored. The previously-restored resources that are not in the current backup become "to-be-deleted" by default. These resources don't actually get considered/processed by the restore operation at all, so we are not in a position to log anything about them.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the "patch labels" operation something that is likely to fail? I know for modified resources, failure to patch is a real possibility, if some immutable field is different in the backup vs. the cluster, but if all we're patching are the velero labels, is failure here likely? If it's really unlikely, should "unchanged resource only patching label, failed patch" be considered a restore error rather than a restore warning?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as my understanding, we want users to decide which resource is out-of-date thus should be deleted. So how do users tell this? - users compare the version labels we put along with the resource during restore. Eventually, it is our label that causes deletion of a resource. If my above suggestion doesn't work, we should find a way to cope with this problem.

For "patch labels", personally, I think we should consider it case by case, if the result of the patching is critical, we definitely need to fail the restore; otherwise, we can let it go with a warning.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So for unchanged resources, if we're patching labels for the purposes of allowing users to delete things (i.e. the "updateAll" functionality), we may want to make it an error since, as you point out, otherwise the unlabeled resource will look like it's out-of-date and ready to be deleted. For modified resources, it's trickier. If it's not too much trouble in the implementation, I'd propose the following:

For modified resources, if the full patch fails and the policy is "update", just warn and move on.
For modified resources, if the full patch fails and the policy is "updateAll", we have a couple options:

  1. error on path fail just as for unmodified resources
  2. attempt to patch just the label and if this succeeds, add the usual "couldn't modify resource" warning, and if label patching fails, then add the "label update failed" error and the "couldn't update resource" warning.

skip restoration.
2. `update` or `merge` or `patch`: In this behaviour Velero will try an attempt to patch the resource with the backed up version,
if the patch fails log it as a warning and continue the restore process, user may wish specify the merge type.
3. `updateAll`: Similar to `update` option but Velero will update labels on unchanged resources too.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment makes sense:
#4613 (comment)
Therefore, why not just make it default behavior of update/patch option? i.e. if user chooses this option the annotation velero.io/backup-name is always updated? Are there benefits for NOT updating it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only real benefit of not updating it would be a (probably minor) performance increase. I wouldn't be completely opposed to just making the "update" policy behave like "updateAll" described above, simplifying the configuration options a bit.

```

## Detailed Design
### Approach 1: Add a new spec field `existingResourcePolicy` to the Restore API
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's your preference in terms of approach 1 vs 2 vs 3?
Approach 1 is probably good enough? Are we good not to make it any more complicated?
Wanna point out it's not flexible for introducing more fine-grained control, but I'm totally fine with that.

Copy link
Collaborator

@sseago sseago Mar 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking is that approach 1 is probably be good enough for a first release implementation. It leaves the door open to expanding in the future to approach 3. So for now, we just make one policy for the whole restore. For the future, we could add the policy override by resource type.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@reasonerjt @sseago I vote for approach 1 as well. Easy to implement and scale.

- `patch:`
- `includedResources:` [ ]string
- `recreate:` [ ]string
- `includedResources:` [ ]string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the values of includeResources of different actions have overlap?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would need to be explicit in documenting this, but I think "first match" is good enough. Ideally there would be no overlap here, but if there is, take the first one in the list.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or error out so user fix it in the next run?

@reasonerjt
Copy link
Contributor

A more general question, have we discussed whether this option should be exposed via CLI? I think the answer is yes, and this design should add one more section to describe how user should set the policy via CLI and what change in the code should be made?

@sseago
Copy link
Collaborator

sseago commented Mar 1, 2022

Regarding the CLI, I think "approach 1" should be relatively easy to add to the CLI -- and yes, that should probably be added to the design. I think the more complex configuration of approach 3 (the resource-specific overrides) may be OK to leave out of the cli. If the user needs complex policy like that, they need to create the CR directly, but if a single per-restore policy is sufficient, the CLI should have a flag for that.

@shubham-pampattiwar shubham-pampattiwar force-pushed the add-design-existing-rs-policy branch from e730f5b to 06fee54 Compare March 3, 2022 08:03
@reasonerjt
Copy link
Contributor

reasonerjt commented Mar 8, 2022

I would vote for approach 1 and my only concern is the confusing difference between update and updateAll
https://github.com/vmware-tanzu/velero/pull/4613/files#r816560921

@shubham-pampattiwar
Copy link
Collaborator Author

@reasonerjt Lets say we dump the updateAll option altogether. We keep only 2 policy options:

  1. none: This option preserves the existing velero restore workflow.
  2. update: This option would provide the following behavior
    • Unchanged resources: Velero would update the backup/restore labels on the unchanged resources, if labels patch fails Velero adds a restore error.
    • Changed resources: Velero will first try to patch the changed resource, Now if the patch:
      • succeeds: Then the in-cluster resource gets updated with the labels as well as the resource diff
      • fails: Velero adds a restore warning and tries to just update the backup/restore labels on the resource, if the labels patch also fails then we add restore error.

Does the above sound reasonable ?

@shubham-pampattiwar
Copy link
Collaborator Author

A more general question, have we discussed whether this option should be exposed via CLI? I think the answer is yes, and this design should add one more section to describe how user should set the policy via CLI and what change in the code should be made?

Added details for CLI changes https://github.com/vmware-tanzu/velero/pull/4613/files#diff-472925e9329f9cf29ef5d1c15a734e96587e5e975d67b7b7fec53936a280c28dR251

@sseago
Copy link
Collaborator

sseago commented Mar 8, 2022

@shubham-pampattiwar Yes, the proposed change to update/updateAll seems reasonable to me.

@codecov-commenter
Copy link

codecov-commenter commented Mar 8, 2022

Codecov Report

❗ No coverage uploaded for pull request base (main@574baeb). Click here to learn what that means.
The diff coverage is n/a.

@@           Coverage Diff           @@
##             main    #4613   +/-   ##
=======================================
  Coverage        ?   41.44%           
=======================================
  Files           ?      204           
  Lines           ?    18094           
  Branches        ?        0           
=======================================
  Hits            ?     7499           
  Misses          ?    10046           
  Partials        ?      549           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 574baeb...d449ca0. Read the comment docs.

@shubham-pampattiwar shubham-pampattiwar force-pushed the add-design-existing-rs-policy branch from 06fee54 to 3137ad6 Compare March 9, 2022 23:31
@shubham-pampattiwar
Copy link
Collaborator Author

I have updated the design proposal with merging the update and updateAll policy option (dumping the updateAll option altogether). Going ahead with the implementation of approach 1 as discussed in the community call. Will work on feature documentation as well.

add design doc for existing resource policy

Signed-off-by: Shubham Pampattiwar <[email protected]>

add use-cases and update non-goals

Signed-off-by: Shubham Pampattiwar <[email protected]>

update approach-1 and add policy-action table

Signed-off-by: Shubham Pampattiwar <[email protected]>

minor updates

Signed-off-by: Shubham Pampattiwar <[email protected]>

fix typos

Signed-off-by: Shubham Pampattiwar <[email protected]>

add CLI details

Signed-off-by: Shubham Pampattiwar <[email protected]>

dump updateAll option

Signed-off-by: Shubham Pampattiwar <[email protected]>

add implementation decision

Signed-off-by: Shubham Pampattiwar <[email protected]>
@shubham-pampattiwar shubham-pampattiwar force-pushed the add-design-existing-rs-policy branch from 3137ad6 to d449ca0 Compare April 27, 2022 19:24
@shubham-pampattiwar
Copy link
Collaborator Author

@reasonerjt @sseago I have updated the PR with Implementation decision section.

@shubham-pampattiwar
Copy link
Collaborator Author

/kind changelog-not-required

@github-actions github-actions bot added the kind/changelog-not-required PR does not require a user changelog. Often for docs, website, or build changes label Apr 27, 2022
@reasonerjt reasonerjt merged commit ad7a940 into vmware-tanzu:main May 4, 2022
sseago added a commit to sseago/velero that referenced this pull request May 19, 2022
Shubham Pampattiwar has made several contributions to Velero,
most recently in designing and implementing two v1.9
features, including the following:
- [Add design for enabling multiple label support[(vmware-tanzu#4619)
- [Add multiple label selector support to Velero Backup and Restore APIs](vmware-tanzu#4650)
- [Add design for enabling support for ExistingResourcePolicy to restore API](vmware-tanzu#4613)
- [Add existingResourcePolicy to Restore API](vmware-tanzu#4628)

Signed-off-by: Scott Seago <[email protected]>
sseago added a commit to sseago/velero that referenced this pull request May 19, 2022
Shubham Pampattiwar has made several contributions to Velero,
most recently in designing and implementing two v1.9
features, including the following:
- [Add design for enabling multiple label support](vmware-tanzu#4619)
- [Add multiple label selector support to Velero Backup and Restore APIs](vmware-tanzu#4650)
- [Add design for enabling support for ExistingResourcePolicy to restore API](vmware-tanzu#4613)
- [Add existingResourcePolicy to Restore API](vmware-tanzu#4628)

Signed-off-by: Scott Seago <[email protected]>
sseago added a commit to sseago/velero that referenced this pull request May 19, 2022
Shubham Pampattiwar has made several contributions to Velero,
most recently in designing and implementing two v1.9
features, including the following:
- [Add design for enabling multiple label support](vmware-tanzu#4619)
- [Add multiple label selector support to Velero Backup and Restore APIs](vmware-tanzu#4650)
- [Add design for enabling support for ExistingResourcePolicy to restore API](vmware-tanzu#4613)
- [Add existingResourcePolicy to Restore API](vmware-tanzu#4628)

Shubham has also been driving forward the data mover requirements and design discussions for velero 1.10:
- [Add datamover design](vmware-tanzu#4768)

Signed-off-by: Scott Seago <[email protected]>
Lyndon-Li pushed a commit to Lyndon-Li/velero that referenced this pull request May 27, 2022
Shubham Pampattiwar has made several contributions to Velero,
most recently in designing and implementing two v1.9
features, including the following:
- [Add design for enabling multiple label support](vmware-tanzu#4619)
- [Add multiple label selector support to Velero Backup and Restore APIs](vmware-tanzu#4650)
- [Add design for enabling support for ExistingResourcePolicy to restore API](vmware-tanzu#4613)
- [Add existingResourcePolicy to Restore API](vmware-tanzu#4628)

Shubham has also been driving forward the data mover requirements and design discussions for velero 1.10:
- [Add datamover design](vmware-tanzu#4768)

Signed-off-by: Scott Seago <[email protected]>
danfengliu pushed a commit to danfengliu/velero that referenced this pull request Sep 13, 2022
Shubham Pampattiwar has made several contributions to Velero,
most recently in designing and implementing two v1.9
features, including the following:
- [Add design for enabling multiple label support](vmware-tanzu#4619)
- [Add multiple label selector support to Velero Backup and Restore APIs](vmware-tanzu#4650)
- [Add design for enabling support for ExistingResourcePolicy to restore API](vmware-tanzu#4613)
- [Add existingResourcePolicy to Restore API](vmware-tanzu#4628)

Shubham has also been driving forward the data mover requirements and design discussions for velero 1.10:
- [Add datamover design](vmware-tanzu#4768)

Signed-off-by: Scott Seago <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area/Design Design Documents kind/changelog-not-required PR does not require a user changelog. Often for docs, website, or build changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add ExistingResourcePolicy support to restore API
7 participants