Timeout for must-gather failed backups #497

savitharaghunathan · 2021-12-06T15:46:44Z

Failed backups can cause must-gather to be slow. This PR implements a timeout flag which can be passed as an argument to the gather command.

To test:
Build and push the must-gather image to your quay repo and then run,
oc adm must-gather --image=<your repo path> -- /usr/bin/gather_with_timeout <timeout_value_in_seconds>
eg:
oc adm must-gather --image=quay.io/sraghuna/must-gather1:43 -- /usr/bin/gather_with_timeout 3s

/assign @dymurray

kaovilai · 2021-12-06T19:14:58Z

Do you think there should be a default timeout and only if that's not enough.. customize?

savitharaghunathan · 2021-12-06T19:19:01Z

I am not sure if we need a default value. Most of my test runs I found that the command ran faster without specifying the timeout value. Only reason we are doing this is to avoid must-gather run for hours when there is a failedvalidation error due to BSL location.

kaovilai · 2021-12-06T20:39:11Z

It would be cool if you could make it spit out info about this function when must-gather has been running for x minutes. (15/30/60?)

savitharaghunathan · 2021-12-06T21:56:36Z

It would be cool if you could make it spit out info about this function when must-gather has been running for x minutes. (15/30/60?)

It will be definitely useful when the tool is actually hanging due to failed backups. But if the cluster has a lot of valid resources and it takes a long time to process, I am afraid spitting this command would prompt the user to rerun the gather again. Rerunning it with timeout will not improve the time in this case, as all the resources are valid and the process is not hanging.
I am actually torn between having an arbitrary timeout irrespective of the cluster resources state vs giving multiple options for the user. If we decide on having an arbitrary timeout then what that value would be. All the test runs I did, it ranged anywhere between 2s and 7s.

savitharaghunathan · 2021-12-06T22:01:13Z

Another implementation could actually check for failed resources and add a timeout to those, and let the other resources run normally. This will avoid prompting the user and giving them multiple options.

kaovilai · 2021-12-09T22:59:10Z

I am afraid spitting this command would prompt the user to rerun the gather again.

How about spitting it out at the beginning? then it is not an indication of hanging. They can decide at 0 minutes which route they want to take.

openshift-ci · 2021-12-13T20:46:00Z

@savitharaghunathan: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

* api v1 upgrade CR typo (openshift#501) * Removing kube-rbac-proxy from the containers needed (openshift#504) * fix indentation (openshift#505) * turn OCP versions from int to string (openshift#502) * Ensure velero is deleted (openshift#510) * Timeout for must-gather failed backups (openshift#497) * Adding known limitations to must-gather * Adding implementation details for timeout * remove VSPHERE env configs for csv (openshift#512) * update troubleshooting doc (openshift#509) * Adding support for service principal credential for Azure plugin (openshift#507) * Adding validation for azure creds * Adding account key check * Adding checks to validate SP if storage key is not present * Adding unit test#1 * Adding a test case for service principal * Adding review comments * Remove: logging sensitive info * make deploy velero namespace fix (openshift#506) * make deploy velero namespace fix * add changes for undeploy * add deploy-tmp-cleanup * fix aws registry env vars (openshift#515) * Azure SP docs (openshift#514) * Adding Azure SP related doc * Fixing title * Fixing nit * Registry should not be deployed when Azure SP is used (openshift#518) * Registry should not be deployed when Azure SP is used * Fixing unit tests * Adding review comments * Fixing typos * Adding registry label to BSL * Updating azure credentials documentation (openshift#519) * AWS plugin config: BSL Region not required when s3ForcePathStyle is false and BackupImages is false (openshift#517) * OADP-153, Close openshift#424 * Nil restic Config should delete previous restic daemonset * only check restic config if it is not nil * installCase wantError implement * Make err more verbose * commit metav1 * Changes for BackupImages considerations * fake client fix * removed vsphere from source manager config (openshift#520) * Update README.md * badge relocate (openshift#521) Co-authored-by: Tiger Kaovilai <[email protected]> Co-authored-by: Shawn Hurley <[email protected]> Co-authored-by: Emily McMullan <[email protected]> Co-authored-by: Savitha Raghunathan <[email protected]> Co-authored-by: Wesley Hayutin <[email protected]> Co-authored-by: Dylan Murray <[email protected]>

savitharaghunathan added 2 commits December 2, 2021 14:31

Adding known limitations to must-gather

46bebbe

Adding implementation details for timeout

7668f08

openshift-ci bot assigned dymurray Dec 6, 2021

openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 6, 2021

Merge branch 'master' into timeout_mustgather

98415fb

openshift-ci bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 6, 2021

openshift-ci bot requested review from kaovilai and shubham-pampattiwar December 6, 2021 16:03

dymurray approved these changes Dec 13, 2021

View reviewed changes

dymurray merged commit cc449b9 into openshift:master Dec 13, 2021

savitharaghunathan deleted the timeout_mustgather branch December 15, 2021 16:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Timeout for must-gather failed backups #497

Timeout for must-gather failed backups #497

Uh oh!

savitharaghunathan commented Dec 6, 2021

Uh oh!

kaovilai commented Dec 6, 2021

Uh oh!

savitharaghunathan commented Dec 6, 2021 •

edited

Loading

Uh oh!

kaovilai commented Dec 6, 2021

Uh oh!

savitharaghunathan commented Dec 6, 2021 •

edited

Loading

Uh oh!

savitharaghunathan commented Dec 6, 2021

Uh oh!

kaovilai commented Dec 9, 2021

Uh oh!

openshift-ci bot commented Dec 13, 2021

Uh oh!

Uh oh!

Timeout for must-gather failed backups #497

Timeout for must-gather failed backups #497

Uh oh!

Conversation

savitharaghunathan commented Dec 6, 2021

Uh oh!

kaovilai commented Dec 6, 2021

Uh oh!

savitharaghunathan commented Dec 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kaovilai commented Dec 6, 2021

Uh oh!

savitharaghunathan commented Dec 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

savitharaghunathan commented Dec 6, 2021

Uh oh!

kaovilai commented Dec 9, 2021

Uh oh!

openshift-ci bot commented Dec 13, 2021

Uh oh!

Uh oh!

savitharaghunathan commented Dec 6, 2021 •

edited

Loading

savitharaghunathan commented Dec 6, 2021 •

edited

Loading