-
Notifications
You must be signed in to change notification settings - Fork 82
Timeout for must-gather failed backups #497
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Timeout for must-gather failed backups #497
Conversation
Do you think there should be a default timeout and only if that's not enough.. customize? |
I am not sure if we need a default value. Most of my test runs I found that the command ran faster without specifying the timeout value. Only reason we are doing this is to avoid must-gather run for hours when there is a failedvalidation error due to BSL location. |
It would be cool if you could make it spit out info about this function when must-gather has been running for x minutes. (15/30/60?) |
It will be definitely useful when the tool is actually hanging due to failed backups. But if the cluster has a lot of valid resources and it takes a long time to process, I am afraid spitting this command would prompt the user to rerun the gather again. Rerunning it with timeout will not improve the time in this case, as all the resources are valid and the process is not hanging. |
Another implementation could actually check for failed resources and add a timeout to those, and let the other resources run normally. This will avoid prompting the user and giving them multiple options. |
How about spitting it out at the beginning? then it is not an indication of hanging. They can decide at 0 minutes which route they want to take. |
@savitharaghunathan: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
* api v1 upgrade CR typo (openshift#501) * Removing kube-rbac-proxy from the containers needed (openshift#504) * fix indentation (openshift#505) * turn OCP versions from int to string (openshift#502) * Ensure velero is deleted (openshift#510) * Timeout for must-gather failed backups (openshift#497) * Adding known limitations to must-gather * Adding implementation details for timeout * remove VSPHERE env configs for csv (openshift#512) * update troubleshooting doc (openshift#509) * Adding support for service principal credential for Azure plugin (openshift#507) * Adding validation for azure creds * Adding account key check * Adding checks to validate SP if storage key is not present * Adding unit test#1 * Adding a test case for service principal * Adding review comments * Remove: logging sensitive info * make deploy velero namespace fix (openshift#506) * make deploy velero namespace fix * add changes for undeploy * add deploy-tmp-cleanup * fix aws registry env vars (openshift#515) * Azure SP docs (openshift#514) * Adding Azure SP related doc * Fixing title * Fixing nit * Registry should not be deployed when Azure SP is used (openshift#518) * Registry should not be deployed when Azure SP is used * Fixing unit tests * Adding review comments * Fixing typos * Adding registry label to BSL * Updating azure credentials documentation (openshift#519) * AWS plugin config: BSL Region not required when s3ForcePathStyle is false and BackupImages is false (openshift#517) * OADP-153, Close openshift#424 * Nil restic Config should delete previous restic daemonset * only check restic config if it is not nil * installCase wantError implement * Make err more verbose * commit metav1 * Changes for BackupImages considerations * fake client fix * removed vsphere from source manager config (openshift#520) * Update README.md * badge relocate (openshift#521) Co-authored-by: Tiger Kaovilai <[email protected]> Co-authored-by: Shawn Hurley <[email protected]> Co-authored-by: Emily McMullan <[email protected]> Co-authored-by: Savitha Raghunathan <[email protected]> Co-authored-by: Wesley Hayutin <[email protected]> Co-authored-by: Dylan Murray <[email protected]>
Failed backups can cause must-gather to be slow. This PR implements a timeout flag which can be passed as an argument to the gather command.
To test:
Build and push the must-gather image to your quay repo and then run,
oc adm must-gather --image=<your repo path> -- /usr/bin/gather_with_timeout <timeout_value_in_seconds>
eg:
oc adm must-gather --image=quay.io/sraghuna/must-gather1:43 -- /usr/bin/gather_with_timeout 3s
/assign @dymurray