Skip to content

Conversation

savitharaghunathan
Copy link
Member

Failed backups can cause must-gather to be slow. This PR implements a timeout flag which can be passed as an argument to the gather command.

To test:
Build and push the must-gather image to your quay repo and then run,
oc adm must-gather --image=<your repo path> -- /usr/bin/gather_with_timeout <timeout_value_in_seconds>
eg:
oc adm must-gather --image=quay.io/sraghuna/must-gather1:43 -- /usr/bin/gather_with_timeout 3s

/assign @dymurray

@openshift-ci openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 6, 2021
@openshift-ci openshift-ci bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 6, 2021
@kaovilai
Copy link
Member

kaovilai commented Dec 6, 2021

Do you think there should be a default timeout and only if that's not enough.. customize?

@savitharaghunathan
Copy link
Member Author

savitharaghunathan commented Dec 6, 2021

I am not sure if we need a default value. Most of my test runs I found that the command ran faster without specifying the timeout value. Only reason we are doing this is to avoid must-gather run for hours when there is a failedvalidation error due to BSL location.

@kaovilai
Copy link
Member

kaovilai commented Dec 6, 2021

It would be cool if you could make it spit out info about this function when must-gather has been running for x minutes. (15/30/60?)

@savitharaghunathan
Copy link
Member Author

savitharaghunathan commented Dec 6, 2021

It would be cool if you could make it spit out info about this function when must-gather has been running for x minutes. (15/30/60?)

It will be definitely useful when the tool is actually hanging due to failed backups. But if the cluster has a lot of valid resources and it takes a long time to process, I am afraid spitting this command would prompt the user to rerun the gather again. Rerunning it with timeout will not improve the time in this case, as all the resources are valid and the process is not hanging.
I am actually torn between having an arbitrary timeout irrespective of the cluster resources state vs giving multiple options for the user. If we decide on having an arbitrary timeout then what that value would be. All the test runs I did, it ranged anywhere between 2s and 7s.

@savitharaghunathan
Copy link
Member Author

Another implementation could actually check for failed resources and add a timeout to those, and let the other resources run normally. This will avoid prompting the user and giving them multiple options.

@kaovilai
Copy link
Member

kaovilai commented Dec 9, 2021

I am afraid spitting this command would prompt the user to rerun the gather again.

How about spitting it out at the beginning? then it is not an indication of hanging. They can decide at 0 minutes which route they want to take.

@openshift-ci
Copy link

openshift-ci bot commented Dec 13, 2021

@savitharaghunathan: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@dymurray dymurray merged commit cc449b9 into openshift:master Dec 13, 2021
@savitharaghunathan savitharaghunathan deleted the timeout_mustgather branch December 15, 2021 16:38
kaovilai added a commit to kaovilai/oadp-operator that referenced this pull request Dec 22, 2021
* api v1 upgrade CR typo (openshift#501)

* Removing kube-rbac-proxy from the containers needed (openshift#504)

* fix indentation (openshift#505)

* turn OCP versions from int to string (openshift#502)

* Ensure velero is deleted (openshift#510)

* Timeout  for must-gather failed backups (openshift#497)

* Adding known limitations to must-gather

* Adding implementation details for timeout

* remove VSPHERE env configs for csv (openshift#512)

* update troubleshooting doc (openshift#509)

* Adding support for service principal credential for Azure plugin (openshift#507)

* Adding validation for azure creds

* Adding account key check

* Adding checks to validate SP if storage key is not present

* Adding unit test#1

* Adding a test case for service principal

* Adding review comments

* Remove: logging sensitive info

* make deploy velero namespace fix (openshift#506)

* make deploy velero namespace fix

* add changes for undeploy

* add deploy-tmp-cleanup

* fix aws registry env vars (openshift#515)

* Azure SP docs (openshift#514)

* Adding Azure SP related doc

* Fixing title

* Fixing nit

* Registry should not be deployed when Azure SP is used (openshift#518)

* Registry should not be deployed when Azure SP is used

* Fixing unit tests

* Adding review comments

* Fixing typos

* Adding registry label to BSL

* Updating azure credentials documentation (openshift#519)

* AWS plugin config: BSL Region not required when s3ForcePathStyle is false and BackupImages is false (openshift#517)

* OADP-153, Close openshift#424

* Nil restic Config should delete previous restic daemonset

* only check restic config if it is not nil

* installCase wantError implement

* Make err more verbose

* commit metav1

* Changes for BackupImages considerations

* fake client fix

* removed vsphere from source manager config (openshift#520)

* Update README.md

* badge relocate (openshift#521)

Co-authored-by: Tiger Kaovilai <[email protected]>
Co-authored-by: Shawn Hurley <[email protected]>
Co-authored-by: Emily McMullan <[email protected]>
Co-authored-by: Savitha Raghunathan <[email protected]>
Co-authored-by: Wesley Hayutin <[email protected]>
Co-authored-by: Dylan Murray <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants