-
Notifications
You must be signed in to change notification settings - Fork 61
Check devworkspace pod events for "Unrecoverable" events #549
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Add an indexer to the internal controller cache to track events indexed by involvedObject.Name. This allows the controller to list events with the fieldSelector "involvedObject.name=<pod-name>" to query events specific to a workspace Pod. Signed-off-by: Angel Misevski <[email protected]>
8588a62
to
86340c5
Compare
"CrashLoopBackOff", | ||
"ImagePullBackOff", | ||
"CreateContainerError", | ||
"RunContainerError", | ||
} | ||
|
||
var unrecoverablePodEventReasons = []string{ | ||
"FailedMount", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On some infrastructure, FailedMount or FailedScheduling can be an usual situation when you just need to be more patient and then on Che we suggest users to drop them from unrecoverable events list.
We should have it configurable in some way or let users annotate particular workspace, maybe get it from namespaced configmap settings if present.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like an annotation system might complicate things too much -- we should just do what we think is best (whether that's check or don't check).
One issue I've seen in Che before is that there can be a bit of a thundering herd problem -- if I've got 10 workspaces waiting for the common PVC to be mountable, it can be very confusing because they'll randomly start up and have to be stopped until it gets to the one I want to actually use.
One option that just occurred to me from looking at the Event struct a bit more -- what if we have a threshold for "number of times the event has occurred"? We could fail workspace start if e.g. we've seen the FailedMount
even three times.
edit: tested checking the count, and for e.g. a non-existent PVC, it never goes past 1 :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created #550 to follow up
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: amisevsk, JPinkney, sleshchenko The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
New changes are detected. LGTM label has been removed. |
Add functionality to check pod events during reconciles, in order to detect unrecoverable states in workspace startup that aren't reflected in the pod's status. Signed-off-by: Angel Misevski <[email protected]>
Signed-off-by: Angel Misevski <[email protected]>
7f265b1
to
05f102f
Compare
/test v7-devworkspaces-operator-e2e, v7-devworkspace-happy-path |
@amisevsk: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
What does this PR do?
Checks events related to the DevWorkspace pod for what we would consider "unrecoverable" events (list stolen from Che 😛)
This requires adding an index to the manager, to allow us to list events with the appropriate fieldSelector.
Also adds
get/list/watch
Events RBAC to the controller.What issues does this PR fix or reference?
Closes #465
Is it tested? How?
It's hard to reproduce these events manually; the easiest way I found was to patch
commonStorage.go
to readand trying to start a non-ephemeral DevWorkspace.
PR Checklist
/test v7-devworkspaces-operator-e2e, v7-devworkspace-happy-path
to trigger)v7-devworkspaces-operator-e2e
: DevWorkspace e2e testv7-devworkspace-happy-path
: DevWorkspace e2e test