Generate fatal error when cri plugin fail to start.#794
Conversation
|
Log when fail to start streaming server: Log for graceful stop: |
|
Hey @Random-Liu thanks so much for the quick fix! This change will definitely improve my setup and hopefully it helps others too. As always thanks to the team for the work on |
| errCh := make(chan error) | ||
| if em.ch == nil || em.errCh == nil { | ||
| return nil, errors.New("event channel is nil") | ||
| logrus.Fatal("event channel is nil") |
There was a problem hiding this comment.
If this case is a programmer error, not initializing channels, then you should panic() not log fatal
There was a problem hiding this comment.
Will do. Thought panic and fatal are same thing.
There was a problem hiding this comment.
Same outcome yes, but panic means "I messed up in the code" and a log message with an exit(1) usually just means an application/runtime error, not developer error
| close(closeCh) | ||
| if err != nil { | ||
| logrus.WithError(err).Errorf("Failed to handle event stream") | ||
| errCh <- err |
There was a problem hiding this comment.
How do we know the difference between an bootup error and a normal runtime error? Do we care?
There was a problem hiding this comment.
Since there is no grpc in between now, Subscribe either generates a real error https://github.com/containerd/containerd/blob/master/events/exchange/exchange.go#L153, https://github.com/containerd/containerd/blob/master/events/exchange/exchange.go#L194, or an error which should never happen https://github.com/containerd/containerd/blob/master/events/exchange/exchange.go#L178.
| // a channel for the caller to wait for the event monitor to stop. start must be called after | ||
| // subscribe. | ||
| func (em *eventMonitor) start() (<-chan struct{}, error) { | ||
| func (em *eventMonitor) start() <-chan error { |
There was a problem hiding this comment.
cool.. Should probably change the function declaration comments to reflect the change in return value. For example, "It returns an error return channel to the caller. Callers should wait for errors to be returned over the error return channel, such as Failed to handle event stream."
| case err := <-em.errCh: | ||
| logrus.WithError(err).Error("Failed to handle event stream") | ||
| close(closeCh) | ||
| if err != nil { |
There was a problem hiding this comment.
comment here would be nice // if errCH is nil just return (and close errCh) do not report error with event stream
| if err != nil { | ||
| return errors.Wrap(err, "failed to start event monitor") | ||
| } | ||
| eventMonitorErrCh := c.eventMonitor.start() |
There was a problem hiding this comment.
you can still check if eventMonitorErrCh == nil and leave the old return message if it does, though the make chan error should always work...
There was a problem hiding this comment.
Yeah, don't think we need to check. :) If it is nil, it is our problem.
| } | ||
|
|
||
| <-eventMonitorCloseCh | ||
| if err := <-eventMonitorErrCh; err != nil { |
There was a problem hiding this comment.
? not sure what's going on here. Should we be overwriting eventMonitorErr if it was set in the above select?
There was a problem hiding this comment.
The above select only waits for one of event monitor and stream server to stop, we don't know which one.
| case <-streamServerCloseCh: | ||
| case err := <-streamServerErrCh: | ||
| if err != nil { | ||
| streamServerErr = err |
There was a problem hiding this comment.
ditto.
Actually, I plan to move the channel wait logic into the stop function, which should make things more clear.
Basically what we are doing is:
- Wait for either event monitor and stream server to stop;
- Event monitor and stream server are both important system component. If one of them stops, we gracefully stop CRI plugin and report an error.
014e38e to
8b51a8e
Compare
Signed-off-by: Lantao Liu <lantaol@google.com>
8b51a8e to
b870ee7
Compare
|
LGTM |
Generate fatal error when cri plugin fail to start.
|
Hey how would I find out when this PR will make it into [EDIT] - I want to note that I've solved this by including these requirements in the Up until now I was only using It's cut off but it was something like |
|
@Random-Liu The VM booted without a NIC and I added one later. After restarting the containerd service everything started to work. Log from containerd: Versions: |
Fixes containerd/containerd#2371.
This PR:
Runreturn error if an error is received from the error channels of event monitor or streaming server.http.ErrServerClosedwhich is a normal error generated when http server is Shutdown/Closed.@t3hmrman @crosbymichael
Signed-off-by: Lantao Liu lantaol@google.com