Skip to content
This repository was archived by the owner on Mar 9, 2022. It is now read-only.

Generate fatal error when cri plugin fail to start.#794

Merged
Random-Liu merged 1 commit into
containerd:masterfrom
Random-Liu:panic-for-cri-start-failure
May 31, 2018
Merged

Generate fatal error when cri plugin fail to start.#794
Random-Liu merged 1 commit into
containerd:masterfrom
Random-Liu:panic-for-cri-start-failure

Conversation

@Random-Liu

@Random-Liu Random-Liu commented May 31, 2018

Copy link
Copy Markdown
Member

Fixes containerd/containerd#2371.

This PR:

  1. Let event monitor and streaming server return an error channel.
  2. Let CRI plugin Run return error if an error is received from the error channels of event monitor or streaming server.
  3. Ignore http.ErrServerClosed which is a normal error generated when http server is Shutdown/Closed.

@t3hmrman @crosbymichael
Signed-off-by: Lantao Liu lantaol@google.com

@Random-Liu

Random-Liu commented May 31, 2018

Copy link
Copy Markdown
Member Author

Log when fail to start streaming server:

INFO[0000] Connect containerd service                   
INFO[0000] Get image filesystem path "/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs" 
INFO[0000] loading plugin "io.containerd.grpc.v1.introspection"...  type=io.containerd.grpc.v1
INFO[0000] Start subscribing containerd event           
INFO[0000] Start recovering state                       
INFO[0000] serving...                                    address="/run/containerd/containerd.sock"
INFO[0000] containerd successfully booted in 0.005218s  
INFO[0000] Start event monitor                          
INFO[0000] Start snapshots syncer                       
INFO[0000] Start streaming server                       
ERRO[0000] Failed to start streaming server              error="stayUp=false is not yet implemented"
INFO[0000] Stop CRI service                             
INFO[0000] Event monitor stopped                        
INFO[0000] Stream server stopped                        
FATA[0000] Failed to run CRI service                     error="stream server error: stayUp=false is not yet implemented"

Log for graceful stop:

INFO[0000] Connect containerd service                   
INFO[0000] Get image filesystem path "/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs" 
INFO[0000] loading plugin "io.containerd.grpc.v1.introspection"...  type=io.containerd.grpc.v1
INFO[0000] Start subscribing containerd event           
INFO[0000] Start recovering state                       
INFO[0000] serving...                                    address="/run/containerd/containerd.sock"
INFO[0000] containerd successfully booted in 0.005665s  
INFO[0000] Start event monitor                          
INFO[0000] Start snapshots syncer                       
INFO[0000] Start streaming server                       
^CINFO[0002] Stop CRI service                             
INFO[0002] Stop CRI service                             
INFO[0002] Event monitor stopped                        
INFO[0002] Stream server stopped

@t3hmrman

Copy link
Copy Markdown

Hey @Random-Liu thanks so much for the quick fix! This change will definitely improve my setup and hopefully it helps others too.

As always thanks to the team for the work on containerd (and of course cri-containerd)

Comment thread pkg/server/events.go Outdated
errCh := make(chan error)
if em.ch == nil || em.errCh == nil {
return nil, errors.New("event channel is nil")
logrus.Fatal("event channel is nil")

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this case is a programmer error, not initializing channels, then you should panic() not log fatal

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do. Thought panic and fatal are same thing.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same outcome yes, but panic means "I messed up in the code" and a log message with an exit(1) usually just means an application/runtime error, not developer error

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, :D Got it.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment thread pkg/server/events.go
close(closeCh)
if err != nil {
logrus.WithError(err).Errorf("Failed to handle event stream")
errCh <- err

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we know the difference between an bootup error and a normal runtime error? Do we care?

@Random-Liu Random-Liu May 31, 2018

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mikebrow mikebrow left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments / questions.

Comment thread pkg/server/events.go
// a channel for the caller to wait for the event monitor to stop. start must be called after
// subscribe.
func (em *eventMonitor) start() (<-chan struct{}, error) {
func (em *eventMonitor) start() <-chan error {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool.. Should probably change the function declaration comments to reflect the change in return value. For example, "It returns an error return channel to the caller. Callers should wait for errors to be returned over the error return channel, such as Failed to handle event stream."

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment thread pkg/server/events.go
case err := <-em.errCh:
logrus.WithError(err).Error("Failed to handle event stream")
close(closeCh)
if err != nil {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment here would be nice // if errCH is nil just return (and close errCh) do not report error with event stream

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment thread pkg/server/service.go
if err != nil {
return errors.Wrap(err, "failed to start event monitor")
}
eventMonitorErrCh := c.eventMonitor.start()

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can still check if eventMonitorErrCh == nil and leave the old return message if it does, though the make chan error should always work...

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, don't think we need to check. :) If it is nil, it is our problem.

Comment thread pkg/server/service.go
}

<-eventMonitorCloseCh
if err := <-eventMonitorErrCh; err != nil {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

? not sure what's going on here. Should we be overwriting eventMonitorErr if it was set in the above select?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above select only waits for one of event monitor and stream server to stop, we don't know which one.

Comment thread pkg/server/service.go
case <-streamServerCloseCh:
case err := <-streamServerErrCh:
if err != nil {
streamServerErr = err

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as above ..

@Random-Liu Random-Liu May 31, 2018

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

Actually, I plan to move the channel wait logic into the stop function, which should make things more clear.

Basically what we are doing is:

  1. Wait for either event monitor and stream server to stop;
  2. Event monitor and stream server are both important system component. If one of them stops, we gracefully stop CRI plugin and report an error.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kk cool

@Random-Liu Random-Liu force-pushed the panic-for-cri-start-failure branch from 014e38e to 8b51a8e Compare May 31, 2018 17:32
Signed-off-by: Lantao Liu <lantaol@google.com>
@Random-Liu Random-Liu force-pushed the panic-for-cri-start-failure branch from 8b51a8e to b870ee7 Compare May 31, 2018 17:49
@crosbymichael

Copy link
Copy Markdown
Member

LGTM

@mikebrow mikebrow left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mikebrow mikebrow added the lgtm label May 31, 2018
@Random-Liu Random-Liu merged commit 578b34f into containerd:master May 31, 2018
@Random-Liu Random-Liu deleted the panic-for-cri-start-failure branch May 31, 2018 20:21
Random-Liu referenced this pull request May 31, 2018
Generate fatal error when cri plugin fail to start.
@t3hmrman

t3hmrman commented Jul 30, 2018

Copy link
Copy Markdown

Hey how would I find out when this PR will make it into containerd? I just ran into this issue today on containerd 1.1.2 so I'm assuming it's not in yet?

[EDIT] - I want to note that I've solved this by including these requirements in the [Unit] section of my /etc/systemd/system/containerd.service file which manages starting containerd:

Wants=network-online.target
Requires=network-online.target
After=network-online.target

Up until now I was only using Requires and After, and was pointing them at network.target (so two main differences). Changing to the code above seemed to ensure that containerd starts properly without any intervention. Again the error I was getting was:

level=warning msg="failed to load plugin io.containerd.grpc.v1.cri" error="failed to create CRI service: failed to create stream server: failed to get stream server address: No

It's cut off but it was something like no route IIRC -- the error seemed network related so I figured maybe containerd was trying to start too early or something. The problem was consistently happening @ startup but I've restarted twice now and containerd has started correctly!

@yvespp

yvespp commented Aug 2, 2018

Copy link
Copy Markdown

@Random-Liu
I run into this as well, error from Kubelet and crictl:

root@master1:~# crictl version
FATA[0000] getting the runtime version failed: rpc error: code = Unimplemented desc = unknown service runtime.v1alpha2.RuntimeService

The VM booted without a NIC and I added one later. After restarting the containerd service everything started to work.

Log from containerd:

Aug 02 13:39:14 master1.localdomain containerd[917]: time="2018-08-02T13:39:14Z" level=info msg="Start cri plugin with config {PluginConfig:{ContainerdConfig:{Snapshotter:overlayfs DefaultRuntime:{Type:io.containerd.runtime.v1.linux Engine: Root:} UntrustedWorkloadRuntime:{Type: Engine: Root:}} CniConfig:{NetworkPluginBinDir:/opt/cni/bin NetworkPluginConfDir:/etc/cni/net.d NetworkPluginConfTemplate:} Registry:{Mirrors:map[docker.io:{Endpoints:[https://registry-1.docker.io]}]} StreamServerAddress: StreamServerPort:10010 EnableSelinux:false SandboxImage:docker-registry.mobicorp.ch/gcr/pause:3.1 StatsCollectPeriod:10 SystemdCgroup:false EnableTLSStreaming:false MaxContainerLogLineSize:16384} ContainerdRootDir:/var/lib/containerd ContainerdEndpoint:/run/containerd/containerd.sock RootDir:/var/lib/containerd/io.containerd.grpc.v1.cri StateDir:/run/containerd/io.containerd.grpc.v1.cri}"
Aug 02 13:39:14 master1.localdomain containerd[917]: time="2018-08-02T13:39:14Z" level=info msg="Connect containerd service"
Aug 02 13:39:14 master1.localdomain containerd[917]: time="2018-08-02T13:39:14Z" level=info msg="Get image filesystem path "/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs""
Aug 02 13:39:14 master1.localdomain containerd[917]: time="2018-08-02T13:39:14Z" level=error msg="Failed to load cni during init, please check CRI plugin status before setting up network for pods" error="cni config load failed: no network config found in /etc/cni/net.d: cni plugin not initialized: failed to load cni config"
Aug 02 13:39:14 master1.localdomain containerd[917]: time="2018-08-02T13:39:14Z" level=warning msg="failed to load plugin io.containerd.grpc.v1.cri" error="failed to create CRI service: failed to create stream server: failed to get stream server address: No default routes."
Aug 02 13:39:14 master1.localdomain containerd[917]: time="2018-08-02T13:39:14Z" level=info msg="loading plugin "io.containerd.grpc.v1.introspection"..." type=io.containerd.grpc.v1
Aug 02 13:39:14 master1.localdomain containerd[917]: time="2018-08-02T13:39:14Z" level=info msg=serving... address="/run/containerd/containerd.sock"
Aug 02 13:39:14 master1.localdomain containerd[917]: time="2018-08-02T13:39:14Z" level=info msg="containerd successfully booted in 0.157800s"

Versions:

root@master1:~# crictl version
Version:  0.1.0
RuntimeName:  containerd
RuntimeVersion:  v1.1.2
RuntimeApiVersion:  v1alpha2

root@master1:~# kubelet --version
Kubernetes v1.11.1

root@master1:~# containerd --version
containerd github.com/containerd/containerd v1.1.2 468a545b9edcd5932818eb9de8e72413e616e86e

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants