Skip to content

Device Cache initialization slowing down csi-node when --node-name isn't specified #2200

@Fricounet

Description

@Fricounet

This PR #2141 introduced a new deviceCache that relies on the --node-name flag to work. However, nothing prevent the user from not configuring this flag and when this happens, the node pods initialization get delayed significantly because it has to wait for the full backoff:

gcp-pd-csi-driver-node-vbl4c csi-plugin I1014 07:32:24.784762       1 main.go:125] Operating compute environment set to: production and computeEndpoint is set to: <nil>
gcp-pd-csi-driver-node-vbl4c csi-plugin I1014 07:32:24.785339       1 main.go:134] Sys info: NumCPU: 15 MAXPROC: 1
gcp-pd-csi-driver-node-vbl4c csi-plugin I1014 07:32:24.785357       1 main.go:139] Driver vendor version v1.21.4-dd.202541
gcp-pd-csi-driver-node-vbl4c csi-plugin I1014 07:32:24.786456       1 mount_linux.go:316] Cannot create temp dir to detect safe 'not mounted' behavior: mkdir /tmp/kubelet-detect-safe-umount3379456297: read-only file system
...
gcp-pd-csi-driver-node-vbl4c csi-plugin I1014 07:32:24.790026       1 request.go:1178] Error in request: resource name may not be empty
gcp-pd-csi-driver-node-vbl4c csi-plugin W1014 07:32:24.790071       1 node.go:37] Error getting node : resource name may not be empty, retrying...
gcp-pd-csi-driver-node-vbl4c csi-plugin I1014 07:32:25.790198       1 request.go:1178] Error in request: resource name may not be empty
gcp-pd-csi-driver-node-vbl4c csi-plugin W1014 07:32:25.790245       1 node.go:37] Error getting node : resource name may not be empty, retrying...
gcp-pd-csi-driver-node-vbl4c csi-plugin I1014 07:32:27.791374       1 request.go:1178] Error in request: resource name may not be empty
gcp-pd-csi-driver-node-vbl4c csi-plugin W1014 07:32:27.791414       1 node.go:37] Error getting node : resource name may not be empty, retrying...
gcp-pd-csi-driver-node-vbl4c csi-plugin I1014 07:32:31.793285       1 request.go:1178] Error in request: resource name may not be empty
gcp-pd-csi-driver-node-vbl4c csi-plugin W1014 07:32:31.793325       1 node.go:37] Error getting node : resource name may not be empty, retrying...
gcp-pd-csi-driver-node-vbl4c csi-plugin I1014 07:32:39.793657       1 request.go:1178] Error in request: resource name may not be empty
gcp-pd-csi-driver-node-vbl4c csi-plugin W1014 07:32:39.793724       1 node.go:37] Error getting node : resource name may not be empty, retrying...
gcp-pd-csi-driver-node-vbl4c csi-plugin E1014 07:32:39.793731       1 node.go:46] Failed to get node  after retries: timed out waiting for the condition
gcp-pd-csi-driver-node-vbl4c csi-plugin W1014 07:32:39.793767       1 main.go:283] Failed to create device cache: failed to get node : timed out waiting for the condition
...
gcp-pd-csi-driver-node-vbl4c csi-plugin I1014 07:32:39.794316       1 gce-pd-driver.go:187] Driver: pd.csi.storage.gke.io

That's 15s lost which could be avoided and that's slowing down significantly the rollout of the driver on large clusters.
I know this can be avoided by configuring the flag but I think the situation could also be improved with a saner default.

I have 2 ideas:

  1. check the nodeName isn't an empty string before running NewDeviceCacheForNode
  2. or if the deviceCache is needed to be running in all cases, then the flag should become mandatory

What do you think?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions