Skip to content

NodeUnpublishVolume hangs indefinitely when umount(2) blocks, permanently holding the operation lock #495

@jspiewak

Description

@jspiewak

Describe the bug

When a pod using an FSx for Lustre volume is deleted, the CSI node plugin calls umount(2) on
the volume's target path as part of NodeUnpublishVolume. If the umount(2) syscall blocks
(e.g., due to a temporarily unresponsive FSx endpoint or an in-progress I/O flush that never
completes), the goroutine executing NodeUnpublishVolume never returns. Because the driver
acquires a per-volume operation lock before beginning the unmount, and holds it for the duration,
that lock is never released.

All subsequent NodePublishVolume, NodeUnpublishVolume, NodeStageVolume, and
NodeUnstageVolume calls on the same volume fail immediately with:

operation for volume <volume-id> is already in progress

The result is that the original pod stays in Terminating forever. On our cluster we have
observed pods stuck in this state for 44+ days. There is no self-healing path: the driver has
no lock timeout, and kubelet has no way to interrupt the in-flight syscall. The only recovery is
to restart the CSI node DaemonSet pod on the affected node, which clears the in-memory lock
(requiring --force --grace-period=0 if the CSI node pod itself is also stuck).

Steps to reproduce / how we detected it

We do not have a reliable reproduction recipe, but the failure is detectable through the following
observable state:

  1. Pod is in Terminating with a deletion timestamp set weeks or months ago.
  2. Describing the pod shows it is waiting on a CSI volume to be unmounted.
  3. Logs from the fsx-csi-node pod on the same node contain a stream of entries like:
    An error occurred in NodeUnpublishVolume, operation for volume <id> is already in progress
    
  4. No corresponding "operation complete" log entry exists — the original operation never finished.
  5. Restarting the fsx-csi-node pod (force-delete if also stuck) unblocks the termination
    immediately.

Environment

  • aws-fsx-csi-driver version: v1.8.0 (also present in v1.9.0 based on changelog review)
  • Helm chart version: 1.15.0
  • Kubernetes: Amazon EKS
  • FSx filesystem type: FSx for Lustre

Root cause

The NodeUnpublishVolume gRPC handler acquires a per-volume lock, then calls mounter.Unmount()
which wraps the umount(2) syscall. The gRPC context (which carries the kubelet-imposed deadline)
is not propagated to the mount operation — the syscall runs without any timeout or cancellation
mechanism. If the kernel call blocks, the lock is held forever.

Suggested fix

One or more of the following approaches would address this:

  1. Context-aware unmount: Run the umount(2) call in a goroutine, select on the gRPC context
    deadline and a done channel, and return an error (releasing the lock) if the context is
    cancelled before the syscall completes.

  2. Forced detach on timeout: If umount does not complete within a configurable timeout
    (e.g., 2 minutes), retry with umount2 using MNT_DETACH (MNT_FORCE may not be supported
    on Lustre). This detaches the mount from the namespace immediately even if I/O is outstanding,
    allowing the lock to be released.

  3. Lock TTL / watchdog: Implement a maximum hold time for the operation lock. If the lock has
    been held beyond a threshold, log a warning and release it so other operations can proceed.

Option 1 or 2 (or both combined) seem most consistent with how the gRPC contract is designed —
the kubelet expects the RPC to return within its deadline.

Impact

Pods using FSx for Lustre volumes that encounter this bug are stuck in Terminating with no
automated recovery. On busy clusters this accumulates: we have observed 6 stuck pods at once
after a single blocked unmount event.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions