Describe the bug
When a pod using an FSx for Lustre volume is deleted, the CSI node plugin calls umount(2) on
the volume's target path as part of NodeUnpublishVolume. If the umount(2) syscall blocks
(e.g., due to a temporarily unresponsive FSx endpoint or an in-progress I/O flush that never
completes), the goroutine executing NodeUnpublishVolume never returns. Because the driver
acquires a per-volume operation lock before beginning the unmount, and holds it for the duration,
that lock is never released.
All subsequent NodePublishVolume, NodeUnpublishVolume, NodeStageVolume, and
NodeUnstageVolume calls on the same volume fail immediately with:
operation for volume <volume-id> is already in progress
The result is that the original pod stays in Terminating forever. On our cluster we have
observed pods stuck in this state for 44+ days. There is no self-healing path: the driver has
no lock timeout, and kubelet has no way to interrupt the in-flight syscall. The only recovery is
to restart the CSI node DaemonSet pod on the affected node, which clears the in-memory lock
(requiring --force --grace-period=0 if the CSI node pod itself is also stuck).
Steps to reproduce / how we detected it
We do not have a reliable reproduction recipe, but the failure is detectable through the following
observable state:
- Pod is in
Terminating with a deletion timestamp set weeks or months ago.
- Describing the pod shows it is waiting on a CSI volume to be unmounted.
- Logs from the
fsx-csi-node pod on the same node contain a stream of entries like:
An error occurred in NodeUnpublishVolume, operation for volume <id> is already in progress
- No corresponding "operation complete" log entry exists — the original operation never finished.
- Restarting the
fsx-csi-node pod (force-delete if also stuck) unblocks the termination
immediately.
Environment
- aws-fsx-csi-driver version: v1.8.0 (also present in v1.9.0 based on changelog review)
- Helm chart version: 1.15.0
- Kubernetes: Amazon EKS
- FSx filesystem type: FSx for Lustre
Root cause
The NodeUnpublishVolume gRPC handler acquires a per-volume lock, then calls mounter.Unmount()
which wraps the umount(2) syscall. The gRPC context (which carries the kubelet-imposed deadline)
is not propagated to the mount operation — the syscall runs without any timeout or cancellation
mechanism. If the kernel call blocks, the lock is held forever.
Suggested fix
One or more of the following approaches would address this:
-
Context-aware unmount: Run the umount(2) call in a goroutine, select on the gRPC context
deadline and a done channel, and return an error (releasing the lock) if the context is
cancelled before the syscall completes.
-
Forced detach on timeout: If umount does not complete within a configurable timeout
(e.g., 2 minutes), retry with umount2 using MNT_DETACH (MNT_FORCE may not be supported
on Lustre). This detaches the mount from the namespace immediately even if I/O is outstanding,
allowing the lock to be released.
-
Lock TTL / watchdog: Implement a maximum hold time for the operation lock. If the lock has
been held beyond a threshold, log a warning and release it so other operations can proceed.
Option 1 or 2 (or both combined) seem most consistent with how the gRPC contract is designed —
the kubelet expects the RPC to return within its deadline.
Impact
Pods using FSx for Lustre volumes that encounter this bug are stuck in Terminating with no
automated recovery. On busy clusters this accumulates: we have observed 6 stuck pods at once
after a single blocked unmount event.
Describe the bug
When a pod using an FSx for Lustre volume is deleted, the CSI node plugin calls
umount(2)onthe volume's target path as part of
NodeUnpublishVolume. If theumount(2)syscall blocks(e.g., due to a temporarily unresponsive FSx endpoint or an in-progress I/O flush that never
completes), the goroutine executing
NodeUnpublishVolumenever returns. Because the driveracquires a per-volume operation lock before beginning the unmount, and holds it for the duration,
that lock is never released.
All subsequent
NodePublishVolume,NodeUnpublishVolume,NodeStageVolume, andNodeUnstageVolumecalls on the same volume fail immediately with:The result is that the original pod stays in
Terminatingforever. On our cluster we haveobserved pods stuck in this state for 44+ days. There is no self-healing path: the driver has
no lock timeout, and kubelet has no way to interrupt the in-flight syscall. The only recovery is
to restart the CSI node DaemonSet pod on the affected node, which clears the in-memory lock
(requiring
--force --grace-period=0if the CSI node pod itself is also stuck).Steps to reproduce / how we detected it
We do not have a reliable reproduction recipe, but the failure is detectable through the following
observable state:
Terminatingwith a deletion timestamp set weeks or months ago.fsx-csi-nodepod on the same node contain a stream of entries like:fsx-csi-nodepod (force-delete if also stuck) unblocks the terminationimmediately.
Environment
Root cause
The
NodeUnpublishVolumegRPC handler acquires a per-volume lock, then callsmounter.Unmount()which wraps the
umount(2)syscall. The gRPC context (which carries the kubelet-imposed deadline)is not propagated to the mount operation — the syscall runs without any timeout or cancellation
mechanism. If the kernel call blocks, the lock is held forever.
Suggested fix
One or more of the following approaches would address this:
Context-aware unmount: Run the
umount(2)call in a goroutine, select on the gRPC contextdeadline and a done channel, and return an error (releasing the lock) if the context is
cancelled before the syscall completes.
Forced detach on timeout: If
umountdoes not complete within a configurable timeout(e.g., 2 minutes), retry with
umount2usingMNT_DETACH(MNT_FORCEmay not be supportedon Lustre). This detaches the mount from the namespace immediately even if I/O is outstanding,
allowing the lock to be released.
Lock TTL / watchdog: Implement a maximum hold time for the operation lock. If the lock has
been held beyond a threshold, log a warning and release it so other operations can proceed.
Option 1 or 2 (or both combined) seem most consistent with how the gRPC contract is designed —
the kubelet expects the RPC to return within its deadline.
Impact
Pods using FSx for Lustre volumes that encounter this bug are stuck in
Terminatingwith noautomated recovery. On busy clusters this accumulates: we have observed 6 stuck pods at once
after a single blocked unmount event.