NodeUnpublishVolume hangs indefinitely when umount(2) blocks, permanently holding the operation lock

### Describe the bug

When a pod using an FSx for Lustre volume is deleted, the CSI node plugin calls `umount(2)` on
the volume's target path as part of `NodeUnpublishVolume`. If the `umount(2)` syscall blocks
(e.g., due to a temporarily unresponsive FSx endpoint or an in-progress I/O flush that never
completes), the goroutine executing `NodeUnpublishVolume` never returns. Because the driver
acquires a per-volume operation lock before beginning the unmount, and holds it for the duration,
that lock is never released.

All subsequent `NodePublishVolume`, `NodeUnpublishVolume`, `NodeStageVolume`, and
`NodeUnstageVolume` calls on the same volume fail immediately with:

```
operation for volume <volume-id> is already in progress
```

The result is that the original pod stays in `Terminating` forever. On our cluster we have
observed pods stuck in this state for **44+ days**. There is no self-healing path: the driver has
no lock timeout, and kubelet has no way to interrupt the in-flight syscall. The only recovery is
to restart the CSI node DaemonSet pod on the affected node, which clears the in-memory lock
(requiring `--force --grace-period=0` if the CSI node pod itself is also stuck).

### Steps to reproduce / how we detected it

We do not have a reliable reproduction recipe, but the failure is detectable through the following
observable state:

1. Pod is in `Terminating` with a deletion timestamp set weeks or months ago.
2. Describing the pod shows it is waiting on a CSI volume to be unmounted.
3. Logs from the `fsx-csi-node` pod on the same node contain a stream of entries like:
   ```
   An error occurred in NodeUnpublishVolume, operation for volume <id> is already in progress
   ```
4. No corresponding "operation complete" log entry exists — the original operation never finished.
5. Restarting the `fsx-csi-node` pod (force-delete if also stuck) unblocks the termination
   immediately.

### Environment

- **aws-fsx-csi-driver version**: v1.8.0 (also present in v1.9.0 based on changelog review)
- **Helm chart version**: 1.15.0
- **Kubernetes**: Amazon EKS
- **FSx filesystem type**: FSx for Lustre

### Root cause

The `NodeUnpublishVolume` gRPC handler acquires a per-volume lock, then calls `mounter.Unmount()`
which wraps the `umount(2)` syscall. The gRPC context (which carries the kubelet-imposed deadline)
is not propagated to the mount operation — the syscall runs without any timeout or cancellation
mechanism. If the kernel call blocks, the lock is held forever.

### Suggested fix

One or more of the following approaches would address this:

1. **Context-aware unmount**: Run the `umount(2)` call in a goroutine, select on the gRPC context
   deadline and a done channel, and return an error (releasing the lock) if the context is
   cancelled before the syscall completes.

2. **Forced detach on timeout**: If `umount` does not complete within a configurable timeout
   (e.g., 2 minutes), retry with `umount2` using `MNT_DETACH` (`MNT_FORCE` may not be supported
   on Lustre). This detaches the mount from the namespace immediately even if I/O is outstanding,
   allowing the lock to be released.

3. **Lock TTL / watchdog**: Implement a maximum hold time for the operation lock. If the lock has
   been held beyond a threshold, log a warning and release it so other operations can proceed.

Option 1 or 2 (or both combined) seem most consistent with how the gRPC contract is designed —
the kubelet expects the RPC to return within its deadline.

### Impact

Pods using FSx for Lustre volumes that encounter this bug are stuck in `Terminating` with no
automated recovery. On busy clusters this accumulates: we have observed 6 stuck pods at once
after a single blocked unmount event.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NodeUnpublishVolume hangs indefinitely when umount(2) blocks, permanently holding the operation lock #495

Describe the bug

Steps to reproduce / how we detected it

Environment

Root cause

Suggested fix

Impact

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

NodeUnpublishVolume hangs indefinitely when umount(2) blocks, permanently holding the operation lock #495

Description

Describe the bug

Steps to reproduce / how we detected it

Environment

Root cause

Suggested fix

Impact

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions