Skip to content

Conversation

@Madhu-1
Copy link
Collaborator

@Madhu-1 Madhu-1 commented Nov 20, 2024

This PR includes series for commits for the following actions

  • Introduce in memory lock for NodePublish and UnPublish (we should not be dependent on the CO and make assumption that CO makes the calls serial for the same volID and the targetPath)
  • Use os.Remove instead of os.Removeall to remove the empty directory, os.Removeall ends up removing everything in the path

Thanks to @martin-helmich @hensur and @Lucaber for the detailed analysis

fixes: #4966

@Rakshith-R
Copy link
Contributor

/test ci/centos/mini-e2e/k8s-1.31

@Madhu-1 Madhu-1 force-pushed the fix-4966 branch 2 times, most recently from cafad3a to 8c455bd Compare November 20, 2024 10:25
@Madhu-1
Copy link
Collaborator Author

Madhu-1 commented Nov 20, 2024

/test ci/centos/mini-e2e/k8s-1.31

Copy link

@martin-helmich martin-helmich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, that was quick, thanks. 🤩🚀 This is also pretty much the fix that we had in mind for this. 👍

defer ns.VolumeLocks.Release(targetPath)

// Check if that target path exists properly
notMnt, err := ns.createTargetMountPath(ctx, targetPath, isBlock)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I see it correctly, createTargetMountPath still uses the (possibly unreliable?) IsLikelyNotMountPoint under the hood; should this be adjusted to the same IsMountPoint that the cephfs implementation uses? Or can this remain as-is, as discussed in #4966 (comment)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes thats correct, it checks the sources, The main problem is with lock and the os.RemoveAll, if os.Remove fails we will get another RPC for unpublish which will trigger umount if its actually a mount. For now lets keep it simple with what we have today.

@nixpanic nixpanic changed the title Fix data lose due to the concurrent RPC calls (occurrence is very low) Prevent dataloss due to the concurrent RPC calls (occurrence is very low) Nov 21, 2024
Copy link
Member

@nixpanic nixpanic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

It would be good to backport this to active releases, I think?

@Madhu-1 Madhu-1 added the backport-to-release-v3.12 Label to backport from devel to release-v3.12 branch label Nov 21, 2024
@Madhu-1
Copy link
Collaborator Author

Madhu-1 commented Nov 21, 2024

@Mergifyio queue

@mergify
Copy link
Contributor

mergify bot commented Nov 21, 2024

queue

✅ The pull request has been merged automatically

The pull request has been merged automatically at 00d252e

We should not be dependent on the CO to ensure
that it will serialize the request instead of
that we need to have own internal locks to ensure
that we dont do concurrent operations for same
request.

Signed-off-by: Madhu Rajanna <[email protected]>
We should not be dependent on the CO to ensure
that it will serialize the request instead of
that we need to have own internal locks to ensure
that we dont do concurrent operations for same
request.

Signed-off-by: Madhu Rajanna <[email protected]>
using os.RemoveAll will remove everything
in the director after the Umount we should
be using os.Remove only to remove the empty
directory

Signed-off-by: Madhu Rajanna <[email protected]>
using os.RemoveAll will remove everything
in the director after the Umount we should
be using os.Remove only to remove the empty
directory

Signed-off-by: Madhu Rajanna <[email protected]>
@mergify mergify bot added the ok-to-test Label to trigger E2E tests label Nov 21, 2024
@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/k8s-e2e-external-storage/1.30

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/mini-e2e-helm/k8s-1.30

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/k8s-e2e-external-storage/1.31

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/mini-e2e/k8s-1.30

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/upgrade-tests-cephfs

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/mini-e2e-helm/k8s-1.31

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/k8s-e2e-external-storage/1.29

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/mini-e2e/k8s-1.31

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/upgrade-tests-rbd

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/mini-e2e-helm/k8s-1.29

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/mini-e2e/k8s-1.29

@ceph-csi-bot ceph-csi-bot removed the ok-to-test Label to trigger E2E tests label Nov 21, 2024
@mergify mergify bot merged commit 00d252e into ceph:devel Nov 21, 2024
38 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-to-release-v3.12 Label to backport from devel to release-v3.12 branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Race condition in RBD node plugin may lead to total data loss while unmounting

5 participants