Skip to content

Batch Attach race condition fix#3985

Closed
skogta wants to merge 1 commit into
kubernetes-sigs:masterfrom
skogta:topic/skogta/batchattachRace
Closed

Batch Attach race condition fix#3985
skogta wants to merge 1 commit into
kubernetes-sigs:masterfrom
skogta:topic/skogta/batchattachRace

Conversation

@skogta
Copy link
Copy Markdown
Contributor

@skogta skogta commented Apr 22, 2026

What this PR does / why we need it:

During VM import, there is a race condition where there may be a delay from VM operator in adding a volume to batch attach spec.

In the meantime, CSI might incorrectly interpret that as a detach request (since volume is not there in batchattach spec but is attached to the VM on VM inventory).

In order to fix this, before adding a volume to detach list, it is important that CSI also validates that the volume is not being referenced in the VM spec. If it is being referenced, then skip adding that volume to detach list.

If VM object is not found, then fail the reconciliation.

As discussed on private chat, we should ignore safety check if PVC VM object is not found k8s cluster.

Testing done:
IN PROGESS:
WCP precheckin: https://jenkins-vcf-csifvt.devops.broadcom.net/job/wcp-instapp-e2e-pre-checkin/1307/
VKS precheckin: https://jenkins-vcf-csifvt.devops.broadcom.net/job/vks-instapp-e2e-pre-checkin/1050/

  1. Successfully attached a PVC to a VM.
  2. Sucessfully detached a PVC from a VM.
  3. Removed PVC from batchattach spec while it was still there in VM spec. Observed that detach was deined:
{"level":"info","time":"2026-04-29T17:09:21.219126053Z","caller":"cnsnodevmbatchattachment/cnsnodevmbatchattachment_helper.go:171","msg":"Skipping detach for PVC test/pvc-2 with FCD 2474f3b8-e404-4710-aeb5-575a78f5fa5a from VM Instance UUID 15294371-1935-4a95-96cb-56e2862bf2e2 because it is still referenced in VirtualMachine testvm-1 spec. This indicates the PVC is actively used by the VM.","TraceId":"bff7b482-a75d-4527-afb2-678f14515b25"}

@skogta skogta marked this pull request as draft April 22, 2026 17:47
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: skogta
Once this PR has been reviewed and has the lgtm label, please assign raunakshah for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 22, 2026
@skogta skogta force-pushed the topic/skogta/batchattachRace branch 3 times, most recently from ad5d9cb to b7bb98b Compare April 24, 2026 03:08
@skogta skogta marked this pull request as ready for review April 24, 2026 03:08
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 24, 2026
@skogta skogta force-pushed the topic/skogta/batchattachRace branch 2 times, most recently from 30b6146 to dc31ae6 Compare April 24, 2026 04:23
@deepakkinni
Copy link
Copy Markdown
Collaborator

Triggering CSI-WCP Pre-checkin Pipeline for this PR... Job takes approximately an hour to complete
Jenkins Build #1299

@deepakkinni
Copy link
Copy Markdown
Collaborator

Triggering CSI-TKG Pre-checkin Pipeline for this PR... Job takes approximately an hour to complete
Jenkins Build #1039

@skogta skogta force-pushed the topic/skogta/batchattachRace branch from dc31ae6 to 1b2354b Compare April 24, 2026 04:38
@deepakkinni
Copy link
Copy Markdown
Collaborator

SUCCESS --- Jenkins Build #1299

@skogta skogta force-pushed the topic/skogta/batchattachRace branch 5 times, most recently from 9ebcf7c to 42382d0 Compare April 24, 2026 10:24
@deepakkinni
Copy link
Copy Markdown
Collaborator

Triggering CSI-TKG Pre-checkin Pipeline for this PR... Job takes approximately an hour to complete
Jenkins Build #1048

@deepakkinni
Copy link
Copy Markdown
Collaborator

SUCCESS --- Jenkins Build #1048

@deepakkinni
Copy link
Copy Markdown
Collaborator

Triggering CSI-WCP Pre-checkin Pipeline for this PR... Job takes approximately an hour to complete
Jenkins Build #1305

@deepakkinni
Copy link
Copy Markdown
Collaborator

SUCCESS --- Jenkins Build #1305

@skogta skogta force-pushed the topic/skogta/batchattachRace branch from 42382d0 to 71a4eac Compare April 24, 2026 18:07
@skogta skogta force-pushed the topic/skogta/batchattachRace branch from 71a4eac to 8424166 Compare April 24, 2026 18:17
@deepakkinni
Copy link
Copy Markdown
Collaborator

Triggering CSI-TKG Pre-checkin Pipeline for this PR... Job takes approximately an hour to complete
Jenkins Build #1050

@deepakkinni
Copy link
Copy Markdown
Collaborator

FAILED --- Jenkins Build #1050

@deepakkinni
Copy link
Copy Markdown
Collaborator

Triggering CSI-WCP Pre-checkin Pipeline for this PR... Job takes approximately an hour to complete
Jenkins Build #1307

@deepakkinni
Copy link
Copy Markdown
Collaborator

Triggering CSI-TKG Pre-checkin Pipeline for this PR... Job takes approximately an hour to complete
Jenkins Build #1051

@deepakkinni
Copy link
Copy Markdown
Collaborator

SUCCESS --- Jenkins Build #1051

@deepakkinni
Copy link
Copy Markdown
Collaborator

FAILED --- Jenkins Build #1307

@akutz
Copy link
Copy Markdown
Contributor

akutz commented Apr 28, 2026

@skogta ,

I was surprised to see this patch. We discussed it Th/Friday, and then Bryan, Manoj, and @deepakkinni and I met on Friday to discuss the different solutions, and I thought we agreed that VM Op had to make the change because of the need for the handshake?

I was working on a fix as well at vmware-tanzu/vm-operator#1572. Have you validated your change yet? I am happy to use yours if it works.

Copy link
Copy Markdown
Contributor

@akutz akutz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for doing this @skogta . It looks great! I left a few comments.

jsonpatch "github.com/evanphx/json-patch/v5"
vmoperatorv1alpha1 "github.com/vmware-tanzu/vm-operator/api/v1alpha1"
vmoperatortypes "github.com/vmware-tanzu/vm-operator/api/v1alpha2"
vmoperatorv1alpha5 "github.com/vmware-tanzu/vm-operator/api/v1alpha5"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason v1a5 is used? Would it not make more sense to use an earlier version on Supervisor? @deepakkinni , thoughts on this?

for _, vmVol := range vm.Spec.Volumes {
if vmVol.PersistentVolumeClaim != nil &&
vmVol.PersistentVolumeClaim.ClaimName == pvcName {
log.Infof("Skipping detach for PVC %s/%s with FCD %s from VM %s because it is still "+
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VM %s is confusing. Folks won't know if it is BIOS ID or Instance UUID. Maybe say VM instance UUID %s instead?

@deepakkinni
Copy link
Copy Markdown
Collaborator

@skogta ,

I was surprised to see this patch. We discussed it Th/Friday, and then Bryan, Manoj, and @deepakkinni and I met on Friday to discuss the different solutions, and I thought we agreed that VM Op had to make the change because of the need for the handshake?

I was working on a fix as well at vmware-tanzu/vm-operator#1572. Have you validated your change yet? I am happy to use yours if it works.

This is supposed to be dropped. I think @skogta forgot to close it.

@skogta
Copy link
Copy Markdown
Contributor Author

skogta commented Apr 29, 2026

@skogta ,
I was surprised to see this patch. We discussed it Th/Friday, and then Bryan, Manoj, and @deepakkinni and I met on Friday to discuss the different solutions, and I thought we agreed that VM Op had to make the change because of the need for the handshake?
I was working on a fix as well at vmware-tanzu/vm-operator#1572. Have you validated your change yet? I am happy to use yours if it works.

This is supposed to be dropped. I think @skogta forgot to close it.

Yes, I meant to close it on Friday.

@skogta skogta closed this Apr 29, 2026
@skogta
Copy link
Copy Markdown
Contributor Author

skogta commented Apr 30, 2026

In case we go ahead with this fix, refer to this PR:
#4003

It contains better optimization by creating a cache of all volumes present in VM spec.
It also uses v1alpha2 instead of v1alpha5 as I can see all helper functions in CSI use v1alpha2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants