Batch Attach race condition fix#3985
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: skogta The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
ad5d9cb to
b7bb98b
Compare
30b6146 to
dc31ae6
Compare
|
Triggering CSI-WCP Pre-checkin Pipeline for this PR... Job takes approximately an hour to complete |
|
Triggering CSI-TKG Pre-checkin Pipeline for this PR... Job takes approximately an hour to complete |
dc31ae6 to
1b2354b
Compare
|
SUCCESS --- Jenkins Build #1299 |
9ebcf7c to
42382d0
Compare
|
Triggering CSI-TKG Pre-checkin Pipeline for this PR... Job takes approximately an hour to complete |
|
SUCCESS --- Jenkins Build #1048 |
|
Triggering CSI-WCP Pre-checkin Pipeline for this PR... Job takes approximately an hour to complete |
|
SUCCESS --- Jenkins Build #1305 |
42382d0 to
71a4eac
Compare
71a4eac to
8424166
Compare
|
Triggering CSI-TKG Pre-checkin Pipeline for this PR... Job takes approximately an hour to complete |
|
FAILED --- Jenkins Build #1050 |
|
Triggering CSI-WCP Pre-checkin Pipeline for this PR... Job takes approximately an hour to complete |
|
Triggering CSI-TKG Pre-checkin Pipeline for this PR... Job takes approximately an hour to complete |
|
SUCCESS --- Jenkins Build #1051 |
|
FAILED --- Jenkins Build #1307 |
|
@skogta , I was surprised to see this patch. We discussed it Th/Friday, and then Bryan, Manoj, and @deepakkinni and I met on Friday to discuss the different solutions, and I thought we agreed that VM Op had to make the change because of the need for the handshake? I was working on a fix as well at vmware-tanzu/vm-operator#1572. Have you validated your change yet? I am happy to use yours if it works. |
| jsonpatch "github.com/evanphx/json-patch/v5" | ||
| vmoperatorv1alpha1 "github.com/vmware-tanzu/vm-operator/api/v1alpha1" | ||
| vmoperatortypes "github.com/vmware-tanzu/vm-operator/api/v1alpha2" | ||
| vmoperatorv1alpha5 "github.com/vmware-tanzu/vm-operator/api/v1alpha5" |
There was a problem hiding this comment.
Is there a reason v1a5 is used? Would it not make more sense to use an earlier version on Supervisor? @deepakkinni , thoughts on this?
| for _, vmVol := range vm.Spec.Volumes { | ||
| if vmVol.PersistentVolumeClaim != nil && | ||
| vmVol.PersistentVolumeClaim.ClaimName == pvcName { | ||
| log.Infof("Skipping detach for PVC %s/%s with FCD %s from VM %s because it is still "+ |
There was a problem hiding this comment.
VM %s is confusing. Folks won't know if it is BIOS ID or Instance UUID. Maybe say VM instance UUID %s instead?
This is supposed to be dropped. I think @skogta forgot to close it. |
Yes, I meant to close it on Friday. |
|
In case we go ahead with this fix, refer to this PR: It contains better optimization by creating a cache of all volumes present in VM spec. |
What this PR does / why we need it:
During VM import, there is a race condition where there may be a delay from VM operator in adding a volume to batch attach spec.
In the meantime, CSI might incorrectly interpret that as a detach request (since volume is not there in batchattach spec but is attached to the VM on VM inventory).
In order to fix this, before adding a volume to detach list, it is important that CSI also validates that the volume is not being referenced in the VM spec. If it is being referenced, then skip adding that volume to detach list.
If VM object is not found, then fail the reconciliation.
As discussed on private chat, we should ignore safety check if PVC VM object is not found k8s cluster.
Testing done:
IN PROGESS:
WCP precheckin: https://jenkins-vcf-csifvt.devops.broadcom.net/job/wcp-instapp-e2e-pre-checkin/1307/
VKS precheckin: https://jenkins-vcf-csifvt.devops.broadcom.net/job/vks-instapp-e2e-pre-checkin/1050/