Skip to content

Update cleanup job node affinity logic #1455

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jun 25, 2025
Merged

Conversation

dkwon17
Copy link
Collaborator

@dkwon17 dkwon17 commented Jun 23, 2025

What does this PR do?

Update the cleanup job node affinity logic.

Old logic:

  • Schedule the cleanup job node to the node specified in the claim-devworkspace PVC's volume.kubernetes.io/selected-node annotation (if the annotation exists)

This logic can still cause the multi-attach error since there is no guarantee the PVC is mounted to a pod on the same node as what's specified in volume.kubernetes.io/selected-node .

New logic:

  • Check if there are devworkspace pods running in the namespace that the to-be-deleted devworkspace is in.
    • If yes, check if any of them mount the claim-devworkspace PVC
      • If yes, schedule the cleanup job pod on the same node as the devworkspace pod found in the previous step
    • If not, don't add a nodeAffinity.RequiredDuringSchedulingIgnoredDuringExecution rule onto the cleanup job pod

What issues does this PR fix or reference?

Fix #1453

Is it tested? How?

To follow these steps, a multi node cluster (where there is more than 1 worker node) is required.

  1. Install DWO with this image: quay.io/dkwon17/devworkspace-controller:fix-pvc-node:
export NAMESPACE=openshift-operators
export DWO_IMG="quay.io/dkwon17/devworkspace-controller:fix-pvc-node"
make install
  1. Switch to the openshift-operators namespace and create a workspace:
oc project openshift-operators && oc apply -f https://raw.githubusercontent.com/devfile/devworkspace-operator/refs/heads/main/samples/code-latest.yaml
  1. After the workspace is running (run oc get dw to check), terminate the first workspace:
oc patch dw code-latest --type merge -p '{"spec": {"started": false}}'
  1. Create a second workspace:
curl https://raw.githubusercontent.com/devfile/devworkspace-operator/refs/heads/main/samples/code-latest.yaml | yq '.metadata.name = "code-latest-2"' | oc apply -f -
  1. After the workspace is running terminate the second workspace:
oc patch dw code-latest-2 --type merge -p '{"spec": {"started": false}}'
  1. Get the node specified from the volume.kubernetes.io/selected-node annotation from the claim-devworkspace PVC:
oc get pvc claim-devworkspace -o jsonpath='{.metadata.annotations.volume\.kubernetes\.io\/selected-node}'
  1. Cordon the node, making it unavailable for pod scheduling:
oc adm cordon <output of previous command>
  1. Delete code-latest:
oc delete dw code-latest

With changes in the PR, the deletion should be successful after about 10 seconds.

  1. Create another workspace:
oc apply -f https://raw.githubusercontent.com/devfile/devworkspace-operator/refs/heads/main/samples/code-latest.yaml
  1. Wait until it is running, and delete it:
oc delete dw code-latest
  1. Uncordon the node from step 7:
oc adm uncordon <node name>

The deletion should be successful after about 10 seconds.

PR Checklist

  • E2E tests pass (when PR is ready, comment /test v8-devworkspace-operator-e2e, v8-che-happy-path to trigger)
    • v8-devworkspace-operator-e2e: DevWorkspace e2e test
    • v8-che-happy-path: Happy path for verification integration with Che

@rohanKanojia
Copy link
Collaborator

@dkwon17 : Thanks for fixing this! Would you mind adding some unit tests to cover the changed behavior?

@olkornii
Copy link

olkornii commented Jun 24, 2025

@dkwon17 Tried with rosa 4.18 and oc 4.18.9.

Step 2 failed with next error:

oc project openshift-operators && oc apply -f https://raw.githubusercontent.com/devfile/devworkspace-operator/refs/heads/main/samples/code-latest.yaml
Now using project "openshift-operators" on server "https://api.mibm8-drrcy-uos.1u73.p3.openshiftapps.com:443".
Error from server (InternalError): error when creating "https://raw.githubusercontent.com/devfile/devworkspace-operator/refs/heads/main/samples/code-latest.yaml": Internal error occurred: failed calling webhook "mutate.devworkspace-controller.svc": failed to call webhook: Post "https://devworkspace-webhookserver.openshift-operators.svc:443/mutate?timeout=10s": no endpoints available for service "devworkspace-webhookserver"

Tried to separate to two commands:

oc project openshift-operators

and

oc apply -f https://raw.githubusercontent.com/devfile/devworkspace-operator/refs/heads/main/samples/code-latest.yaml

passed.

Results. After workaround for step 2 everything works as expected.

@rohanKanojia
Copy link
Collaborator

@olkornii : It seems that webhook pod is not running. Could you please check if you have devworkspace-manager devworkspace-controller-manager and devworkspace-webhook-server pods running in openshift-operators namespace:

devworkspace-operator : $ oc get pods -nopenshift-operators
NAME                                               READY   STATUS      RESTARTS   AGE
devworkspace-controller-manager-68fc476665-tzlhj   2/2     Running     0          3h16m
devworkspace-webhook-server-6976b5d46d-c598k       2/2     Running     0          3h12m
devworkspace-webhook-server-6976b5d46d-vsqfw       2/2     Running     0          3h12m

@rohanKanojia
Copy link
Collaborator

I tested this PR with the abovementioned steps and it seems to be working ✔️ . Thanks a lot 👍

Copy link

openshift-ci bot commented Jun 24, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dkwon17, ibuziuk, rohanKanojia

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@olkornii
Copy link

@rohanKanojia

NAME                                               READY   STATUS      RESTARTS   AGE
cleanup-workspaced86095cae49842d3-gx6ks            0/1     Completed   0          119m
devworkspace-controller-manager-68fc476665-qzfrl   2/2     Running     0          154m
devworkspace-webhook-server-67d4bc5fdb-6pv9x       2/2     Running     0          154m
devworkspace-webhook-server-67d4bc5fdb-k5c9r       2/2     Running     0          154m

But as I said, entering those commands one by one works well.

@dkwon17
Copy link
Collaborator Author

dkwon17 commented Jun 24, 2025

Thank you for checking,

@olkornii this error:

Error from server (InternalError): error when creating "https://raw.githubusercontent.com/devfile/devworkspace-
operator/refs/heads/main/samples/code-latest.yaml": Internal error occurred: failed calling webhook 
"mutate.devworkspace-controller.svc": failed to call webhook: Post "https://devworkspace-webhookserver.openshift-
operators.svc:443/mutate?timeout=10s": no endpoints available for service "devworkspace-webhookserver"

is because the devworkspace-webhookserver pods were not running when the devworkspace was created, hence, no endpoints available for service "devworkspace-webhookserver". So it's not an issue for this PR

Signed-off-by: David Kwon <[email protected]>
@openshift-ci openshift-ci bot removed the lgtm label Jun 24, 2025
Copy link

openshift-ci bot commented Jun 24, 2025

New changes are detected. LGTM label has been removed.

@dkwon17
Copy link
Collaborator Author

dkwon17 commented Jun 24, 2025

@rohanKanojia I'm still working on the unit tests, I will create a new PR for them.

dkwon17 added 2 commits June 24, 2025 19:50
Signed-off-by: David Kwon <[email protected]>
Signed-off-by: David Kwon <[email protected]>
@dkwon17 dkwon17 merged commit 01dda08 into devfile:main Jun 25, 2025
10 checks passed
dkwon17 added a commit to dkwon17/devworkspace-operator that referenced this pull request Jun 25, 2025
@dkwon17 dkwon17 mentioned this pull request Jun 25, 2025
3 tasks
dkwon17 added a commit that referenced this pull request Jun 25, 2025
dkwon17 added a commit that referenced this pull request Jun 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cleanup job pod can be assigned to incorrect node
4 participants