What happened:
When you specify on PodFailed action RestartJob it will delete all the pods on the last retry even though there will be no more retries. This means that there's no way to access the logs to figure out why the pods failed and debug the jobs.
# success case w/ PodFailed:RestartJob
sh-5tqgv-sh-0-0 0/1 Completed 0 16s
# failure case without PodFailed:RestartJob
sh-jpfdn-sh-0-0 0/1 Error 0 16s
# failure case with PodFailed:RestartJob
<nothing>
What you expected to happen:
It should leave the pods for the last retry for debugging purposes and logs.
How to reproduce it (as minimally and precisely as possible):
run job:
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
creationTimestamp: "2021-07-30T18:28:15Z"
generateName: sh-
generation: 1
name: sh-kbnmb
namespace: default
resourceVersion: "96243182"
selfLink: /apis/batch.volcano.sh/v1alpha1/namespaces/default/jobs/sh-kbnmb
uid: 13eaaf6e-66f3-434c-a021-623af82771d7
spec:
maxRetry: 1
minAvailable: 1
plugins:
env: []
svc: []
queue: test
schedulerName: volcano
tasks:
- maxRetry: 1
minAvailable: 1
name: sh-0
policies:
- action: RestartJob
event: PodEvicted
- action: RestartJob
event: PodFailed
replicas: 1
template:
metadata: {}
spec:
containers:
- command:
- /bin/sh
- -c
- exit 1
image: alpine:latest
name: sh-0
resources: {}
restartPolicy: Never
status:
minAvailable: 1
retryCount: 1
runningDuration: 6.652806914s
state:
lastTransitionTime: "2021-07-30T18:28:21Z"
phase: Failed
version: 3
Anything else we need to know?:
Environment:
-
Volcano Version:
image: volcanosh/vc-controller-manager:latest
imageID: docker-pullable://volcanosh/vc-controller-manager@sha256:897793abe922641b47acc2c872209078346e855a3453a0d35dfb573812ae64b9
-
Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.3", GitCommit:"ca643a4d1f7bfe34773c74f79527be4afd95bf39", GitTreeState:"archive", BuildDate:"2021-07-16T17:16:46Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"18+", GitVersion:"v1.18.16-eks-7737de", GitCommit:"7737de131e58a68dda49cdd0ad821b4cb3665ae8", GitTreeState:"clean", BuildDate:"2021-03-10T21:33:25Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}
- Cloud provider or hardware configuration: AWS EKS
- OS (e.g. from /etc/os-release): Arch Linux
- Kernel (e.g.
uname -a): Linux tristanr-arch2 5.12.15-arch1-1 #1 SMP PREEMPT Wed, 07 Jul 2021 23:35:29 +0000 x86_64 GNU/Linux
- Install tools:
- Others:
What happened:
When you specify on
PodFailedactionRestartJobit will delete all the pods on the last retry even though there will be no more retries. This means that there's no way to access the logs to figure out why the pods failed and debug the jobs.What you expected to happen:
It should leave the pods for the last retry for debugging purposes and logs.
How to reproduce it (as minimally and precisely as possible):
run job:
Anything else we need to know?:
Environment:
Volcano Version:
image: volcanosh/vc-controller-manager:latest
imageID: docker-pullable://volcanosh/vc-controller-manager@sha256:897793abe922641b47acc2c872209078346e855a3453a0d35dfb573812ae64b9
Kubernetes version (use
kubectl version):uname -a):Linux tristanr-arch2 5.12.15-arch1-1 #1 SMP PREEMPT Wed, 07 Jul 2021 23:35:29 +0000 x86_64 GNU/Linux