RestartJob deletes failed pods on last try losing logs

**What happened**:

When you specify on `PodFailed` action `RestartJob` it will delete all the pods on the last retry even though there will be no more retries. This means that there's no way to access the logs to figure out why the pods failed and debug the jobs.


```
# success case w/ PodFailed:RestartJob
sh-5tqgv-sh-0-0                               0/1     Completed          0          16s
# failure case without PodFailed:RestartJob
sh-jpfdn-sh-0-0                               0/1     Error              0          16s

# failure case with PodFailed:RestartJob
<nothing>
```

**What you expected to happen**:

It should leave the pods for the last retry for debugging purposes and logs.

**How to reproduce it (as minimally and precisely as possible)**:

run job:
```
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  creationTimestamp: "2021-07-30T18:28:15Z"
  generateName: sh-
  generation: 1
  name: sh-kbnmb
  namespace: default
  resourceVersion: "96243182"
  selfLink: /apis/batch.volcano.sh/v1alpha1/namespaces/default/jobs/sh-kbnmb
  uid: 13eaaf6e-66f3-434c-a021-623af82771d7
spec:
  maxRetry: 1
  minAvailable: 1
  plugins:
    env: []
    svc: []
  queue: test
  schedulerName: volcano
  tasks:
  - maxRetry: 1
    minAvailable: 1
    name: sh-0
    policies:
    - action: RestartJob
      event: PodEvicted
    - action: RestartJob
      event: PodFailed
    replicas: 1
    template:
      metadata: {}
      spec:
        containers:
        - command:
          - /bin/sh
          - -c
          - exit 1
          image: alpine:latest
          name: sh-0
          resources: {}
        restartPolicy: Never
status:
  minAvailable: 1
  retryCount: 1
  runningDuration: 6.652806914s
  state:
    lastTransitionTime: "2021-07-30T18:28:21Z"
    phase: Failed
  version: 3
```

**Anything else we need to know?**:

**Environment**:
- Volcano Version:
    image: volcanosh/vc-controller-manager:latest
    imageID: docker-pullable://volcanosh/vc-controller-manager@sha256:897793abe922641b47acc2c872209078346e855a3453a0d35dfb573812ae64b9

- Kubernetes version (use `kubectl version`):
```
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.3", GitCommit:"ca643a4d1f7bfe34773c74f79527be4afd95bf39", GitTreeState:"archive", BuildDate:"2021-07-16T17:16:46Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"18+", GitVersion:"v1.18.16-eks-7737de", GitCommit:"7737de131e58a68dda49cdd0ad821b4cb3665ae8", GitTreeState:"clean", BuildDate:"2021-03-10T21:33:25Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}
```
- Cloud provider or hardware configuration: AWS EKS
- OS (e.g. from /etc/os-release): Arch Linux
- Kernel (e.g. `uname -a`): `Linux tristanr-arch2 5.12.15-arch1-1 #1 SMP PREEMPT Wed, 07 Jul 2021 23:35:29 +0000 x86_64 GNU/Linux`
- Install tools:
- Others:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RestartJob deletes failed pods on last try losing logs #1651

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RestartJob deletes failed pods on last try losing logs #1651

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions