Skip to content

RestartJob deletes failed pods on last try losing logs #1651

@d4l3k

Description

@d4l3k

What happened:

When you specify on PodFailed action RestartJob it will delete all the pods on the last retry even though there will be no more retries. This means that there's no way to access the logs to figure out why the pods failed and debug the jobs.

# success case w/ PodFailed:RestartJob
sh-5tqgv-sh-0-0                               0/1     Completed          0          16s
# failure case without PodFailed:RestartJob
sh-jpfdn-sh-0-0                               0/1     Error              0          16s

# failure case with PodFailed:RestartJob
<nothing>

What you expected to happen:

It should leave the pods for the last retry for debugging purposes and logs.

How to reproduce it (as minimally and precisely as possible):

run job:

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  creationTimestamp: "2021-07-30T18:28:15Z"
  generateName: sh-
  generation: 1
  name: sh-kbnmb
  namespace: default
  resourceVersion: "96243182"
  selfLink: /apis/batch.volcano.sh/v1alpha1/namespaces/default/jobs/sh-kbnmb
  uid: 13eaaf6e-66f3-434c-a021-623af82771d7
spec:
  maxRetry: 1
  minAvailable: 1
  plugins:
    env: []
    svc: []
  queue: test
  schedulerName: volcano
  tasks:
  - maxRetry: 1
    minAvailable: 1
    name: sh-0
    policies:
    - action: RestartJob
      event: PodEvicted
    - action: RestartJob
      event: PodFailed
    replicas: 1
    template:
      metadata: {}
      spec:
        containers:
        - command:
          - /bin/sh
          - -c
          - exit 1
          image: alpine:latest
          name: sh-0
          resources: {}
        restartPolicy: Never
status:
  minAvailable: 1
  retryCount: 1
  runningDuration: 6.652806914s
  state:
    lastTransitionTime: "2021-07-30T18:28:21Z"
    phase: Failed
  version: 3

Anything else we need to know?:

Environment:

  • Volcano Version:
    image: volcanosh/vc-controller-manager:latest
    imageID: docker-pullable://volcanosh/vc-controller-manager@sha256:897793abe922641b47acc2c872209078346e855a3453a0d35dfb573812ae64b9

  • Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.3", GitCommit:"ca643a4d1f7bfe34773c74f79527be4afd95bf39", GitTreeState:"archive", BuildDate:"2021-07-16T17:16:46Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"18+", GitVersion:"v1.18.16-eks-7737de", GitCommit:"7737de131e58a68dda49cdd0ad821b4cb3665ae8", GitTreeState:"clean", BuildDate:"2021-03-10T21:33:25Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration: AWS EKS
  • OS (e.g. from /etc/os-release): Arch Linux
  • Kernel (e.g. uname -a): Linux tristanr-arch2 5.12.15-arch1-1 #1 SMP PREEMPT Wed, 07 Jul 2021 23:35:29 +0000 x86_64 GNU/Linux
  • Install tools:
  • Others:

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/controllershelp wantedDenotes an issue that needs help from a contributor. Must meet "help wanted" guidelines.kind/bugCategorizes issue or PR as related to a bug.lifecycle/staleDenotes an issue or PR has remained open with no activity and has become stale.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions