Skip to content

when pod CrashLoopBackOff,the job status should be failed #436

@davidstack

Description

@davidstack

Is this a BUG REPORT or FEATURE REQUEST?:

Uncomment only one, leave it on its own line:

/kind bug

/kind feature

What happened:
i test create tf-sample ,when i got this status

tensorflow-benchmark-ps-0       0/1     CrashLoopBackOff   5          3m46s
tensorflow-benchmark-worker-0   1/1     Running            0          3m46s
tensorflow-benchmark-worker-1   1/1     Running            0          3m46s

but the job status is running

[root@node1` tf-sample]# kubectl describe jobs.batch.volcano.sh tensorflow-benchmark
Name:         tensorflow-benchmark
Namespace:    default
Labels:       volcano.sh/job-type=Tensorflow
Annotations:  <none>
API Version:  batch.volcano.sh/v1alpha1
Kind:         Job
Metadata:
  Creation Timestamp:  2019-09-05T03:24:41Z
  Generation:          1
  Resource Version:    17317553
  Self Link:           /apis/batch.volcano.sh/v1alpha1/namespaces/default/jobs/tensorflow-benchmark
  UID:                 b3189fdd-cf8c-11e9-84e3-6c92bf8b7a92
Spec:
  Min Available:  3
  Plugins:
    Env:
    Svc:
  Policies:
    Action:        RestartJob
    Event:         PodEvicted
  Queue:           default
  Scheduler Name:  volcano
  Tasks:
    Name:      ps
    Replicas:  1
    Template:
      Spec:
        Containers:
          Command:
            sh
            -c
            PS_HOST=`cat /etc/volcano/ps.host | sed 's/$/&:2222/g' | tr "\n" ","`;
WORKER_HOST=`cat /etc/volcano/worker.host | sed 's/$/&:2222/g' | tr "\n" ","`;
python tf_cnn_benchmarks1.py --batch_size=32 --model=resnet50 --variable_update=parameter_server --flush_stdout=true --num_gpus=1 --local_parameter_device=cpu --device=cpu --data_format=NHWC --job_name=ps --task_index=${VK_TASK_INDEX} --ps_hosts=${PS_HOST} --worker_hosts=${WORKER_HOST}

          Image:  volcanosh/example-tf:0.0.1
          Name:   tensorflow
          Ports:
            Container Port:  2222
            Name:            tfjob-port
          Resources:
            Limits:
              Cpu:     1000m
              Memory:  2048Mi
            Requests:
              Cpu:      1000m
              Memory:   2048Mi
          Working Dir:  /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
        Image Pull Secrets:
          Name:          default-secret
        Restart Policy:  OnFailure
    Name:                worker
    Policies:
      Action:  CompleteJob
      Event:   TaskCompleted
    Replicas:  2
    Template:
      Spec:
        Containers:
          Command:
            sh
            -c
            PS_HOST=`cat /etc/volcano/ps.host | sed 's/$/&:2222/g' | tr "\n" ","`;
WORKER_HOST=`cat /etc/volcano/worker.host | sed 's/$/&:2222/g' | tr "\n" ","`;
python tf_cnn_benchmarks.py --batch_size=32 --model=resnet50 --variable_update=parameter_server --flush_stdout=true --num_gpus=1 --local_parameter_device=cpu --device=cpu --data_format=NHWC --job_name=worker --task_index=${VK_TASK_INDEX} --ps_hosts=${PS_HOST} --worker_hosts=${WORKER_HOST}

          Image:  volcanosh/example-tf:0.0.1
          Name:   tensorflow
          Ports:
            Container Port:  2222
            Name:            tfjob-port
          Resources:
            Limits:
              Cpu:     2000m
              Memory:  4096Mi
            Requests:
              Cpu:      2000m
              Memory:   2048Mi
          Working Dir:  /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
        Image Pull Secrets:
          Name:          default-secret
        Restart Policy:  OnFailure
Status:
  Controlled Resources:
    Plugin - Env:  env
    Plugin - Svc:  svc
  Min Available:   3
  Running:         3
  State:
    Last Transition Time:  2019-09-05T03:24:44Z
    Phase:                 Running
Events:                    <none>

What you expected to happen:

when one task failed, the job should in failed status
How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • Volcano Version:
  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:

Metadata

Metadata

Assignees

Labels

kind/bugCategorizes issue or PR as related to a bug.lifecycle/staleDenotes an issue or PR has remained open with no activity and has become stale.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions