Is this a BUG REPORT or FEATURE REQUEST?:
Uncomment only one, leave it on its own line:
/kind bug
/kind feature
What happened:
i test create tf-sample ,when i got this status
tensorflow-benchmark-ps-0 0/1 CrashLoopBackOff 5 3m46s
tensorflow-benchmark-worker-0 1/1 Running 0 3m46s
tensorflow-benchmark-worker-1 1/1 Running 0 3m46s
but the job status is running
[root@node1` tf-sample]# kubectl describe jobs.batch.volcano.sh tensorflow-benchmark
Name: tensorflow-benchmark
Namespace: default
Labels: volcano.sh/job-type=Tensorflow
Annotations: <none>
API Version: batch.volcano.sh/v1alpha1
Kind: Job
Metadata:
Creation Timestamp: 2019-09-05T03:24:41Z
Generation: 1
Resource Version: 17317553
Self Link: /apis/batch.volcano.sh/v1alpha1/namespaces/default/jobs/tensorflow-benchmark
UID: b3189fdd-cf8c-11e9-84e3-6c92bf8b7a92
Spec:
Min Available: 3
Plugins:
Env:
Svc:
Policies:
Action: RestartJob
Event: PodEvicted
Queue: default
Scheduler Name: volcano
Tasks:
Name: ps
Replicas: 1
Template:
Spec:
Containers:
Command:
sh
-c
PS_HOST=`cat /etc/volcano/ps.host | sed 's/$/&:2222/g' | tr "\n" ","`;
WORKER_HOST=`cat /etc/volcano/worker.host | sed 's/$/&:2222/g' | tr "\n" ","`;
python tf_cnn_benchmarks1.py --batch_size=32 --model=resnet50 --variable_update=parameter_server --flush_stdout=true --num_gpus=1 --local_parameter_device=cpu --device=cpu --data_format=NHWC --job_name=ps --task_index=${VK_TASK_INDEX} --ps_hosts=${PS_HOST} --worker_hosts=${WORKER_HOST}
Image: volcanosh/example-tf:0.0.1
Name: tensorflow
Ports:
Container Port: 2222
Name: tfjob-port
Resources:
Limits:
Cpu: 1000m
Memory: 2048Mi
Requests:
Cpu: 1000m
Memory: 2048Mi
Working Dir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
Image Pull Secrets:
Name: default-secret
Restart Policy: OnFailure
Name: worker
Policies:
Action: CompleteJob
Event: TaskCompleted
Replicas: 2
Template:
Spec:
Containers:
Command:
sh
-c
PS_HOST=`cat /etc/volcano/ps.host | sed 's/$/&:2222/g' | tr "\n" ","`;
WORKER_HOST=`cat /etc/volcano/worker.host | sed 's/$/&:2222/g' | tr "\n" ","`;
python tf_cnn_benchmarks.py --batch_size=32 --model=resnet50 --variable_update=parameter_server --flush_stdout=true --num_gpus=1 --local_parameter_device=cpu --device=cpu --data_format=NHWC --job_name=worker --task_index=${VK_TASK_INDEX} --ps_hosts=${PS_HOST} --worker_hosts=${WORKER_HOST}
Image: volcanosh/example-tf:0.0.1
Name: tensorflow
Ports:
Container Port: 2222
Name: tfjob-port
Resources:
Limits:
Cpu: 2000m
Memory: 4096Mi
Requests:
Cpu: 2000m
Memory: 2048Mi
Working Dir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
Image Pull Secrets:
Name: default-secret
Restart Policy: OnFailure
Status:
Controlled Resources:
Plugin - Env: env
Plugin - Svc: svc
Min Available: 3
Running: 3
State:
Last Transition Time: 2019-09-05T03:24:44Z
Phase: Running
Events: <none>
What you expected to happen:
when one task failed, the job should in failed status
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
Environment:
- Volcano Version:
- Kubernetes version (use
kubectl version):
- Cloud provider or hardware configuration:
- OS (e.g. from /etc/os-release):
- Kernel (e.g.
uname -a):
- Install tools:
- Others:
Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug
What happened:
i test create tf-sample ,when i got this status
but the job status is running
What you expected to happen:
when one task failed, the job should in failed status
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
Environment:
kubectl version):uname -a):