-
Notifications
You must be signed in to change notification settings - Fork 5.9k
Description
【任务节点】
http://10.73.37.43:8900/fileview.html?path=/home/disk1/normandy/maybach/657903/
receiver: yq01-idl-gpu-offline14.yq01.baidu.com:8091 #V2版本的receiver
【问题】
1 、server.log和train.log日志显示不一致,不确定任务是否正常运行?
2、V2版本中间输入参数是否如V1一样 wget ${main_node_ip}:8099/output/xxx 类似方式拉取?目前抓取失败
【server.log】
Thu Mar 8 14:34:22 2018[1,28]:./start_server.sh: line 34: 28618 Killed GLOG_logtostderr=0 GLOG_log_dir="./log" ./paddle_pserver2 --num_gradient_servers=${OMPI_COMM_WORLD_SIZE} --nics=${nics} ${server_arg} --rdma_tcp=${rdma_tcp} --use_gpu=0 --comment=$comment
Thu Mar 8 14:34:22 2018[1,28]:+ check_return 'paddle_pserver2 failed'
Thu Mar 8 14:34:22 2018[1,28]:+ '[' 137 -ne 0 ']'
Thu Mar 8 14:34:22 2018[1,28]:+ echo '[./start_server.sh : 35] [main]'
Thu Mar 8 14:34:22 2018[1,28]:[./start_server.sh : 35] [main]
Thu Mar 8 14:34:22 2018[1,28]:+ echo '[FATAL]: paddle_pserver2 failed'
Thu Mar 8 14:34:22 2018[1,28]:[FATAL]: paddle_pserver2 failed
Thu Mar 8 14:34:22 2018[1,28]:+ get_stack
Thu Mar 8 14:34:22 2018[1,28]:+ set +x
Thu Mar 8 14:34:22 2018[1,28]:
Thu Mar 8 14:34:22 2018[1,28]:*****
【train.log】
Thu Mar 8 15:15:49 2018[1,4]:Pass 48 Batch 5 Cost 0.422954053
Thu Mar 8 15:15:59 2018[1,4]:Pass 48 Batch 6 Cost 0.356798267
Thu Mar 8 15:15:59 2018[1,10]:Pass 48 Batch 10 Cost 0.402026196
Thu Mar 8 15:16:00 2018[1,29]:Pass 49 Batch 4 Cost 0.420187109
Thu Mar 8 15:16:01 2018[1,28]:Pass 48 Batch 5 Cost 0.402949292
Thu Mar 8 15:16:08 2018[1,9]:Pass 49 Batch 2 Cost 0.440044434
Thu Mar 8 15:16:14 2018[1,17]:Pass 49 Batch 8 Cost 0.427096777
Thu Mar 8 15:16:16 2018[1,13]:Pass 49 Batch 4 Cost 0.394396509