Skip to content

paddle_pserver2 Connection reset  #8876

@adrianhust

Description

@adrianhust

Hi,paddle ps server throws err, Connection reset by peer

Thu Mar 8 16:54:29 2018[1,77]:+ ./paddle_pserver2 --num_gradient_servers=100 --nics=xgbe0 --port=7165 --ports_num=1 --ports_num_for_sparse=1 --rdma_tcp=tcp --comment=paddle_cluster_job
Thu Mar 8 17:17:08 2018[1,69]:F0308 17:17:08.399897 46342 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.199.23: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,73]:F0308 17:17:08.404299 36768 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.199.23: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,70]:F0308 17:17:08.399475 8739 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.199.23: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,72]:F0308 17:17:08.403358 10075 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.199.23: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,71]:F0308 17:17:08.402631 7730 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.199.23: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,76]:F0308 17:17:08.409656 16576 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.199.23: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,83]:F0308 17:17:08.398970 10447 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.199.23: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,87]:F0308 17:17:08.399874 1616 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.199.23: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,88]:F0308 17:17:08.406509 16449 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.199.23: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,86]:F0308 17:17:08.401118 27746 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.199.23: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,75]:F0308 17:17:08.399859 11851 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.199.23: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,91]:F0308 17:17:08.401727 18870 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.199.23: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,84]:F0308 17:17:08.396749 11437 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.199.23: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,80]:F0308 17:17:08.401507 32596 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.199.23: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,90]:F0308 17:17:08.405221 27163 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.199.23: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,99]:F0308 17:17:08.402868 16889 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.199.23: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,79]:F0308 17:17:08.400950 35561 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.199.23: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,85]:F0308 17:17:08.403419 46104 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.199.23: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,74]:F0308 17:17:08.402575 24915 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.199.23: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,93]:F0308 17:17:08.401983 8600 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.199.23: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,89]:F0308 17:17:08.401999 36127 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.199.23: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,82]:F0308 17:17:08.402421 14384 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.199.23: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,94]:F0308 17:17:08.400952 3120 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.199.23: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,97]:F0308 17:17:08.401190 17058 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.199.23: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,95]:F0308 17:17:08.404366 37113 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.199.23: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,81]:F0308 17:17:08.403667 41279 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.199.23: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,77]:F0308 17:17:08.408119 22789 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.199.23: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,78]:F0308 17:17:08.403462 47509 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.199.23: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,98]:F0308 17:17:08.403429 12925 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.199.23: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,96]:F0308 17:17:08.402117 17866 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.199.23: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,73]:*** Check failure stack trace: ***
Thu Mar 8 17:17:08 2018[1,73]:F0308 17:17:08.411231 41691 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.196.16: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,73]:*** Check failure stack trace: ***
Thu Mar 8 17:17:08 2018[1,76]:*** Check failure stack trace: ***
Thu Mar 8 17:17:08 2018[1,76]:F0308 17:17:08.416602 21110 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.196.16: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,76]:*** Check failure stack trace: ***
Thu Mar 8 17:17:08 2018[1,72]:*** Check failure stack trace: ***
Thu Mar 8 17:17:08 2018[1,72]:F0308 17:17:08.410295 16444 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.196.16: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,72]:*** Check failure stack trace: ***
Thu Mar 8 17:17:08 2018[1,99]:*** Check failure stack trace: ***
Thu Mar 8 17:17:08 2018[1,99]:F0308 17:17:08.409775 18855 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.196.16: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,99]:*** Check failure stack trace: ***
Thu Mar 8 17:17:08 2018[1,84]:*** Check failure stack trace: ***
Thu Mar 8 17:17:08 2018[1,84]:F0308 17:17:08.403645 18169 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.196.16: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,84]:*** Check failure stack trace: ***
Thu Mar 8 17:17:08 2018[1,80]:*** Check failure stack trace: ***
Thu Mar 8 17:17:08 2018[1,80]:F0308 17:17:08.408404 37644 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.196.16: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,80]:*** Check failure stack trace: ***
Thu Mar 8 17:17:08 2018[1,71]:*** Check failure stack trace: ***
Thu Mar 8 17:17:08 2018[1,71]:F0308 17:17:08.409564 15520 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.196.16: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,71]:*** Check failure stack trace: ***
Thu Mar 8 17:17:08 2018[1,82]:*** Check failure stack trace: ***
Thu Mar 8 17:17:08 2018[1,82]:F0308 17:17:08.409303 18967 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.196.16: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,82]:*** Check failure stack trace: ***
Thu Mar 8 17:17:08 2018[1,73]: @ 0x8d19fd google::LogMessage::Fail()
Thu Mar 8 17:17:08 2018[1,73]: @ 0x8d19fd google::LogMessage::Fail()
Thu Mar 8 17:17:08 2018[1,88]:*** Check failure stack trace: ***
Thu Mar 8 17:17:08 2018[1,88]:F0308 17:17:08.413429 24016 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.196.16: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,88]:*** Check failure stack trace: ***
Thu Mar 8 17:17:08 2018[1,73]: @ 0x8d54ac google::LogMessage::SendToLog()
Thu Mar 8 17:17:08 2018[1,73]: @ 0x8d54ac google::LogMessage::SendToLog()
Thu Mar 8 17:17:08 2018[1,24]:./start_server.sh: line 33: 39014 Killed GLOG_logtostderr=0 GLOG_log_dir="./log" ./paddle_pserver2 --num_gradient_servers=${OMPI_COMM_WORLD_SIZE} --nics=${nics} ${server_arg} --rdma_tcp=${rdma_tcp} --comment=$comment
Thu Mar 8 17:17:08 2018[1,24]:+ check_return 'paddle_pserver2 failed'
Thu Mar 8 17:17:08 2018[1,24]:+ '[' 137 -ne 0 ']'
Thu Mar 8 17:17:08 2018[1,72]: @ 0x8d19fd google::LogMessage::Fail()
Thu Mar 8 17:17:08 2018[1,72]: @ 0x8d19fd google::LogMessage::Fail()
Thu Mar 8 17:17:08 2018[1,24]:+ echo '[./start_server.sh : 34] [main]'
Thu Mar 8 17:17:08 2018[1,24]:[./start_server.sh : 34] [main]
Thu Mar 8 17:17:08 2018[1,24]:+ echo '[FATAL]: paddle_pserver2 failed'
Thu Mar 8 17:17:08 2018[1,24]:[FATAL]: paddle_pserver2 failed

mem2018-03-08 17-37-10

Metadata

Metadata

Assignees

No one assigned

    Labels

    User用于标记用户问题

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions