You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
net/rds: RDS connection does not reconnect after CQ access violation error
The sequence that leads to this state is as follows.
1) First we see CQ error logged.
Sep 29 22:32:33 dm54cel14 kernel: [471472.784371] mlx4_core
0000:46:00.0: CQ access violation on CQN 000419 syndrome=0x2
vendor_error_syndrome=0x0
2) That is followed by the drop of the associated RDS connection.
Sep 29 22:32:33 dm54cel14 kernel: [471472.784403] RDS/IB: connection
<192.168.54.43,192.168.54.1,0> dropped due to 'qp event'
3) We don't get the WR_FLUSH_ERRs for the posted receive buffers after that.
4) RDS is stuck in rds_ib_conn_shutdown while shutting down that connection.
crash64> bt 62577
PID: 62577 TASK: ffff88143f045400 CPU: 4 COMMAND: "kworker/u224:1"
#0 [ffff8813663bbb58] __schedule at ffffffff816ab68b
#1 [ffff8813663bbbb0] schedule at ffffffff816abca7
#2 [ffff8813663bbbd0] schedule_timeout at ffffffff816aee71
#3 [ffff8813663bbc80] rds_ib_conn_shutdown at ffffffffa041f7d1 [rds_rdma]
#4 [ffff8813663bbd10] rds_conn_shutdown at ffffffffa03dc6e2 [rds]
#5 [ffff8813663bbdb0] rds_shutdown_worker at ffffffffa03e2699 [rds]
#6 [ffff8813663bbe00] process_one_work at ffffffff8109cda1
#7 [ffff8813663bbe50] worker_thread at ffffffff8109d92b
#8 [ffff8813663bbec0] kthread at ffffffff810a304b
#9 [ffff8813663bbf50] ret_from_fork at ffffffff816b0752
crash64>
It was stuck here in rds_ib_conn_shutdown for ever:
/* quiesce tx and rx completion before tearing down */
while (!wait_event_timeout(rds_ib_ring_empty_wait,
rds_ib_ring_empty(&ic->i_recv_ring) &&
(atomic_read(&ic->i_signaled_sends) == 0),
msecs_to_jiffies(5000))) {
/* Try to reap pending RX completions every 5 secs */
if (!rds_ib_ring_empty(&ic->i_recv_ring)) {
spin_lock_bh(&ic->i_rx_lock);
rds_ib_rx(ic);
spin_unlock_bh(&ic->i_rx_lock);
}
}
The recv ring was not empty.
w_alloc_ptr = 560
w_free_ptr = 256
This is what Mellanox had to say:
When CQ moves to error (e.g. due to CQ Overrun, CQ Access violation) FW will
generate Async event to notify this error, also the QPs that tries to access
this CQ will be put to error state but will not be flushed since we must not
post CQEs to a broken CQ. The QP that tries to access will also issue an
Async catas event.
In summary we cannot wait for any more WR_FLUSH_ERRs in that state.
Orabug: 29180452
Reviewed-by: Rama Nichanamatlu <[email protected]>
Signed-off-by: Venkat Venkatsubra <[email protected]>
0 commit comments