[TE] fix deadlock in handshake connection setup#1762
[TE] fix deadlock in handshake connection setup#1762alogfans wants to merge 4 commits intokvcache-ai:mainfrom
Conversation
Release lock before calling sendHandshake RPC and getSegmentDescByName to avoid deadlock when RPC framework needs to access the same endpoint. Re-acquire lock after RPC call and check connection status again to prevent concurrent connection establishment. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request refactors the connection setup logic in both EFA and RDMA transports to reduce the duration for which the RWSpinlock is held, specifically moving blocking handshake calls outside the critical section. It also introduces double-checked locking to prevent race conditions during connection finalization. Feedback indicates that these changes introduced potential data races on peer_nic_path_ because it is accessed outside the lock scope; the reviewer suggests using local copies of this member variable to ensure thread safety.
Copy peer_nic_path_ to local variable inside lock scope to avoid data race when setPeerNicPath is called concurrently. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
|
Closed as #1733 merged. |
Description
Release lock before calling sendHandshake RPC and getSegmentDescByName to avoid deadlock when RPC framework needs to access the same endpoint. Re-acquire lock after RPC call and check connection status again to prevent concurrent connection establishment.
Module
mooncake-transfer-engine)mooncake-store)mooncake-ep)mooncake-integration)mooncake-p2p-store)mooncake-wheel)mooncake-pg)mooncake-rl)Type of Change
How Has This Been Tested?
Checklist
./scripts/code_format.shbefore submitting.