Skip to content

[TE] fix deadlock in handshake connection setup#1762

Closed
alogfans wants to merge 4 commits intokvcache-ai:mainfrom
alogfans:fix-handshake-deadlock
Closed

[TE] fix deadlock in handshake connection setup#1762
alogfans wants to merge 4 commits intokvcache-ai:mainfrom
alogfans:fix-handshake-deadlock

Conversation

@alogfans
Copy link
Copy Markdown
Collaborator

Description

Release lock before calling sendHandshake RPC and getSegmentDescByName to avoid deadlock when RPC framework needs to access the same endpoint. Re-acquire lock after RPC call and check connection status again to prevent concurrent connection establishment.

Module

  • Transfer Engine (mooncake-transfer-engine)
  • Mooncake Store (mooncake-store)
  • Mooncake EP (mooncake-ep)
  • Integration (mooncake-integration)
  • P2P Store (mooncake-p2p-store)
  • Python Wheel (mooncake-wheel)
  • PyTorch Backend (mooncake-pg)
  • Mooncake RL (mooncake-rl)
  • CI/CD
  • Docs
  • Other

Type of Change

  • Bug fix
  • New feature
  • Refactor
  • Breaking change
  • Documentation update
  • Other

How Has This Been Tested?

Checklist

  • I have performed a self-review of my own code.
  • I have formatted my own code using ./scripts/code_format.sh before submitting.
  • I have updated the documentation.
  • I have added tests to prove my changes are effective.

alogfans and others added 2 commits March 29, 2026 03:28
Release lock before calling sendHandshake RPC and getSegmentDescByName
to avoid deadlock when RPC framework needs to access the same endpoint.
Re-acquire lock after RPC call and check connection status again to
prevent concurrent connection establishment.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the connection setup logic in both EFA and RDMA transports to reduce the duration for which the RWSpinlock is held, specifically moving blocking handshake calls outside the critical section. It also introduces double-checked locking to prevent race conditions during connection finalization. Feedback indicates that these changes introduced potential data races on peer_nic_path_ because it is accessed outside the lock scope; the reviewer suggests using local copies of this member variable to ensure thread safety.

alogfans and others added 2 commits March 29, 2026 03:33
Copy peer_nic_path_ to local variable inside lock scope to avoid
data race when setPeerNicPath is called concurrently.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@codecov-commenter
Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 0% with 60 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...ine/src/transport/rdma_transport/rdma_endpoint.cpp 0.00% 60 Missing ⚠️

📢 Thoughts on this report? Let us know!

@alogfans
Copy link
Copy Markdown
Collaborator Author

Closed as #1733 merged.

@alogfans alogfans closed this Mar 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants