-
-
Notifications
You must be signed in to change notification settings - Fork 364
Fix Several Sources of Frequent Lock Contention #1688
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
WalkthroughThrottled banned-node cleanup and longer idle sleep added to quorum signing shares; networking socket handling refactored to collect disconnected nodes under lock and perform cleanup phases outside the critical section, and message handling now iterates nodes without snapshot copying. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant TSH as ThreadSocketHandler
participant Mtx as cs_vNodes (lock)
participant V as vNodes
participant N as Node
participant OS as Socket/OS
rect rgba(230,240,255,0.5)
note over TSH: Quickly identify disconnected nodes under lock
TSH->>Mtx: Lock
TSH->>V: Scan & collect disconnected nodes
TSH-->>Mtx: Unlock
end
loop Disconnect processing (outside lock)
TSH->>V: Remove from vNodes
TSH->>N: Release outbound grants / state
TSH->>OS: CloseSocketDisconnect
TSH->>V: Add to vNodesDisconnected
end
loop Final cleanup
TSH->>N: Wait for refcount == 0 & locks free
TSH->>N: DeleteNode
end
sequenceDiagram
autonumber
participant TMH as ThreadMessageHandler
participant FN as ForEachNode(...)
participant N as Node
participant INT as interruptNet
loop Main loop
TMH->>FN: Iterate nodes (no snapshot)
FN->>N: ProcessMessages(N)
FN->>N: SendMessages(N)
TMH->>INT: sleep_for(...)
end
sequenceDiagram
autonumber
participant W as SigningSharesWorker
participant T as GetTimeMillis()
participant C as RemoveBannedNodeStates()
loop Idle loop
W->>T: now = GetTimeMillis()
alt now - lastRemoveBannedNodeStatesTime >= 30s
W->>C: RemoveBannedNodeStates()
W->>W: lastRemoveBannedNodeStatesTime = now
else
Note over W: Skip cleanup this iteration
end
W->>W: Sleep 1000ms when no work
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Suggested reviewers
Poem
Pre-merge checks and finishing touches✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (2)
🔇 Additional comments (1)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
This breaks RPC tests in Github Actions (Linux Cmake/Autotools) and Jenkins. |
…ever 100ms, and call RemoveBannedNodesStates() only once every 30s. This fixes near constant contention of cs_main.
735a854 to
fbcf197
Compare
| // TODO Wakeup when pending signing is needed? | ||
| if (!didWork) { | ||
| if (!workInterrupt.sleep_for(std::chrono::milliseconds(100))) { | ||
| if (!workInterrupt.sleep_for(std::chrono::milliseconds(1000))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, PR looks correct to me. But I suspect that 1000ms is too high for this sleep_for. Maybe something like 250ms is more reasonable. Just a suggestion. Since this is not going to make a huge difference.
Overall, I don't see any specific issues with the PR.
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
In the first commit, banned nodes were being removed from our internal database 10 times per second. In each instance, it acquired cs_main, caused near constant contention for that. It is not at all necessary to do it that frequently, this is just for internal data structures and such, and actual validation logic is not checked based on this watchdog. This changes that time to 30s.
The first part of the second commit only copies relevant codes, usually none, instead of copying thousands of nodes. This is run every several seconds, so those allocations were extremely expensive and happened while cs_vNodes was held no less.
The second part of the second commit undoes premature optimisation that copied out vNodes in a hot loop. The copying causes thousands of allocations and deallocations and is run continuously. The actual operations don't take nearly as long as the allocations do. This simplifies the code to simply hold the lock the whole time.
I haven't extensively tested these, but the changes are straightforward and do not change behaviour of the functions.
These three changes reduce lock contention from something happening multiple times per second to something happening every ten seconds or so, in other places which are a bit harder to fix.