Skip to content

Conversation

@cassandras-lies
Copy link
Contributor

@cassandras-lies cassandras-lies commented Oct 14, 2025

In the first commit, banned nodes were being removed from our internal database 10 times per second. In each instance, it acquired cs_main, caused near constant contention for that. It is not at all necessary to do it that frequently, this is just for internal data structures and such, and actual validation logic is not checked based on this watchdog. This changes that time to 30s.

The first part of the second commit only copies relevant codes, usually none, instead of copying thousands of nodes. This is run every several seconds, so those allocations were extremely expensive and happened while cs_vNodes was held no less.

The second part of the second commit undoes premature optimisation that copied out vNodes in a hot loop. The copying causes thousands of allocations and deallocations and is run continuously. The actual operations don't take nearly as long as the allocations do. This simplifies the code to simply hold the lock the whole time.

I haven't extensively tested these, but the changes are straightforward and do not change behaviour of the functions.

These three changes reduce lock contention from something happening multiple times per second to something happening every ten seconds or so, in other places which are a bit harder to fix.

@cassandras-lies cassandras-lies changed the title Fix Several Sources Frequent Lock Contention Fix Several Sources of Frequent Lock Contention Oct 14, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 14, 2025

Walkthrough

Throttled banned-node cleanup and longer idle sleep added to quorum signing shares; networking socket handling refactored to collect disconnected nodes under lock and perform cleanup phases outside the critical section, and message handling now iterates nodes without snapshot copying.

Changes

Cohort / File(s) Summary
Quorum signing shares idle/cleanup throttling
src/llmq/quorums_signing_shares.cpp
Added lastRemoveBannedNodeStatesTime and guard so RemoveBannedNodeStates() runs at most every 30s (checked via GetTimeMillis()); increased idle no-work sleep from 100ms to 1000ms; added explanatory comments.
Networking disconnect/refactor
src/net.cpp
Reworked ThreadSocketHandler to collect only actually disconnected nodes while holding cs_vNodes, then perform removal, grant release, socket close, Release, and move to vNodesDisconnected outside the lock. Added a subsequent cleanup phase that waits for zero refs and calls DeleteNode when safe; reduced lock duration and changed lifecycle handling of disconnected nodes. Also altered message handler to iterate nodes via ForEachNode(...) instead of copying snapshots and adjusted sleep pacing.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant TSH as ThreadSocketHandler
  participant Mtx as cs_vNodes (lock)
  participant V as vNodes
  participant N as Node
  participant OS as Socket/OS

  rect rgba(230,240,255,0.5)
  note over TSH: Quickly identify disconnected nodes under lock
  TSH->>Mtx: Lock
  TSH->>V: Scan & collect disconnected nodes
  TSH-->>Mtx: Unlock
  end

  loop Disconnect processing (outside lock)
    TSH->>V: Remove from vNodes
    TSH->>N: Release outbound grants / state
    TSH->>OS: CloseSocketDisconnect
    TSH->>V: Add to vNodesDisconnected
  end

  loop Final cleanup
    TSH->>N: Wait for refcount == 0 & locks free
    TSH->>N: DeleteNode
  end
Loading
sequenceDiagram
  autonumber
  participant TMH as ThreadMessageHandler
  participant FN as ForEachNode(...)
  participant N as Node
  participant INT as interruptNet

  loop Main loop
    TMH->>FN: Iterate nodes (no snapshot)
    FN->>N: ProcessMessages(N)
    FN->>N: SendMessages(N)
    TMH->>INT: sleep_for(...)
  end
Loading
sequenceDiagram
  autonumber
  participant W as SigningSharesWorker
  participant T as GetTimeMillis()
  participant C as RemoveBannedNodeStates()

  loop Idle loop
    W->>T: now = GetTimeMillis()
    alt now - lastRemoveBannedNodeStatesTime >= 30s
      W->>C: RemoveBannedNodeStates()
      W->>W: lastRemoveBannedNodeStatesTime = now
    else
      Note over W: Skip cleanup this iteration
    end
    W->>W: Sleep 1000ms when no work
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Pay attention to locking boundaries and invariants in src/net.cpp (especially cs_vNodes, vNodesDisconnected handling).
  • Verify reference-counting/wait-for-zero logic before DeleteNode.
  • Check timing/ordering assumptions around CloseSocketDisconnect, Release, and outbound grants.
  • Confirm GetTimeMillis() usage and 30s throttle logic in src/llmq/quorums_signing_shares.cpp.

Suggested reviewers

  • levonpetrosyan93

Poem

I twitch my whiskers, guard the net by night,
Bans are swept but only when the time is right.
Longer naps between each busy hop,
Locks are light, the nodes can drop.
A rabbit nods — connections hum and stop. 🐇✨

Pre-merge checks and finishing touches

✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The pull request title 'Fix Several Sources of Frequent Lock Contention' is concise, clear, and directly summarizes the main objective of the changes. It accurately reflects the primary goal of the PR—reducing lock contention in multiple parts of the codebase—which aligns well with the substantial changes made to quorums_signing_shares.cpp and net.cpp.
Description check ✅ Passed The pull request description adequately covers the PR intention by explaining what each commit addresses: the banned-node cleanup throttling (commit 1), the optimization of node copying to reduce allocations (commit 2 part 1), and the removal of premature optimization in the hot loop (commit 2 part 2). The description provides architectural context about why these changes reduce lock contention and includes a note about testing and behavior preservation. While the description does not strictly follow the provided template sections ('PR intention' and 'Code changes brief'), it contains all the essential information needed to understand the purpose and impact of the changes.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 22bd437 and fbcf197.

📒 Files selected for processing (2)
  • src/llmq/quorums_signing_shares.cpp (3 hunks)
  • src/net.cpp (1 hunks)
🔇 Additional comments (1)
src/net.cpp (1)

1363-1389: Disconnect refactor keeps semantics intact

Copying just the fDisconnect nodes before erasing them from vNodes trims the time under cs_vNodes while preserving the release/cleanup sequence. Looks good.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@justanwar
Copy link
Member

This breaks RPC tests in Github Actions (Linux Cmake/Autotools) and Jenkins.

…ever 100ms, and call RemoveBannedNodesStates() only once every 30s. This fixes near constant contention of cs_main.
// TODO Wakeup when pending signing is needed?
if (!didWork) {
if (!workInterrupt.sleep_for(std::chrono::milliseconds(100))) {
if (!workInterrupt.sleep_for(std::chrono::milliseconds(1000))) {
Copy link
Contributor

@aleflm aleflm Oct 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, PR looks correct to me. But I suspect that 1000ms is too high for this sleep_for. Maybe something like 250ms is more reasonable. Just a suggestion. Since this is not going to make a huge difference.

Overall, I don't see any specific issues with the PR.

@reubenyap
Copy link
Member

@coderabbitai full review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 3, 2025

✅ Actions performed

Full review triggered.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants