Fix Several Sources of Frequent Lock Contention #1688

cassandras-lies · 2025-10-14T11:48:05Z

In the first commit, banned nodes were being removed from our internal database 10 times per second. In each instance, it acquired cs_main, caused near constant contention for that. It is not at all necessary to do it that frequently, this is just for internal data structures and such, and actual validation logic is not checked based on this watchdog. This changes that time to 30s.

The first part of the second commit only copies relevant codes, usually none, instead of copying thousands of nodes. This is run every several seconds, so those allocations were extremely expensive and happened while cs_vNodes was held no less.

The second part of the second commit undoes premature optimisation that copied out vNodes in a hot loop. The copying causes thousands of allocations and deallocations and is run continuously. The actual operations don't take nearly as long as the allocations do. This simplifies the code to simply hold the lock the whole time.

I haven't extensively tested these, but the changes are straightforward and do not change behaviour of the functions.

These three changes reduce lock contention from something happening multiple times per second to something happening every ten seconds or so, in other places which are a bit harder to fix.

coderabbitai · 2025-10-14T11:48:37Z

Walkthrough

Throttled banned-node cleanup and longer idle sleep added to quorum signing shares; networking socket handling refactored to collect disconnected nodes under lock and perform cleanup phases outside the critical section, and message handling now iterates nodes without snapshot copying.

Changes

Cohort / File(s)	Summary
Quorum signing shares idle/cleanup throttling `src/llmq/quorums_signing_shares.cpp`	Added `lastRemoveBannedNodeStatesTime` and guard so `RemoveBannedNodeStates()` runs at most every 30s (checked via `GetTimeMillis()`); increased idle no-work sleep from 100ms to 1000ms; added explanatory comments.
Networking disconnect/refactor `src/net.cpp`	Reworked `ThreadSocketHandler` to collect only actually disconnected nodes while holding `cs_vNodes`, then perform removal, grant release, socket close, `Release`, and move to `vNodesDisconnected` outside the lock. Added a subsequent cleanup phase that waits for zero refs and calls `DeleteNode` when safe; reduced lock duration and changed lifecycle handling of disconnected nodes. Also altered message handler to iterate nodes via `ForEachNode(...)` instead of copying snapshots and adjusted sleep pacing.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant TSH as ThreadSocketHandler
  participant Mtx as cs_vNodes (lock)
  participant V as vNodes
  participant N as Node
  participant OS as Socket/OS

  rect rgba(230,240,255,0.5)
  note over TSH: Quickly identify disconnected nodes under lock
  TSH->>Mtx: Lock
  TSH->>V: Scan & collect disconnected nodes
  TSH-->>Mtx: Unlock
  end

  loop Disconnect processing (outside lock)
    TSH->>V: Remove from vNodes
    TSH->>N: Release outbound grants / state
    TSH->>OS: CloseSocketDisconnect
    TSH->>V: Add to vNodesDisconnected
  end

  loop Final cleanup
    TSH->>N: Wait for refcount == 0 & locks free
    TSH->>N: DeleteNode
  end

sequenceDiagram
  autonumber
  participant TMH as ThreadMessageHandler
  participant FN as ForEachNode(...)
  participant N as Node
  participant INT as interruptNet

  loop Main loop
    TMH->>FN: Iterate nodes (no snapshot)
    FN->>N: ProcessMessages(N)
    FN->>N: SendMessages(N)
    TMH->>INT: sleep_for(...)
  end

sequenceDiagram
  autonumber
  participant W as SigningSharesWorker
  participant T as GetTimeMillis()
  participant C as RemoveBannedNodeStates()

  loop Idle loop
    W->>T: now = GetTimeMillis()
    alt now - lastRemoveBannedNodeStatesTime >= 30s
      W->>C: RemoveBannedNodeStates()
      W->>W: lastRemoveBannedNodeStatesTime = now
    else
      Note over W: Skip cleanup this iteration
    end
    W->>W: Sleep 1000ms when no work
  end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Pay attention to locking boundaries and invariants in src/net.cpp (especially cs_vNodes, vNodesDisconnected handling).
Verify reference-counting/wait-for-zero logic before DeleteNode.
Check timing/ordering assumptions around CloseSocketDisconnect, Release, and outbound grants.
Confirm GetTimeMillis() usage and 30s throttle logic in src/llmq/quorums_signing_shares.cpp.

Suggested reviewers

levonpetrosyan93

Poem

I twitch my whiskers, guard the net by night,
Bans are swept but only when the time is right.
Longer naps between each busy hop,
Locks are light, the nodes can drop.
A rabbit nods — connections hum and stop. 🐇✨

Pre-merge checks and finishing touches

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The pull request title 'Fix Several Sources of Frequent Lock Contention' is concise, clear, and directly summarizes the main objective of the changes. It accurately reflects the primary goal of the PR—reducing lock contention in multiple parts of the codebase—which aligns well with the substantial changes made to quorums_signing_shares.cpp and net.cpp.
Description check	✅ Passed	The pull request description adequately covers the PR intention by explaining what each commit addresses: the banned-node cleanup throttling (commit 1), the optimization of node copying to reduce allocations (commit 2 part 1), and the removal of premature optimization in the hot loop (commit 2 part 2). The description provides architectural context about why these changes reduce lock contention and includes a note about testing and behavior preservation. While the description does not strictly follow the provided template sections ('PR intention' and 'Code changes brief'), it contains all the essential information needed to understand the purpose and impact of the changes.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 22bd437 and fbcf197.

📒 Files selected for processing (2)

src/llmq/quorums_signing_shares.cpp (3 hunks)
src/net.cpp (1 hunks)

🔇 Additional comments (1)

src/net.cpp (1)

1363-1389: Disconnect refactor keeps semantics intact

Copying just the fDisconnect nodes before erasing them from vNodes trims the time under cs_vNodes while preserving the release/cleanup sequence. Looks good.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

justanwar · 2025-10-14T18:02:35Z

This breaks RPC tests in Github Actions (Linux Cmake/Autotools) and Jenkins.

…ketHandler

…ever 100ms, and call RemoveBannedNodesStates() only once every 30s. This fixes near constant contention of cs_main.

aleflm · 2025-10-23T17:44:35Z

src/llmq/quorums_signing_shares.cpp

        // TODO Wakeup when pending signing is needed?
        if (!didWork) {
-            if (!workInterrupt.sleep_for(std::chrono::milliseconds(100))) {
+            if (!workInterrupt.sleep_for(std::chrono::milliseconds(1000))) {


Overall, PR looks correct to me. But I suspect that 1000ms is too high for this sleep_for. Maybe something like 250ms is more reasonable. Just a suggestion. Since this is not going to make a huge difference.

Overall, I don't see any specific issues with the PR.

reubenyap · 2025-11-03T10:12:22Z

@coderabbitai full review

coderabbitai · 2025-11-03T10:12:35Z

✅ Actions performed

Full review triggered.

cassandras-lies changed the title ~~Fix Several Sources Frequent Lock Contention~~ Fix Several Sources of Frequent Lock Contention Oct 14, 2025

cassandras-lies added 2 commits October 23, 2025 07:57

Stop copying vNodes dozens of times per second in CConnman::ThreadSoc…

158d43a

…ketHandler

Make CSigSharesManager::WorkThreadMain run only every 1s rather than …

fbcf197

…ever 100ms, and call RemoveBannedNodesStates() only once every 30s. This fixes near constant contention of cs_main.

cassandras-lies force-pushed the lock-contention branch from 735a854 to fbcf197 Compare October 23, 2025 09:02

reubenyap requested review from aleflm, levonpetrosyan93 and psolstice October 23, 2025 17:13

aleflm reviewed Oct 23, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fix Several Sources of Frequent Lock Contention #1688

Fix Several Sources of Frequent Lock Contention #1688

cassandras-lies commented Oct 14, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Oct 14, 2025 •

edited

Loading

Uh oh!

justanwar commented Oct 14, 2025

Uh oh!

aleflm Oct 23, 2025 •

edited

Loading

Uh oh!

reubenyap commented Nov 3, 2025

Uh oh!

coderabbitai bot commented Nov 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Fix Several Sources of Frequent Lock Contention #1688

Are you sure you want to change the base?

Fix Several Sources of Frequent Lock Contention #1688

Conversation

cassandras-lies commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

justanwar commented Oct 14, 2025

Uh oh!

aleflm Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

reubenyap commented Nov 3, 2025

Uh oh!

coderabbitai bot commented Nov 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cassandras-lies commented Oct 14, 2025 •

edited

Loading

coderabbitai bot commented Oct 14, 2025 •

edited

Loading

aleflm Oct 23, 2025 •

edited

Loading