Skip to content

Fix data races triggered by functional tests. #4247

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

gabriel-bjg
Copy link

Thread sanitizer triggers data races on functional tests. Steps to reproduce:

./configure --enable-debug --prefix=`pwd`/depends/x86_64-pc-linux-gnu/ --with-sanitizers=thread
make clean
make -j8
python3 test/functional/test_runner.py -j 4

Fixes:

  • Function CWallet::KeepKey requires locking as it has concurrent access to database and member nKeysLeftSinceAutoBackup.
  • Avoid data race when reading setInventoryTxToSend size by locking the read. If locking happens after the read, the size may change.
  • Lock cs_mnauth when reading verifiedProRegTxHash.
  • Make fRPCRunning atomic as it can be read/written from different threads simultaneously.
  • Make m_masternode_iqr_connection atomic as it can be read/written from different threads simultaneously.
  • Use a recursive mutex to synchronize concurrent access to quorumVvec.
  • Make m_masternode_connection atomic as it can be read/written from different threads simultaneously.
  • Make m_masternode_probe_connection atomic as it can be read/written from different threads simultaneously.
  • Use a recursive mutex for synchronized access to activeMasterNode.

@gabriel-bjg gabriel-bjg force-pushed the thread-sanitizer-fixes-functional-tests branch from ecb339b to 0a34a4b Compare July 13, 2021 18:43
@gabriel-bjg gabriel-bjg force-pushed the thread-sanitizer-fixes-functional-tests branch 2 times, most recently from 323dc8d to fe7bba9 Compare July 13, 2021 20:43
Copy link
Member

@PastaPastaPasta PastaPastaPasta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utACK

@UdjinM6 UdjinM6 added this to the 18 milestone Jul 13, 2021
Copy link

@UdjinM6 UdjinM6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for finding and fixing these issues! 👍

Few notes:

  • it's better to avoid holding locks for too long to lower the risk of introducing deadlocks at some point (d876781)
  • we should always lock quorumVvecCs whenever we access quorumVvec now that it's GUARDED_BY(quorumVvecCs) (099909a)
  • running p2p_quorum_data.py revealed a similar issue on my machine, this time for skShare (da44228)

@gabriel-bjg gabriel-bjg force-pushed the thread-sanitizer-fixes-functional-tests branch 2 times, most recently from 3531516 to a28e1f5 Compare July 15, 2021 14:28
Copy link
Member

@PastaPastaPasta PastaPastaPasta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re-utACK

@PastaPastaPasta PastaPastaPasta requested a review from UdjinM6 July 15, 2021 17:01
@PastaPastaPasta
Copy link
Member

Needs rebase

@gabriel-bjg gabriel-bjg force-pushed the thread-sanitizer-fixes-functional-tests branch 4 times, most recently from 9540a56 to 018a7aa Compare July 16, 2021 09:38
UdjinM6
UdjinM6 previously approved these changes Jul 16, 2021
Copy link

@UdjinM6 UdjinM6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utACK

@UdjinM6 UdjinM6 dismissed their stale review July 16, 2021 23:08

more fixes coming?

@github-actions
Copy link

This pull request has conflicts, please rebase.

@gabriel-bjg gabriel-bjg force-pushed the thread-sanitizer-fixes-functional-tests branch 2 times, most recently from 8b82d6a to be928d9 Compare July 17, 2021 07:26
@gabriel-bjg gabriel-bjg force-pushed the thread-sanitizer-fixes-functional-tests branch from da89dcd to 7955fb3 Compare July 17, 2021 12:46
@github-actions
Copy link

This pull request has conflicts, please rebase.

@gabriel-bjg gabriel-bjg force-pushed the thread-sanitizer-fixes-functional-tests branch 2 times, most recently from 4543fab to 3778e3f Compare July 17, 2021 12:54
@gabriel-bjg
Copy link
Author

gabriel-bjg commented Jul 17, 2021

more fixes coming?

yes, i finished all fixes now, waiting for the builds.

@gabriel-bjg gabriel-bjg force-pushed the thread-sanitizer-fixes-functional-tests branch from 3778e3f to 9047608 Compare July 17, 2021 13:29
@PastaPastaPasta
Copy link
Member

There was any additional data race / deadlock I detected on a run of feature_block.py which I forwarded to @gabriel-bjg

My understanding is he's planning to fix that and push

@github-actions
Copy link

This pull request has conflicts, please rebase.

@gabriel-bjg gabriel-bjg force-pushed the thread-sanitizer-fixes-functional-tests branch from 9047608 to d7501e0 Compare July 22, 2021 18:09
@gabriel-bjg
Copy link
Author

There was any additional data race / deadlock I detected on a run of feature_block.py which I forwarded to @gabriel-bjg

My understanding is he's planning to fix that and push

Regarding the data races, they are all triggered by zmq and libzmq. According to bitcoin tsan exceptions, zmq and libzmq data races are all suppressed (https://github.com/bitcoin/bitcoin/blob/master/test/sanitizer_suppressions/tsan). As agreed with @PastaPastaPasta we will ignore them for now and won't fix anything related to them in this PR.

The possible deadlock triggered by tsan is also a false positive. I present below the scenario which triggers the lock order inversion:

Thread19:
    LOCK2(cs_main, mempool.cs) (miner.cpp:235) => scriptcheckqueue->m_control_mutex (validation.cpp:2160) => cs_main, cgovernancemanager.cs (governance/governance.cpp:352)

Thread18:
    LOCK(cs_main) (validation.cpp:984) => LOCK(mempool.cs) (txmempool.cpp:1232)

Basically what tsan detects is T19 locking: cs_main (first locking), mempool.cs, m_control_mutex, cs_main (second locking as it's already locked, but it's a recursive mutex), cgovernancemanager.cs while T18 is locking: cs_main, mempool.cs. The possible deadlock detection comes from the fact that tsan is not able to detect that cs_main has been already locked before locking mempool.cs on thread19. So it sees thread19 locking mempool.cs, m_control_mutex, cs_main while it sees thread18 locking cs_main and mempool.cs afterwards.

Function CWallet::KeepKey requires locking as it has concurrent access to database and member nKeysLeftSinceAutoBackup.

Avoid data race when reading setInventoryTxToSend size by locking the read. If locking happens after the read, the size may change.

Lock cs_mnauth when reading verifiedProRegTxHash.

Make fRPCRunning atomic as it can be read/written from different threads simultaneously.

Make m_masternode_iqr_connection atomic as it can be read/written from different threads simultaneously.

Use a recursive mutex to synchronize concurrent access to quorumVvec.

Make m_masternode_connection atomic as it can be read/written from different threads simultaneously.

Make m_masternode_probe_connection atomic as it can be read/written from different threads simultaneously.

Use a recursive mutex in order to lock access to activeMasterNode.

Use a recursive mutex to synchronize concurrent access to skShare.

Guarded all mnauth fields of a CNode.
@gabriel-bjg gabriel-bjg force-pushed the thread-sanitizer-fixes-functional-tests branch from d7501e0 to b85fc7a Compare July 23, 2021 13:03
Copy link

@UdjinM6 UdjinM6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utACK

Copy link
Member

@PastaPastaPasta PastaPastaPasta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utACK for squash merge

@PastaPastaPasta PastaPastaPasta merged commit 41190e9 into dashpay:develop Jul 26, 2021
christiancfifi pushed a commit to christiancfifi/dash that referenced this pull request Aug 24, 2021
Already fixed here:
41190e9 ("Fix data races triggered by functional tests. (dashpay#4247)", 2021-07-26)

6c10037 rpc: Fix data race (UB) in InterruptRPC() (practicalswift)

Pull request description:

  Fix data race (UB) in `InterruptRPC()`.

  Before:

  ```
  $ ./configure --with-sanitizers=thread
  $ make
  $ test/functional/test_runner.py feature_shutdown.py
  …
  SUMMARY: ThreadSanitizer: data race rpc/server.cpp:314 in InterruptRPC()
  …
  ALL                 | ✖ Failed  | 2 s (accumulated)
  ```

  After:

  ```
  $ ./configure --with-sanitizers=thread
  $ make
  $ test/functional/test_runner.py feature_shutdown.py
  …
  ALL                 | ✓ Passed  | 3 s (accumulated)
  ```

Tree-SHA512: b139ca1a0480258f8caa7730cabd7783a821d906630f51487750a6b15b7842675ed679747e1ff1bdade77d248807e9d77bae7bb88da54d1df84a179cd9b9b987
gades pushed a commit to cosanta/cosanta-core that referenced this pull request May 9, 2022
Function CWallet::KeepKey requires locking as it has concurrent access to database and member nKeysLeftSinceAutoBackup.

Avoid data race when reading setInventoryTxToSend size by locking the read. If locking happens after the read, the size may change.

Lock cs_mnauth when reading verifiedProRegTxHash.

Make fRPCRunning atomic as it can be read/written from different threads simultaneously.

Make m_masternode_iqr_connection atomic as it can be read/written from different threads simultaneously.

Use a recursive mutex to synchronize concurrent access to quorumVvec.

Make m_masternode_connection atomic as it can be read/written from different threads simultaneously.

Make m_masternode_probe_connection atomic as it can be read/written from different threads simultaneously.

Use a recursive mutex in order to lock access to activeMasterNode.

Use a recursive mutex to synchronize concurrent access to skShare.

Guarded all mnauth fields of a CNode.

Co-authored-by: UdjinM6 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants