Skip to content

test_multiprocessing_spawn.test_manager: _TestCondition hung (20 min timeout) on AMD64 RHEL8 3.x #110206

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
vstinner opened this issue Oct 2, 2023 · 5 comments
Labels
tests Tests in the Lib/test dir topic-multiprocessing

Comments

@vstinner
Copy link
Member

vstinner commented Oct 2, 2023

  File "/home/buildbot/buildarea/3.x.cstratak-RHEL8-x86_64/build/Lib/test/_test_multiprocessing.py", line 1426, in f
    woken.release()

This frame comes from _TestCondition.

AMD64 RHEL8 3.x:

0:04:43 load avg: 7.07 [464/467] test_zipfile passed -- running (2): test.test_multiprocessing_spawn.test_manager (1 min 12 sec), test_math (1 min 22 sec)
0:04:52 load avg: 6.14 [465/467] test_xmlrpc passed -- running (2): test.test_multiprocessing_spawn.test_manager (1 min 21 sec), test_math (1 min 30 sec)
0:05:19 load avg: 4.38 [466/467] test_math passed (1 min 57 sec) -- running (1): test.test_multiprocessing_spawn.test_manager (1 min 48 sec)
0:05:49 load avg: 2.66 running (1): test.test_multiprocessing_spawn.test_manager (2 min 18 sec)
0:06:19 load avg: 1.61 running (1): test.test_multiprocessing_spawn.test_manager (2 min 48 sec)
0:06:49 load avg: 0.97 running (1): test.test_multiprocessing_spawn.test_manager (3 min 18 sec)
(...)
0:22:19 load avg: 0.00 running (1): test.test_multiprocessing_spawn.test_manager (18 min 48 sec)
0:22:49 load avg: 0.00 running (1): test.test_multiprocessing_spawn.test_manager (19 min 18 sec)
0:23:19 load avg: 0.00 running (1): test.test_multiprocessing_spawn.test_manager (19 min 48 sec)
0:23:31 load avg: 0.00 [467/467/1] test.test_multiprocessing_spawn.test_manager worker non-zero exit code (Exit code 1)
Process Process-44:
Traceback (most recent call last):
  File "/home/buildbot/buildarea/3.x.cstratak-RHEL8-x86_64/build/Lib/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/home/buildbot/buildarea/3.x.cstratak-RHEL8-x86_64/build/Lib/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/buildbot/buildarea/3.x.cstratak-RHEL8-x86_64/build/Lib/test/_test_multiprocessing.py", line 1426, in f
    woken.release()
  File "/home/buildbot/buildarea/3.x.cstratak-RHEL8-x86_64/build/Lib/multiprocessing/managers.py", line 1059, in release
    return self._callmethod('release')
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/buildbot/buildarea/3.x.cstratak-RHEL8-x86_64/build/Lib/multiprocessing/managers.py", line 840, in _callmethod
    raise convert_to_error(kind, result)
multiprocessing.managers.RemoteError: 
---------------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/buildbot/buildarea/3.x.cstratak-RHEL8-x86_64/build/Lib/multiprocessing/managers.py", line 263, in serve_client
    self.id_to_local_proxy_obj[ident]
    ~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^
KeyError: '7f528ad142e0'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/buildbot/buildarea/3.x.cstratak-RHEL8-x86_64/build/Lib/multiprocessing/managers.py", line 265, in serve_client
    raise ke
  File "/home/buildbot/buildarea/3.x.cstratak-RHEL8-x86_64/build/Lib/multiprocessing/managers.py", line 259, in serve_client
    obj, exposed, gettypeid = id_to_obj[ident]
                              ~~~~~~~~~^^^^^^^
KeyError: '7f528ad142e0'
---------------------------------------------------------------------------
Timeout (0:20:00)!
Thread 0x00007f3635005740 (most recent call first):
  File "/home/buildbot/buildarea/3.x.cstratak-RHEL8-x86_64/build/Lib/multiprocessing/popen_fork.py", line 27 in poll
  File "/home/buildbot/buildarea/3.x.cstratak-RHEL8-x86_64/build/Lib/multiprocessing/popen_fork.py", line 43 in wait
  File "/home/buildbot/buildarea/3.x.cstratak-RHEL8-x86_64/build/Lib/multiprocessing/process.py", line 149 in join
  File "/home/buildbot/buildarea/3.x.cstratak-RHEL8-x86_64/build/Lib/unittest/case.py", line 597 in _callCleanup
  File "/home/buildbot/buildarea/3.x.cstratak-RHEL8-x86_64/build/Lib/unittest/case.py", line 673 in doCleanups
  File "/home/buildbot/buildarea/3.x.cstratak-RHEL8-x86_64/build/Lib/unittest/case.py", line 640 in run
  File "/home/buildbot/buildarea/3.x.cstratak-RHEL8-x86_64/build/Lib/unittest/case.py", line 692 in __call__
  File "/home/buildbot/buildarea/3.x.cstratak-RHEL8-x86_64/build/Lib/unittest/suite.py", line 122 in run
  File "/home/buildbot/buildarea/3.x.cstratak-RHEL8-x86_64/build/Lib/unittest/suite.py", line 84 in __call__
  File "/home/buildbot/buildarea/3.x.cstratak-RHEL8-x86_64/build/Lib/unittest/suite.py", line 122 in run
  File "/home/buildbot/buildarea/3.x.cstratak-RHEL8-x86_64/build/Lib/unittest/suite.py", line 84 in __call__
  File "/home/buildbot/buildarea/3.x.cstratak-RHEL8-x86_64/build/Lib/unittest/suite.py", line 122 in run
  File "/home/buildbot/buildarea/3.x.cstratak-RHEL8-x86_64/build/Lib/unittest/suite.py", line 84 in __call__
  File "/home/buildbot/buildarea/3.x.cstratak-RHEL8-x86_64/build/Lib/unittest/runner.py", line 240 in run
  File "/home/buildbot/buildarea/3.x.cstratak-RHEL8-x86_64/build/Lib/test/support/__init__.py", line 1155 in _run_suite

test_multiprocessing_spawn.test_manager passed when re-run.

build: https://buildbot.python.org/all/#/builders/185/builds/5160

Linked PRs

@vstinner vstinner added tests Tests in the Lib/test dir topic-multiprocessing labels Oct 2, 2023
@vstinner
Copy link
Member Author

vstinner commented Oct 5, 2023

Similar error on PPC64LE RHEL8 Refleaks 3.x: https://buildbot.python.org/all/#/builders/384/builds/892

@vstinner
Copy link
Member Author

vstinner commented Oct 5, 2023

I failed to reproduce the issue on my Fedora 38 by stressing my laptop (12 logical CPUs) with:

./python -m test test_multiprocessing_spawn.test_manager -m WithManagerTestCondition -F -j70 -W

I ran the test for 3 min 30 sec.

@vstinner
Copy link
Member Author

vstinner commented Nov 8, 2023

I didn't see this failure recently, I close the issue.

@vstinner vstinner closed this as completed Nov 8, 2023
@colesbury
Copy link
Contributor

I've seen this a few times recently. For example on the GH ubuntu-24.04-arm runner: https://github.com/python/cpython/actions/runs/13642688178/job/38135745891?pr=130811

test_notify_all (test.test_multiprocessing_spawn.test_manager.WithManagerTestCondition.test_notify_all) ... Process Process-43:
Traceback (most recent call last):
  File "/home/runner/work/cpython/cpython-ro-srcdir/Lib/multiprocessing/process.py", line 313, in _bootstrap
    self.run()
    ~~~~~~~~^^
  File "/home/runner/work/cpython/cpython-ro-srcdir/Lib/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
    ~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/work/cpython/cpython-ro-srcdir/Lib/test/_test_multiprocessing.py", line 1621, in f
    woken.release()
    ~~~~~~~~~~~~~^^
  File "/home/runner/work/cpython/cpython-ro-srcdir/Lib/multiprocessing/managers.py", line 1067, in release
    return self._callmethod('release')
           ~~~~~~~~~~~~~~~~^^^^^^^^^^^
  File "/home/runner/work/cpython/cpython-ro-srcdir/Lib/multiprocessing/managers.py", line 848, in _callmethod
    raise convert_to_error(kind, result)
multiprocessing.managers.RemoteError: 
---------------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/runner/work/cpython/cpython-ro-srcdir/Lib/multiprocessing/managers.py", line 264, in serve_client
    self.id_to_local_proxy_obj[ident]
    ~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^
KeyError: 'ff241[728](https://github.com/python/cpython/actions/runs/13642688178/job/38135745891?pr=130811#step:22:729)daa0'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/runner/work/cpython/cpython-ro-srcdir/Lib/multiprocessing/managers.py", line 266, in serve_client
    raise ke
  File "/home/runner/work/cpython/cpython-ro-srcdir/Lib/multiprocessing/managers.py", line 260, in serve_client
    obj, exposed, gettypeid = id_to_obj[ident]
                              ~~~~~~~~~^^^^^^^
KeyError: 'ff241728daa0'
---------------------------------------------------------------------------
Timeout (0:10:00)!

@colesbury colesbury reopened this Mar 4, 2025
colesbury added a commit to colesbury/cpython that referenced this issue Mar 6, 2025
The test could deadlock trying join on the worker processes due to a
combination of behaviors:

* The use of `assertReachesEventually` did not ensure that workers
  actually woken.release() because the SyncManager's Semaphore does not
  implement get_value.

* This mean that the test could finish and the variable "sleeping" would
  got out of scope and be collected. This unregisters the proxy leading
  to failures in the worker or possibly the manager.

* The subsequent call to `p.join()` during cleanUp therefore never
  finished.

This takes two approaches to fix this:

1) Use woken.acquire() to ensure that the workers actually finish
   calling woken.release()

2) Wait until the workers finish during the test, while cond, sleeping,
   and woken are still valid.
colesbury added a commit that referenced this issue Mar 7, 2025
The test could deadlock trying join on the worker processes due to a
combination of behaviors:

* The use of `assertReachesEventually` did not ensure that workers
  actually woken.release() because the SyncManager's Semaphore does not
  implement get_value.

* This mean that the test could finish and the variable "sleeping" would
  got out of scope and be collected. This unregisters the proxy leading
  to failures in the worker or possibly the manager.

* The subsequent call to `p.join()` during cleanUp therefore never
  finished.

This takes two approaches to fix this:

1) Use woken.acquire() to ensure that the workers actually finish
   calling woken.release()

2) At the end of the test, wait until the workers are finished, while `cond`,
   `sleeping`, and `woken` are still valid.
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Mar 7, 2025
The test could deadlock trying join on the worker processes due to a
combination of behaviors:

* The use of `assertReachesEventually` did not ensure that workers
  actually woken.release() because the SyncManager's Semaphore does not
  implement get_value.

* This mean that the test could finish and the variable "sleeping" would
  got out of scope and be collected. This unregisters the proxy leading
  to failures in the worker or possibly the manager.

* The subsequent call to `p.join()` during cleanUp therefore never
  finished.

This takes two approaches to fix this:

1) Use woken.acquire() to ensure that the workers actually finish
   calling woken.release()

2) At the end of the test, wait until the workers are finished, while `cond`,
   `sleeping`, and `woken` are still valid.
(cherry picked from commit c476410)

Co-authored-by: Sam Gross <[email protected]>
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Mar 7, 2025
The test could deadlock trying join on the worker processes due to a
combination of behaviors:

* The use of `assertReachesEventually` did not ensure that workers
  actually woken.release() because the SyncManager's Semaphore does not
  implement get_value.

* This mean that the test could finish and the variable "sleeping" would
  got out of scope and be collected. This unregisters the proxy leading
  to failures in the worker or possibly the manager.

* The subsequent call to `p.join()` during cleanUp therefore never
  finished.

This takes two approaches to fix this:

1) Use woken.acquire() to ensure that the workers actually finish
   calling woken.release()

2) At the end of the test, wait until the workers are finished, while `cond`,
   `sleeping`, and `woken` are still valid.
(cherry picked from commit c476410)

Co-authored-by: Sam Gross <[email protected]>
colesbury added a commit that referenced this issue Mar 7, 2025
…30951)

The test could deadlock trying join on the worker processes due to a
combination of behaviors:

* The use of `assertReachesEventually` did not ensure that workers
  actually woken.release() because the SyncManager's Semaphore does not
  implement get_value.

* This mean that the test could finish and the variable "sleeping" would
  got out of scope and be collected. This unregisters the proxy leading
  to failures in the worker or possibly the manager.

* The subsequent call to `p.join()` during cleanUp therefore never
  finished.

This takes two approaches to fix this:

1) Use woken.acquire() to ensure that the workers actually finish
   calling woken.release()

2) At the end of the test, wait until the workers are finished, while `cond`,
   `sleeping`, and `woken` are still valid.
(cherry picked from commit c476410)

Co-authored-by: Sam Gross <[email protected]>
colesbury added a commit that referenced this issue Mar 7, 2025
…30950)

The test could deadlock trying join on the worker processes due to a
combination of behaviors:

* The use of `assertReachesEventually` did not ensure that workers
  actually woken.release() because the SyncManager's Semaphore does not
  implement get_value.

* This mean that the test could finish and the variable "sleeping" would
  got out of scope and be collected. This unregisters the proxy leading
  to failures in the worker or possibly the manager.

* The subsequent call to `p.join()` during cleanUp therefore never
  finished.

This takes two approaches to fix this:

1) Use woken.acquire() to ensure that the workers actually finish
   calling woken.release()

2) At the end of the test, wait until the workers are finished, while `cond`,
   `sleeping`, and `woken` are still valid.
(cherry picked from commit c476410)

Co-authored-by: Sam Gross <[email protected]>
@colesbury
Copy link
Contributor

test_notify_all should be fixed now, but see #130954 for a similar issue with test_notify_n.

seehwan pushed a commit to seehwan/cpython that referenced this issue Apr 16, 2025
The test could deadlock trying join on the worker processes due to a
combination of behaviors:

* The use of `assertReachesEventually` did not ensure that workers
  actually woken.release() because the SyncManager's Semaphore does not
  implement get_value.

* This mean that the test could finish and the variable "sleeping" would
  got out of scope and be collected. This unregisters the proxy leading
  to failures in the worker or possibly the manager.

* The subsequent call to `p.join()` during cleanUp therefore never
  finished.

This takes two approaches to fix this:

1) Use woken.acquire() to ensure that the workers actually finish
   calling woken.release()

2) At the end of the test, wait until the workers are finished, while `cond`,
   `sleeping`, and `woken` are still valid.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests Tests in the Lib/test dir topic-multiprocessing
Projects
None yet
Development

No branches or pull requests

2 participants