Skip to content

gh-110206: Fix multiprocessing test_notify_all #130933

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Mar 7, 2025

Conversation

colesbury
Copy link
Contributor

@colesbury colesbury commented Mar 6, 2025

The test could deadlock trying join on the worker processes due to a combination of behaviors:

  • The use of assertReachesEventually did not ensure that workers actually called woken.release() because the SyncManager's Semaphore does not implement get_value.

  • This mean that the test could finish and the variable "sleeping" would got out of scope and be collected. This unregisters the proxy leading to failures in the worker or possibly the manager.

  • The subsequent call to p.join() during cleanUp therefore never finished.

This takes two approaches to fix this:

  1. Use woken.acquire() to ensure that the workers actually finish calling woken.release().

  2. Wait until the workers finish during the test, while cond, sleeping,
    and woken are still valid.

The test could deadlock trying join on the worker processes due to a
combination of behaviors:

* The use of `assertReachesEventually` did not ensure that workers
  actually woken.release() because the SyncManager's Semaphore does not
  implement get_value.

* This mean that the test could finish and the variable "sleeping" would
  got out of scope and be collected. This unregisters the proxy leading
  to failures in the worker or possibly the manager.

* The subsequent call to `p.join()` during cleanUp therefore never
  finished.

This takes two approaches to fix this:

1) Use woken.acquire() to ensure that the workers actually finish
   calling woken.release()

2) Wait until the workers finish during the test, while cond, sleeping,
   and woken are still valid.
@bedevere-bot
Copy link

🤖 New build scheduled with the buildbot fleet by @colesbury for commit c06d0f1 🤖

Results will be shown at:

https://buildbot.python.org/all/#/grid?branch=refs%2Fpull%2F130933%2Fmerge

If you want to schedule another build, you need to add the 🔨 test-with-buildbots label again.

@bedevere-bot bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Mar 6, 2025
@gpshead gpshead added the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Mar 7, 2025
@bedevere-bot
Copy link

🤖 New build scheduled with the buildbot fleet by @gpshead for commit 19c049d 🤖

Results will be shown at:

https://buildbot.python.org/all/#/grid?branch=refs%2Fpull%2F130933%2Fmerge

If you want to schedule another build, you need to add the 🔨 test-with-buildbots label again.

@bedevere-bot bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Mar 7, 2025
@colesbury colesbury added needs backport to 3.12 only security fixes needs backport to 3.13 bugs and security fixes labels Mar 7, 2025
Copy link
Member

@vstinner vstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@colesbury
Copy link
Contributor Author

nogil refleak buildbot failures will be fixed by #130901

@colesbury colesbury merged commit c476410 into python:main Mar 7, 2025
121 of 123 checks passed
@miss-islington-app
Copy link

Thanks @colesbury for the PR 🌮🎉.. I'm working now to backport this PR to: 3.12, 3.13.
🐍🍒⛏🤖

@colesbury colesbury deleted the gh-110206-test-notify-all branch March 7, 2025 14:58
miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Mar 7, 2025
The test could deadlock trying join on the worker processes due to a
combination of behaviors:

* The use of `assertReachesEventually` did not ensure that workers
  actually woken.release() because the SyncManager's Semaphore does not
  implement get_value.

* This mean that the test could finish and the variable "sleeping" would
  got out of scope and be collected. This unregisters the proxy leading
  to failures in the worker or possibly the manager.

* The subsequent call to `p.join()` during cleanUp therefore never
  finished.

This takes two approaches to fix this:

1) Use woken.acquire() to ensure that the workers actually finish
   calling woken.release()

2) At the end of the test, wait until the workers are finished, while `cond`,
   `sleeping`, and `woken` are still valid.
(cherry picked from commit c476410)

Co-authored-by: Sam Gross <[email protected]>
@bedevere-app
Copy link

bedevere-app bot commented Mar 7, 2025

GH-130950 is a backport of this pull request to the 3.13 branch.

@bedevere-app bedevere-app bot removed the needs backport to 3.13 bugs and security fixes label Mar 7, 2025
miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Mar 7, 2025
The test could deadlock trying join on the worker processes due to a
combination of behaviors:

* The use of `assertReachesEventually` did not ensure that workers
  actually woken.release() because the SyncManager's Semaphore does not
  implement get_value.

* This mean that the test could finish and the variable "sleeping" would
  got out of scope and be collected. This unregisters the proxy leading
  to failures in the worker or possibly the manager.

* The subsequent call to `p.join()` during cleanUp therefore never
  finished.

This takes two approaches to fix this:

1) Use woken.acquire() to ensure that the workers actually finish
   calling woken.release()

2) At the end of the test, wait until the workers are finished, while `cond`,
   `sleeping`, and `woken` are still valid.
(cherry picked from commit c476410)

Co-authored-by: Sam Gross <[email protected]>
@bedevere-app
Copy link

bedevere-app bot commented Mar 7, 2025

GH-130951 is a backport of this pull request to the 3.12 branch.

@bedevere-app bedevere-app bot removed the needs backport to 3.12 only security fixes label Mar 7, 2025
colesbury added a commit to colesbury/cpython that referenced this pull request Mar 7, 2025
The test could deadlock trying join on the worker processes.
Apply the same technique as pythongh-130933.

Join the process before the test ends in `test_notify` as well.
colesbury added a commit that referenced this pull request Mar 7, 2025
…30951)

The test could deadlock trying join on the worker processes due to a
combination of behaviors:

* The use of `assertReachesEventually` did not ensure that workers
  actually woken.release() because the SyncManager's Semaphore does not
  implement get_value.

* This mean that the test could finish and the variable "sleeping" would
  got out of scope and be collected. This unregisters the proxy leading
  to failures in the worker or possibly the manager.

* The subsequent call to `p.join()` during cleanUp therefore never
  finished.

This takes two approaches to fix this:

1) Use woken.acquire() to ensure that the workers actually finish
   calling woken.release()

2) At the end of the test, wait until the workers are finished, while `cond`,
   `sleeping`, and `woken` are still valid.
(cherry picked from commit c476410)

Co-authored-by: Sam Gross <[email protected]>
colesbury added a commit that referenced this pull request Mar 7, 2025
…30950)

The test could deadlock trying join on the worker processes due to a
combination of behaviors:

* The use of `assertReachesEventually` did not ensure that workers
  actually woken.release() because the SyncManager's Semaphore does not
  implement get_value.

* This mean that the test could finish and the variable "sleeping" would
  got out of scope and be collected. This unregisters the proxy leading
  to failures in the worker or possibly the manager.

* The subsequent call to `p.join()` during cleanUp therefore never
  finished.

This takes two approaches to fix this:

1) Use woken.acquire() to ensure that the workers actually finish
   calling woken.release()

2) At the end of the test, wait until the workers are finished, while `cond`,
   `sleeping`, and `woken` are still valid.
(cherry picked from commit c476410)

Co-authored-by: Sam Gross <[email protected]>
@bedevere-bot
Copy link

⚠️⚠️⚠️ Buildbot failure ⚠️⚠️⚠️

Hi! The buildbot s390x RHEL9 Refleaks 3.13 (tier-3) has failed when building commit 94b94d0.

What do you need to do:

  1. Don't panic.
  2. Check the buildbot page in the devguide if you don't know what the buildbots are or how they work.
  3. Go to the page of the buildbot that failed (https://buildbot.python.org/#/builders/1575/builds/567) and take a look at the build logs.
  4. Check if the failure is related to this commit (94b94d0) or if it is a false positive.
  5. If the failure is related to this commit, please, reflect that on the issue and make a new Pull Request with a fix.

You can take a look at the buildbot page here:

https://buildbot.python.org/#/builders/1575/builds/567

Failed tests:

  • test.test_multiprocessing_spawn.test_manager

Summary of the results of the build (if available):

==

Click to see traceback logs
Traceback (most recent call last):
  File "/home/buildbot/buildarea/3.13.cstratak-rhel9-s390x.refleak/build/Lib/multiprocessing/managers.py", line 265, in serve_client
    raise ke
  File "/home/buildbot/buildarea/3.13.cstratak-rhel9-s390x.refleak/build/Lib/multiprocessing/managers.py", line 259, in serve_client
    obj, exposed, gettypeid = id_to_obj[ident]
                              ~~~~~~~~~^^^^^^^
KeyError: '3ff7b113a80'
---------------------------------------------------------------------------
Timeout (0:45:00)!
Thread 0x000003ffa9d73740 (most recent call first):
  File "/home/buildbot/buildarea/3.13.cstratak-rhel9-s390x.refleak/build/Lib/multiprocessing/popen_fork.py", line 28 in poll
  File "/home/buildbot/buildarea/3.13.cstratak-rhel9-s390x.refleak/build/Lib/multiprocessing/popen_fork.py", line 44 in wait
  File "/home/buildbot/buildarea/3.13.cstratak-rhel9-s390x.refleak/build/Lib/multiprocessing/process.py", line 149 in join
  File "/home/buildbot/buildarea/3.13.cstratak-rhel9-s390x.refleak/build/Lib/unittest/case.py", line 614 in _callCleanup
  File "/home/buildbot/buildarea/3.13.cstratak-rhel9-s390x.refleak/build/Lib/unittest/case.py", line 688 in doCleanups
  File "/home/buildbot/buildarea/3.13.cstratak-rhel9-s390x.refleak/build/Lib/unittest/case.py", line 655 in run
  File "/home/buildbot/buildarea/3.13.cstratak-rhel9-s390x.refleak/build/Lib/unittest/case.py", line 707 in __call__
  File "/home/buildbot/buildarea/3.13.cstratak-rhel9-s390x.refleak/build/Lib/unittest/suite.py", line 122 in run
  File "/home/buildbot/buildarea/3.13.cstratak-rhel9-s390x.refleak/build/Lib/unittest/suite.py", line 84 in __call__
  File "/home/buildbot/buildarea/3.13.cstratak-rhel9-s390x.refleak/build/Lib/unittest/suite.py", line 122 in run
  File "/home/buildbot/buildarea/3.13.cstratak-rhel9-s390x.refleak/build/Lib/unittest/suite.py", line 84 in __call__
  File "/home/buildbot/buildarea/3.13.cstratak-rhel9-s390x.refleak/build/Lib/unittest/runner.py", line 240 in run
  File "/home/buildbot/buildarea/3.13.cstratak-rhel9-s390x.refleak/build/Lib/test/libregrtest/single.py", line 57 in _run_suite
  File "/home/buildbot/buildarea/3.13.cstratak-rhel9-s390x.refleak/build/Lib/test/libregrtest/single.py", line 37 in run_unittest
  File "/home/buildbot/buildarea/3.13.cstratak-rhel9-s390x.refleak/build/Lib/test/libregrtest/single.py", line 135 in test_func
  File "/home/buildbot/buildarea/3.13.cstratak-rhel9-s390x.refleak/build/Lib/test/libregrtest/refleak.py", line 132 in runtest_refleak
  File "/home/buildbot/buildarea/3.13.cstratak-rhel9-s390x.refleak/build/Lib/test/libregrtest/single.py", line 87 in regrtest_runner
  File "/home/buildbot/buildarea/3.13.cstratak-rhel9-s390x.refleak/build/Lib/test/libregrtest/single.py", line 138 in _load_run_test
  File "/home/buildbot/buildarea/3.13.cstratak-rhel9-s390x.refleak/build/Lib/test/libregrtest/single.py", line 181 in _runtest_env_changed_exc
  File "/home/buildbot/buildarea/3.13.cstratak-rhel9-s390x.refleak/build/Lib/test/libregrtest/single.py", line 281 in _runtest
  File "/home/buildbot/buildarea/3.13.cstratak-rhel9-s390x.refleak/build/Lib/test/libregrtest/single.py", line 310 in run_single_test
  File "/home/buildbot/buildarea/3.13.cstratak-rhel9-s390x.refleak/build/Lib/test/libregrtest/worker.py", line 77 in worker_process
  File "/home/buildbot/buildarea/3.13.cstratak-rhel9-s390x.refleak/build/Lib/test/libregrtest/worker.py", line 112 in main
  File "/home/buildbot/buildarea/3.13.cstratak-rhel9-s390x.refleak/build/Lib/test/libregrtest/worker.py", line 116 in <module>
  File "/home/buildbot/buildarea/3.13.cstratak-rhel9-s390x.refleak/build/Lib/runpy.py", line 88 in _run_code
  File "/home/buildbot/buildarea/3.13.cstratak-rhel9-s390x.refleak/build/Lib/runpy.py", line 198 in _run_module_as_main


Traceback (most recent call last):
  File "/home/buildbot/buildarea/3.13.cstratak-rhel9-s390x.refleak/build/Lib/multiprocessing/managers.py", line 265, in serve_client
    raise ke
  File "/home/buildbot/buildarea/3.13.cstratak-rhel9-s390x.refleak/build/Lib/multiprocessing/managers.py", line 259, in serve_client
    obj, exposed, gettypeid = id_to_obj[ident]
                              ~~~~~~~~~^^^^^^^
KeyError: '3ff96a13a80'
---------------------------------------------------------------------------
Timeout (0:45:00)!
Thread 0x000003ff89d73740 (most recent call first):
  File "/home/buildbot/buildarea/3.13.cstratak-rhel9-s390x.refleak/build/Lib/multiprocessing/popen_fork.py", line 28 in poll
  File "/home/buildbot/buildarea/3.13.cstratak-rhel9-s390x.refleak/build/Lib/multiprocessing/popen_fork.py", line 44 in wait
  File "/home/buildbot/buildarea/3.13.cstratak-rhel9-s390x.refleak/build/Lib/multiprocessing/process.py", line 149 in join
  File "/home/buildbot/buildarea/3.13.cstratak-rhel9-s390x.refleak/build/Lib/unittest/case.py", line 614 in _callCleanup
  File "/home/buildbot/buildarea/3.13.cstratak-rhel9-s390x.refleak/build/Lib/unittest/case.py", line 688 in doCleanups
  File "/home/buildbot/buildarea/3.13.cstratak-rhel9-s390x.refleak/build/Lib/unittest/case.py", line 655 in run
  File "/home/buildbot/buildarea/3.13.cstratak-rhel9-s390x.refleak/build/Lib/unittest/case.py", line 707 in __call__
  File "/home/buildbot/buildarea/3.13.cstratak-rhel9-s390x.refleak/build/Lib/unittest/suite.py", line 122 in run
  File "/home/buildbot/buildarea/3.13.cstratak-rhel9-s390x.refleak/build/Lib/unittest/suite.py", line 84 in __call__
  File "/home/buildbot/buildarea/3.13.cstratak-rhel9-s390x.refleak/build/Lib/unittest/suite.py", line 122 in run
  File "/home/buildbot/buildarea/3.13.cstratak-rhel9-s390x.refleak/build/Lib/unittest/suite.py", line 84 in __call__
  File "/home/buildbot/buildarea/3.13.cstratak-rhel9-s390x.refleak/build/Lib/unittest/runner.py", line 240 in run
  File "/home/buildbot/buildarea/3.13.cstratak-rhel9-s390x.refleak/build/Lib/test/libregrtest/single.py", line 57 in _run_suite
  File "/home/buildbot/buildarea/3.13.cstratak-rhel9-s390x.refleak/build/Lib/test/libregrtest/single.py", line 37 in run_unittest
  File "/home/buildbot/buildarea/3.13.cstratak-rhel9-s390x.refleak/build/Lib/test/libregrtest/single.py", line 135 in test_func
  File "/home/buildbot/buildarea/3.13.cstratak-rhel9-s390x.refleak/build/Lib/test/libregrtest/refleak.py", line 132 in runtest_refleak
  File "/home/buildbot/buildarea/3.13.cstratak-rhel9-s390x.refleak/build/Lib/test/libregrtest/single.py", line 87 in regrtest_runner
  File "/home/buildbot/buildarea/3.13.cstratak-rhel9-s390x.refleak/build/Lib/test/libregrtest/single.py", line 138 in _load_run_test
  File "/home/buildbot/buildarea/3.13.cstratak-rhel9-s390x.refleak/build/Lib/test/libregrtest/single.py", line 181 in _runtest_env_changed_exc
  File "/home/buildbot/buildarea/3.13.cstratak-rhel9-s390x.refleak/build/Lib/test/libregrtest/single.py", line 281 in _runtest
  File "/home/buildbot/buildarea/3.13.cstratak-rhel9-s390x.refleak/build/Lib/test/libregrtest/single.py", line 310 in run_single_test
  File "/home/buildbot/buildarea/3.13.cstratak-rhel9-s390x.refleak/build/Lib/test/libregrtest/worker.py", line 77 in worker_process
  File "/home/buildbot/buildarea/3.13.cstratak-rhel9-s390x.refleak/build/Lib/test/libregrtest/worker.py", line 112 in main
  File "/home/buildbot/buildarea/3.13.cstratak-rhel9-s390x.refleak/build/Lib/test/libregrtest/worker.py", line 116 in <module>
  File "/home/buildbot/buildarea/3.13.cstratak-rhel9-s390x.refleak/build/Lib/runpy.py", line 88 in _run_code
  File "/home/buildbot/buildarea/3.13.cstratak-rhel9-s390x.refleak/build/Lib/runpy.py", line 198 in _run_module_as_main


Traceback (most recent call last):
  File "/home/buildbot/buildarea/3.13.cstratak-rhel9-s390x.refleak/build/Lib/multiprocessing/process.py", line 313, in _bootstrap
    self.run()
    ~~~~~~~~^^
  File "/home/buildbot/buildarea/3.13.cstratak-rhel9-s390x.refleak/build/Lib/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
    ~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/buildbot/buildarea/3.13.cstratak-rhel9-s390x.refleak/build/Lib/test/_test_multiprocessing.py", line 1581, in f
    cond.release()
    ~~~~~~~~~~~~^^
  File "/home/buildbot/buildarea/3.13.cstratak-rhel9-s390x.refleak/build/Lib/multiprocessing/managers.py", line 1066, in release
    return self._callmethod('release')
           ~~~~~~~~~~~~~~~~^^^^^^^^^^^
  File "/home/buildbot/buildarea/3.13.cstratak-rhel9-s390x.refleak/build/Lib/multiprocessing/managers.py", line 847, in _callmethod
    raise convert_to_error(kind, result)
multiprocessing.managers.RemoteError: 
---------------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/buildbot/buildarea/3.13.cstratak-rhel9-s390x.refleak/build/Lib/multiprocessing/managers.py", line 263, in serve_client
    self.id_to_local_proxy_obj[ident]
    ~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^
KeyError: '3ff96a13a80'


Traceback (most recent call last):
  File "/home/buildbot/buildarea/3.13.cstratak-rhel9-s390x.refleak/build/Lib/multiprocessing/process.py", line 313, in _bootstrap
    self.run()
    ~~~~~~~~^^
  File "/home/buildbot/buildarea/3.13.cstratak-rhel9-s390x.refleak/build/Lib/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
    ~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/buildbot/buildarea/3.13.cstratak-rhel9-s390x.refleak/build/Lib/test/_test_multiprocessing.py", line 1581, in f
    cond.release()
    ~~~~~~~~~~~~^^
  File "/home/buildbot/buildarea/3.13.cstratak-rhel9-s390x.refleak/build/Lib/multiprocessing/managers.py", line 1066, in release
    return self._callmethod('release')
           ~~~~~~~~~~~~~~~~^^^^^^^^^^^
  File "/home/buildbot/buildarea/3.13.cstratak-rhel9-s390x.refleak/build/Lib/multiprocessing/managers.py", line 847, in _callmethod
    raise convert_to_error(kind, result)
multiprocessing.managers.RemoteError: 
---------------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/buildbot/buildarea/3.13.cstratak-rhel9-s390x.refleak/build/Lib/multiprocessing/managers.py", line 263, in serve_client
    self.id_to_local_proxy_obj[ident]
    ~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^
KeyError: '3ff7b113a80'

colesbury added a commit that referenced this pull request Mar 8, 2025
The test could deadlock trying join on the worker processes.
Apply the same technique as gh-130933.

Join the process before the test ends in `test_notify` as well.
miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Mar 8, 2025
The test could deadlock trying join on the worker processes.
Apply the same technique as pythongh-130933.

Join the process before the test ends in `test_notify` as well.
(cherry picked from commit edd1eca)

Co-authored-by: Sam Gross <[email protected]>
miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Mar 8, 2025
The test could deadlock trying join on the worker processes.
Apply the same technique as pythongh-130933.

Join the process before the test ends in `test_notify` as well.
(cherry picked from commit edd1eca)

Co-authored-by: Sam Gross <[email protected]>
colesbury added a commit that referenced this pull request Mar 8, 2025
)

The test could deadlock trying join on the worker processes.
Apply the same technique as gh-130933.

Join the process before the test ends in `test_notify` as well.
(cherry picked from commit edd1eca)

Co-authored-by: Sam Gross <[email protected]>
colesbury added a commit that referenced this pull request Mar 8, 2025
)

The test could deadlock trying join on the worker processes.
Apply the same technique as gh-130933.

Join the process before the test ends in `test_notify` as well.
(cherry picked from commit edd1eca)

Co-authored-by: Sam Gross <[email protected]>
seehwan pushed a commit to seehwan/cpython that referenced this pull request Apr 16, 2025
The test could deadlock trying join on the worker processes due to a
combination of behaviors:

* The use of `assertReachesEventually` did not ensure that workers
  actually woken.release() because the SyncManager's Semaphore does not
  implement get_value.

* This mean that the test could finish and the variable "sleeping" would
  got out of scope and be collected. This unregisters the proxy leading
  to failures in the worker or possibly the manager.

* The subsequent call to `p.join()` during cleanUp therefore never
  finished.

This takes two approaches to fix this:

1) Use woken.acquire() to ensure that the workers actually finish
   calling woken.release()

2) At the end of the test, wait until the workers are finished, while `cond`,
   `sleeping`, and `woken` are still valid.
seehwan pushed a commit to seehwan/cpython that referenced this pull request Apr 16, 2025
The test could deadlock trying join on the worker processes.
Apply the same technique as pythongh-130933.

Join the process before the test ends in `test_notify` as well.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
skip news tests Tests in the Lib/test dir
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants