-
-
Notifications
You must be signed in to change notification settings - Fork 31.7k
[WIP] bpo-38323: Fix MultiLoopChildWatcher hangs #20142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
1279008
to
5d10132
Compare
This changes asyncio.MultiLoopChildWatcher's attach_loop() method to call loop.add_signal_handler() instead of calling only signal.signal(). This should eliminate some rare hangs since loop.add_signal_handler() calls signal.set_wakeup_fd(). Without this, the main thread sometimes wasn't getting awakened if a signal occurred during an await.
policy = asyncio.get_event_loop_policy() | ||
watcher = policy.get_child_watcher() | ||
policy.set_child_watcher(None) | ||
watcher.attach_loop(None) | ||
watcher.close() | ||
# Since setUp() does super().setUp() first, do tearDown() last. | ||
super().tearDown() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this change causes the CI to fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling super().tearDown()
last here should not be causing dangling threads, so I think this could be pointing to a more fundamental issue.
In a previous PR, I added an internal _join_threads()
method to ThreadedChildWatcher, which specifically cleans up all non-daemon threads. This was there to ensure all threads are joined when the watcher is closed. However, in add_child_handler()
the _do_waitpid()
threads that are created are set as daemon, meaning they aren't joined in _join_threads()
since it checks if they're not daemon.
Based on the error message in Travis, it definitely seems to be those waitpid threads created in ThreadedChildWatcher causing the hangs:
0:11:41 load avg: 3.89 [375/424/1] test_asyncio failed (env changed) (2 min 3 sec) -- running: test_multiprocessing_spawn (1 min 2 sec)
Warning -- threading_cleanup() failed to cleanup 1 threads (count: 1, dangling: 2)
Warning -- Dangling thread: <_MainThread(MainThread, started 140164662380352)>
Warning -- Dangling thread: <Thread(waitpid-0, started daemon 140164437370624)>
Warning -- threading_cleanup() failed to cleanup 1 threads (count: 1, dangling: 2)
Warning -- Dangling thread: <Thread(waitpid-0, started daemon 140164437370624)>
Warning -- Dangling thread: <_MainThread(MainThread, started 140164662380352)>
Warning -- threading_cleanup() failed to cleanup 1 threads (count: 1, dangling: 2)
Warning -- Dangling thread: <_MainThread(MainThread, started 140164662380352)>
Warning -- Dangling thread: <Thread(waitpid-0, started daemon 140164437370624)>
Warning -- threading_cleanup() failed to cleanup 1 threads (count: 1, dangling: 2)
Warning -- Dangling thread: <_MainThread(MainThread, started 140164662380352)>
Warning -- Dangling thread: <Thread(waitpid-0, started daemon 140164437370624)>
As a result, I think a viable fix might be just removing the and not thread.daemon
check in ThreadedChildWatcher._join_threads()
, to have it also clean up the waitpid threads when the watcher is closed. Based on my local testing, this seems to resolve the issue, but test-with-buildbots
should probably be used to make sure it doesn't cause other side effects on other platforms.
A viable alternative might be changing add_child_handler()
to instead spawn non-daemon threads. However, I suspect changing the internal _join_threads()
method would be less likely to be disruptive to user code and causing side effects because of reliance on those threads being daemon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't recall exactly why I made this particular tearDown()
change for the purposes of this PR. (There was probably some failure I was trying to see if I could get by quickly.) In any case, it should be incidental to the MultiLoopChildWatcher
hang. So if any changes are needed for the asyncio tests related to tearDown()
, it should be done in a separate PR IMO independent of the MultiLoopChildWatcher
hang issue. In other words, this eventual PR shouldn't contain this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aeros Just to clarify what I said above, if you feel your ThreadedChildWatcher
change would help the test harness in general, I would recommend making that change in a separate PR to reduce the number of moving parts in the changes needed to address the MultiLoopChildWatcher
hang.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, I'll look into opening a separate PR to address it. However, if this PR is continued upon (either by yourself or me), I'd suggest reverting the super().tearDown()
location change made.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. I guess there are a few steps to be done. First, you can do your ThreadedChildWatcher
change in a separate PR, and super().tearDown()
can be reverted in this PR. If however, after resuming work on this PR, it looks like a change to the teardown logic might be needed for some reason, I would still suggest that the tearDown()
change also be done in a separate PR (since it affects many of the watcher tests and not just MultiLoopChildWatcher
, and so would be good to isolate that change).
Misc/NEWS.d/next/Library/2020-05-16-17-50-10.bpo-38323.Ar35np.rst
Outdated
Show resolved
Hide resolved
@cjerdonek Are you still interested/able to work on this? I'm mainly asking since it's been a few months since there was activity, and I'm fairly confident that my suggestion will resolve the CI failures. If you're not currently able to, I can re-open a separate PR applying this changes in addition to mine with you as the co-author, titling it something like "fix MultiLoopChildWatcher and ThreadChildWatcher hangs". @vstinner recently brought to my attention that this issue has some degree of urgency, with it having caused intermittent failures in the buildbots for quite some time now. So, there's some definitely an advantage to having it resolved sooner than later. |
@aeros As I said in the reply above to the review comment, I think the Regarding the rest of the PR, sure, you're welcome to continue what I started. However, I definitely don't think the changes I made to For example, I don't think |
if loop is None or self._saved_sighandler is not None: | ||
return | ||
|
||
self._loop = loop |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Part of why I was saying to remove this line is to honor the code comment 7 lines above:
# Don't save the loop but initialize itself if called first time
Something else to be aware of is that # Implementation note:
# The class keeps compatibility with AbstractChildWatcher ABC
# To achieve this it has empty attach_loop() method
# and doesn't accept explicit loop argument
# for add_child_handler()/remove_child_handler()
# but retrieves the current loop by get_running_loop() The part mentioning its "empty |
From the git blame, it looks like Andrew wrote that note when initially implementing the |
I reverted the |
# Don't save the loop but initialize itself if called first time | ||
# The reason to do it here is that attach_loop() is called from | ||
# unix policy only for the main thread. | ||
# Main thread is required for subscription on SIGCHLD signal | ||
if loop is None or self._saved_sighandler is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aeros I would think through what should happen if either loop is None
or self._saved_sighandler is not None
. For example, the previous implementation didn't use the loop
argument, so there is a risk that code will stop working for people who were passing loop=None
.
Can I suggest an improved documentation-comment?
I noticed a similar constraint in the existing code. MultiLoopChildWatcher could cause hangs if the main thread is waiting in a system call. Because the watcher's signal handler would not get run, until the system call returns. It only works properly, if the main thread can be woken up by [EDIT: a signal that happens to be received on the main thread, or] the fd passed to signal.set_wakeup_fd(). This PR tightens the requirements slightly. MultiLoopChildWatcher now needs the main thread to run an asyncio event loop. MultiLoopChildWatcher will cause hangs if the main thread waits elsewhere in a system call. Or if the main thread runs a very long computation without returning to the event loop. (Code looks plausible. Good analysis of a nasty problem. The signal arrives e.g. within the libc select() function. The python signal handler, MultiLoopChildWatcher._sig_chld(), doesn't get a chance to run before the low-level system call. signal.set_wakeup_fd() is the solution. It gets the select() system call to wake up. The python-level SIGCHLD handler gets a chance to run.). I would drop |
Thanks @sourcejedi for the second pair of eyes and great additional comments! I like your doc suggestion and agree it should be incorporated in some form into this PR. A couple unrelated things:
|
I just noticed that changing I guess this means |
@asvetlov "MultiLoopChildWatcher problems can and should be fixed; there is nothing bad in the idea but slightly imperfect implementation." This fix needs to switch from using signal.signal(), to loop.add_signal_handler(). This requires the main thread to run an asyncio event loop. Previously, this requirement was only documented for SafeChildWatcher and FastChildWatcher. So we need to decide - Q. Is it acceptable to add this new requirement? Or do we deprecate MultiLoopChildWatcher, and warn people that it occasionally hangs? In other words, it is possible an existing program used the "imperfect" MultiLoopChildWatcher, while the main thread ran a non-asyncio event loop. (And the non-asyncio event loop was using signal.set_wakeup_fd()). This fix would break any such program (hang in signal.set_wakeup_fd() is roughly the only way python programs can use signal handlers correctly. The python-level signal handler is run in the main thread. If the main thread is waiting in a system call, it needs something to wake it up after a signal was delivered to a different thread. POSIX says that process signals can be delivered to any thread. (I say "process signals" to exclude pthread_kill(), and faults like
I don't know how that would help, sorry. We absolutely need to interrupt the epoll_wait() call in the main thread. We need that epoll to be watching the signal wakeup fd. Hence we need to talk to the main thread event loop. We may as well use loop.add_signal_handler(). |
I said that because if cpython/Lib/asyncio/unix_events.py Lines 1205 to 1206 in 95ad890
So I was wondering if there might be any options for using In the meantime, we might also want to think about whether a band-aid could be added to the tests, at least to prevent hangs as long as |
Yeah, I would agree that changing
My only concern though is that by bandaging the tests temporarily, it may encourage usage in production in situations that could lead to the hangs (although admittedly, I'm not aware of specific real-world situations where it would currently be used). That being said, it's seeming like we may have to go that route for now since this issue seems like it might be blocked on feedback from @asvetlov (since it seems like it might require some fundamental design changes). This would at least ensure that other test failures aren't obfuscated due to intermittent |
I was just meaning in master (e.g. for the next few months) while a resolution is developed. By the way, @aeros can you review my other PR #22756, which is independent of (but related to) this issue? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, this fixes some hanging sets on Solaris/illumos/OpenIndiana.
I'm closing this PR which was mainly exploratory. See here for my conclusions (also noted above): https://bugs.python.org/issue38323#msg395113 |
https://bugs.python.org/issue38323