[WIP] bpo-38323: Fix MultiLoopChildWatcher hangs #20142

cjerdonek · 2020-05-17T01:40:53Z

https://bugs.python.org/issue38323

This changes asyncio.MultiLoopChildWatcher's attach_loop() method to call loop.add_signal_handler() instead of calling only signal.signal(). This should eliminate some rare hangs since loop.add_signal_handler() calls signal.set_wakeup_fd(). Without this, the main thread sometimes wasn't getting awakened if a signal occurred during an await.

Doc/library/asyncio-policy.rst

ZackerySpytz · 2020-05-23T04:34:02Z

Lib/test/test_asyncio/test_subprocess.py

            policy = asyncio.get_event_loop_policy()
            watcher = policy.get_child_watcher()
            policy.set_child_watcher(None)
            watcher.attach_loop(None)
            watcher.close()
+            # Since setUp() does super().setUp() first, do tearDown() last.
+            super().tearDown()


I think this change causes the CI to fail.

Calling super().tearDown() last here should not be causing dangling threads, so I think this could be pointing to a more fundamental issue.

In a previous PR, I added an internal _join_threads() method to ThreadedChildWatcher, which specifically cleans up all non-daemon threads. This was there to ensure all threads are joined when the watcher is closed. However, in add_child_handler() the _do_waitpid() threads that are created are set as daemon, meaning they aren't joined in _join_threads() since it checks if they're not daemon.

Based on the error message in Travis, it definitely seems to be those waitpid threads created in ThreadedChildWatcher causing the hangs:

0:11:41 load avg: 3.89 [375/424/1] test_asyncio failed (env changed) (2 min 3 sec) -- running: test_multiprocessing_spawn (1 min 2 sec) Warning -- threading_cleanup() failed to cleanup 1 threads (count: 1, dangling: 2) Warning -- Dangling thread: <_MainThread(MainThread, started 140164662380352)> Warning -- Dangling thread: <Thread(waitpid-0, started daemon 140164437370624)> Warning -- threading_cleanup() failed to cleanup 1 threads (count: 1, dangling: 2) Warning -- Dangling thread: <Thread(waitpid-0, started daemon 140164437370624)> Warning -- Dangling thread: <_MainThread(MainThread, started 140164662380352)> Warning -- threading_cleanup() failed to cleanup 1 threads (count: 1, dangling: 2) Warning -- Dangling thread: <_MainThread(MainThread, started 140164662380352)> Warning -- Dangling thread: <Thread(waitpid-0, started daemon 140164437370624)> Warning -- threading_cleanup() failed to cleanup 1 threads (count: 1, dangling: 2) Warning -- Dangling thread: <_MainThread(MainThread, started 140164662380352)> Warning -- Dangling thread: <Thread(waitpid-0, started daemon 140164437370624)>

As a result, I think a viable fix might be just removing the and not thread.daemon check in ThreadedChildWatcher._join_threads(), to have it also clean up the waitpid threads when the watcher is closed. Based on my local testing, this seems to resolve the issue, but test-with-buildbots should probably be used to make sure it doesn't cause other side effects on other platforms.

A viable alternative might be changing add_child_handler() to instead spawn non-daemon threads. However, I suspect changing the internal _join_threads() method would be less likely to be disruptive to user code and causing side effects because of reliance on those threads being daemon.

I don't recall exactly why I made this particular tearDown() change for the purposes of this PR. (There was probably some failure I was trying to see if I could get by quickly.) In any case, it should be incidental to the MultiLoopChildWatcher hang. So if any changes are needed for the asyncio tests related to tearDown(), it should be done in a separate PR IMO independent of the MultiLoopChildWatcher hang issue. In other words, this eventual PR shouldn't contain this change.

@aeros Just to clarify what I said above, if you feel your ThreadedChildWatcher change would help the test harness in general, I would recommend making that change in a separate PR to reduce the number of moving parts in the changes needed to address the MultiLoopChildWatcher hang.

Alright, I'll look into opening a separate PR to address it. However, if this PR is continued upon (either by yourself or me), I'd suggest reverting the super().tearDown() location change made.

Right. I guess there are a few steps to be done. First, you can do your ThreadedChildWatcher change in a separate PR, and super().tearDown() can be reverted in this PR. If however, after resuming work on this PR, it looks like a change to the teardown logic might be needed for some reason, I would still suggest that the tearDown() change also be done in a separate PR (since it affects many of the watcher tests and not just MultiLoopChildWatcher, and so would be good to isolate that change).

Misc/NEWS.d/next/Library/2020-05-16-17-50-10.bpo-38323.Ar35np.rst

aeros · 2020-10-16T01:11:44Z

@cjerdonek Are you still interested/able to work on this? I'm mainly asking since it's been a few months since there was activity, and I'm fairly confident that my suggestion will resolve the CI failures. If you're not currently able to, I can re-open a separate PR applying this changes in addition to mine with you as the co-author, titling it something like "fix MultiLoopChildWatcher and ThreadChildWatcher hangs".

@vstinner recently brought to my attention that this issue has some degree of urgency, with it having caused intermittent failures in the buildbots for quite some time now. So, there's some definitely an advantage to having it resolved sooner than later.

cjerdonek · 2020-10-16T02:55:52Z

@aeros As I said in the reply above to the review comment, I think the tearDown() change I made here shouldn't be done in this PR (or in any PR in combination with a fix for the hang). It should be incidental to the MultiLoopChildWatcher hang and is a separate issue (if it is an issue).

Regarding the rest of the PR, sure, you're welcome to continue what I started. However, I definitely don't think the changes I made to MultiLoopChildWatcher in this PR should be made as is. This was just a minimal start that applied what I found and that I confirmed addressed the hang. So while the general direction of the solution and PR should be sound (i.e. ensuring that the wakeup file descriptor is set in attach_loop() as I mentioned in the bug report), there are other aspects that need to be addressed.

For example, I don't think attach_loop() should be setting self._loop as I did. And there's a chance it may need to support being called on more than one loop, which means in turn that close() will need to be able to unwind the operation for each loop that attach_loop() was called on. So there's a chance you will need to record in an attribute the set of loops that attach_loop() was called on. I stopped working on the PR before I started thinking through these other issues.

cjerdonek · 2020-10-16T03:12:15Z

Lib/asyncio/unix_events.py

+        if loop is None or self._saved_sighandler is not None:
+            return
+
+        self._loop = loop


Part of why I was saying to remove this line is to honor the code comment 7 lines above:

# Don't save the loop but initialize itself if called first time

cjerdonek · 2020-10-16T08:05:47Z

Something else to be aware of is that MultiLoopChildWatcher has this code comment right after its docstring:

# Implementation note:
# The class keeps compatibility with AbstractChildWatcher ABC
# To achieve this it has empty attach_loop() method
# and doesn't accept explicit loop argument
# for add_child_handler()/remove_child_handler()
# but retrieves the current loop by get_running_loop()

The part mentioning its "empty attach_loop()" method should probably be updated. I'm not sure why it says that and what it means (since its attach_loop() method isn't currently empty).

aeros · 2020-10-16T21:42:20Z

The part mentioning its "empty attach_loop()" method should probably be updated. I'm not sure why it says that and what it means (since its attach_loop() method isn't currently empty).

From the git blame, it looks like Andrew wrote that note when initially implementing the MultiLoopoChildWatcher class in 0d671c0. @asvetlov, could you clarify?

cjerdonek · 2020-10-16T23:50:06Z

I reverted the tearDown() change. We'll see what the tests do.

cjerdonek · 2020-10-17T00:50:49Z

Lib/asyncio/unix_events.py

        # Don't save the loop but initialize itself if called first time
        # The reason to do it here is that attach_loop() is called from
        # unix policy only for the main thread.
        # Main thread is required for subscription on SIGCHLD signal
+        if loop is None or self._saved_sighandler is not None:


@aeros I would think through what should happen if either loop is None or self._saved_sighandler is not None. For example, the previous implementation didn't use the loop argument, so there is a risk that code will stop working for people who were passing loop=None.

sourcejedi · 2020-10-17T20:24:45Z

Can I suggest an improved documentation-comment?

class MultiLoopChildWatcher(AbstractChildWatcher):
    """A watcher that allows running event loops outside of the main thread.
    
    This implementation registers a SIGCHLD handler in the event loop of
    the main thread.  It requires the main thread use an asyncio event loop,
    and it will be affected by anything that delays or blocks event processing
    in the main thread.  It will conflict with any other code that installs its own
    signal handler for SIGCHLD.

I noticed a similar constraint in the existing code. MultiLoopChildWatcher could cause hangs if the main thread is waiting in a system call. Because the watcher's signal handler would not get run, until the system call returns. It only works properly, if the main thread can be woken up by [EDIT: a signal that happens to be received on the main thread, or] the fd passed to signal.set_wakeup_fd().

This PR tightens the requirements slightly. MultiLoopChildWatcher now needs the main thread to run an asyncio event loop. MultiLoopChildWatcher will cause hangs if the main thread waits elsewhere in a system call. Or if the main thread runs a very long computation without returning to the event loop.

(Code looks plausible. Good analysis of a nasty problem. The signal arrives e.g. within the libc select() function. The python signal handler, MultiLoopChildWatcher._sig_chld(), doesn't get a chance to run before the low-level system call.

signal.set_wakeup_fd() is the solution. It gets the select() system call to wake up. The python-level SIGCHLD handler gets a chance to run.).

I would drop self._saved_sighandler if I could. This is getting messy. I'm a bit surprised to think of code that contrives to work that way. I.e. that temporarily uses asyncio.subprocess, doesn't need their own SIGCHLD handler to work for the duration, but needs it to start working again afterwards. It warned this would conflict from the start, and it was only introduced in 3.8...).

cjerdonek · 2020-10-18T22:07:08Z

Thanks @sourcejedi for the second pair of eyes and great additional comments! I like your doc suggestion and agree it should be incorporated in some form into this PR.

A couple unrelated things:

I filed a trivial refactoring PR bpo-38323: Add guard clauses in MultiLoopChildWatcher. #22756 to precede the changes here. It will make the changes here smaller and make the diff easier to understand / review.
I noticed that AbstractChildWatcher.attach_loop() has this documentation:

If the watcher was previously attached to an event loop, then it is first detached before attaching to the new loop.

So the current PR will also need to be adjusted to handle the loop being different if attach_loop() was already called.

cjerdonek · 2020-10-18T22:43:38Z

I just noticed that changing MultiLoopChildWatcher to require a loop may be the same as turning MultiLoopChildWatcher into SafeChildWatcher:
https://docs.python.org/3/library/asyncio-policy.html#asyncio.SafeChildWatcher

I guess this means MultiLoopChildWatcher should be changed to use signal.set_wakeup_fd(), but not via a loop's loop.add_signal_handler().

sourcejedi · 2020-10-19T14:45:49Z

@asvetlov "MultiLoopChildWatcher problems can and should be fixed; there is nothing bad in the idea but slightly imperfect implementation."

This fix needs to switch from using signal.signal(), to loop.add_signal_handler(). This requires the main thread to run an asyncio event loop. Previously, this requirement was only documented for SafeChildWatcher and FastChildWatcher. So we need to decide -

Q. Is it acceptable to add this new requirement? Or do we deprecate MultiLoopChildWatcher, and warn people that it occasionally hangs?

In other words, it is possible an existing program used the "imperfect" MultiLoopChildWatcher, while the main thread ran a non-asyncio event loop. (And the non-asyncio event loop was using signal.set_wakeup_fd()). This fix would break any such program (hang in await subprocess.wait()).

signal.set_wakeup_fd() is roughly the only way python programs can use signal handlers correctly. The python-level signal handler is run in the main thread. If the main thread is waiting in a system call, it needs something to wake it up after a signal was delivered to a different thread. POSIX says that process signals can be delivered to any thread.

(I say "process signals" to exclude pthread_kill(), and faults like SIGSEGV. Most programs should assume they have some threads started by random library code - see e.g. ThreadedChildWatcher :-P).

I guess this means MultiLoopChildWatcher should be changed to use signal.set_wakeup_fd(), but not via a loop's loop.add_signal_handler().

I don't know how that would help, sorry. We absolutely need to interrupt the epoll_wait() call in the main thread. We need that epoll to be watching the signal wakeup fd. Hence we need to talk to the main thread event loop. We may as well use loop.add_signal_handler().

cjerdonek · 2020-10-19T20:17:58Z

I said that because if MultiLoopChildWatcher is changed to require an event loop to be running in the main thread, then it will be identical to SafeChildWatcher. So I don't think that change should be made. Indeed, looking at MultiLoopChildWatcher's docstring, that looks to be its distinguishing feature:

cpython/Lib/asyncio/unix_events.py

Lines 1205 to 1206 in 95ad890

    
           class MultiLoopChildWatcher(AbstractChildWatcher): 
        
               """A watcher that doesn't require running loop in the main thread.

So I was wondering if there might be any options for using signal.set_wakeup_fd() that don't require running a full-blown asyncio event loop. (MultiLoopChildWatcher could do any setup it requires.)

In the meantime, we might also want to think about whether a band-aid could be added to the tests, at least to prevent hangs as long as MultiLoopChildWatcher stays the same.

aeros · 2020-10-19T23:17:36Z

Yeah, I would agree that changing MultiLoopChildWatcher to require being used in the main thread with a running event loop would definitely defeat the main purpose of it.

In the meantime, we might also want to think about whether a band-aid could be added to the tests, at least to prevent hangs as long as MultiLoopChildWatcher stays the same.

My only concern though is that by bandaging the tests temporarily, it may encourage usage in production in situations that could lead to the hangs (although admittedly, I'm not aware of specific real-world situations where it would currently be used). That being said, it's seeming like we may have to go that route for now since this issue seems like it might be blocked on feedback from @asvetlov (since it seems like it might require some fundamental design changes). This would at least ensure that other test failures aren't obfuscated due to intermittent MultiLoopChildWatcher hangs.

cjerdonek · 2020-10-20T03:06:44Z

My only concern though is that by bandaging the tests temporarily, it may encourage usage in production in situations that could lead to the hangs

I was just meaning in master (e.g. for the next few months) while a resolution is developed. By the way, @aeros can you review my other PR #22756, which is independent of (but related to) this issue?

jstasiak

LGTM, this fixes some hanging sets on Solaris/illumos/OpenIndiana.

cjerdonek · 2021-06-04T19:01:56Z

I'm closing this PR which was mainly exploratory. See here for my conclusions (also noted above): https://bugs.python.org/issue38323#msg395113

cjerdonek requested review from 1st1 and asvetlov as code owners May 17, 2020 01:40

the-knights-who-say-ni added the CLA signed label May 17, 2020

bedevere-bot added the awaiting core review label May 17, 2020

cjerdonek force-pushed the issue-38323-multi-watcher-hang branch from 1279008 to 5d10132 Compare May 17, 2020 01:45

cjerdonek added 2 commits May 16, 2020 18:47

Add docstring.

9618884

ZackerySpytz reviewed May 23, 2020

View reviewed changes

cjerdonek commented Oct 16, 2020

View reviewed changes

cjerdonek added 2 commits October 16, 2020 16:37

Address a couple review comments.

4d4c147

Revert tearDown() change.

14f6cfc

Tweak documentation change.

22ba0b1

cjerdonek commented Oct 17, 2020

View reviewed changes

cjerdonek mentioned this pull request Oct 18, 2020

bpo-38323: Add guard clauses in MultiLoopChildWatcher. #22756

Merged

jstasiak approved these changes Nov 5, 2020

View reviewed changes

sourcejedi referenced this pull request in sourcejedi/cpython Jan 20, 2021

Add test case

50184ea

cjerdonek closed this Jun 4, 2021

vstinner mentioned this pull request Apr 10, 2022

asyncio: MultiLoopWatcher has a race condition (test_asyncio: test_close_kill_running() hangs on AMD64 RHEL7 Refleaks 3.x) #82504

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] bpo-38323: Fix MultiLoopChildWatcher hangs #20142

[WIP] bpo-38323: Fix MultiLoopChildWatcher hangs #20142

cjerdonek commented May 17, 2020 •

edited by bedevere-bot

Loading

ZackerySpytz May 23, 2020

aeros Oct 16, 2020 •

edited

Loading

cjerdonek Oct 16, 2020

cjerdonek Oct 16, 2020

aeros Oct 16, 2020

cjerdonek Oct 16, 2020

aeros commented Oct 16, 2020

cjerdonek commented Oct 16, 2020

cjerdonek Oct 16, 2020

cjerdonek commented Oct 16, 2020

aeros commented Oct 16, 2020

cjerdonek commented Oct 16, 2020

cjerdonek Oct 17, 2020

sourcejedi commented Oct 17, 2020 •

edited

Loading

cjerdonek commented Oct 18, 2020

cjerdonek commented Oct 18, 2020 •

edited

Loading

sourcejedi commented Oct 19, 2020 •

edited

Loading

cjerdonek commented Oct 19, 2020

aeros commented Oct 19, 2020 •

edited

Loading

cjerdonek commented Oct 20, 2020

jstasiak left a comment

cjerdonek commented Jun 4, 2021

[WIP] bpo-38323: Fix MultiLoopChildWatcher hangs #20142

[WIP] bpo-38323: Fix MultiLoopChildWatcher hangs #20142

Conversation

cjerdonek commented May 17, 2020 • edited by bedevere-bot Loading

ZackerySpytz May 23, 2020

Choose a reason for hiding this comment

aeros Oct 16, 2020 • edited Loading

Choose a reason for hiding this comment

cjerdonek Oct 16, 2020

Choose a reason for hiding this comment

cjerdonek Oct 16, 2020

Choose a reason for hiding this comment

aeros Oct 16, 2020

Choose a reason for hiding this comment

cjerdonek Oct 16, 2020

Choose a reason for hiding this comment

aeros commented Oct 16, 2020

cjerdonek commented Oct 16, 2020

cjerdonek Oct 16, 2020

Choose a reason for hiding this comment

cjerdonek commented Oct 16, 2020

aeros commented Oct 16, 2020

cjerdonek commented Oct 16, 2020

cjerdonek Oct 17, 2020

Choose a reason for hiding this comment

sourcejedi commented Oct 17, 2020 • edited Loading

cjerdonek commented Oct 18, 2020

cjerdonek commented Oct 18, 2020 • edited Loading

sourcejedi commented Oct 19, 2020 • edited Loading

cjerdonek commented Oct 19, 2020

aeros commented Oct 19, 2020 • edited Loading

cjerdonek commented Oct 20, 2020

jstasiak left a comment

Choose a reason for hiding this comment

cjerdonek commented Jun 4, 2021

cjerdonek commented May 17, 2020 •

edited by bedevere-bot

Loading

aeros Oct 16, 2020 •

edited

Loading

sourcejedi commented Oct 17, 2020 •

edited

Loading

cjerdonek commented Oct 18, 2020 •

edited

Loading

sourcejedi commented Oct 19, 2020 •

edited

Loading

aeros commented Oct 19, 2020 •

edited

Loading