Skip to content

[WIP] bpo-38323: Fix MultiLoopChildWatcher hangs #20142

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

cjerdonek
Copy link
Member

@cjerdonek cjerdonek commented May 17, 2020

cjerdonek added 2 commits May 16, 2020 18:47
This changes asyncio.MultiLoopChildWatcher's attach_loop() method
to call loop.add_signal_handler() instead of calling only
signal.signal().  This should eliminate some rare hangs since
loop.add_signal_handler() calls signal.set_wakeup_fd().  Without
this, the main thread sometimes wasn't getting awakened if a
signal occurred during an await.
policy = asyncio.get_event_loop_policy()
watcher = policy.get_child_watcher()
policy.set_child_watcher(None)
watcher.attach_loop(None)
watcher.close()
# Since setUp() does super().setUp() first, do tearDown() last.
super().tearDown()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change causes the CI to fail.

Copy link
Contributor

@aeros aeros Oct 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling super().tearDown() last here should not be causing dangling threads, so I think this could be pointing to a more fundamental issue.

In a previous PR, I added an internal _join_threads() method to ThreadedChildWatcher, which specifically cleans up all non-daemon threads. This was there to ensure all threads are joined when the watcher is closed. However, in add_child_handler() the _do_waitpid() threads that are created are set as daemon, meaning they aren't joined in _join_threads() since it checks if they're not daemon.

Based on the error message in Travis, it definitely seems to be those waitpid threads created in ThreadedChildWatcher causing the hangs:

0:11:41 load avg: 3.89 [375/424/1] test_asyncio failed (env changed) (2 min 3 sec) -- running: test_multiprocessing_spawn (1 min 2 sec)
Warning -- threading_cleanup() failed to cleanup 1 threads (count: 1, dangling: 2)
Warning -- Dangling thread: <_MainThread(MainThread, started 140164662380352)>
Warning -- Dangling thread: <Thread(waitpid-0, started daemon 140164437370624)>
Warning -- threading_cleanup() failed to cleanup 1 threads (count: 1, dangling: 2)
Warning -- Dangling thread: <Thread(waitpid-0, started daemon 140164437370624)>
Warning -- Dangling thread: <_MainThread(MainThread, started 140164662380352)>
Warning -- threading_cleanup() failed to cleanup 1 threads (count: 1, dangling: 2)
Warning -- Dangling thread: <_MainThread(MainThread, started 140164662380352)>
Warning -- Dangling thread: <Thread(waitpid-0, started daemon 140164437370624)>
Warning -- threading_cleanup() failed to cleanup 1 threads (count: 1, dangling: 2)
Warning -- Dangling thread: <_MainThread(MainThread, started 140164662380352)>
Warning -- Dangling thread: <Thread(waitpid-0, started daemon 140164437370624)>

As a result, I think a viable fix might be just removing the and not thread.daemon check in ThreadedChildWatcher._join_threads(), to have it also clean up the waitpid threads when the watcher is closed. Based on my local testing, this seems to resolve the issue, but test-with-buildbots should probably be used to make sure it doesn't cause other side effects on other platforms.

A viable alternative might be changing add_child_handler() to instead spawn non-daemon threads. However, I suspect changing the internal _join_threads() method would be less likely to be disruptive to user code and causing side effects because of reliance on those threads being daemon.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't recall exactly why I made this particular tearDown() change for the purposes of this PR. (There was probably some failure I was trying to see if I could get by quickly.) In any case, it should be incidental to the MultiLoopChildWatcher hang. So if any changes are needed for the asyncio tests related to tearDown(), it should be done in a separate PR IMO independent of the MultiLoopChildWatcher hang issue. In other words, this eventual PR shouldn't contain this change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aeros Just to clarify what I said above, if you feel your ThreadedChildWatcher change would help the test harness in general, I would recommend making that change in a separate PR to reduce the number of moving parts in the changes needed to address the MultiLoopChildWatcher hang.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I'll look into opening a separate PR to address it. However, if this PR is continued upon (either by yourself or me), I'd suggest reverting the super().tearDown() location change made.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I guess there are a few steps to be done. First, you can do your ThreadedChildWatcher change in a separate PR, and super().tearDown() can be reverted in this PR. If however, after resuming work on this PR, it looks like a change to the teardown logic might be needed for some reason, I would still suggest that the tearDown() change also be done in a separate PR (since it affects many of the watcher tests and not just MultiLoopChildWatcher, and so would be good to isolate that change).

@aeros
Copy link
Contributor

aeros commented Oct 16, 2020

@cjerdonek Are you still interested/able to work on this? I'm mainly asking since it's been a few months since there was activity, and I'm fairly confident that my suggestion will resolve the CI failures. If you're not currently able to, I can re-open a separate PR applying this changes in addition to mine with you as the co-author, titling it something like "fix MultiLoopChildWatcher and ThreadChildWatcher hangs".

@vstinner recently brought to my attention that this issue has some degree of urgency, with it having caused intermittent failures in the buildbots for quite some time now. So, there's some definitely an advantage to having it resolved sooner than later.

@cjerdonek
Copy link
Member Author

@aeros As I said in the reply above to the review comment, I think the tearDown() change I made here shouldn't be done in this PR (or in any PR in combination with a fix for the hang). It should be incidental to the MultiLoopChildWatcher hang and is a separate issue (if it is an issue).

Regarding the rest of the PR, sure, you're welcome to continue what I started. However, I definitely don't think the changes I made to MultiLoopChildWatcher in this PR should be made as is. This was just a minimal start that applied what I found and that I confirmed addressed the hang. So while the general direction of the solution and PR should be sound (i.e. ensuring that the wakeup file descriptor is set in attach_loop() as I mentioned in the bug report), there are other aspects that need to be addressed.

For example, I don't think attach_loop() should be setting self._loop as I did. And there's a chance it may need to support being called on more than one loop, which means in turn that close() will need to be able to unwind the operation for each loop that attach_loop() was called on. So there's a chance you will need to record in an attribute the set of loops that attach_loop() was called on. I stopped working on the PR before I started thinking through these other issues.

if loop is None or self._saved_sighandler is not None:
return

self._loop = loop
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part of why I was saying to remove this line is to honor the code comment 7 lines above:

# Don't save the loop but initialize itself if called first time

@cjerdonek
Copy link
Member Author

Something else to be aware of is that MultiLoopChildWatcher has this code comment right after its docstring:

# Implementation note:
# The class keeps compatibility with AbstractChildWatcher ABC
# To achieve this it has empty attach_loop() method
# and doesn't accept explicit loop argument
# for add_child_handler()/remove_child_handler()
# but retrieves the current loop by get_running_loop()

The part mentioning its "empty attach_loop()" method should probably be updated. I'm not sure why it says that and what it means (since its attach_loop() method isn't currently empty).

@aeros
Copy link
Contributor

aeros commented Oct 16, 2020

The part mentioning its "empty attach_loop()" method should probably be updated. I'm not sure why it says that and what it means (since its attach_loop() method isn't currently empty).

From the git blame, it looks like Andrew wrote that note when initially implementing the MultiLoopoChildWatcher class in 0d671c0. @asvetlov, could you clarify?

@cjerdonek
Copy link
Member Author

I reverted the tearDown() change. We'll see what the tests do.

# Don't save the loop but initialize itself if called first time
# The reason to do it here is that attach_loop() is called from
# unix policy only for the main thread.
# Main thread is required for subscription on SIGCHLD signal
if loop is None or self._saved_sighandler is not None:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aeros I would think through what should happen if either loop is None or self._saved_sighandler is not None. For example, the previous implementation didn't use the loop argument, so there is a risk that code will stop working for people who were passing loop=None.

@sourcejedi
Copy link

sourcejedi commented Oct 17, 2020

Can I suggest an improved documentation-comment?

class MultiLoopChildWatcher(AbstractChildWatcher):
    """A watcher that allows running event loops outside of the main thread.
    
    This implementation registers a SIGCHLD handler in the event loop of
    the main thread.  It requires the main thread use an asyncio event loop,
    and it will be affected by anything that delays or blocks event processing
    in the main thread.  It will conflict with any other code that installs its own
    signal handler for SIGCHLD.

I noticed a similar constraint in the existing code. MultiLoopChildWatcher could cause hangs if the main thread is waiting in a system call. Because the watcher's signal handler would not get run, until the system call returns. It only works properly, if the main thread can be woken up by [EDIT: a signal that happens to be received on the main thread, or] the fd passed to signal.set_wakeup_fd().

This PR tightens the requirements slightly. MultiLoopChildWatcher now needs the main thread to run an asyncio event loop. MultiLoopChildWatcher will cause hangs if the main thread waits elsewhere in a system call. Or if the main thread runs a very long computation without returning to the event loop.


(Code looks plausible. Good analysis of a nasty problem. The signal arrives e.g. within the libc select() function. The python signal handler, MultiLoopChildWatcher._sig_chld(), doesn't get a chance to run before the low-level system call.

signal.set_wakeup_fd() is the solution. It gets the select() system call to wake up. The python-level SIGCHLD handler gets a chance to run.).

I would drop self._saved_sighandler if I could. This is getting messy. I'm a bit surprised to think of code that contrives to work that way. I.e. that temporarily uses asyncio.subprocess, doesn't need their own SIGCHLD handler to work for the duration, but needs it to start working again afterwards. It warned this would conflict from the start, and it was only introduced in 3.8...).

@cjerdonek
Copy link
Member Author

Thanks @sourcejedi for the second pair of eyes and great additional comments! I like your doc suggestion and agree it should be incorporated in some form into this PR.

A couple unrelated things:

  • I filed a trivial refactoring PR bpo-38323: Add guard clauses in MultiLoopChildWatcher. #22756 to precede the changes here. It will make the changes here smaller and make the diff easier to understand / review.

  • I noticed that AbstractChildWatcher.attach_loop() has this documentation:

    If the watcher was previously attached to an event loop, then it is first detached before attaching to the new loop.

    So the current PR will also need to be adjusted to handle the loop being different if attach_loop() was already called.

@cjerdonek
Copy link
Member Author

cjerdonek commented Oct 18, 2020

I just noticed that changing MultiLoopChildWatcher to require a loop may be the same as turning MultiLoopChildWatcher into SafeChildWatcher:
https://docs.python.org/3/library/asyncio-policy.html#asyncio.SafeChildWatcher

I guess this means MultiLoopChildWatcher should be changed to use signal.set_wakeup_fd(), but not via a loop's loop.add_signal_handler().

@sourcejedi
Copy link

sourcejedi commented Oct 19, 2020

@asvetlov "MultiLoopChildWatcher problems can and should be fixed; there is nothing bad in the idea but slightly imperfect implementation."

This fix needs to switch from using signal.signal(), to loop.add_signal_handler(). This requires the main thread to run an asyncio event loop. Previously, this requirement was only documented for SafeChildWatcher and FastChildWatcher. So we need to decide -

Q. Is it acceptable to add this new requirement? Or do we deprecate MultiLoopChildWatcher, and warn people that it occasionally hangs?

In other words, it is possible an existing program used the "imperfect" MultiLoopChildWatcher, while the main thread ran a non-asyncio event loop. (And the non-asyncio event loop was using signal.set_wakeup_fd()). This fix would break any such program (hang in await subprocess.wait()).


signal.set_wakeup_fd() is roughly the only way python programs can use signal handlers correctly. The python-level signal handler is run in the main thread. If the main thread is waiting in a system call, it needs something to wake it up after a signal was delivered to a different thread. POSIX says that process signals can be delivered to any thread.

(I say "process signals" to exclude pthread_kill(), and faults like SIGSEGV. Most programs should assume they have some threads started by random library code - see e.g. ThreadedChildWatcher :-P).

I guess this means MultiLoopChildWatcher should be changed to use signal.set_wakeup_fd(), but not via a loop's loop.add_signal_handler().

I don't know how that would help, sorry. We absolutely need to interrupt the epoll_wait() call in the main thread. We need that epoll to be watching the signal wakeup fd. Hence we need to talk to the main thread event loop. We may as well use loop.add_signal_handler().

@cjerdonek
Copy link
Member Author

I said that because if MultiLoopChildWatcher is changed to require an event loop to be running in the main thread, then it will be identical to SafeChildWatcher. So I don't think that change should be made. Indeed, looking at MultiLoopChildWatcher's docstring, that looks to be its distinguishing feature:

class MultiLoopChildWatcher(AbstractChildWatcher):
"""A watcher that doesn't require running loop in the main thread.

So I was wondering if there might be any options for using signal.set_wakeup_fd() that don't require running a full-blown asyncio event loop. (MultiLoopChildWatcher could do any setup it requires.)

In the meantime, we might also want to think about whether a band-aid could be added to the tests, at least to prevent hangs as long as MultiLoopChildWatcher stays the same.

@aeros
Copy link
Contributor

aeros commented Oct 19, 2020

Yeah, I would agree that changing MultiLoopChildWatcher to require being used in the main thread with a running event loop would definitely defeat the main purpose of it.

In the meantime, we might also want to think about whether a band-aid could be added to the tests, at least to prevent hangs as long as MultiLoopChildWatcher stays the same.

My only concern though is that by bandaging the tests temporarily, it may encourage usage in production in situations that could lead to the hangs (although admittedly, I'm not aware of specific real-world situations where it would currently be used). That being said, it's seeming like we may have to go that route for now since this issue seems like it might be blocked on feedback from @asvetlov (since it seems like it might require some fundamental design changes). This would at least ensure that other test failures aren't obfuscated due to intermittent MultiLoopChildWatcher hangs.

@cjerdonek
Copy link
Member Author

My only concern though is that by bandaging the tests temporarily, it may encourage usage in production in situations that could lead to the hangs

I was just meaning in master (e.g. for the next few months) while a resolution is developed. By the way, @aeros can you review my other PR #22756, which is independent of (but related to) this issue?

Copy link
Contributor

@jstasiak jstasiak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, this fixes some hanging sets on Solaris/illumos/OpenIndiana.

sourcejedi referenced this pull request in sourcejedi/cpython Jan 20, 2021
@cjerdonek
Copy link
Member Author

I'm closing this PR which was mainly exploratory. See here for my conclusions (also noted above): https://bugs.python.org/issue38323#msg395113

@cjerdonek cjerdonek closed this Jun 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants