Skip to content

asyncio: Improve support Python 3.8 on Windows #2815

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Sep 20, 2020

Conversation

bdarnell
Copy link
Member

This commit removes the need for applications to work around the
backwards-incompatible change to the default event loop. Instead,
Tornado will detect the use of the windows proactor event loop and
start a selector event loop in a separate thread.

Closes #2804

@minrk
Copy link
Contributor

minrk commented Feb 27, 2020

As a Jupyter maintainer, I'd love to see this in a release, so anything I can do to help it along, let me know.

I opened bdarnell#2 against this branch, which gets things passing for me on non-Windows, but it looks like there are still a couple thread-related kinks when this actually gets used.

We've been pretty slow to roll out the application-level workaround for #2804 because we have so many applications. I think your points about this being a pain at the application level and not having a good library-level solution are being borne out. Notebook itself and ipykernel were fixed relatively promptly, but there are still stragglers like QtConsole, and there isn't a good way for us in Jupyter to have a single global fix that wouldn't have the same problems as tornado itself setting the event loop at import/start/etc.

@minrk
Copy link
Contributor

minrk commented Feb 27, 2020

The set_wakeup_fd error (ValueError: set_wakeup_fd only works in main thread) is a bug in Python 3.8.0, fixed in 3.8.1 https://bugs.python.org/issue34679 (commit), but appveyor still only has 3.8.0 on the Windows images.

@minrk
Copy link
Contributor

minrk commented Feb 27, 2020

Another test failure is in ioloop test test_handler_callback_file_object.

The test failure is because the fds list is expected to be [socket_obj, socket_fileno], but some runs (not every time), events occur more than once, e.g. [socket_obj, socket_obj, socket_fileno, socket_fileno, socket_fileno]. I believe this is caused by races in the thread synchronization:

When a read event fires:

  • [selector] calls _real_loop.call_soon_threadsafe to schedule read handler
  • [selector] still has read events waiting, next iteration schedules read handler again
  • [main] calls wrapped callback first time, closes sock
  • [main] calls wrapped callback second time

The _run_on_selector uses concurrent.futures to make a blocking call from real loop -> selector. I think to preserve these handlers rigorously, we need the callback to be similarly symmetrical with _run_on_main called frmo the selector, instead of just the async call_soon_threadsafe so that the selector is blocked until the actual callback finishes firing in the main loop.

@bdarnell
Copy link
Member Author

Excellent, thanks for jumping in and testing!

That test_handler_callback_file_object is a tricky one - wonder why it didn't show up on my machine? Anyway, a blocking _run_on_main won't work here - it would deadlock if the callback needed to call back to the selector thread to add or remove a handler. A short delay (like you have in your patch) would get this test passing (at least most of the time) but doesn't reliably solve the problem.

Other partial solutions include

  • Declaring spurious wakeups to be legal. Semantically I think this would be fine for most production code, although tests like this one are sensitive to it. Plus unless there is some sort of delay on the selector thread it could lead to unbounded memory growth by adding more callbacks to the main thread's queue.
  • Call select.select before running a callback to ensure the desired condition still holds. Same problem of unbounded memory growth, but it lets us reject the spurious callbacks a little faster and without visible changes to application code behavior
  • Unregister file descriptors from the selector before running their callbacks, and re-register them once they're finished. This is bad for performance but I think it would work OK.

I think we need to control the select loop itself instead of just using the SelectorEventLoop black box API. This probably means writing directly to the lower-level selector interface (or select.select? Since this is windows-only it wouldn't benefit from the stateful selector objects and I think this interface may turn out a little simpler), which I just said I didn't want to do in https://bugs.python.org/issue37373#msg362658. Ah well, at least socket.socketpair looks like it'll make the implementation simpler this time around.

@minrk
Copy link
Contributor

minrk commented Feb 28, 2020

That test_handler_callback_file_object is a tricky one - wonder why it didn't show up on my machine?

I think it makes sense since it’s a race - sometimes main will process the event before selector’s next tick, in which case it’ll pass, sometimes it won’t. It fails less than half of the time for me, but pretty often still. I saw everywhere from 0 to 7 extra events when I added some prints to debug it.

The delay I added wasn’t a generic sleep to help win the race more often; a future is used to prevent duplicate calls to the “real” callback, while duplicate firing in the selector results in a sleep until the “real” callback finishes firing.

I’m only working part time, so it’s a bit of a flyby contribution, but if you haven’t had a chance by the time I get some time next week, I’ll look at the select approach.

@bdarnell
Copy link
Member Author

I've reworked this to fix the race; all the tests are now passing on windows. It's still failing in CI, but only because there's a bunch of log spam. Most of that is coming from asyncio itself so I may just have to disable the log checks on windows.

@minrk If you'd like to take another look, pay attention to the issues behind #2719. That test needed some tweaking and I'm not entirely sure whether the test was at fault or if the bug is resurfacing with the new event loop.

@bdarnell bdarnell force-pushed the thread-selector branch 2 times, most recently from 0a0b7e3 to eebd582 Compare August 7, 2020 17:18
bdarnell and others added 7 commits September 2, 2020 11:10
Without this try/finally, if this test ever fails, errors can be
reported in a confusing way.
Closing the file descriptor without removing the corresponding handler
is technically incorrect, although the default IOLoops don't have a
problem with it.
This commit removes the need for applications to work around the
backwards-incompatible change to the default event loop. Instead,
Tornado will detect the use of the windows proactor event loop and
start a selector event loop in a separate thread.

Closes tornadoweb#2804
Running a whole event loop on the other thread leads to tricky
synchronization problems. Instead, keep as much as possible on the
main thread, and call out to a second thread only for the blocking
select system call itself.
Use this on windows due to a log spam issue in asyncio.
Restarting the event loop to "cleanly" shut down a coroutine introduces
other problems (mainly manifesting as errors logged while running
tornado.test.gen_test). Replace the coroutine with a pair of callbacks
so we don't need to do anything special to shut down without logging
warnings.
The just-released version 0.3.0 is incompatible with our older pinned
version of sphinx.
@bdarnell
Copy link
Member Author

bdarnell commented Sep 2, 2020

This is getting more or less ready. I've submitted a fix to cpython to eliminate the "self pipe" logging (we'll still need to run without the log assertion until that fix comes out in python 3.8.6). Testing locally with that build I found another unrelated source of log spam which I fixed by moving away from coroutines in the event loop itself (it's hard to shut these things down cleanly when they're a part of the event loop).

Now things are looking good except for one failure I just saw on appveyor that I haven't been able to reproduce either locally or retrying on appveyor. It's not completely clear what's going on because appveyor is running python with the log spam issue (and it's just running 3.8.0 instead of 3.8.5, although I don't see anything in the changelog that's obviously relevant).

@stonebig
Copy link

Any hope for jupyterlab-3 final ? Around mid october

@bdarnell
Copy link
Member Author

Yeah, that should be doable.

I've re-run CI a bunch of times without seeing that mysterious failure again, so I'll just chalk it up to a fluke and merge this. Then I think there are just a couple of open PRs I want to evaluate and maybe merge before starting the release process for 6.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NotImplementedError when calling HTTPServer.listen in Python 3.8.1 and Tornado 6.0.x
3 participants