bpo-35310: Retry select() calls on EINTR even if deadline has passed #10700

oranav · 2018-11-25T14:46:23Z

select() calls are retried on EINTR (per PEP 475). However, if a
timeout was provided and the deadline has passed after running signal
handlers, we should still retry select() with a non-blocking call -- to
make sure rlist, wlist and xlist are initialized appropriately.

https://bugs.python.org/issue35310

the-knights-who-say-ni · 2018-11-25T14:46:26Z

Hello, and thanks for your contribution!

I'm a bot set up to make sure that the project can legally accept your contribution by verifying you have signed the PSF contributor agreement (CLA).

Our records indicate we have not received your CLA. For legal reasons we need you to sign this before we can look at your contribution. Please follow the steps outlined in the CPython devguide to rectify this issue.

If you have recently signed the CLA, please wait at least one business day
before our records are updated.

You can check yourself to see if the CLA has been received.

Thanks again for your contribution, we look forward to reviewing it!

select() calls are retried on EINTR (per PEP 475). However, if a timeout was provided and the deadline has passed after running signal handlers, we should still retry select() with a non-blocking call -- to make sure rlist, wlist and xlist are initialized appropriately.

pablogsal · 2018-11-26T21:10:59Z

Lib/test/eintrdata/eintr_tester.py

    # signal delivery periodicity
    signal_period = 0.1
    # default sleep time for tests - should obviously have:
    # sleep_time > signal_period
    sleep_time = 0.2

    def sighandler(self, signum, frame):
+        time.sleep(self.signal_duration)


I am not convinced about this test. It relies on timing and this is very error prone on different platforms (I am very certain this will fail on s390x buildbots) . We have a lot of buildbot failing all the time because the tests rely on timing. Here, the signal delivery can be affected by lots of things (they are not processed instantly, as the Python signal handlers only set up a flag for later execution).

Check out #10413 for a more extended rationale.

I agree, I wasn't 100% happy with it either. I will take a look and revise the test. Thanks.

Please remove this sleep. It seems like a very bad idea to sleep in a thread handler. Moreover, I don't see the purpose of sleeping!?

bedevere-bot · 2018-11-26T21:12:20Z

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

vstinner

I'm not convinced that "retry select() with a non-blocking call" is the correct behavior.

oranav · 2018-11-26T21:30:08Z

@vstinner So what is it then? Obviously returning a file descriptor which is not ready is the wrong behavior (this is what happens now). In older Python versions (3.5 for example) it may even non-existent file descriptors!
Returning empty lists is another sane option, but it is also wrong in my opinion. The extra select(2) call takes negligible time, and may save the user from issuing another select.select().

asvetlov · 2018-11-27T08:38:59Z

I prefer an empty list on timeout instead of extra syscall.
Usually, people do iteration over select.select(), if sockets are not ready -- the check will be applied on the next iteration step.
Non-blocking syscall if the timeout has expired looks wrong.

I doubt if we need to clean up returning lists explicitly also: from my understanding select syscall always returns a correct (non-broken) result:

On error, -1 is returned, and errno is set to
indicate the error; the file descriptor sets are unmodified, and timeout becomes undefined

I read it "as EINTR returns empty lists". No code change is required for the case.

oranav · 2018-11-27T09:52:29Z

@asvetlov "unmodified" means "as you provided them", not "cleared". The test I added clearly breaks on my Linux 4.18 machine; see also the test case uploaded in the issue tracker. A code change is definitely needed.

vstinner

poll() and epoll() for example have the same bug. Please also fix them.

vstinner · 2018-11-27T11:03:39Z

Lib/test/eintrdata/eintr_tester.py

+        self.addCleanup(os.close, wr)
+        # timeout is enough for signal to fire, but not for it to return
+        timeout = self.signal_delay + self.signal_duration/2
+        rlist, _, _ = select.select([rd], [], [], timeout)


Please use self.sleep_time as the timeout.

But then the test doesn't necessarily trigger the bug!
As the comment states, we have to make sure the signal fires before reaching the timeout, but finishes running after reaching it. Any other case, no bug is triggered.

Currently, a signal is sent every 100 ms, whereas the sleep is 200 ms. tearDown() ensures that the Python signal handler has been called at least once. See https://bugs.python.org/issue35316 ;-)

Am I missing something? If timeout is 200ms:

select.select() is called

Signal handler starts after some time, say 90ms -- it fires before reaching the timeout, which is okay

Signal handler finishes immediately (or within 50ms, because of the time.sleep I've added)

Less than 200ms have passed (140ms) -- bug is not triggered, thus the test case is basically useless!

Again, I need a very fine timing situation - handler starts before reaching the timeout, but finishes only after.

vstinner · 2018-11-27T11:04:05Z

Lib/test/eintrdata/eintr_tester.py

    # signal delivery periodicity
    signal_period = 0.1
    # default sleep time for tests - should obviously have:
    # sleep_time > signal_period
    sleep_time = 0.2

    def sighandler(self, signum, frame):
+        time.sleep(self.signal_duration)


Please remove this sleep. It seems like a very bad idea to sleep in a thread handler. Moreover, I don't see the purpose of sleeping!?

vstinner · 2018-11-27T11:04:14Z

Lib/test/eintrdata/eintr_tester.py

@@ -39,13 +39,16 @@ class EINTRBaseTest(unittest.TestCase):

    # delay for initial signal delivery
    signal_delay = 0.1
+    # signal handler run time duration
+    signal_duration = 0.05


Please remove this variable.

Regarding both comments: as I've said, for the test case to be valid - we have to make sure the signal handler returns only after the timeout was reached. That's exactly the purpose of sleeping. How do you want me to make sure it happened?
I can maybe use a special parameter initialized just before the select.select() call which indicates the deadline, and sleep within the signal handler until this deadline is reached. It may be safer for buildbots, but it doesn't save us from sleeping. Honestly, I don't see how you trigger this bug without sleeping inside the signal handler.

Again, just to clarify - the bug only happens if the signal handler fires before the deadline, and returns after. It follows that the signal handler must take significant time.

vstinner · 2018-11-27T15:02:50Z

Ah, I see your confusion. A reliable test triggering your bug cannot be designed. I suggest to only write a test on select.select() getting EINTR... but there is already a test. Aha. Now I'm confused as well.

@oranav, @pablogsal: I propose to not add an unit test for the fix. Just fix the bugs, that's it.

oranav · 2018-12-01T13:44:32Z

I can devise a reliable test, but it will be somewhat more complex. Right before we call select.select(), we set up a variable indicating the select deadline time; then, when the signal handler fires, it waits until this deadline is reached, then returns. In all other tests, no deadline is set, and the handler returns immediately. It sounds good to me because of these reasons:

It doesn't affect any other test (a deadline won't be set, therefore the handler returns immediately)
It reliably tests the bug: the signal handler fires before the deadline (because of the original signal_delay and sleep_time values), but it returns only after the deadline is reached -- by design
Even in case something went wrong (e.g. timings were off), the test won't fail; the worst case is that the test passes even though the bug is there, which is, well, acceptable...

Reason 1 convinced me that it is a good idea. That being said, I don't mind removing this test entirely; I just want the bug to be fixed. I didn't want the bug to reappear later, so I thought adding a test might be a good idea.

@vstinner @pablogsal Would like to hear your opinion.

vstinner · 2018-12-03T16:35:30Z

I can devise a reliable test, but it will be somewhat more complex.

When you write "complex", I read "fragile". Signals are not reliable at all, it's just a mess :-)

I do maintain all CI, mostyl with @pablogsal. It's a painful work. I don't want to merge a new test which might fail randomly. So I suggest to remove the test, sorry.

My review stands. Please fix also other function of the select module.

oranav · 2018-12-03T18:28:43Z

Just to be clear - I got you up until the point you said "which might fail randomly". This is entirely incorrect: the test will fail only if a file descriptor that is not readable will be returned from select. If this happens, there is definitely a bug. The test cannot fail randomly! It can only pass randomly.

A file descriptor which is not readable should never be returned from select, and if it is, then there's a bug, certainly.

Again, I don't have any special reason to insist for this test (as I said, I just want the bug fixed) -- I just believe that you misunderstood and therefore decline it. I will be glad to remove the test to get this merged; I just want it to be for the right reasons :-)

pablogsal · 2018-12-03T18:46:54Z

@oranav The test can fail randomly among other things because signal delivery is not deterministic in all platforms. The documentation of alarm(2) says:

Scheduling delays can, as ever, cause the execution of the process to be delayed by an arbitrary amount of time.

oranav · 2018-12-03T19:34:05Z

@pablogsal I know it's not deterministic. Even if everything's going wrong, which assert call is going to fail in this case?
I'll remove the test anyway; I'm just curious, that's all :)

pablogsal · 2018-12-03T19:56:13Z

@oranav The test may not be testing what it says is testing: The signal can be delayed an arbitrary amount of time and therefore select may not be interrupted by SIGALRM because it arrives late
Multiply this by very slow buildbots and weird platforms (we have s390 machines) and you have maximum unpredictability and a new circle of madness to maintain (think that the signal can arrive while other test is running).

oranav · 2018-12-03T20:53:19Z

@pablogsal I see, thanks!
I have made the requested changes; please review again.
I'm also referring to pull request #10877, which fixes the bug by clearing the lists before returning (instead of issuing another non-blocking syscall); obviously one pull request should be declined and the other hopefully merged.

@vstinner You said poll and epoll have the same bug. I don't think so: This line clears the result for poll calls, and this one clears it for epoll.poll(). In addition, devpoll.poll() and kqueue.control() are also unaffected. Only select.select() is.

bedevere-bot · 2018-12-03T20:53:22Z

Thanks for making the requested changes!

@pablogsal, @vstinner: please review the changes made to this pull request.

vstinner · 2018-12-03T22:15:22Z

@vstinner You said poll and epoll have the same bug. I don't think so: This line clears the result for poll calls, and this one clears it for epoll.poll(). In addition, devpoll.poll() and kqueue.control() are also unaffected. Only select.select() is.

Oh, I see. The bug is not the general concept of not calling again the function in non-blocking mode. The bug is specific to select() and the "fd_set" types. If select() failed with EINTR, file descriptor sets are in an undefined state: likely in a bug state, considering the bug report. So the bug is really specific to select() and how we handle EINTR.

I understood that there are two options to fix the bug:

retry select() in non-blocking mode
clear the file descriptor sets

If an attacker process bombs the victim process with signals, is it possible that select() keeps failing with EINTR for signficant time? Say at least 1 second? If it's possible, it would be safer to clear the sets instead.

poll, epoll, etc. don't retry in blocking mode. In case of doubt, I would prefer to mimick epoll behavior because this selector is one of the most tested selector: default one used by the selectors module and so by asyncio. So I would prefer to clear sets.

What do you think?

oranav · 2018-12-03T22:53:25Z

@vstinner The fd-sets are unmodified (they remain in the same state as the user passed them), which is indeed a bug: one of the fds might be unreadable even though it is returned in the rlist. Since all the other functions clear the lists before returning, the bug is not there.

Yes, you understood correctly. I don't think that an attacker process could case such a significant time, as it really depends on the signal handler itself; if the signal handler itself is slow, then the attacker process might cause such delays with or without the select() bug.

That being said, I'm more and more convinced that we should take the second approach (as in #10877). It mimics other behaviors in this module, and it also "feels right" to me to return as soon as possible after the deadline is reached.

vstinner · 2018-12-03T23:21:43Z

That being said, I'm more and more convinced that we should take the second approach (as in #10877). It mimics other behaviors in this module, and it also "feels right" to me to return as soon as possible after the deadline is reached.

Well, that's the normal behavior:

>>> import select, os
>>> r,w=os.pipe()
>>> select.select([r], [], [], 1.0)
([], [], [])

In case of timeout, select.select() returns 3 empty lists. That's it. The fact that it was interrupted shouldn't change its behavior.

Note: time.sleep() is implemented internally using select(). So I would expect that time.sleep(1.0) and select.select([fd], [], [], 1.0) would have the same duration even if select.select() is interrupted by many signals. The contract is that the overall operation takes up to 1.0 second, but not more.

vstinner · 2018-12-03T23:26:08Z

I propose to close this PR in favor of PR #10877 which is IMHO the right fix.

the-knights-who-say-ni added the CLA not signed label Nov 25, 2018

bedevere-bot added the awaiting review label Nov 25, 2018

oranav force-pushed the fix-bpo-35310 branch from f3be8d9 to d076c53 Compare November 25, 2018 14:56

pablogsal requested changes Nov 26, 2018

View reviewed changes

bedevere-bot removed the awaiting review label Nov 26, 2018

bedevere-bot added the awaiting changes label Nov 26, 2018

the-knights-who-say-ni added CLA signed and removed CLA not signed labels Nov 26, 2018

vstinner requested changes Nov 26, 2018

View reviewed changes

vstinner requested changes Nov 27, 2018

View reviewed changes

bpo-35310: Remove EINTR test case when handler is delayed

9683be9

bedevere-bot added awaiting change review and removed awaiting changes labels Dec 3, 2018

oranav mentioned this pull request Dec 3, 2018

bpo-35310: Clear select() lists before returning upon EINTR #10877

Merged

oranav closed this Dec 5, 2018

pablogsal mentioned this pull request May 20, 2019

bpo-36977: Make SharedMemoryManager release its resources if its parent process dies #13451

Open

Uh oh!

bpo-35310: Retry select() calls on EINTR even if deadline has passed #10700

bpo-35310: Retry select() calls on EINTR even if deadline has passed #10700

Uh oh!

Conversation

oranav commented Nov 25, 2018 • edited by bedevere-bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

the-knights-who-say-ni commented Nov 25, 2018

Uh oh!

pablogsal Nov 26, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bedevere-bot commented Nov 26, 2018

Uh oh!

vstinner left a comment

Choose a reason for hiding this comment

Uh oh!

oranav commented Nov 26, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

asvetlov commented Nov 27, 2018

Uh oh!

oranav commented Nov 27, 2018

Uh oh!

vstinner left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

oranav Nov 27, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vstinner commented Nov 27, 2018

Uh oh!

oranav commented Dec 1, 2018

Uh oh!

vstinner commented Dec 3, 2018

Uh oh!

oranav commented Dec 3, 2018

Uh oh!

pablogsal commented Dec 3, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oranav commented Dec 3, 2018

Uh oh!

pablogsal commented Dec 3, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oranav commented Dec 3, 2018

Uh oh!

bedevere-bot commented Dec 3, 2018

Uh oh!

vstinner commented Dec 3, 2018

Uh oh!

oranav commented Dec 3, 2018

Uh oh!

vstinner commented Dec 3, 2018

Uh oh!

vstinner commented Dec 3, 2018

Uh oh!

Uh oh!

oranav commented Nov 25, 2018 •

edited by bedevere-bot

Loading

pablogsal Nov 26, 2018 •

edited

Loading

oranav commented Nov 26, 2018 •

edited

Loading

oranav Nov 27, 2018 •

edited

Loading

pablogsal commented Dec 3, 2018 •

edited

Loading

pablogsal commented Dec 3, 2018 •

edited

Loading