Skip to content

bpo-35310: Retry select() calls on EINTR even if deadline has passed #10700

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

oranav
Copy link
Contributor

@oranav oranav commented Nov 25, 2018

select() calls are retried on EINTR (per PEP 475). However, if a
timeout was provided and the deadline has passed after running signal
handlers, we should still retry select() with a non-blocking call -- to
make sure rlist, wlist and xlist are initialized appropriately.

https://bugs.python.org/issue35310

@the-knights-who-say-ni
Copy link

Hello, and thanks for your contribution!

I'm a bot set up to make sure that the project can legally accept your contribution by verifying you have signed the PSF contributor agreement (CLA).

Our records indicate we have not received your CLA. For legal reasons we need you to sign this before we can look at your contribution. Please follow the steps outlined in the CPython devguide to rectify this issue.

If you have recently signed the CLA, please wait at least one business day
before our records are updated.

You can check yourself to see if the CLA has been received.

Thanks again for your contribution, we look forward to reviewing it!

select() calls are retried on EINTR (per PEP 475).  However, if a
timeout was provided and the deadline has passed after running signal
handlers, we should still retry select() with a non-blocking call -- to
make sure rlist, wlist and xlist are initialized appropriately.
# signal delivery periodicity
signal_period = 0.1
# default sleep time for tests - should obviously have:
# sleep_time > signal_period
sleep_time = 0.2

def sighandler(self, signum, frame):
time.sleep(self.signal_duration)
Copy link
Member

@pablogsal pablogsal Nov 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not convinced about this test. It relies on timing and this is very error prone on different platforms (I am very certain this will fail on s390x buildbots) . We have a lot of buildbot failing all the time because the tests rely on timing. Here, the signal delivery can be affected by lots of things (they are not processed instantly, as the Python signal handlers only set up a flag for later execution).

Check out #10413 for a more extended rationale.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I wasn't 100% happy with it either. I will take a look and revise the test. Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this sleep. It seems like a very bad idea to sleep in a thread handler. Moreover, I don't see the purpose of sleeping!?

@bedevere-bot
Copy link

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

Copy link
Member

@vstinner vstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced that "retry select() with a non-blocking call" is the correct behavior.

@oranav
Copy link
Contributor Author

oranav commented Nov 26, 2018

@vstinner So what is it then? Obviously returning a file descriptor which is not ready is the wrong behavior (this is what happens now). In older Python versions (3.5 for example) it may even non-existent file descriptors!
Returning empty lists is another sane option, but it is also wrong in my opinion. The extra select(2) call takes negligible time, and may save the user from issuing another select.select().

@asvetlov
Copy link
Contributor

I prefer an empty list on timeout instead of extra syscall.
Usually, people do iteration over select.select(), if sockets are not ready -- the check will be applied on the next iteration step.
Non-blocking syscall if the timeout has expired looks wrong.

I doubt if we need to clean up returning lists explicitly also: from my understanding select syscall always returns a correct (non-broken) result:

On error, -1 is returned, and errno is set to
indicate the error; the file descriptor sets are unmodified, and timeout becomes undefined

I read it "as EINTR returns empty lists". No code change is required for the case.

@oranav
Copy link
Contributor Author

oranav commented Nov 27, 2018

@asvetlov "unmodified" means "as you provided them", not "cleared". The test I added clearly breaks on my Linux 4.18 machine; see also the test case uploaded in the issue tracker. A code change is definitely needed.

Copy link
Member

@vstinner vstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

poll() and epoll() for example have the same bug. Please also fix them.

self.addCleanup(os.close, wr)
# timeout is enough for signal to fire, but not for it to return
timeout = self.signal_delay + self.signal_duration/2
rlist, _, _ = select.select([rd], [], [], timeout)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use self.sleep_time as the timeout.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then the test doesn't necessarily trigger the bug!
As the comment states, we have to make sure the signal fires before reaching the timeout, but finishes running after reaching it. Any other case, no bug is triggered.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, a signal is sent every 100 ms, whereas the sleep is 200 ms. tearDown() ensures that the Python signal handler has been called at least once. See https://bugs.python.org/issue35316 ;-)

Copy link
Contributor Author

@oranav oranav Nov 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I missing something? If timeout is 200ms:

  1. select.select() is called
  2. Signal handler starts after some time, say 90ms -- it fires before reaching the timeout, which is okay
  3. Signal handler finishes immediately (or within 50ms, because of the time.sleep I've added)
  4. Less than 200ms have passed (140ms) -- bug is not triggered, thus the test case is basically useless!

Again, I need a very fine timing situation - handler starts before reaching the timeout, but finishes only after.

# signal delivery periodicity
signal_period = 0.1
# default sleep time for tests - should obviously have:
# sleep_time > signal_period
sleep_time = 0.2

def sighandler(self, signum, frame):
time.sleep(self.signal_duration)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this sleep. It seems like a very bad idea to sleep in a thread handler. Moreover, I don't see the purpose of sleeping!?

@@ -39,13 +39,16 @@ class EINTRBaseTest(unittest.TestCase):

# delay for initial signal delivery
signal_delay = 0.1
# signal handler run time duration
signal_duration = 0.05
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this variable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding both comments: as I've said, for the test case to be valid - we have to make sure the signal handler returns only after the timeout was reached. That's exactly the purpose of sleeping. How do you want me to make sure it happened?
I can maybe use a special parameter initialized just before the select.select() call which indicates the deadline, and sleep within the signal handler until this deadline is reached. It may be safer for buildbots, but it doesn't save us from sleeping. Honestly, I don't see how you trigger this bug without sleeping inside the signal handler.

Again, just to clarify - the bug only happens if the signal handler fires before the deadline, and returns after. It follows that the signal handler must take significant time.

@vstinner
Copy link
Member

Ah, I see your confusion. A reliable test triggering your bug cannot be designed. I suggest to only write a test on select.select() getting EINTR... but there is already a test. Aha. Now I'm confused as well.

@oranav, @pablogsal: I propose to not add an unit test for the fix. Just fix the bugs, that's it.

@oranav
Copy link
Contributor Author

oranav commented Dec 1, 2018

I can devise a reliable test, but it will be somewhat more complex. Right before we call select.select(), we set up a variable indicating the select deadline time; then, when the signal handler fires, it waits until this deadline is reached, then returns. In all other tests, no deadline is set, and the handler returns immediately. It sounds good to me because of these reasons:

  1. It doesn't affect any other test (a deadline won't be set, therefore the handler returns immediately)
  2. It reliably tests the bug: the signal handler fires before the deadline (because of the original signal_delay and sleep_time values), but it returns only after the deadline is reached -- by design
  3. Even in case something went wrong (e.g. timings were off), the test won't fail; the worst case is that the test passes even though the bug is there, which is, well, acceptable...

Reason 1 convinced me that it is a good idea. That being said, I don't mind removing this test entirely; I just want the bug to be fixed. I didn't want the bug to reappear later, so I thought adding a test might be a good idea.

@vstinner @pablogsal Would like to hear your opinion.

@vstinner
Copy link
Member

vstinner commented Dec 3, 2018

I can devise a reliable test, but it will be somewhat more complex.

When you write "complex", I read "fragile". Signals are not reliable at all, it's just a mess :-)

I do maintain all CI, mostyl with @pablogsal. It's a painful work. I don't want to merge a new test which might fail randomly. So I suggest to remove the test, sorry.

My review stands. Please fix also other function of the select module.

@oranav
Copy link
Contributor Author

oranav commented Dec 3, 2018

Just to be clear - I got you up until the point you said "which might fail randomly". This is entirely incorrect: the test will fail only if a file descriptor that is not readable will be returned from select. If this happens, there is definitely a bug. The test cannot fail randomly! It can only pass randomly.

A file descriptor which is not readable should never be returned from select, and if it is, then there's a bug, certainly.

Again, I don't have any special reason to insist for this test (as I said, I just want the bug fixed) -- I just believe that you misunderstood and therefore decline it. I will be glad to remove the test to get this merged; I just want it to be for the right reasons :-)

@pablogsal
Copy link
Member

pablogsal commented Dec 3, 2018

@oranav The test can fail randomly among other things because signal delivery is not deterministic in all platforms. The documentation of alarm(2) says:

Scheduling delays can, as ever, cause the execution of the process to be delayed by an arbitrary amount of time.

@oranav
Copy link
Contributor Author

oranav commented Dec 3, 2018

@pablogsal I know it's not deterministic. Even if everything's going wrong, which assert call is going to fail in this case?
I'll remove the test anyway; I'm just curious, that's all :)

@pablogsal
Copy link
Member

pablogsal commented Dec 3, 2018

@oranav The test may not be testing what it says is testing: The signal can be delayed an arbitrary amount of time and therefore select may not be interrupted by SIGALRM because it arrives late
Multiply this by very slow buildbots and weird platforms (we have s390 machines) and you have maximum unpredictability and a new circle of madness to maintain (think that the signal can arrive while other test is running).

@oranav
Copy link
Contributor Author

oranav commented Dec 3, 2018

@pablogsal I see, thanks!
I have made the requested changes; please review again.
I'm also referring to pull request #10877, which fixes the bug by clearing the lists before returning (instead of issuing another non-blocking syscall); obviously one pull request should be declined and the other hopefully merged.

@vstinner You said poll and epoll have the same bug. I don't think so: This line clears the result for poll calls, and this one clears it for epoll.poll(). In addition, devpoll.poll() and kqueue.control() are also unaffected. Only select.select() is.

@bedevere-bot
Copy link

Thanks for making the requested changes!

@pablogsal, @vstinner: please review the changes made to this pull request.

@vstinner
Copy link
Member

vstinner commented Dec 3, 2018

@vstinner You said poll and epoll have the same bug. I don't think so: This line clears the result for poll calls, and this one clears it for epoll.poll(). In addition, devpoll.poll() and kqueue.control() are also unaffected. Only select.select() is.

Oh, I see. The bug is not the general concept of not calling again the function in non-blocking mode. The bug is specific to select() and the "fd_set" types. If select() failed with EINTR, file descriptor sets are in an undefined state: likely in a bug state, considering the bug report. So the bug is really specific to select() and how we handle EINTR.

I understood that there are two options to fix the bug:

  • retry select() in non-blocking mode
  • clear the file descriptor sets

If an attacker process bombs the victim process with signals, is it possible that select() keeps failing with EINTR for signficant time? Say at least 1 second? If it's possible, it would be safer to clear the sets instead.

poll, epoll, etc. don't retry in blocking mode. In case of doubt, I would prefer to mimick epoll behavior because this selector is one of the most tested selector: default one used by the selectors module and so by asyncio. So I would prefer to clear sets.

What do you think?

@oranav
Copy link
Contributor Author

oranav commented Dec 3, 2018

@vstinner The fd-sets are unmodified (they remain in the same state as the user passed them), which is indeed a bug: one of the fds might be unreadable even though it is returned in the rlist. Since all the other functions clear the lists before returning, the bug is not there.

Yes, you understood correctly. I don't think that an attacker process could case such a significant time, as it really depends on the signal handler itself; if the signal handler itself is slow, then the attacker process might cause such delays with or without the select() bug.

That being said, I'm more and more convinced that we should take the second approach (as in #10877). It mimics other behaviors in this module, and it also "feels right" to me to return as soon as possible after the deadline is reached.

@vstinner
Copy link
Member

vstinner commented Dec 3, 2018

That being said, I'm more and more convinced that we should take the second approach (as in #10877). It mimics other behaviors in this module, and it also "feels right" to me to return as soon as possible after the deadline is reached.

Well, that's the normal behavior:

>>> import select, os
>>> r,w=os.pipe()
>>> select.select([r], [], [], 1.0)
([], [], [])

In case of timeout, select.select() returns 3 empty lists. That's it. The fact that it was interrupted shouldn't change its behavior.

Note: time.sleep() is implemented internally using select(). So I would expect that time.sleep(1.0) and select.select([fd], [], [], 1.0) would have the same duration even if select.select() is interrupted by many signals. The contract is that the overall operation takes up to 1.0 second, but not more.

@vstinner
Copy link
Member

vstinner commented Dec 3, 2018

I propose to close this PR in favor of PR #10877 which is IMHO the right fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants