-
-
Notifications
You must be signed in to change notification settings - Fork 32.1k
bpo-35310: Retry select() calls on EINTR even if deadline has passed #10700
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hello, and thanks for your contribution! I'm a bot set up to make sure that the project can legally accept your contribution by verifying you have signed the PSF contributor agreement (CLA). Our records indicate we have not received your CLA. For legal reasons we need you to sign this before we can look at your contribution. Please follow the steps outlined in the CPython devguide to rectify this issue. If you have recently signed the CLA, please wait at least one business day You can check yourself to see if the CLA has been received. Thanks again for your contribution, we look forward to reviewing it! |
select() calls are retried on EINTR (per PEP 475). However, if a timeout was provided and the deadline has passed after running signal handlers, we should still retry select() with a non-blocking call -- to make sure rlist, wlist and xlist are initialized appropriately.
f3be8d9
to
d076c53
Compare
Lib/test/eintrdata/eintr_tester.py
Outdated
# signal delivery periodicity | ||
signal_period = 0.1 | ||
# default sleep time for tests - should obviously have: | ||
# sleep_time > signal_period | ||
sleep_time = 0.2 | ||
|
||
def sighandler(self, signum, frame): | ||
time.sleep(self.signal_duration) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not convinced about this test. It relies on timing and this is very error prone on different platforms (I am very certain this will fail on s390x buildbots) . We have a lot of buildbot failing all the time because the tests rely on timing. Here, the signal delivery can be affected by lots of things (they are not processed instantly, as the Python signal handlers only set up a flag for later execution).
Check out #10413 for a more extended rationale.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, I wasn't 100% happy with it either. I will take a look and revise the test. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove this sleep. It seems like a very bad idea to sleep in a thread handler. Moreover, I don't see the purpose of sleeping!?
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced that "retry select() with a non-blocking call" is the correct behavior.
@vstinner So what is it then? Obviously returning a file descriptor which is not ready is the wrong behavior (this is what happens now). In older Python versions (3.5 for example) it may even non-existent file descriptors! |
I prefer an empty list on timeout instead of extra syscall. I doubt if we need to clean up returning lists explicitly also: from my understanding
I read it "as EINTR returns empty lists". No code change is required for the case. |
@asvetlov "unmodified" means "as you provided them", not "cleared". The test I added clearly breaks on my Linux 4.18 machine; see also the test case uploaded in the issue tracker. A code change is definitely needed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
poll() and epoll() for example have the same bug. Please also fix them.
Lib/test/eintrdata/eintr_tester.py
Outdated
self.addCleanup(os.close, wr) | ||
# timeout is enough for signal to fire, but not for it to return | ||
timeout = self.signal_delay + self.signal_duration/2 | ||
rlist, _, _ = select.select([rd], [], [], timeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use self.sleep_time as the timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But then the test doesn't necessarily trigger the bug!
As the comment states, we have to make sure the signal fires before reaching the timeout, but finishes running after reaching it. Any other case, no bug is triggered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, a signal is sent every 100 ms, whereas the sleep is 200 ms. tearDown() ensures that the Python signal handler has been called at least once. See https://bugs.python.org/issue35316 ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I missing something? If timeout is 200ms:
select.select()
is called- Signal handler starts after some time, say 90ms -- it fires before reaching the timeout, which is okay
- Signal handler finishes immediately (or within 50ms, because of the
time.sleep
I've added) - Less than 200ms have passed (140ms) -- bug is not triggered, thus the test case is basically useless!
Again, I need a very fine timing situation - handler starts before reaching the timeout, but finishes only after.
Lib/test/eintrdata/eintr_tester.py
Outdated
# signal delivery periodicity | ||
signal_period = 0.1 | ||
# default sleep time for tests - should obviously have: | ||
# sleep_time > signal_period | ||
sleep_time = 0.2 | ||
|
||
def sighandler(self, signum, frame): | ||
time.sleep(self.signal_duration) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove this sleep. It seems like a very bad idea to sleep in a thread handler. Moreover, I don't see the purpose of sleeping!?
Lib/test/eintrdata/eintr_tester.py
Outdated
@@ -39,13 +39,16 @@ class EINTRBaseTest(unittest.TestCase): | |||
|
|||
# delay for initial signal delivery | |||
signal_delay = 0.1 | |||
# signal handler run time duration | |||
signal_duration = 0.05 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove this variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding both comments: as I've said, for the test case to be valid - we have to make sure the signal handler returns only after the timeout was reached. That's exactly the purpose of sleeping. How do you want me to make sure it happened?
I can maybe use a special parameter initialized just before the select.select()
call which indicates the deadline, and sleep within the signal handler until this deadline is reached. It may be safer for buildbots, but it doesn't save us from sleeping. Honestly, I don't see how you trigger this bug without sleeping inside the signal handler.
Again, just to clarify - the bug only happens if the signal handler fires before the deadline, and returns after. It follows that the signal handler must take significant time.
Ah, I see your confusion. A reliable test triggering your bug cannot be designed. I suggest to only write a test on select.select() getting EINTR... but there is already a test. Aha. Now I'm confused as well. @oranav, @pablogsal: I propose to not add an unit test for the fix. Just fix the bugs, that's it. |
I can devise a reliable test, but it will be somewhat more complex. Right before we call
Reason 1 convinced me that it is a good idea. That being said, I don't mind removing this test entirely; I just want the bug to be fixed. I didn't want the bug to reappear later, so I thought adding a test might be a good idea. @vstinner @pablogsal Would like to hear your opinion. |
When you write "complex", I read "fragile". Signals are not reliable at all, it's just a mess :-) I do maintain all CI, mostyl with @pablogsal. It's a painful work. I don't want to merge a new test which might fail randomly. So I suggest to remove the test, sorry. My review stands. Please fix also other function of the select module. |
Just to be clear - I got you up until the point you said "which might fail randomly". This is entirely incorrect: the test will fail only if a file descriptor that is not readable will be returned from A file descriptor which is not readable should never be returned from Again, I don't have any special reason to insist for this test (as I said, I just want the bug fixed) -- I just believe that you misunderstood and therefore decline it. I will be glad to remove the test to get this merged; I just want it to be for the right reasons :-) |
@oranav The test can fail randomly among other things because signal delivery is not deterministic in all platforms. The documentation of alarm(2) says:
|
@pablogsal I know it's not deterministic. Even if everything's going wrong, which assert call is going to fail in this case? |
@oranav The test may not be testing what it says is testing: The signal can be delayed an arbitrary amount of time and therefore select may not be interrupted by SIGALRM because it arrives late |
@pablogsal I see, thanks! @vstinner You said |
Thanks for making the requested changes! @pablogsal, @vstinner: please review the changes made to this pull request. |
Oh, I see. The bug is not the general concept of not calling again the function in non-blocking mode. The bug is specific to select() and the "fd_set" types. If select() failed with EINTR, file descriptor sets are in an undefined state: likely in a bug state, considering the bug report. So the bug is really specific to select() and how we handle EINTR. I understood that there are two options to fix the bug:
If an attacker process bombs the victim process with signals, is it possible that select() keeps failing with EINTR for signficant time? Say at least 1 second? If it's possible, it would be safer to clear the sets instead. poll, epoll, etc. don't retry in blocking mode. In case of doubt, I would prefer to mimick epoll behavior because this selector is one of the most tested selector: default one used by the selectors module and so by asyncio. So I would prefer to clear sets. What do you think? |
@vstinner The fd-sets are unmodified (they remain in the same state as the user passed them), which is indeed a bug: one of the fds might be unreadable even though it is returned in the rlist. Since all the other functions clear the lists before returning, the bug is not there. Yes, you understood correctly. I don't think that an attacker process could case such a significant time, as it really depends on the signal handler itself; if the signal handler itself is slow, then the attacker process might cause such delays with or without the That being said, I'm more and more convinced that we should take the second approach (as in #10877). It mimics other behaviors in this module, and it also "feels right" to me to return as soon as possible after the deadline is reached. |
Well, that's the normal behavior:
In case of timeout, select.select() returns 3 empty lists. That's it. The fact that it was interrupted shouldn't change its behavior. Note: time.sleep() is implemented internally using select(). So I would expect that time.sleep(1.0) and select.select([fd], [], [], 1.0) would have the same duration even if select.select() is interrupted by many signals. The contract is that the overall operation takes up to 1.0 second, but not more. |
I propose to close this PR in favor of PR #10877 which is IMHO the right fix. |
select() calls are retried on EINTR (per PEP 475). However, if a
timeout was provided and the deadline has passed after running signal
handlers, we should still retry select() with a non-blocking call -- to
make sure rlist, wlist and xlist are initialized appropriately.
https://bugs.python.org/issue35310