-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Fix high cpu usage caused by fd leak #1581
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
We found a problem of high CPU usage of the supervisor. This problem is caused by continuous polling of a wrong fd in the main loop of the supervisor. Busy polling leads to a CPU usage close to 100%. (We can confirm this problem through the strace tool) This issue can be reproduced by: 1. Continuously initiate arbitrary requests to supervisor through supervisorctl 2. After the socket fd is closed, trigger the supervisor's subprocess to rotate the log (or reopen the file) 3. If the above steps are completed within a single main loop of the supervisor, the problem can be triggered The reason for the problem is that supervisor relies on using _ignore_invalid() in the main loop to close fds. This method has a flaw that if fd is reused before _ignore_invalid() is called, then the fd may always exist in the fd list of poll . This commit fixes the problem. By checking the validity of the fd in the event list in the main loop, if the fd is not in the combined_map, it is considered to be an invalid fd and will be removed from the list.
39744bf
to
9ed8069
Compare
Thanks. This looks promising, and I've deployed it a few places to test. The warnings you added my be as important as the fix itself - if we had that 3 years ago, the bug probably would've been a lot more obvious. Are there any other warnings that should be added ? For example, what about when an FD that isn't a pipe ends up on the list of FDs to be polled, as I've also seen. I suspect there are more bugs with FDs, which may be rare and hard to hit, but it'll be amply easy to address them if logs are added to warn about inconsistencies. I didn't understand what did you meant when you said "trigger the supervisor's subprocess to rotate the log" ? Do you mean by connecting to that process and causing it to write an adequately large logs to stdout ? Is it possible to consistently make the bug easier to hit by injecting a "sleep" command ? |
Yes, more warnings is needed, but what important is we need to unregister the FD from polling list after each event is handled (if it's need to), instead of unregistering FD by _ignore_invalid().
Yes, but the faster method is to use a shared variable or signal to reproduce:
|
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
I've deployed this change to customers and saw no issues since last month. |
Hello, I deployed also this change and I confirm it fixes the issue. Do you know when this will be merged ? |
Encountered similar issue where supervisord is taking high CPU because it is busy processing POLLERR on pipe fd for which the other end of the pipe does not exist. Running version 4.0.3. Question: 1. Not sure if this fix will help here as we don't handle POLLERR as part of |
We found a problem of high CPU usage of the supervisor. We believe it's same reason for #807 . This problem is caused by continuous polling of a wrong fd in the main loop of the supervisor. Busy polling leads to a CPU usage close to 100%. (We can confirm this problem through the strace tool)
This issue can be reproduced by:
The reason for the problem is that supervisor relies on using _ignore_invalid() in the main loop to close fds. This method has a flaw that if fd is reused before _ignore_invalid() is called, then the fd may always exist in the fd list of poll .
This commit fixes the problem. By checking the validity of the fd in the event list in the main loop, if the fd is not in the combined_map, it is considered to be an invalid fd and will be removed from the list.