Fix high cpu usage caused by fd leak #1581

aftersnow · 2023-03-14T08:38:22Z

We found a problem of high CPU usage of the supervisor. We believe it's same reason for #807 . This problem is caused by continuous polling of a wrong fd in the main loop of the supervisor. Busy polling leads to a CPU usage close to 100%. (We can confirm this problem through the strace tool)

This issue can be reproduced by:

Continuously initiate arbitrary requests to supervisor through supervisorctl
After the socket fd is closed, trigger the supervisor's subprocess to rotate the log (or reopen the file)
If the above steps are completed within a single main loop of the supervisor, the problem can be triggered

The reason for the problem is that supervisor relies on using _ignore_invalid() in the main loop to close fds. This method has a flaw that if fd is reused before _ignore_invalid() is called, then the fd may always exist in the fd list of poll .

This commit fixes the problem. By checking the validity of the fd in the event list in the main loop, if the fd is not in the combined_map, it is considered to be an invalid fd and will be removed from the list.

We found a problem of high CPU usage of the supervisor. This problem is caused by continuous polling of a wrong fd in the main loop of the supervisor. Busy polling leads to a CPU usage close to 100%. (We can confirm this problem through the strace tool) This issue can be reproduced by: 1. Continuously initiate arbitrary requests to supervisor through supervisorctl 2. After the socket fd is closed, trigger the supervisor's subprocess to rotate the log (or reopen the file) 3. If the above steps are completed within a single main loop of the supervisor, the problem can be triggered The reason for the problem is that supervisor relies on using _ignore_invalid() in the main loop to close fds. This method has a flaw that if fd is reused before _ignore_invalid() is called, then the fd may always exist in the fd list of poll . This commit fixes the problem. By checking the validity of the fd in the event list in the main loop, if the fd is not in the combined_map, it is considered to be an invalid fd and will be removed from the list.

justinpryzby · 2023-03-14T20:41:21Z

Thanks. This looks promising, and I've deployed it a few places to test.

The warnings you added my be as important as the fix itself - if we had that 3 years ago, the bug probably would've been a lot more obvious. Are there any other warnings that should be added ? For example, what about when an FD that isn't a pipe ends up on the list of FDs to be polled, as I've also seen. I suspect there are more bugs with FDs, which may be rare and hard to hit, but it'll be amply easy to address them if logs are added to warn about inconsistencies.

I didn't understand what did you meant when you said "trigger the supervisor's subprocess to rotate the log" ? Do you mean by connecting to that process and causing it to write an adequately large logs to stdout ? Is it possible to consistently make the bug easier to hit by injecting a "sleep" command ?

aftersnow · 2023-03-15T06:11:26Z

Thanks. This looks promising, and I've deployed it a few places to test.

The warnings you added my be as important as the fix itself - if we had that 3 years ago, the bug probably would've been a lot more obvious. Are there any other warnings that should be added ? For example, what about when an FD that isn't a pipe ends up on the list of FDs to be polled, as I've also seen. I suspect there are more bugs with FDs, which may be rare and hard to hit, but it'll be amply easy to address them if logs are added to warn about inconsistencies.

Yes, more warnings is needed, but what important is we need to unregister the FD from polling list after each event is handled (if it's need to), instead of unregistering FD by _ignore_invalid().

I didn't understand what did you meant when you said "trigger the supervisor's subprocess to rotate the log" ? Do you mean by connecting to that process and causing it to write an adequately large logs to stdout ? Is it possible to consistently make the bug easier to hit by injecting a "sleep" command ?

Yes, but the faster method is to use a shared variable or signal to reproduce:

When the supervisorctl socket FD is closed, change the variable
When handling POutputDispatcher's read event, the above variable tells the logger to rotate log file, this will cause supervisor to reopen file and get a new FD. The FD must same with the closed socket FD.
Issue reproduced

justinpryzby · 2023-04-18T21:05:12Z

I've deployed this change to customers and saw no issues since last month.
Thanks to @aftersnow for diagnosing the issue.

Tomo59 · 2023-05-23T08:03:20Z

Hello, I deployed also this change and I confirm it fixes the issue.

Do you know when this will be merged ?

mandaramle · 2025-01-13T22:26:57Z

Encountered similar issue where supervisord is taking high CPU because it is busy processing POLLERR on pipe fd for which the other end of the pipe does not exist. Running version 4.0.3. Question: 1. Not sure if this fix will help here as we don't handle POLLERR as part of def _ignore_invalid(self, fd, eventmask): can we add logic to handle POLLERR as part of def _ignore_invalid(self, fd, eventmask): if (eventmask & select.POLLNVAL) or (eventmask & select.POLLERR): 2. Another symptom is restarting one of the process on our system resolves the issue, which indicates that some fd cleanup post chile supervisor process restarts kicks in and cleans up that fd. Also the issue happens after 50+ days on our system.

aftersnow force-pushed the fix-high-cpu-usage branch from 39744bf to 9ed8069 Compare March 14, 2023 08:41

mnaberez mentioned this pull request Mar 14, 2023

100% CPU usage (maybe caused by new poll implementation?) (help wanted) #807

Open

This comment was marked as outdated.

Sign in to view

mnaberez mentioned this pull request Mar 27, 2023

Adding epoll Poller #1516

Closed

mnaberez merged commit 2a93d6b into Supervisor:main May 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix high cpu usage caused by fd leak #1581

Fix high cpu usage caused by fd leak #1581

Uh oh!

aftersnow commented Mar 14, 2023

Uh oh!

justinpryzby commented Mar 14, 2023

Uh oh!

aftersnow commented Mar 15, 2023

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

justinpryzby commented Apr 18, 2023

Uh oh!

Tomo59 commented May 23, 2023

Uh oh!

mandaramle commented Jan 13, 2025

Uh oh!

Uh oh!

Fix high cpu usage caused by fd leak #1581

Fix high cpu usage caused by fd leak #1581

Uh oh!

Conversation

aftersnow commented Mar 14, 2023

Uh oh!

justinpryzby commented Mar 14, 2023

Uh oh!

aftersnow commented Mar 15, 2023

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

justinpryzby commented Apr 18, 2023

Uh oh!

Tomo59 commented May 23, 2023

Uh oh!

mandaramle commented Jan 13, 2025

Uh oh!

Uh oh!