Skip to content

Supervisord crashes when over 1023 files are open (even with ulimit set) #26

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
shimon opened this issue Jul 7, 2011 · 27 comments
Closed

Comments

@shimon
Copy link

shimon commented Jul 7, 2011

Supervisord uses select.select to monitor filehandles related to the processes it supervises. This is problematic because select.select raises a ValueError for filehandles numbered >1023. (Observed with supervisor 3.0a8 on an Ubuntu Gnu/Linux 11.04 amd64 machine.)

We ran into this problem when running approximately 254 supervised processes. Initially, we assumed it was a ulimit configuration problem, but found that the crash occurred even when running supervisord in non-daemon mode. I've been able to reproduce the stacktrace by supervising a large number of /bin/cat processes, and have included it below. Here's a conf file
to run 1100 cats:
https://gist.github.com/1068713

To reproduce this bug, just install that config file and run something like:
sudo bash -c "ulimit -n 10000; supervisord -n"

You'll see a ValueError (out of range) from select.select(), called from supervisord's runforever():
https://github.com/Supervisor/supervisor/blob/master/supervisor/supervisord.py#L218

It appears this is a limitation of Python's select() function, which raises a ValueError on file descriptors > 1023. I've seen some suggestions that beyond this limit, one should use poll() instead of select(), but I'm not an expert.

FULL TRACEBACK:

Traceback (most recent call last):
File "/usr/bin/supervisord", line 9, in
load_entry_point('supervisor==3.0a8', 'console_scripts', 'supervisord')()
File "/usr/lib/pymodules/python2.7/supervisor/supervisord.py", line 371, in main
go(options)
File "/usr/lib/pymodules/python2.7/supervisor/supervisord.py", line 381, in go
d.main()
File "/usr/lib/pymodules/python2.7/supervisor/supervisord.py", line 94, in main
self.run()
File "/usr/lib/pymodules/python2.7/supervisor/supervisord.py", line 111, in run
self.runforever()
File "/usr/lib/pymodules/python2.7/supervisor/supervisord.py", line 229, in runforever
r, w, x = self.options.select(r, w, x, timeout)
File "/usr/lib/pymodules/python2.7/supervisor/options.py", line 1097, in select
return select.select(r, w, x, timeout)
ValueError: filedescriptor out of range in select()

@chuckadams
Copy link

I threw together a quick hack that emulates select() with poll(). I don't recommend trusting your production boxes with my skin-deep knowledge of poll(), but it does withstand the cat-bomb as well as pass all tests.

https://github.com/sproingie/supervisor/commit/2d3e753b8bca6c39eca3840290b5425f583fb0db

@chuckadams
Copy link

The above hack improperly handled POLLPRI, and it's fixed in in the head of my fork

https://github.com/sproingie/supervisor/compare/d5aa987d26786c46ee5397b76c6a39afd84c9d0b...sproingie:master

@shimon
Copy link
Author

shimon commented Jul 9, 2011

Thanks for the quick attention, sproingie!

@chuckadams
Copy link

Since renaming my repo (owing to some incompatible changes I'm making), I can't seem to nail down that changeset anymore, but if you're trying it, you'll want to grab the code out of the head revision. I recall making at least one extra change, namely multiplying the timeout by 1000. Turns out that select.select specifies the timeout in seconds, whereas for select.poll it's in milliseconds. Oops. The tiny timeout caused a lot of spinning and probably even some livelock.

It's still a hack job, since the proper design would be to keep the poll object around persistently and not try to make it emulate the statelessness of select().

@mcdonc
Copy link
Member

mcdonc commented Sep 2, 2011

FWIW, I think there is a way to compile a Python (involving some FD_SETSIZE hackery IIRC) that allows for more file descriptors to be accessible by select(). Googling doesn't lead to any obvious URLs however.

@mikekhristo
Copy link

Running into the error in the initial post. Any advice on how to deal with it?

@mcdonc
Copy link
Member

mcdonc commented Sep 7, 2011

Currently the only workaround is to compile a Python that supports > 1024 file descriptors and run supervisor under that.

@mikekhristo
Copy link

Any idea how to do that? Google hasn't been helpful. I have the python 2.7.2 source extracted and ready to go.

@mcdonc
Copy link
Member

mcdonc commented Sep 7, 2011

Nope. As I said in a previous entry, I could not find a suitable Google entry. Likely have to either ask on python newsgroup or stackoverflow.

@mikekhristo
Copy link

@timbaileyjones
Copy link

Shouldn't supervisor just switch from select.select to select.poll ? By my math (5 fds per child), this restricts supervisor to about 204 processes, actually fewer if you substract stdin/stdout/stderr, listeners for rpc/http, and whatever shlibs python has upon. So maybe 200 or 201.

For the time being, we are probably going to cope with this by running two instances of supervisord and splitting our workload among them.

@timbaileyjones
Copy link

I am taking a whack at fixing this myself, since Chuck Adams can't seem to find his change set: https://github.com/linuxtampa/supervisor

@kevin1024
Copy link

We just ran into this issue in production today. Not sure what our interim solution will be. Running two supervisors would be awkward.

@mcdonc
Copy link
Member

mcdonc commented Jan 14, 2012

I looked into maybe trying to implement the mainloop in terms of select.poll, but it doesn't appear to work on Mac OS X, or at least the out-of-the-box Python builds on Mac OS X don't support it:

http://bugs.python.org/issue5154

Bleh.

@timbaileyjones
Copy link

I tried making that change too... it seemed to work for the first day, but
then it got all sluggish and eventually gets stuck. I don't really know
what I'm doing wrong, but here's my fork.

https://github.com/linuxtampa/supervisor

tlj

On Sat, Jan 14, 2012 at 4:38 AM, Chris McDonough <
[email protected]

wrote:

I looked into maybe trying to implement the mainloop in terms of
select.poll, but it doesn't appear to work on Mac OS X, or at least the
out-of-the-box Python builds on Mac OS X don't support it:

http://bugs.python.org/issue5154

Bleh.


Reply to this email directly or view it on GitHub:
#26 (comment)

@igorsobreira
Copy link
Contributor

Hi, I've started working to replace select() for poll(), my fork is on: https://github.com/igorsobreira/supervisor/commits/master
It kinda works now, but there is a lot to be done yet, this first commit is just trying to understand the solution...

  • I need to verify what's the correct bitmask to use when registering a file descriptior on poll()
  • Most of the tests passes (i've executed just python setup.py test, not using tox yet), there are just 3 failures
  • As @mcdonc pointed out, osx's python doesn't have select.poll(), so I plan to use select.kqueue on this case. For this I will detect the OS and move the poll() call to self.options (as it works with select() now), that will use the correct one based on the OS
  • There is an error being raised when the process starts Cannot allocate memory. This is probably because it's trying to read the fd but supervisor dispatcher says the process is STARTING. It should not be a big problem though

I would love some feedback, and please let me know if i'm on the wrong track.

@igorsobreira
Copy link
Contributor

@mnaberez
Copy link
Member

See also #145 which sounds like it is also caused by this issue.

@weissi
Copy link

weissi commented Jan 10, 2013

What is the status with this issue? @igorsobreira what about your pull request?

@igorsobreira
Copy link
Contributor

@weissi my pull request has a working solution, I mean there are no more features I had in mind that were needed. But two issues were reported on the pull request, maybe it's the same (see the comments), I didn't have time to dive into those yet. I plan to investigate this hand on linux this weekend.

Anyway, needs more testing, and maybe an update to supervisor master.

@weissi
Copy link

weissi commented Jan 10, 2013

Cool, thank you!

@spleeyah
Copy link

spleeyah commented Apr 5, 2013

Any possible hope of this being addressed? :(

@jeff-minard-ck
Copy link

Our organization is hitting this same bug too. It's a pretty big deal.

@sandra1n
Copy link

Same bug. Thanks to @igorsobreira, his version work at me fine.

@frankmayer
Copy link

Just installed 3.0 and hitting this issue. Is there a plan to resolve this?

@akimicyu
Copy link

akimicyu commented Jan 9, 2014

I met same bug. Thanks to @igorsobreira. your work is very cool.

@mnaberez
Copy link
Member

Fixed in 9e6aa44 (PR #129).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests