-
Notifications
You must be signed in to change notification settings - Fork 18k
net: DialTimeout causes thread exhaustion #5625
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Comment 2 by [email protected]: you're right. if i catch the errors this actually runs. the problem is that we did find a bug in production but it seems to be a lot harder to find. this was just my first try to "capture" this problem. the actual production dump is atttached. we make a lot of calls to a lot of different http apis from customers. some of them are https and some have really long timeouts and connection interruptions. we had no problem with 1.0.3 but with 1.1 it does crash (and yes in production we do not ignore errors, this was just me trying to write a quick demo hack) Attachments:
|
Comment 3 by [email protected]: i believe this could be related to https://code.google.com/p/go/source/detail?r=d4e1ec84876c i will attempt to derive the issue from our production env and to deliver smt. reproducible. |
Comment 5 by [email protected]: pretty much the same thing...it works normally for a while, then suddenly stops making any calls and then some moments later takes huge amounts of ram and gets killed. we saw that the fewer http.Client instances we used the faster the bug would occur. so far we are still trying to make a reproducible case. we can reliably crash it in production but not with a test so far. |
@paul. The reason your code crashed was there were over 500 threads doing dns lookups. Because Go uses the hosts' dns resolver by default on linux, each of these lookups consumes a thread as cgo always runs on an operating system thread, not a goroutine. This has exhausted the number of threads allowed by the ulimit of the user you were running as. I would *not* recommend raising this limit as a stop gap, this will only make things worse as the system resolver will crap out if you have more than 1024 FD's open in your program. Is the code to this custom transport available ? /usr/lib/go/src/pkg/net/dial.go:138 +0xa5 github.com/adeven/adjust_backend/callback_worker.dialWithTimeout(0x76e010, 0x3, 0xc263588b80, 0x14, 0x0, ...) /home/callback_worker/app/releases/20130525213259/callback_worker/run/.go/src/github.com/adeven/adjust_backend/callback_worker/callback_consumer.go:96 +0x196 net/http.(*Transport).dial(0xc2002d4f80, 0x76e010, 0x3, 0xc263588b80, 0x14, ...) /usr/lib/go/src/pkg/net/http/transport.go:382 +0x87 |
Comment 7 by [email protected]: hi, the dial simply func dialWithTimeout(network, address string) (connection net.Conn, err error) { connection, err = net.DialTimeout(network, address, requestTimeout) return } where the timeout is 5s. we iterated a lot of different versions of this code as we thought the problem would be the code. in the end even http.Get crashed. so we downgraded again. |
Comment 10 by [email protected]: around 100-300/sec theoretically the address should be cacheable as we usually query <10 hostnames |
I think you are going to have to cache the address. If you read the comment inside net/lookup.go, there is no throttling and each lookup is handled with a timeout thread, so if there is even a tiny blip in your dns server,or the name is slow to resolve you'll blow you thread limit quickly. Alternatively you could a. compile with cgo disabled, this is my recommendation b. switch to github.com/miekg/dns for dns resoltion, then pass a *net.TCPAddr to net.DialTCP Making as accepted, but leaving priority as triage for now, I do not know the correct priority to apply to this. Status changed to Accepted. |
Comment 12 by [email protected]: ok so this maybe a stupid question, but how does one disable cgo? and does this have any other downsides? |
CGO_ENABLED=1 ./all.bash will build a version of Go with cgo disabled. The drawbacks are parts of the standard library that use cgo will be disabled, or work with reduced of altered functionality. For example, the net package will switch to a native Go dns resolver, which shouldn't be a problem unless you need to use more esoteric resolvers (ie, nsswitch). The os/user package will also stop working. Obviously you will not be able to use packages that require cgo. |
Comment 14 by [email protected]: ok thx a lot so far, we'll see what solution is best for us now. but why exactly does go 1.0.3 work like a charm while 1.1 has this issue? |
Looking at the implementation in 1.0.3, I cannot see why you did not see the same problem. Possibly there is some additional blocking in 1.0.x which slowed your accept or dial rate down. You could try hitting your program with SIGQUIT under the same load and counting the number of goroutines waiting in [syscall]. |
We should at least have a cache of inflight lookups, so that 100 simultaneous dials of one host name don't do the work 100x. That's easy and (assume we forget the answer once they all get it) doesn't pose any consistency problems. It just merges simultaneous work. The rest is issue #4056. Labels changed: added priority-later, go1.2, removed priority-triage. |
This issue was updated by revision 61d3b2d. R=golang-dev, iant, cespare, rsc, dave, rogpeppe, remyoudompheng CC=golang-dev https://golang.org/cl/10079043 |
This issue was closed by revision 1d3efd6. Status changed to Fixed. |
by [email protected]:
Attachments:
The text was updated successfully, but these errors were encountered: