-
Notifications
You must be signed in to change notification settings - Fork 822
pubsub/awssnssqs: DNS resolution failure in tight receive-ack loop (AWS SQS) #2672
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
We've done load testing with the 100 concurrent goroutines and it has worked fine, so I'm not sure that's the issue. Some Googling implies that the DNS failures may be due to running out of file descriptors. Try running https://grokbase.com/t/gg/golang-nuts/14br47sfpj/go-nuts-no-such-host-with-many-get-requests are similar issues with some suggestions for things to try. Hope that helps! |
It indeed seems to be related to file descriptors limit and I guess it's related to golang/go/issues/18588 (although "dial tcp: XXX: no such host" is not being preceded by "socket: too many open files" message in this case). The default file descriptors limit on my Mac is 256 and I don't remember messing with this so I assume it's a default on macOS:
After inceasing descriptors limit for that session to 1024 (which I believe is a default on many Linux systems) it works fine. I still think that requiring devs (at least the ones that run macOS, which is quite popular) to change this limit to run such a simple program is a bad user experience. Did you see significant throughput improvements in your load tests when moving from 10 to 100 goroutines? I'm asking this because even with concurrency level = 10 it runs really fast. These people ran some benchmarks on SQS (albeit with VM client, and they use same thread for Would it be possible to reduce the default concurrency levels (and maybe expose to users a way to tune it)? |
Yes. Since the SQS ReceiveMessage RPC only returns at most 10 messages at a time, and each RPC takes non-trivial time to round-trip, we have to make many concurrent RPCs to get good throughput. #1657 is stale, but shows a benchmark that at one point demonstrated that 100 concurrent is 5x faster than 10 concurrent (500 messages/sec vs 2500 messages/sec). I'm pretty sure the difference would be even greater now due to performance improvements in the
Interesting data point. It's a couple of years old, and uses the Java client library instead of the Go one, so not sure how directly relevant it is.
We have explicitly tried to avoid exposing knobs for tuning performance to make the library easier to use. It dynamically adjusts to the throughput it sees (i.e., it won't start off making 100 concurrent RPCs to AWS, it will only get there if your message processing throughput warrants it). Here are some data from benchmarks using simulated backends with various characteristics. Note that because of this, another way to work around this issue (other than increasing |
I'm aware that SQS allows to receive at most 10 messages at a time and for maximum throughput one needs to use a lot of concurrent requests due to long RTTs. The heuristic you use (where concurrency level depends on "consumer throughput") seems very clever, but I still think the upper bounds might be bit too high for many scenarios - 100 concurrent receive and 100 concurrent delete (ack) requests is quite a lot (especially if there's no HTTP 2.x multiplexing or HTTP 1.1 pipelining employed). On macOS the only problem is the crazily low default limit that one can change with a single command (changing soft limits does not require special privileges) but I can imagine setups where this might not be possible (luckily that's not my use case). I can't access the spreadsheet you linked to, but from the throughput numbers shared in #1657 I see that you used some beefy nodes - probably with over 20Gb network. This is not what most people use :) E.g. I can imagine someone wanting to write simple pub-sub client running on Raspberry Pi... That said I don't have any hard facts at hand and I can easily solve my problems (with a wrapper script that bumps maxfiles limit) so I'm fine if you decide to close this issue for now. |
Yea, I think we can leave this alone. Users can control the concurrency indirectly as I described above, by controlling the message processing speed explicitly if necessary. |
Uh oh!
There was an error while loading. Please reload this page.
Describe the bug
Simple program with receive-print-ack loop running on SQS queue with huge backlog runs fine for some time after which it gets stuck with the following:
I'm pretty sure that my network is fine. And this fails consistently on my machine.
Now, I haven't spent too much time on debugging this issue and I don't blame you for broken DNS resolver in Go (I think that might be specific to
cgo
resolver used by default on macOS) but is it maybe possible that you do too much DNS resolving at the same time? Quick look into code shows me that you have max concurrency for receives and max concurrency for acks set to 100. This looks like a high number to me. If I change these numbers to something more sane (like 10) the problem disappears.To Reproduce
Expected behavior
Program continues to print all messages
Version
v0.17.0
Additional context
The text was updated successfully, but these errors were encountered: