Skip to content

pubsub/awssnssqs: DNS resolution failure in tight receive-ack loop (AWS SQS) #2672

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
telendt opened this issue Sep 19, 2019 · 5 comments
Closed
Assignees

Comments

@telendt
Copy link

telendt commented Sep 19, 2019

Describe the bug

Simple program with receive-print-ack loop running on SQS queue with huge backlog runs fine for some time after which it gets stuck with the following:

pubsub (code=Unknown): RequestError: send request failed
caused by: Post https://sqs.eu-west-1.amazonaws.com/: dial tcp: lookup sqs.eu-west-1.amazonaws.com: no such host

I'm pretty sure that my network is fine. And this fails consistently on my machine.

Now, I haven't spent too much time on debugging this issue and I don't blame you for broken DNS resolver in Go (I think that might be specific to cgo resolver used by default on macOS) but is it maybe possible that you do too much DNS resolving at the same time? Quick look into code shows me that you have max concurrency for receives and max concurrency for acks set to 100. This looks like a high number to me. If I change these numbers to something more sane (like 10) the problem disappears.

To Reproduce

sub, _ := pubsub.OpenSubscription(ctx, "awssqs://SOME_SQS_URL?region=SQS_REGION")
for  {
	msg, err := sub.Receive(ctx)
	if err != nil {
		log.Println(err.Error())
		time.Sleep(1 * time.Second)
		continue
	}
	fmt.Println(msg.Body)
	msg.Ack()
}

Expected behavior

Program continues to print all messages

Version

v0.17.0

Additional context

go version go1.13 darwin/amd64
@vangent
Copy link
Contributor

vangent commented Sep 19, 2019

We've done load testing with the 100 concurrent goroutines and it has worked fine, so I'm not sure that's the issue.

Some Googling implies that the DNS failures may be due to running out of file descriptors. Try running ulimit -n, then setting it to something smaller and see if the problem happens faster. I.e., if it is currently 1024, try ulimit -n 256. If you start seeing the problem faster than before, trying setting a higher ulimit.

https://grokbase.com/t/gg/golang-nuts/14br47sfpj/go-nuts-no-such-host-with-many-get-requests
rclone/rclone#1111

are similar issues with some suggestions for things to try. Hope that helps!

@telendt
Copy link
Author

telendt commented Sep 20, 2019

It indeed seems to be related to file descriptors limit and I guess it's related to golang/go/issues/18588 (although "dial tcp: XXX: no such host" is not being preceded by "socket: too many open files" message in this case).

The default file descriptors limit on my Mac is 256 and I don't remember messing with this so I assume it's a default on macOS:

launchctl limit | grep maxfiles
	maxfiles    256            unlimited

After inceasing descriptors limit for that session to 1024 (which I believe is a default on many Linux systems) it works fine.

I still think that requiring devs (at least the ones that run macOS, which is quite popular) to change this limit to run such a simple program is a bad user experience.

Did you see significant throughput improvements in your load tests when moving from 10 to 100 goroutines? I'm asking this because even with concurrency level = 10 it runs really fast.

These people ran some benchmarks on SQS (albeit with VM client, and they use same thread for receive and delete) and they noticed peak throughput at ~25 threads:
https://softwaremill.com/amazon-sqs-performance-latency/

Would it be possible to reduce the default concurrency levels (and maybe expose to users a way to tune it)?

@vangent
Copy link
Contributor

vangent commented Sep 20, 2019

Did you see significant throughput improvements in your load tests when moving from 10 to 100 goroutines

Yes. Since the SQS ReceiveMessage RPC only returns at most 10 messages at a time, and each RPC takes non-trivial time to round-trip, we have to make many concurrent RPCs to get good throughput.

#1657 is stale, but shows a benchmark that at one point demonstrated that 100 concurrent is 5x faster than 10 concurrent (500 messages/sec vs 2500 messages/sec). I'm pretty sure the difference would be even greater now due to performance improvements in the pubsub concrete type code.

These people ran some benchmarks on SQS

Interesting data point. It's a couple of years old, and uses the Java client library instead of the Go one, so not sure how directly relevant it is.

Would it be possible to reduce the default concurrency levels (and maybe expose to users a way to tune it)?

We have explicitly tried to avoid exposing knobs for tuning performance to make the library easier to use. It dynamically adjusts to the throughput it sees (i.e., it won't start off making 100 concurrent RPCs to AWS, it will only get there if your message processing throughput warrants it). Here are some data from benchmarks using simulated backends with various characteristics.

Note that because of this, another way to work around this issue (other than increasing ulimit) might be to process messages slower. I.e., if you add a time.Sleep(100 * time.Millisecond) in your message processing loop, your throughput will go down but the concurrency to AWS will also go down. Not sure if that's an option for you.

@vangent vangent self-assigned this Sep 20, 2019
@telendt
Copy link
Author

telendt commented Sep 21, 2019

I'm aware that SQS allows to receive at most 10 messages at a time and for maximum throughput one needs to use a lot of concurrent requests due to long RTTs. The heuristic you use (where concurrency level depends on "consumer throughput") seems very clever, but I still think the upper bounds might be bit too high for many scenarios - 100 concurrent receive and 100 concurrent delete (ack) requests is quite a lot (especially if there's no HTTP 2.x multiplexing or HTTP 1.1 pipelining employed).

On macOS the only problem is the crazily low default limit that one can change with a single command (changing soft limits does not require special privileges) but I can imagine setups where this might not be possible (luckily that's not my use case).

I can't access the spreadsheet you linked to, but from the throughput numbers shared in #1657 I see that you used some beefy nodes - probably with over 20Gb network. This is not what most people use :) E.g. I can imagine someone wanting to write simple pub-sub client running on Raspberry Pi...

That said I don't have any hard facts at hand and I can easily solve my problems (with a wrapper script that bumps maxfiles limit) so I'm fine if you decide to close this issue for now.

@vangent
Copy link
Contributor

vangent commented Sep 21, 2019

Yea, I think we can leave this alone. Users can control the concurrency indirectly as I described above, by controlling the message processing speed explicitly if necessary.

@vangent vangent closed this as completed Sep 21, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants