pubsub/awssnssqs: DNS resolution failure in tight receive-ack loop (AWS SQS) #2672

telendt · 2019-09-19T20:46:27Z

Describe the bug

Simple program with receive-print-ack loop running on SQS queue with huge backlog runs fine for some time after which it gets stuck with the following:

pubsub (code=Unknown): RequestError: send request failed
caused by: Post https://sqs.eu-west-1.amazonaws.com/: dial tcp: lookup sqs.eu-west-1.amazonaws.com: no such host

I'm pretty sure that my network is fine. And this fails consistently on my machine.

Now, I haven't spent too much time on debugging this issue and I don't blame you for broken DNS resolver in Go (I think that might be specific to cgo resolver used by default on macOS) but is it maybe possible that you do too much DNS resolving at the same time? Quick look into code shows me that you have max concurrency for receives and max concurrency for acks set to 100. This looks like a high number to me. If I change these numbers to something more sane (like 10) the problem disappears.

To Reproduce

sub, _ := pubsub.OpenSubscription(ctx, "awssqs://SOME_SQS_URL?region=SQS_REGION")
for  {
	msg, err := sub.Receive(ctx)
	if err != nil {
		log.Println(err.Error())
		time.Sleep(1 * time.Second)
		continue
	}
	fmt.Println(msg.Body)
	msg.Ack()
}

Expected behavior

Program continues to print all messages

Version

v0.17.0

Additional context

go version go1.13 darwin/amd64

The text was updated successfully, but these errors were encountered:

vangent · 2019-09-19T22:52:25Z

We've done load testing with the 100 concurrent goroutines and it has worked fine, so I'm not sure that's the issue.

Some Googling implies that the DNS failures may be due to running out of file descriptors. Try running ulimit -n, then setting it to something smaller and see if the problem happens faster. I.e., if it is currently 1024, try ulimit -n 256. If you start seeing the problem faster than before, trying setting a higher ulimit.

https://grokbase.com/t/gg/golang-nuts/14br47sfpj/go-nuts-no-such-host-with-many-get-requests
rclone/rclone#1111

are similar issues with some suggestions for things to try. Hope that helps!

telendt · 2019-09-20T09:08:22Z

It indeed seems to be related to file descriptors limit and I guess it's related to golang/go/issues/18588 (although "dial tcp: XXX: no such host" is not being preceded by "socket: too many open files" message in this case).

The default file descriptors limit on my Mac is 256 and I don't remember messing with this so I assume it's a default on macOS:

launchctl limit | grep maxfiles
	maxfiles    256            unlimited

After inceasing descriptors limit for that session to 1024 (which I believe is a default on many Linux systems) it works fine.

I still think that requiring devs (at least the ones that run macOS, which is quite popular) to change this limit to run such a simple program is a bad user experience.

Did you see significant throughput improvements in your load tests when moving from 10 to 100 goroutines? I'm asking this because even with concurrency level = 10 it runs really fast.

These people ran some benchmarks on SQS (albeit with VM client, and they use same thread for receive and delete) and they noticed peak throughput at ~25 threads:
https://softwaremill.com/amazon-sqs-performance-latency/

Would it be possible to reduce the default concurrency levels (and maybe expose to users a way to tune it)?

vangent · 2019-09-20T16:09:33Z

Did you see significant throughput improvements in your load tests when moving from 10 to 100 goroutines

Yes. Since the SQS ReceiveMessage RPC only returns at most 10 messages at a time, and each RPC takes non-trivial time to round-trip, we have to make many concurrent RPCs to get good throughput.

#1657 is stale, but shows a benchmark that at one point demonstrated that 100 concurrent is 5x faster than 10 concurrent (500 messages/sec vs 2500 messages/sec). I'm pretty sure the difference would be even greater now due to performance improvements in the pubsub concrete type code.

These people ran some benchmarks on SQS

Interesting data point. It's a couple of years old, and uses the Java client library instead of the Go one, so not sure how directly relevant it is.

Would it be possible to reduce the default concurrency levels (and maybe expose to users a way to tune it)?

We have explicitly tried to avoid exposing knobs for tuning performance to make the library easier to use. It dynamically adjusts to the throughput it sees (i.e., it won't start off making 100 concurrent RPCs to AWS, it will only get there if your message processing throughput warrants it). Here are some data from benchmarks using simulated backends with various characteristics.

Note that because of this, another way to work around this issue (other than increasing ulimit) might be to process messages slower. I.e., if you add a time.Sleep(100 * time.Millisecond) in your message processing loop, your throughput will go down but the concurrency to AWS will also go down. Not sure if that's an option for you.

telendt · 2019-09-21T09:41:04Z

I'm aware that SQS allows to receive at most 10 messages at a time and for maximum throughput one needs to use a lot of concurrent requests due to long RTTs. The heuristic you use (where concurrency level depends on "consumer throughput") seems very clever, but I still think the upper bounds might be bit too high for many scenarios - 100 concurrent receive and 100 concurrent delete (ack) requests is quite a lot (especially if there's no HTTP 2.x multiplexing or HTTP 1.1 pipelining employed).

On macOS the only problem is the crazily low default limit that one can change with a single command (changing soft limits does not require special privileges) but I can imagine setups where this might not be possible (luckily that's not my use case).

I can't access the spreadsheet you linked to, but from the throughput numbers shared in #1657 I see that you used some beefy nodes - probably with over 20Gb network. This is not what most people use :) E.g. I can imagine someone wanting to write simple pub-sub client running on Raspberry Pi...

That said I don't have any hard facts at hand and I can easily solve my problems (with a wrapper script that bumps maxfiles limit) so I'm fine if you decide to close this issue for now.

vangent · 2019-09-21T17:35:03Z

Yea, I think we can leave this alone. Users can control the concurrency indirectly as I described above, by controlling the message processing speed explicitly if necessary.

vangent self-assigned this Sep 20, 2019

vangent closed this as completed Sep 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

pubsub/awssnssqs: DNS resolution failure in tight receive-ack loop (AWS SQS) #2672

pubsub/awssnssqs: DNS resolution failure in tight receive-ack loop (AWS SQS) #2672

telendt commented Sep 19, 2019 •

edited

Loading

vangent commented Sep 19, 2019

Uh oh!

telendt commented Sep 20, 2019 •

edited

Loading

Uh oh!

vangent commented Sep 20, 2019

Uh oh!

telendt commented Sep 21, 2019 •

edited

Loading

Uh oh!

vangent commented Sep 21, 2019

Uh oh!

pubsub/awssnssqs: DNS resolution failure in tight receive-ack loop (AWS SQS) #2672

pubsub/awssnssqs: DNS resolution failure in tight receive-ack loop (AWS SQS) #2672

Comments

telendt commented Sep 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the bug

To Reproduce

Expected behavior

Version

Additional context

vangent commented Sep 19, 2019

Uh oh!

telendt commented Sep 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vangent commented Sep 20, 2019

Uh oh!

telendt commented Sep 21, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vangent commented Sep 21, 2019

Uh oh!

telendt commented Sep 19, 2019 •

edited

Loading

telendt commented Sep 20, 2019 •

edited

Loading

telendt commented Sep 21, 2019 •

edited

Loading