Skip to content

Excessive TLS connections - CPU/Memory Usage #3067

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
nbaztec opened this issue Jan 7, 2020 · 8 comments
Closed

Excessive TLS connections - CPU/Memory Usage #3067

nbaztec opened this issue Jan 7, 2020 · 8 comments
Labels
performance service-api This issue is due to a problem in a service API, not the SDK implementation.

Comments

@nbaztec
Copy link

nbaztec commented Jan 7, 2020

Version of AWS SDK for Go?

v1.26.8

Version of Go (go version)?

go version go1.13 darwin/amd64

What issue did you see?

Profiling the app shows that the app was spending much resources while handling TLS handshakes.
Over a third of CPU/Memory was being used for TLS negotiation (which after setting the MaxIdleConnsPerHost seemed to be better)
However, even when IdleConnTimeout is set, the connections still seem to be discarded after around ~6-8 seconds of inactivity and a new TLS negotiation is initiated.

Steps to reproduce

  • Connect to Kinesis
sess := session.Must(session.NewSession(&aws.Config{
		Region:                        aws.String(awsRegion),
		CredentialsChainVerboseErrors: aws.Bool(verboseErrors),
	}))
stsConfig := &aws.Config{
		Credentials:                   creds,
		Region:                        aws.String(awsRegion),
		CredentialsChainVerboseErrors: aws.Bool(verboseErrors),
		HTTPClient: &http.Client{
			Transport: &http.Transport{
				Proxy: http.ProxyFromEnvironment,
				DialContext: (&net.Dialer{
					Timeout:   30 * time.Second,
					KeepAlive: 30 * time.Second,
				}).DialContext,
				MaxIdleConns:          100,
				IdleConnTimeout:       90 * time.Second,
				MaxIdleConnsPerHost:   50,
				TLSHandshakeTimeout:   3 * time.Second,
				ExpectContinueTimeout: 1 * time.Second,
			},
		},
	}

client := kinesis.New(sess, stsConfig)
client.PutRecord(...)
  • Execute the script to send some data

net/http/transport.go addTLS() is invoked, for every request that is 6-10 seconds apart to start a new TLS session.

Expected

One would expect that the TLS session would only be initiated once, and be re-used from idle sessions

@diehlaws diehlaws self-assigned this Jan 8, 2020
@jasdel
Copy link
Contributor

jasdel commented Jan 14, 2020

Thanks for reaching out @nbaztec. The connection issue you are experiencing is most likely due to the Kinesis sever closing the connection after the 6-10 second window due to inactivity. Since the server is probably closing the connection due to inactivity the client's configuration is not making a difference.

This can be verified with httptrace.ClientTrace. The httptrace.GotConnInfo value passed into the ClientTrace's GotConn callback.

Each time the sever closes the connection, the client will need to re-establish the connection. The Go HTTP client's transport TLS sessions are not resumed between re-established connections. A new TLS session will need to be created for the new connection. This is the reason for the TLS connection activity.

One potential workaround for this issue is to space reporting records out in order to reduce the duration of time between calls to PutRecord. Potentially by sending the Records to a channel that will rate limit calls to PutRecord. I'm not sure if Kinesis calling PutRecord with empty data, but if so your application could make these (or similar API call) to keep the connection alive. Though some AWS APIs will still only allow a reused connection to be reused for a limited number of requests.

Kinesis's API does support HTTP without TLS, which would remove the overhead of the TLS session setup, but would make your record data visible on an network.

@jasdel jasdel added performance service-api This issue is due to a problem in a service API, not the SDK implementation. labels Jan 14, 2020
@nbaztec
Copy link
Author

nbaztec commented Jan 15, 2020

Thanks @jasdel for the added discovery. That was our guess as well with Kinesis closing connections altogether. And in that light the approach of using channels to better throttle/batch the PutRecord is also something we are considering.

So far, as you've pointed out correctly that there's no proper library solution that can be made available to mitigate the issue, thereby we shall be looking to mitigate it with go channels instead.

Thanks!

@diehlaws
Copy link
Contributor

Hi @nbaztec, please do let us know if you require further assistance from us on this. Otherwise feel free to close the issue, or we can let it auto-close due to inactivity.

@diehlaws diehlaws added the closing-soon This issue will automatically close in 4 days unless further comments are made. label May 15, 2020
@nbaztec
Copy link
Author

nbaztec commented May 16, 2020

Hi! Thanks for the help. We've successfully resolved the issue by tweaking the keepalive parameters for the connection on our end.

Closing this one.

@nbaztec nbaztec closed this as completed May 16, 2020
@diehlaws diehlaws removed the closing-soon This issue will automatically close in 4 days unless further comments are made. label May 18, 2020
@diehlaws diehlaws removed their assignment Aug 26, 2020
@leventov
Copy link

@nbaztec could you please describe how did you change the keep-alive parameters specifically to resolve this problem?

We have exactly the same issue with TLS reconnections accessing DynamoDB. We have configured net.Dialer.KeepAlive to 10 seconds and still have a lot of full handshakes (addTLS() calls) in profile.

OTOH, I don't understand why keep-alive even matters under load.

However, even when IdleConnTimeout is set, the connections still seem to be discarded after around ~6-8 seconds of inactivity and a new TLS negotiation is initiated.

If there is a performance problem, why there are long periods of connection inactivity? (The same reasoning applies to our case with DynamoDB as well.)

@jasdel do you have some insights on this?

@nbaztec
Copy link
Author

nbaztec commented Apr 30, 2021

@leventov I mitigated it by explicitly setting the keep-alive parameters to the following in the HTTPClient:

        cfg := &aws.Config{
		...
		HTTPClient: &http.Client{
			Transport: &http.Transport{
				MaxIdleConns:        100,
				IdleConnTimeout:     90 * time.Second,
				MaxIdleConnsPerHost: 50,
				MaxConnsPerHost:     100,
			},
		},
	}

This is my hypothesis and some memory jogging from the profiler: The MaxConnsPerHost is by default set to 2 and hence under heavy load, for reasons I cannot explain, only the first 2 connections get reused and if a third connection is established it needs to performs the TLS handshake once more. Hope it helps.

@leventov
Copy link

@nbaztec thanks for reply!

It seems to me that setting MaxConnsPerHost makes significant difference, despite in theory, it shouldn't.

It might be that these issues are related: golang/go#20960, golang/go#42650.

Also, surprisingly, http2.ConfigureTransport() doesn't make any noticeable difference.

@KingJayant
Copy link

@leventov Hi leventov, I am also facing the issue of CPU usage while having concurrent requests requiring TLS negotiations with dynamodb. Can you please explain how you resolved this issue ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance service-api This issue is due to a problem in a service API, not the SDK implementation.
Projects
None yet
Development

No branches or pull requests

5 participants