-
Notifications
You must be signed in to change notification settings - Fork 4.4k
Description
Describe the bug
When using the gcs backend we noticed a high latency for vault authentication and reads when doing operations in parallel from several clients (sometime the clients even receive errors). We first suspected throttling on the GCS api but accessing the same bucket and keys using the gsutil CLI remained fast during these higher latency events.
Average latency reported by the telemetry endpoint goes up to dozens of seconds during these events for most operations.
We looked at the vault logs and saw that when this was happening we were getting a lot of errors in the vault logs:
2018-09-25T17:39:09.236Z [ERROR] core: error checking health: error="failed to check for initialization: failed to read object: Get https://www.googleapis.com/storage/v1/b//o?alt=json&delimiter=%2F&pageToken=&prefix=core%2F&projection=full&versions=false: stream error: stream ID 68481; REFUSED_STREAM"
error closing connection: Post https://www.googleapis.com/upload/storage/v1/b//o?alt=json&projection=full&uploadType=multipart: http2: Transport: cannot retry err [stream error: stream ID 68483; REFUSED_STREAM] after Request.Body was written; define Request.GetBody to avoid this error"
REFUSED_STREAM
seems to indicate that the gcs client is trying to open too many streams on the http2 connection. I had a quick look at the code and the gcs backend seems to be using the standard gcs go library so I wonder if the issue could actually be there.
We will try to work around this by decreasing the max_parallel
option of the backend (I will update the issue after our tests), but I was wondering if you had other ideas.
To Reproduce
Steps to reproduce the behavior:
- Run a vault server using gcs as backend
- Authenticate and read keys from many clients in parallel. In our case we have ~1000 clients performing authentication (kubernetes auth backend) and reading ~5 keys each over 15mn (but since this happens during a kubernetes rolling update, we have 3 batches of ~300 clients that perform these operations over 1-2 minutes)
Expected behavior
Limited increase in latency and no errors
Environment:
- Vault Server Version:
0.11.1
- Vault CLI Version:
consul-template 0.19.4
- Server Operating System/Architecture:
ubuntu 1804
Vault server configuration file(s):
plugin_directory = "/var/lib/vault/plugins"
storage gcs {}
disable_clustering = true
ha_storage consul {
address = "127.0.0.1:8500"
path = "vault/"
}
ui = true
listener tcp {
address = ":443"
cluster_address = ":8201"
}
telemetry {
dogstatsd_addr = "localhost:8125"
}
GOOGLE_STORAGE_BUCKET defined with environment variable
Additional context
N/A