Errors and high latency when using GCS backend under (medium) load

**Describe the bug**
When using the gcs backend we noticed a high latency for vault authentication and reads when doing operations in parallel from several clients (sometime the clients even receive errors). We first suspected throttling on the GCS api but accessing the same bucket and keys using the gsutil CLI remained fast during these higher latency events.

Average latency reported by the telemetry endpoint goes up to dozens of seconds during these events for most operations.

We looked at the vault logs and saw that when this was happening we were getting a lot of errors in the vault logs:

> 2018-09-25T17:39:09.236Z [ERROR] core: error checking health: error="failed to check for initialization: failed to read object: Get https://www.googleapis.com/storage/v1/b/<bucket>/o?alt=json&delimiter=%2F&pageToken=&prefix=core%2F&projection=full&versions=false: stream error: stream ID 68481; REFUSED_STREAM"

> error closing connection: Post https://www.googleapis.com/upload/storage/v1/b/<bucket>/o?alt=json&projection=full&uploadType=multipart: http2: Transport: cannot retry err [stream error: stream ID 68483; REFUSED_STREAM] after Request.Body was written; define Request.GetBody to avoid this error"

`REFUSED_STREAM` seems to indicate that the gcs client is trying to open too many streams on the http2 connection. I had a quick look at the code and the gcs backend seems to be using the standard gcs go library so I wonder if the issue could actually be there.

We will try to work around this by decreasing the `max_parallel` option of the backend (I will update the issue after our tests), but I was wondering if you had other ideas.

**To Reproduce**
Steps to reproduce the behavior:
1. Run a vault server using gcs as backend
2. Authenticate and read keys from many clients in parallel. In our case we have ~1000 clients performing authentication (kubernetes auth backend) and reading ~5 keys each over 15mn (but since this happens during a kubernetes rolling update, we have 3 batches of ~300 clients that perform these operations over 1-2 minutes)

**Expected behavior**
Limited increase in latency and no errors

**Environment:**
* Vault Server Version: `0.11.1`
* Vault CLI Version: `consul-template 0.19.4`
* Server Operating System/Architecture: `ubuntu 1804`

Vault server configuration file(s):

```hcl
plugin_directory = "/var/lib/vault/plugins"

storage gcs {}

disable_clustering = true

ha_storage consul {
  address = "127.0.0.1:8500"
  path    = "vault/"
}

ui = true

listener tcp {
  address         = ":443"
  cluster_address = ":8201"
}

telemetry {
  dogstatsd_addr = "localhost:8125"
}
```
GOOGLE_STORAGE_BUCKET defined with environment variable

**Additional context**
N/A


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Errors and high latency when using GCS backend under (medium) load #5419

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Errors and high latency when using GCS backend under (medium) load #5419

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions