Skip to content

Errors and high latency when using GCS backend under (medium) load #5419

@lbernail

Description

@lbernail

Describe the bug
When using the gcs backend we noticed a high latency for vault authentication and reads when doing operations in parallel from several clients (sometime the clients even receive errors). We first suspected throttling on the GCS api but accessing the same bucket and keys using the gsutil CLI remained fast during these higher latency events.

Average latency reported by the telemetry endpoint goes up to dozens of seconds during these events for most operations.

We looked at the vault logs and saw that when this was happening we were getting a lot of errors in the vault logs:

2018-09-25T17:39:09.236Z [ERROR] core: error checking health: error="failed to check for initialization: failed to read object: Get https://www.googleapis.com/storage/v1/b//o?alt=json&delimiter=%2F&pageToken=&prefix=core%2F&projection=full&versions=false: stream error: stream ID 68481; REFUSED_STREAM"

error closing connection: Post https://www.googleapis.com/upload/storage/v1/b//o?alt=json&projection=full&uploadType=multipart: http2: Transport: cannot retry err [stream error: stream ID 68483; REFUSED_STREAM] after Request.Body was written; define Request.GetBody to avoid this error"

REFUSED_STREAM seems to indicate that the gcs client is trying to open too many streams on the http2 connection. I had a quick look at the code and the gcs backend seems to be using the standard gcs go library so I wonder if the issue could actually be there.

We will try to work around this by decreasing the max_parallel option of the backend (I will update the issue after our tests), but I was wondering if you had other ideas.

To Reproduce
Steps to reproduce the behavior:

  1. Run a vault server using gcs as backend
  2. Authenticate and read keys from many clients in parallel. In our case we have ~1000 clients performing authentication (kubernetes auth backend) and reading ~5 keys each over 15mn (but since this happens during a kubernetes rolling update, we have 3 batches of ~300 clients that perform these operations over 1-2 minutes)

Expected behavior
Limited increase in latency and no errors

Environment:

  • Vault Server Version: 0.11.1
  • Vault CLI Version: consul-template 0.19.4
  • Server Operating System/Architecture: ubuntu 1804

Vault server configuration file(s):

plugin_directory = "/var/lib/vault/plugins"

storage gcs {}

disable_clustering = true

ha_storage consul {
  address = "127.0.0.1:8500"
  path    = "vault/"
}

ui = true

listener tcp {
  address         = ":443"
  cluster_address = ":8201"
}

telemetry {
  dogstatsd_addr = "localhost:8125"
}

GOOGLE_STORAGE_BUCKET defined with environment variable

Additional context
N/A

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions