Skip to content

Transient failures in loadTrustedCerts can wind up poisoning the cache #31348

@doty-db

Description

@doty-db

Describe the bug
Transient failures can result in loadTrustedCerts caching only a partial set of certificates in the cache used for authentication.

Specifically, if the backend encounters a transient error in the Cert constructor then the error is dropped on the floor and the loop continues. Any certificate that was successfully loaded gets put into the cache, and any certificate that failed to load for any reason is excluded.

We specifically had a case where a request timed out while the cache was being filled. Our Vault instance saw a spike of traffic after a restart and was consequently throttled. As a result of that, the request timed out while it was loading certificates, and the Cert constructor failed because the context was canceled. Despite the canceled context, the partial list of certificates was installed in the cache. From that point on, the clients could never successfully authenticate with the process.

To Reproduce
I don't have a turnkey reproduction of this problem yet.

Expected behavior
I would expect that the process doesn't wind up with a poisoned cache, and eventually self-corrects.

Environment:

  • Vault Server Version (retrieve with vault status): 1.19.3
  • Vault CLI Version (retrieve with vault version): Vault v1.19.3 ('a2de3bb7bcf4a073cbb8724863a5a88d3c2f83da+CHANGES'), built 2025-04-29T10:34:52Z
  • Server Operating System/Architecture: Linux hc-vault-2 5.4.0-1149-azure-fips #157+fips1-Ubuntu SMP Thu Apr 3 05:40:02 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux

Vault server configuration file(s):

# cat /config/vault.conf
listener "tcp" {
  address = "0.0.0.0:8200"
  tls_cert_file = "/etc/tls/tls.crt"
  tls_key_file = "/etc/tls/tls.key"
  tls_cipher_suites = "TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_CB
C_SHA,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384"
  telemetry {
    unauthenticated_metrics_access = true
  }
}
listener "tcp" {
  address = "127.0.0.1:9000"
  tls_disable = 1
  telemetry {
    unauthenticated_metrics_access = true
  }
}
telemetry {
  prometheus_retention_time = "360s"
  disable_hostname = true
  statsd_address = "localhost:9125"
}
plugin_directory = "/databricks/vault/plugins"
ha_storage "raft" {
  path = "/vault/data"
}

disable_mlock = true
ui = true
service_registration "kubernetes" {}
# cat /storage-config/vault-azure.conf
storage "azure" {
  accountName = [REDACTED]
  accountKey = [REDACTED]
  container = [REDACTED]
  environment = "AzurePublicCloud"
}
seal "azurekeyvault" {
  client_id = [REDACTED]
  tenant_id = [REDACTED]
  vault_name = [REDACTED]
  key_name = [REDACTED]
  environment = "AZUREPUBLICCLOUD"
}

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    auth/certAuthentication - certificatesbugUsed to indicate a potential bug

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions