-
Notifications
You must be signed in to change notification settings - Fork 4.4k
Description
Describe the bug
Transient failures can result in loadTrustedCerts
caching only a partial set of certificates in the cache used for authentication.
Specifically, if the backend encounters a transient error in the Cert
constructor then the error is dropped on the floor and the loop continues. Any certificate that was successfully loaded gets put into the cache, and any certificate that failed to load for any reason is excluded.
We specifically had a case where a request timed out while the cache was being filled. Our Vault instance saw a spike of traffic after a restart and was consequently throttled. As a result of that, the request timed out while it was loading certificates, and the Cert
constructor failed because the context was canceled. Despite the canceled context, the partial list of certificates was installed in the cache. From that point on, the clients could never successfully authenticate with the process.
To Reproduce
I don't have a turnkey reproduction of this problem yet.
Expected behavior
I would expect that the process doesn't wind up with a poisoned cache, and eventually self-corrects.
Environment:
- Vault Server Version (retrieve with
vault status
): 1.19.3 - Vault CLI Version (retrieve with
vault version
):Vault v1.19.3 ('a2de3bb7bcf4a073cbb8724863a5a88d3c2f83da+CHANGES'), built 2025-04-29T10:34:52Z
- Server Operating System/Architecture:
Linux hc-vault-2 5.4.0-1149-azure-fips #157+fips1-Ubuntu SMP Thu Apr 3 05:40:02 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Vault server configuration file(s):
# cat /config/vault.conf
listener "tcp" {
address = "0.0.0.0:8200"
tls_cert_file = "/etc/tls/tls.crt"
tls_key_file = "/etc/tls/tls.key"
tls_cipher_suites = "TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_CB
C_SHA,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384"
telemetry {
unauthenticated_metrics_access = true
}
}
listener "tcp" {
address = "127.0.0.1:9000"
tls_disable = 1
telemetry {
unauthenticated_metrics_access = true
}
}
telemetry {
prometheus_retention_time = "360s"
disable_hostname = true
statsd_address = "localhost:9125"
}
plugin_directory = "/databricks/vault/plugins"
ha_storage "raft" {
path = "/vault/data"
}
disable_mlock = true
ui = true
service_registration "kubernetes" {}
# cat /storage-config/vault-azure.conf
storage "azure" {
accountName = [REDACTED]
accountKey = [REDACTED]
container = [REDACTED]
environment = "AzurePublicCloud"
}
seal "azurekeyvault" {
client_id = [REDACTED]
tenant_id = [REDACTED]
vault_name = [REDACTED]
key_name = [REDACTED]
environment = "AZUREPUBLICCLOUD"
}
Additional context