Skip to content

Conversation

benashz
Copy link
Collaborator

@benashz benashz commented Mar 22, 2024

In the case where a lifetimeWatcher fails to renew the Vault client lease, we want all related VDS instances to be synced. This helps to mitigate the issue where an external revocation of the client token causes the issued secret leases to be revoked. In that case VSO would have no idea that the token has been revoked. This fix should also work in the case where the token has a max TTL set.

The ideal TTL for the client token should be relatively short, e.g. 1m, so as to trigger the lifetimeWatcher earlier.

In the future, VSO will be able to subscribe to lease revocation events from Vault. In that case, VSO will be able to perform the sync immediately.

This PR adds a new source.Channel watcher for handling external reconciliation requests. These are triggered by the Vault Client's token having expired or whenever a token lease renewal has failed.

The CachingClientFactory has been updated to handle when a Client's LifetimeWatcher is done. In which case, the factory will remove the Vault Client from the its cache, and call any of the registered ClientCallback functions on behalf of the VDS reconciler.

This fix should also allow for using non-periodic Vault client tokens, although that is not a recommended practice as all of the other Vault secret reconcilers have this this new callback feature.

@benashz benashz added dynamic Dynamic secrets reproduced This issue has been reproduced by a Vault engineer labels Mar 22, 2024
@benashz benashz linked an issue Apr 4, 2024 that may be closed by this pull request
@benashz benashz force-pushed the VAULT-25273/vds-trigger-reconciliation-on-token-errors branch 2 times, most recently from 3fc9613 to 38e20fd Compare April 11, 2024 14:14
In the case where a lifetimeWatcher fails to renew the Vault client
lease, we want all related VDS instances to be synced. This helps to
mitigate the issue where an external revocation of the client token
causes the issued secret leases to be revoked. In that case VSO would
have no idea that the token has been revoked.

The ideal TTL for the client token should be relatively short, e.g. 1m,
so as to trigger the lifetimeWatcher earlier.

In the future, VSO will be able to subscribe to lease revocation events
from Vault. In that case, VSO will be able to perform the sync
immediately.
@benashz benashz force-pushed the VAULT-25273/vds-trigger-reconciliation-on-token-errors branch from 38e20fd to e1b3b26 Compare April 11, 2024 14:18
@benashz benashz added this to the v0.6.0 milestone Apr 15, 2024
@benashz benashz requested a review from tvoran April 16, 2024 22:08
@benashz benashz marked this pull request as ready for review April 16, 2024 22:08
@benashz benashz requested a review from a team as a code owner April 16, 2024 22:08
@benashz benashz requested a review from thyton April 16, 2024 22:08
@benashz benashz force-pushed the VAULT-25273/vds-trigger-reconciliation-on-token-errors branch from 044b340 to 1cdae92 Compare April 16, 2024 22:22
@benashz benashz changed the title WIP: reconcile VDS instances on lifetimeWatcher errors Reconcile VDS instances on lifetimeWatcher errors Apr 16, 2024
@benashz benashz changed the title Reconcile VDS instances on lifetimeWatcher errors Reconcile VDS instances on lifetimeWatcher done events Apr 16, 2024
benashz added 5 commits April 19, 2024 11:26
- remove the called back client from the factory's cache to ensure that
  any of its clones are purged. The next call to Sync() will get a new
  client back
- fix bogus lock handling in the factory's Get()
@benashz benashz requested a review from tvoran April 19, 2024 19:15
@tvoran
Copy link
Member

tvoran commented Apr 23, 2024

Have you considered using a GenericEvent source channel in the controller's watches to enqueue reconciles? Looks like it could simplify the sync controller portion of this. There's a slightly outdated example here.

@benashz
Copy link
Collaborator Author

benashz commented Apr 23, 2024

Have you considered using a GenericEvent source channel in the controller's watches to enqueue reconciles? Looks like it could simplify the sync controller portion of this. There's a slightly outdated example here.

I had considered its use but it did not offer an easy way to specify a sync delay. We may be able to add a random sleep or something before enqueuing the object. I can put up a spike PR to see if its use fits well here.

@benashz
Copy link
Collaborator Author

benashz commented Apr 24, 2024

Switched to use GenericEvents in #704

benashz added 4 commits April 24, 2024 12:58
Update the VaultDynamicSecretReconciler to rely on the
controller-runtime's source.Channel that is watched as a raw source.
This ensures that there is only one reconciler work queue required to
handle external reconciliation requests.
Previously, the client factory would restore all stored clients to warm
its client cache. Some issues with that:
- restore all is blocking and can lead to a failure of the VSO pod to
  become ready if there a large number of clients to be restored
- restore all is unnecessary, since the client factory will always
  attempt a restoration upon resource reconciliation. All resources are
  reconciled when VSO starts up.
It is nice to have k8s cluster metrics during integration testing.
@benashz benashz merged commit 5c62c3c into main Apr 24, 2024
@benashz benashz deleted the VAULT-25273/vds-trigger-reconciliation-on-token-errors branch April 24, 2024 18:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dynamic Dynamic secrets reproduced This issue has been reproduced by a Vault engineer
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Secrets (still) not being renewed occasionally
2 participants