Skip to content

Conversation

rfredette
Copy link
Contributor

@rfredette rfredette commented Feb 7, 2023

This PR makes the router parse the certificates supplied in ROUTER_MUTUAL_TLS_AUTH_CA, download CRLs from any CRL distribution points that are found, and write them to /crls/crls.pem. That file is supplied to haproxy to use for mTLS client auth.

This PR requires that the router deployment include an EmptyDir mount for the /crls directory

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 7, 2023
@openshift-ci openshift-ci bot requested review from gcs278 and knobunc February 7, 2023 20:29
@rfredette rfredette force-pushed the ocpbugs-6661-crl-limit branch from 706a67b to e431715 Compare February 15, 2023 18:37
@rfredette rfredette changed the title WIP: Download CRLs as part of the router startup WIP: Manage mTLS CRLs Feb 15, 2023
if nextUpdate.IsZero() {
// TODO: Make this a longer interval (for example, hourly) or implement backoff logic?
log.V(1).Info("Couldn't determine next CRL update time. Falling back to updating every 5 minutes.")
nextUpdate = time.Now().Add(5 * time.Minute)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every five minutes seems rather frequent, and CRLs can be quite large. I think daily is more typical for CRLs, right?

In any case, can you add a named constant for the default update interval?

const (
	// defaultUpdateInterval is the interval to wait before updating CRLs
	// when the CRLs do not specify any next update time.
	defaultUpdateInterval = 1 * time.Day
	// ...
)
Suggested change
nextUpdate = time.Now().Add(5 * time.Minute)
nextUpdate = time.Now().Add(defaultUpdateInterval)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I currently have it set to 5 minutes to aid in testing, since I manually update the CRL files, and sometimes I've missed the actual expiration time. I will add a constant with the appropriate fallback interval before this is ready to merge.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the failure was transient should we wait for another day before downloading again? I could see a case where we do a number of retries before falling back to downloading once a day. Otherwise (and assuming a transient download failure) the only way to get a new list (in a hurry) is to delete the pod.

@rfredette rfredette changed the title WIP: Manage mTLS CRLs WIP: OCPBUGS-6661: Manage mTLS CRLs Feb 17, 2023
@openshift-ci-robot openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. labels Feb 17, 2023
@openshift-ci-robot
Copy link
Contributor

@rfredette: This pull request references Jira Issue OCPBUGS-6661, which is invalid:

  • expected the bug to target the "4.13.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

This PR makes the router parse the certificates supplied in ROUTER_MUTUAL_TLS_AUTH_CA, download CRLs from any CRL distribution points that are found, and write them to /crls/crls.pem. That file is supplied to haproxy to use for mTLS client auth.

This PR requires that the router deployment include an EmptyDir mount for the /crls directory

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Feb 17, 2023
@rfredette rfredette force-pushed the ocpbugs-6661-crl-limit branch from 1673c1c to d309252 Compare February 27, 2023 19:37
if nextUpdate.IsZero() {
// TODO: Make this a longer interval (for example, hourly) or implement backoff logic?
log.V(1).Info("Couldn't determine next CRL update time. Falling back to updating every 5 minutes.")
nextUpdate = time.Now().Add(5 * time.Minute)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the failure was transient should we wait for another day before downloading again? I could see a case where we do a number of retries before falling back to downloading once a day. Otherwise (and assuming a transient download failure) the only way to get a new list (in a hurry) is to delete the pod.

Copy link
Contributor

@Miciah Miciah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the latest changes. The cleanup of stale mTLS config looks much more straightforward now. Nice improvement!

However, several comments from my previous review don't appear to have been addressed. Can you make sure that past review comments haven't been overlooked? It's fine if you decide to reject some of the suggestions, but please reply in those cases so that I know a comment wasn't inadvertently missed.

@rfredette rfredette force-pushed the ocpbugs-6661-crl-limit branch from 7a406f0 to f4898ec Compare March 14, 2023 00:05
@rfredette
Copy link
Contributor Author

I plan to remove the build changes I made (in Makefile and the haproxy Dockerfile.rhel8), set the retry timeout to a final value, probably larger than 5 minutes, and squash the commits before this merges. It still needs some cleanup in the tags, but just in case:

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 14, 2023
@frobware
Copy link
Contributor

/retest

Copy link
Contributor

@gcs278 gcs278 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall look pretty good. A bit hard to understand what's happening, but I think I got the essence of it.

for {
updated := false
if nextUpdate.IsZero() {
log.V(4).Info("no nextUpdate. only watching for CA updates")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit Should we have these log messages describe where these are coming from? As they stand on their own, it might be confusing to a dev in the future when they turn on verbose logging. e.g.

Suggested change
log.V(4).Info("no nextUpdate. only watching for CA updates")
log.V(4).Info("ManageCRLs: no nextUpdate. only watching for CA updates")

crls := make(map[string]*x509.RevocationList)
updated := false
now := time.Now()
for len(clientCAData) > 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic is duplicated with ShouldHaveCRLs. Is there a way to clean this up? Seems like a maintenance burden.

Seems like you could just wait to read the clientCAData (it's done early in writeCRLFile, but not necessary), and then pass caBundleFilename to ShouldHaveCRLs.

return nil
}

// ShouldHaveCRLs returns true if any of the certificates in caBundleFilename specify a CRL distribution point
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit and it returns an error

@@ -290,8 +292,8 @@ frontend fe_sni
{{- "" }} crt-list /var/lib/haproxy/conf/cert_config.map accept-proxy
{{- with (env "ROUTER_MUTUAL_TLS_AUTH") }}
{{- "" }} verify {{. }}
{{- with (env "ROUTER_MUTUAL_TLS_AUTH_CA") }} ca-file {{. }} {{ else }} ca-file /etc/ssl/certs/ca-bundle.trust.crt {{ end }}
{{- with (env "ROUTER_MUTUAL_TLS_AUTH_CRL") }} crl-file {{. }} {{ end }}
{{- if $haveClientCA }} ca-file /var/lib/haproxy/mtls/latest/ca-bundle.pem {{ else }} ca-file /etc/ssl/certs/ca-bundle.trust.crt {{ end }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a thought, /var/lib/haproxy/mtls/latest/ca-bundle.pem could also be GO template variables, since they are already defined in crl.go

// caUpdateChannel (indicating the CA bundle has been updated), or when any existing CRL expires. Whenever either the CA
// bundle or the CRL file has changed, updateCallback is called, with a boolean indicating whether crl-file needs to be
// specified in the HAProxy config.
func ManageCRLs(caBundleFilename string, caUpdateChannel <-chan struct{}, updateCallback func(bool)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have some unit testing for at least the functions that aren't a part of templateRouter struct. Might have to be a little creative with the tests, but I think it's do-able.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most (if not all) of the functions I've added either deal with downloading data from another source, or read and write files in fixed locations on the filesystem, so isolating them to run in a unit test scenario would take some work. If I'm able to get good enough coverage with e2e tests, what do you think about me circling back to figure out unit tests in a later PR?

@rfredette rfredette force-pushed the ocpbugs-6661-crl-limit branch from 990dd53 to 79f25b4 Compare March 27, 2023 22:14
@rfredette rfredette force-pushed the ocpbugs-6661-crl-limit branch from 79f25b4 to b2e8375 Compare April 4, 2023 22:53
@gcs278
Copy link
Contributor

gcs278 commented Apr 5, 2023

Commandeering this CI to test if haproxy22 RPM is still available
/test images

@rfredette rfredette changed the title WIP: OCPBUGS-6661: Manage mTLS CRLs OCPBUGS-6661: Manage mTLS CRLs Apr 20, 2023
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 20, 2023
@rfredette
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Apr 20, 2023
@openshift-ci-robot
Copy link
Contributor

@rfredette: This pull request references Jira Issue OCPBUGS-6661, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.14.0) matches configured target version for branch (4.14.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @lihongan

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot requested a review from lihongan April 20, 2023 21:34
@rfredette
Copy link
Contributor Author

/retest-required

@candita
Copy link
Contributor

candita commented May 10, 2023

Assigning Miciah and Andy due to their current level of feedback. Feel free to remove, and add to PR Scrub instead.
/assign @Miciah
/assign @frobware

@rfredette rfredette force-pushed the ocpbugs-6661-crl-limit branch from fd46043 to 8c320e2 Compare May 10, 2023 21:58
@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 10, 2023

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from frobware. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 11, 2023

@rfredette: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-upgrade 8c320e2 link true /test e2e-upgrade
ci/prow/e2e-agnostic 8c320e2 link true /test e2e-agnostic

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Comment on lines +295 to +296
{{- if $haveClientCA }} ca-file /var/lib/haproxy/mtls/latest/ca-bundle.pem {{ else }} ca-file /etc/ssl/certs/ca-bundle.trust.crt {{ end }}
{{- if $haveCRLs }} crl-file /var/lib/haproxy/mtls/latest/crls.pem {{ end }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You replaced spaces with tabs. Please use spaces for consistency with the rest of the template.

I'm wary of breaking OKD users, or people using custom router deployments (unsupported, but better not to break them if we don't need to). If ROUTER_MUTUAL_TLS_AUTH_CA or ROUTER_MUTUAL_TLS_AUTH_CRL is set, we should still respect it. Likewise for ROUTER_MUTUAL_TLS_AUTH_CRL below.

Suggested change
{{- if $haveClientCA }} ca-file /var/lib/haproxy/mtls/latest/ca-bundle.pem {{ else }} ca-file /etc/ssl/certs/ca-bundle.trust.crt {{ end }}
{{- if $haveCRLs }} crl-file /var/lib/haproxy/mtls/latest/crls.pem {{ end }}
{{- with (env "ROUTER_MUTUAL_TLS_AUTH_CA") }} ca-file {{ . }} {{- else }}{{ if $haveClientCA }} ca-file /var/lib/haproxy/mtls/latest/ca-bundle.pem {{- else }} ca-file /etc/ssl/certs/ca-bundle.trust.crt {{- end }}{{ end }}
{{- with (env "ROUTER_MUTUAL_TLS_AUTH_CRL") }} crl-file {{ . }} {{- else }}{{ if $haveCRLs }} crl-file /var/lib/haproxy/mtls/latest/crls.pem {{- end }}{{ end }}

Likewise for ROUTER_MUTUAL_TLS_AUTH_CRL below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My router changes still use ROUTER_MUTUAL_TLS_AUTH_CA since the operator still needs to supply the CA bundle, so using with (env "ROUTER_MUTUAL_TRLS_AUTH_CA") by itself to select the old behavior isn't specific enough.

What do you think of something like this?

Suggested change
{{- if $haveClientCA }} ca-file /var/lib/haproxy/mtls/latest/ca-bundle.pem {{ else }} ca-file /etc/ssl/certs/ca-bundle.trust.crt {{ end }}
{{- if $haveCRLs }} crl-file /var/lib/haproxy/mtls/latest/crls.pem {{ end }}
{{- with (env "ROUTER_MUTUAL_TLS_AUTH_CRL") }} crl-file {{ . }}
{{- with (env "ROUTER_MUTUAL_TLS_AUTH_CA") }} ca-file {{ . }} {{- else }} ca-file /etc/ssl/certs/ca-bundle.trust.crt {{ end }}
{{- else }}
{{- if $haveClientCA }} ca-file /var/lib/haproxy/mtls/latest/ca-bundle.pem {{- else }} ca-file /etc/ssl/certs/ca-bundle.trust.crt {{ end }}
{{- if $haveCRLs }} crl-file /var/lib/haproxy/mtls/latest/crls.pem {{ end }}
{{- end}}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That could work (though I think you also need to re-add {{- with (env "ROUTER_MUTUAL_TLS_AUTH_CRL") }} crl-file {{. }} {{ end }}).

Copy link
Contributor Author

@rfredette rfredette May 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix I proposed does include crl-file ... in that case, but maybe it's a bit confusing. I'll see if I can make it clearer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've included a clearer version of this fix in #472.

@rfredette
Copy link
Contributor Author

Closing in favor of #472, which includes this fix as well as a fix for OCPBUGS-9464.

@rfredette rfredette closed this May 17, 2023
@openshift-ci-robot
Copy link
Contributor

@rfredette: This pull request references Jira Issue OCPBUGS-6661. The bug has been updated to no longer refer to the pull request using the external bug tracker.

In response to this:

This PR makes the router parse the certificates supplied in ROUTER_MUTUAL_TLS_AUTH_CA, download CRLs from any CRL distribution points that are found, and write them to /crls/crls.pem. That file is supplied to haproxy to use for mTLS client auth.

This PR requires that the router deployment include an EmptyDir mount for the /crls directory

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants