Skip to content

Conversation

@kflynn
Copy link
Contributor

@kflynn kflynn commented Jul 23, 2025

This is a GEP for an extra minimal OCG API, intended not to be production-ready but to permit experimentation.

/kind gep

Fixes #3951

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/gep PRs related to Gateway Enhancement Proposal(GEP) cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jul 23, 2025
@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jul 23, 2025
Comment on lines +172 to +186
- The trust bundle
in the Gateway resource
will define the CA certificate(s)
that the OCG
should accept as trusted
when validating connections
from meshed peers.

- The trust bundle
in the Mesh resource
will define the CA certificate(s)
that the mesh
should accept as trusted
when validating connections
from the OCG.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a bit more info around why we chose to have GW trust bundle and mesh and the Mesh trust bundle in the GW resource?

while i agree that this model to be simpler, but IMO it's worth mentioning the alternatives.
i added some info around this in https://github.com/kubernetes-sigs/gateway-api/pull/3941/files#diff-4a7d8011b2ad7222ce2d13ee98f49443d6eb56518625438daa62c10e94d9f772R279-R489
specifically proposals 1 and 2.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with the proposed model, is the mesh trust bundle duplicated every single Gateway resource?
(is there a common gateway config somewhere?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposed model is that each Gateway gets its own trust bundle. We may want to consider having a default in the GatewayClass, but this is the extra minimal API so it's not there yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I agree about alternatives -- I had it in my head that in many cases I should move them into GEP-3792 itself, but thinking about it while the sun is up, that seems silly. 🙂 Will update.

Comment on lines +151 to +155
trustBundle:
name: mesh-trust-bundle
namespace: mesh-namespace
# Key in Configmap; defaults to "ca-bundle.crt"
bundleKey: ca-bundle.crt

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i personally think clusterTrustBundle is a good fit for this. can we mention that we intend to support clusterTrustBundle in the future? or do u see some fundamental problem with it?

Copy link
Member

@robscott robscott Jul 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, I'd much rather start with ClusterTrustBundle as the recommendation where available, and ConfigMap as an optional backfill where it's not. I know that's not great now, but by the time this API is stable/GA, I'm guessing ClusterTrustBundle will be much more widely available.

Comment on lines +157 to +158
matchLabels:
mesh: one-mesh-to-mesh-them-all

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to confirm - is this selecting the Mesh resource or the Routes or Namespaces?
i think it's the latter but the one-mesh-to-mesh-them-all is confusing me because u have that set on the Mesh resource in the Mesh GEP.

also, can u consider calling out some alternative mechanisms?
https://github.com/kubernetes-sigs/gateway-api/pull/3941/files#diff-4a7d8011b2ad7222ce2d13ee98f49443d6eb56518625438daa62c10e94d9f772R638-R730.
doesn't have to be this but i think it's useful to document the alternatives.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If lines 142-144 are still the state, it looks like this selects Routes. Left another comment on those lines. And also we had a different comment thread on the original PR.

My strong preference is to go with namespace, and potentially provide opt-out for services.

Comment on lines +293 to +312
The extra-minimal API
solves this problem
by adding a label selector
to the Gateway resource
that indicates which Routes
are meshed.
When the OCG connects
to any Route
that either directly matches this selector,
or is in a namespace that matches this selector,
it MUST use mTLS
with a certificate
that is ultimately signed
by a CA certificate
in the Mesh resource's `trustBundle`,
and the OCG MUST validate
that the peer presents a certificate
that is ultimately signed
by a CA certificate
in the Gateway resource's `trustBundle`.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple of open questions (copied from my #3941) :

  1. How to configure the Gateway's identity certificate and private key?

This GEP defines how the OCG and the mesh should be configured to trust each
other by exchanging CA bundles. However, it does not standardize how an
administrator configures the specific client certificate and private key that
the OCG uses to identify itself to the mesh.

Currently, this is left as an implementation detail, likely handled via a
provider-specific CRD referenced from the GatewayClass or through an out-of-band
mechanism. The open question is: Should a future version of this GEP standardize
this configuration to ensure a consistent user experience? This could involve
adding a new identityCertificateRef field to the Gateway spec.

  1. Should use cases where mesh workloads disable mTLS be supported?

This GEP focuses on meshes where mTLS is strictly enforced for
communication. However, some service meshes support a "DISABLE" mode where mTLS
can be disabled for certain workloads. This raises the question: How should an
OCG behave when a target workload is discovered as "meshed" but does not require
or accept an mTLS connection?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should a future version of this GEP standardize this configuration to ensure a consistent user experience?

I'm not certain there's much value in standardizing the inner mechanism here as long as the high-level interface is sufficient for implementations to configure mTLS.

How should an OCG behave when a target workload is discovered as "meshed" but does not require
or accept an mTLS connection?

This might have some overlap with #3876

@kflynn kflynn mentioned this pull request Jul 25, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: kflynn
Once this PR has been reviewed and has the lgtm label, please assign danwinship for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 26, 2025
kflynn added 3 commits July 26, 2025 09:53
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 26, 2025
@k8s-ci-robot
Copy link
Contributor

@kflynn: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-gateway-api-verify d17ffa6 link true /test pull-gateway-api-verify

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Comment on lines +21 to +31
[GEP-3792] defines the rationale
for allowing out-of-cluster Gateways (OCGs)
to participate in a
GAMMA-compliant in-cluster service mesh,
and the problems that must be solved
to allow them to do so.
This GEP defines
an extremely minimal API
to permit experimentation
with OCGs and
in-cluster mTLS meshes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is odd formatting? is this a new formatter or something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refs discussion at #3950 (comment), I personally find it harder to read even if it does make edit diffs more clear.

Comment on lines +142 to +144
- a `labelSelector` field
that indicates which Routes
are meshed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we had similar comment thread on the original PR. #3894 (comment)

I am unclear why we need to select routes, namespace seems like it would cover 90% of the cases, and we can opt-in OR opt-out services.

Comment on lines +157 to +158
matchLabels:
mesh: one-mesh-to-mesh-them-all
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If lines 142-144 are still the state, it looks like this selects Routes. Left another comment on those lines. And also we had a different comment thread on the original PR.

My strong preference is to go with namespace, and potentially provide opt-out for services.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 17, 2025
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link
Contributor

@mikemorris mikemorris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally got around to reading through this, left a few high-level notes.

...
spec:
...
ocg:
Copy link
Contributor

@mikemorris mikemorris Oct 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ocg:
trustFederation:

ocg is a dense acronym that would have zero familiarity to most users, would suggest spelling out a field name and using more familiar language. Additionally, naming this more generically might unlock additional use cases under the broader "trust domain federation" umbrella, such as transitioning a mesh over to a new root, migrating a mesh deployment to a new cluster, or extending trust to other off-cluster resources, such as "mesh expansion" for connectivity with VMs using their own PKI root (and I'm intentionally not even citing "mesh federation" yet).

#### Additions to the Gateway Resource

The Gateway resource
gains a `mesh` stanza
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This again feels too narrowly-scoped, and likely more controversial than splitting the trust federation capability into some way to specify a bundle under https://gateway-api.sigs.k8s.io/reference/spec/#gatewaytlsconfig, and handling the mesh scoping entirely within the mesh resource through something like a namespace selector as suggested by @LiorLieberman (with some consideration for how to handle exclusion, likely via matchExpressions).

(Also the Gateway trust bundle scope feels relatively duplicative of #91?)

Comment on lines +209 to +212
The `namespace` field is required
in the Mesh resource,
but may be omitted
in the Gateway resource
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably better to model these as two different Go types similar to the existing WithNamespace patterns to avoid UX confusion.

Comment on lines +284 to +286
In practice, this isn't
actually a question of _workloads_
but of _Routes_:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In simple cases this holds, but I think it's actually inaccurate. A single route could absolutely split traffic between a "meshed" backend in one namespace and an "unmeshed" backend in another namespace.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/gep PRs related to Gateway Enhancement Proposal(GEP) needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Extra-minimal OCG API

6 participants