capture the optional managed identities proposal using API Model. #1037

machi1990 · 2025-05-15T16:57:08Z

{
 ...rest of cluster spec
 "azure": {
   "required_operators_authentication": {
     "managed_identities": {
       "managed_identities_data_plane_identity_url": "https://dummyhost.identity.azure.net",
       "control_plane_operators_managed_identities": {
         "operator1": {
           "resource_id": "example_resource_id"
         },
       },
       "data_plane_operators_managed_identities": {
         "example_operator": {
           "resource_id": "example_resource_id"
         }
       },
       "service_managed_identity": {
         "resource_id": "example_resource_id"
       }
     }
   },
   "image_registry": {
     "state": "enabled",
     "control_plane_identity": {
       "resource_id": "example_resource_id"
       },
     "data_plane_identity": {
       "resource_id": "example_resource_id"
     }
   }
 }
}

The PR captures the above proposal as stated out in the DDR around optional managed identities.
However, there are a few challenges with the current proposal;

if in the future an operator becomes not required all the time anymore, it'll be an API breaking change as we'll have to move from the required_operators_authentication section to its own section as par the current proposal. How do we people see a better design to future proof ourselves? This is the case we're having now with the image registry operator taken as an example.
How are generic optional features that requires Managed/Workload identities designed across different clouds?
The proposal above designs it as cluster.image_registry however, the concept of managed identities is tied to Azure.
Other clouds have different concepts / naming. An alternative to the proposal above will be to have cluster.azure.image_registry and cluster.aws.image_registry i.e each cloud section in the cluster's spec will have its own type defining of the feature. However, this duplicates the feature definition across different clouds.

Another alternative is cluster.image_registry.aws and cluster.image_registry.azure i.e on the feature itself, have configurations per cloud. This duplicates the cloud platform everywhere.

I am opening this discussion so that we can iterate on the design and come up with a solid one to captured in the DDR and adopted across offerings and specifically for aro-hcp optional features design.

@tzvatot @vkareh @miguelsorianod @deads2k @flavianmissi @mbarnes

openshift-ci · 2025-05-15T16:57:12Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

openshift-ci · 2025-05-15T16:57:15Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: machi1990

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [machi1990]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

miguelsorianod · 2025-05-15T17:02:20Z

model/clusters_mgmt/v1/cluster_image_registry_type.model

@@ -0,0 +1,4 @@
+struct ClusterImageRegistry {
+	ControlPlaneIdentity AzureControlPlaneManagedIdentity


Another related question we have here is: what happens if the cluster image registry functionality is not supported only on azure but also on aws. We would need to have here something like:

ClusterImageRegistry: State: enabled/disabled AWS: ControlPlaneIdentity AWSSTSControlPlaneIdentity DataPlaneIdentity AWSSTSDataPlaneIdentity Azure: ControlPlaneIdentity AzureControlPlaneManagedIdentity DataPlaneIdentity AzureDataPlaneManagedIdentity

or similar (if we want we could put the cloud providers in a union type inside ClusterImageRegistry, etc...)

This is an interesting problem. For completion, the way we've solved this in ROSA is by providing an array of arbitrary operators that require identity:

{ "aws": { "operator_iam_roles": [ {"id": "openshift-image-registry-operator", "role_arn": "<identity resource>"}, ] } }

However, when it comes to any knobs for said operator, they don't live there, they are provided elsewhere on an as-needed basis.

We should probably follow a similar model.

A similar approach is what's implemented in master currently for aro-hcp. This was implemented before optional functionalities came into play. The difference with what you described is that instead of having an array we have a free-form map where the key is the id of the operator.

I explain the detail about that here #1037 (comment)

The use of a map instead of an array is intentional. The reason is that arrays are not PATCH friendly. JSON Merge Patch doesn’t support adding, removing, or changing array elements (except by explicitly replacing the whole array).

I've pushed an update on this in line with what Miguel was suggesting above.

miguelsorianod · 2025-05-15T17:06:21Z

model/clusters_mgmt/v1/azure_type.model

 	// Defines how the operators of the cluster authenticate to Azure.
 	// Required during creation.
 	// Immutable.
-	OperatorsAuthentication AzureOperatorsAuthentication
+	RequiredOperatorsAuthentication AzureOperatorsAuthentication


Even if this remains called OperatorsAuthentication, where you put the core operators information only, the same issue would happen. The name doesn't really change much in this proposal in that regard.

The name would matter if you group all identities together, regardless of whether they are for core components or optional components.

The name would matter if you group all identities together, regardless of whether they are for core components or optional components.

Yes in that case the name would matter, that's correct. In the master branch that's how the API currently looks like if you want to take a look:
This is, all identities are grouped together and there's no differentiation between core and optional components. There is differentiation between control plane and data plane identities because they are conceptually different, configured in a different way and receive/return a different set of data. The structure we use to represent the set of identities is a free-form map where the key is an arbitrary name that we define representing that operator/functionality identity and the value it is the associated identity information. The reason a map was chosen is because it allows you to introduce new keys without needing to change the API definition. It is as well PATCH friendly. The set of key names are defined by us.

As you well explained in #1037 (review) there's tradeoffs between the two approaches.

miguelsorianod · 2025-05-15T17:09:32Z

As a side note:

Even though a PR is a good place where to leave comments, have them tracked and have discussions around them, the fact that we are iterating on the custom ocm api model definitions makes hard to understand how would that actually end up looking: As in how the json payload will look like, the openapi definition, the go types, ... . None of those are tracked in the repository and therefore neither in the PR.

…abled or disabled via a flag

vkareh

There are two paths forward here:

Bundle all identities; component knobs go in separate structs

Pros:
- All identities are in the same location
- Good separation of concerns: Identities belong to the cloud platform, whereas components belong to the cluster.
- Single location for validating all identities and providing status
- More consistent with ROSA
Cons:
- Harder to distinguish visually which identities are required

Co-locate identity within the discrete component configuration

Pros:
- Identities belong where they are used
- Good separation of configuration
- Easy to visualize what's required based on components
Cons:
- More places for validation/status is cumbersome for client consumption
- Inconsistent with ROSA

The current option of bundling identities for some components, but co-locating for others is inconsistent and it seems harder to future-proof. Let's not do that and instead let's focus our efforts on improving on one of these two options above. My preference is for the first option, as it's more consistent with how ROSA does it today, but I'm willing to concede based on good arguments for the second option.

miguelsorianod · 2025-05-16T08:45:24Z

There are two paths forward here:
1. Bundle all identities; component knobs go in separate structs


* Pros:
  
  * All identities are in the same location
  * Good separation of concerns: Identities belong to the cloud platform, whereas components belong to the cluster.
  * Single location for validating all identities and providing status
  * More consistent with ROSA

* Cons:
  
  * Harder to distinguish visually which identities are required


2. Co-locate identity within the discrete component configuration


* Pros:
  
  * Identities belong where they are used
  * Good separation of configuration
  * Easy to visualize what's required based on components

* Cons:
  
  * More places for validation/status is cumbersome for client consumption
  * Inconsistent with ROSA
The current option of bundling identities for some components, but co-locating for others is inconsistent and it seems harder to future-proof. Let's not do that and instead let's focus our efforts on improving on one of these two options above. My preference is for the first option, as it's more consistent with how ROSA does it today, but I'm willing to concede based on good arguments for the second option.

I think when considering this it is also relevant to keep in mind cluster upgrades:
When a Cluster is upgraded to an OCP version, aside from changing the version attribute in the API payload, if the new OCP version has new required core operators the end user will need to introduce the corresponding managed identity information for them. In the same way, if the new OCP version stops using previously existing core operators the end user would need to remove them when performing the patch. Something similar could happen if for some reason an optional CS functionality stops being supported. Although the latter seems unlikely as it would be a breaking change

machi1990 · 2025-05-16T08:53:44Z

Thanks @vkareh @miguelsorianod for chiming in.

I am for option 2; which is feature/functionality centric.

Re: consistency with rosa, as mentioned in slack thread, rosa-hcp isn't consistency with itself. That is STS iam roles for operators are configured in the STS operators roles however there is a feature (an optional one, a user can chose to activate ir or not) named AuditLog forwarding which follows the design as what's proposed in the DDR; similar to option 2.

Re: status of managed identities; Me, as user I care more of the status of the cluster or the extra functionalities that I've enabled on this cluster. The cluster already has a "status" object which can be enriched to return the "state of the world at time T of managed identities or anything else we would like to report". For example today, as part of inflight checks on MIs (if the MIs don't exist or anything) the status object would report that. Additionally, the inflight check status endpoint provides additional status of whatever the failure was. I personally find the cluster.status object to be a place where the "state of the world" is reported and not the managed_identities part which is the spec.

More places for validation - this concern is already the case. The cluster payload is big and the user/consumer of the API needs to make sure that they are providing a valid payload. In regards to validation, some functionality validation are interconnected by the knobs that are associated with that validation. Having knobs related to the functionality in one place helps quiet a lot in this and makes it easier for the user and CS developers themselves to perform those validations.
e.g the image registry case, if everything is in one spec, it is a matter of validating that image registry spec. However, if one knob on managed identity is in the "managed_identities" spec, the other knob about feature enablement is on "ocp_capabilities" spec, the other xxx knob is on the "image_registry" related spec, validating this becomes hard in my opinion.

user wanting to know the managed identities associated to a cluster at a central place while looking at the cluster schema Me, as a user the API, this is an interesting use case that second option doesn't offer however there are solution to that (e.g a dedicate endpoint similar to what rosa offers that returns the list of all operators roles of a given cluster) if this really becomes an issue from the visualisation point of view. Also this use case, doesn't address the next followup question a user might ask themselves; why do I have this MI? To which feature / functionality is this needed for? The answer can be "for this operator identified by this name " but operator names change or could change - so it is important (from my point of view) that a user clearly sees the presence of a certain managed identity from functionality/feature standpoint.

miguelsorianod · 2025-05-16T08:59:10Z

Thanks @vkareh @miguelsorianod for chiming in.

I am for option 2; which is feature/functionality centric.

Re: consistency with rosa, as mentioned in slack thread, rosa-hcp isn't consistency with itself. That is STS iam roles for operators are configured in the STS operators roles however there is a feature (an optional one, a user can chose to activate ir or not) named AuditLog forwarding which follows the design as what's proposed in the DDR; similar to option 2.

Re: status of managed identities; Me, as user I care more of the status of the cluster or the extra functionalities that I've enabled on this cluster. The cluster already has a "status" object which can be enriched to return the "state of the world at time T of managed identities or anything else we would like to report". For example today, as part of inflight checks on MIs (if the MIs don't exist or anything) the status object would report that. Additionally, the inflight check status endpoint provides additional status of whatever the failure was. I personally find the cluster.status object to be a place where the "state of the world" is reported and not the managed_identities part which is the spec.

More places for validation - this concern is already the case. The cluster payload is big and the user/consumer of the API needs to make sure that they are providing a valid payload. In regards to validation, some functionality validation are interconnected by the knobs that are associated with that validation. Having knobs related to the functionality in one place helps quiet a lot in this and makes it easier for the user and CS developers themselves to perform those validations. e.g the image registry case, if everything is in one spec, it is a matter of validating that image registry spec. However, if one knob on managed identity is in the "managed_identities" spec, the other knob about feature enablement is on "ocp_capabilities" spec, the other xxx knob is on the "image_registry" related spec, validating this becomes hard in my opinion.

user wanting to know the managed identities associated to a cluster at a central place while looking at the cluster schema Me, as a user the API, this is an interesting use case that second option doesn't offer however there are solution to that (e.g a dedicate endpoint similar to what rosa offers that returns the list of all operators roles of a given cluster) if this really becomes an issue from the visualisation point of view. Also this use case, doesn't address the next followup question a user might ask themselves; why do I have this MI? To which feature / functionality is this needed for? The answer can be "for this operator identified by this name " but operator names change or could change - so it is important (from my point of view) that a user clearly sees the presence of a certain managed identity from functionality/feature standpoint.

I have a similar thinking to Manyanda and it is what I commented in previous discussions.

For me, if we decide to go to this route though we need to define how would we provide authentication support if the given optional functionality is supported in multiple cloud providers. This is, imagine you have the image registry functionality which is supported in both AWS and Azure based cluster types. In the case of AWS ones an AWS STS identity would be required and in the case of Azure based clusters an Azure Managed Identity would be required. How would this look like at CS API level? this is considering that we currently have an azure and an aws top level section in the Cluster payload too.
In the case of approach 2 we would also need to decide how would we define the API for the "required/core" operators.

vkareh · 2025-05-16T12:03:18Z

@miguelsorianod

if the new OCP version stops using previously existing core operators the end user would need to remove them when performing the patch

Not strictly true. The identity would be unnecessary, but not removing it would not affect cluster operation/upgrade. In either case, this seems orthogonal to the shape of the API, the user behaviour would exist in either case.

@machi1990

I am for option 2; which is feature/functionality centric.

Yes. I personally like the ergonomics of this approach. However, we need to define it very carefully, since based on @miguelsorianod's comment, it will become awkward for OCP components that require identities and are supported on more than one cloud/platform. That alone is enough to tip me in the other direction.

Can you update this PR to see how it would look in this case?

miguelsorianod · 2025-05-16T12:10:53Z

Not strictly true. The identity would be unnecessary, but not removing it would not affect cluster operation/upgrade.

👍 . That's accurate for the case of the removal. However, I think conceptually that's what we should require. I think we shouldn't have identities information in the API payload that are not used nor supported anymore.

In either case, this seems orthogonal to the shape of the API, the user behaviour would exist in either case.

The user behavior would exist in either case, but we have to think of how the experience of the user would change depending on how the API is designed. As in what changes in the API payload the user would need to perform when performing a cluster upgrade via a PATCH. Like, they would need to change this attribute here and that other one there etc.

machi1990 · 2025-05-16T12:53:53Z

@miguelsorianod

if the new OCP version stops using previously existing core operators the end user would need to remove them when performing the patch

Not strictly true. The identity would be unnecessary, but not removing it would not affect cluster operation/upgrade. In either case, this seems orthogonal to the shape of the API, the user behaviour would exist in either case.

@machi1990

I am for option 2; which is feature/functionality centric.

Yes. I personally like the ergonomics of this approach. However, we need to define it very carefully, since based on @miguelsorianod's comment, it will become awkward for OCP components that require identities and are supported on more than one cloud/platform. That alone is enough to tip me in the other direction.

Can you update this PR to see how it would look in this case?

@vkareh @miguelsorianod I've pushed 2271d41

This gives us a model like

image_registry:
  state: enabled/disabled
  authentication:
    azure:
      managed_identity:
        controlPlaneIdentity:
        dataPlaneIdentity:
    aws:
      sts:
        controlPlaneIamRole:
        dataPlaneIamRole:

deads2k · 2025-05-16T12:59:59Z

The RequiredOperatorsAuthentication will not age well as OCP and HCM evolve. An inability to handle product and service evolution over time blocks the API change until it is resolved. Adding a hold to prevent accidental merge until the upgrade/evolution problem is resolved.

/hold

deads2k · 2025-05-16T13:03:15Z

model/clusters_mgmt/v1/cluster_image_registry_type.model

+// A union type of cloud authentication mechanism.
+struct OperatorAuthentication {
+	Aws AwsOperatorAuthentication
+	Azure AzureOperatorAuthentication
+}


While placing the union at this level appears to give a function based API, our experience (and even this PR here) demonstrate the differences between the offerings that we need admit to ourselves. Having a top level union instead of per-component makes a future refactor that fully separates Azure and AWS cleaner. Such a top level split does not preclude embedding the same type for top level configuration, but does provide two clearly separate schemas for evolution.

If I understand correctly "if we are to go with option(2)", the way you'll address the cloud concern is by placing per-component knobs in the top level cloud i.e

// the cluster spec { ... "azure": { "image_registry": { ... // image registry knobs } } }

The reason is that it'll be make it easier for future split of the services - the world where aro-hcp has its own CS separate from rosa-hcp?

Yes for both what I intend to express in the API and the reasoning behind it. I think our experience with trying to have a unified API has aged not-so-well and a we should be willing to reflect that in our API and code.

I see and thanks for clarifying. That's approach we could take and it has been considered as well. This is what has been commented as a possible alternatives in the PR's description and left out as a comment as well in the first commit of the PR

ocm-api-model/model/clusters_mgmt/v1/cluster_type.model

Line 254 in 2271d41

// We could move it to the .Azure field as it contains Azure specific things.

deads2k · 2025-05-16T13:11:48Z

I am for option 2; which is feature/functionality centric.
Yes. I personally like the ergonomics of this approach. However, we need to define it very carefully, since based on @miguelsorianod's comment, it will become awkward for OCP components that require identities and are supported on more than one cloud/platform. That alone is enough to tip me in the other direction.

Thinking through adding additional identities, a choice of adding them per-component requires

components remain stable over time and don't move. This isn't true for OCP, we move parts of components all the time.
every time a component is added that needs an identity, addition validation wiring must be added to suit
any status reporting code built to ensure the validity of identity post-API call also needs to be updated
every time a component is added, even if it has no other configuration, it must have an API change.
there is no mechanism to remove a component when it is EOL'd because that produces an incompatibility that makes moving to new schemas more difficult

Can it be done? Yes. The con-list is notably longer than listed above though.

machi1990 · 2025-05-16T14:02:48Z

Thanks @deads2k for chiming in and additional insights.

components remain stable over time and don't move. This isn't true for OCP, we move parts of components all the time.

The per-component API won't have the operator name in it. It'll be a functionality. If the operator name in OCP changes or moves, as long as the desired functionality is there and can be leveraged, I don't see the move having any effect to the API.

every time a component is added that needs an identity, addition validation wiring must be added to suit

This can be true as well with a centralised list for the case where an optional identity is added. e.g what do you do when the identity is provided but the feature not enabled? What do you when the identity is not provided when the feature is enabled?

any status reporting code built to ensure the validity of identity post-API call also needs to be updated

Same reasoning as above. With the centralised list of identities, if it is an optional identity, you'll need to report it's status conditionally depending on the feature being enabled or not.

every time a component is added, even if it has no other configuration, it must have an API change.

True. Taking care of these kinds of issues was one of the motivations with using a map struct originally for identities cluster.azure.operators_authentication.managed_identities field; the problem that both the RP and CS wanted to solve. However, with some functionality being optional in consequence their operators; we noticed that if we stick with this approach, for a given functionality, we're to have dispersed knobs related to that functionality everywhere - which has its cons.

there is no mechanism to remove a component when it is EOL'd because that produces an incompatibility that makes moving to new schemas more difficult

An API deprecation policy is what will address this best.
However, I think that this con is also applicable to the centralised list approach when the EOL'd component had extra knobs that are defined somewhere in the cluster's spec.

deads2k · 2025-05-16T14:20:35Z

9. there is no mechanism to remove a component when it is EOL'd because that produces an incompatibility that makes moving to new schemas more difficult

An API deprecation policy is what will address this best. However, I think that this con is also applicable to the centralised list approach when the EOL'd component had extra knobs that are defined somewhere in the cluster's spec.

While I can appreciate that the current API compatibility is treated fast and loose, with a map (or other arbitrary list), no schema change is required to be communicated to clients and have all clients update to serialize properly. The wider spread the clients, the harder the required shifts.

7. every time a component is added, even if it has no other configuration, it must have an API change.

True. Taking care of these kinds of issues was one of the motivations with using a map struct originally for identities cluster.azure.operators_authentication.managed_identities field; the problem that both the RP and CS wanted to solve. However, with some functionality being optional in consequence their operators; we noticed that if we stick with this approach, for a given functionality, we're to have dispersed knobs related to that functionality everywhere - which has its cons.

The only listed con is "Harder to distinguish visually which identities are required". Making the change looks a lot like, "the grass must be greener over there". Think carefully about motivations before taking a plunge with more downsides than the current technique.

deads2k · 2025-05-16T14:27:04Z

Even though a PR is a good place where to leave comments, have them tracked and have discussions around them, the fact that we are iterating on the custom ocm api model definitions makes hard to understand how would that actually end up looking: As in how the json payload will look like, the openapi definition, the go types, ... . None of those are tracked in the repository and therefore neither in the PR.

Good points. Have a look at

make it possible to have separate serialization API types and the client to access them ocm-api-metamodel#219
Move generator for clientapi to repo where the model is defined #1024
move JSON structs (serialization) to ocm-api-model ocm-sdk-go#1037

Which do that coalescing and move the openapi and json types into this repo, while using type aliases to avoid causing problems for ocm-sdk-go consumers. I'm anticipating that landing early this summer.

vkareh · 2025-05-16T16:03:03Z

Thanks for this good discussion, folks!

Some things I can see coming from this:

Using {component: {aws: {}, azure: {}}} will get cumbersome fast. In fact, it already is cumbersome to think/type in that dimension. If we go the way of de-centralized identities, the way I would split it would look more like

{
  "azure": {
    "component": {
      "identity": "..."
    }
  },
  "component": {
    "knob": "..."
  }
}

It separates product functionality from cloud requirement.

Using the required* prefix in the attribute is a hard no. It breaks both options 1 and 2, breaks future-proofing of the API, and confuses the heck out of me when thinking of "optional components that have required identities". 😜
Whether identities are part of a required or for an optional component doesn't matter, since a component is either enabled or not, and the identity is therefore required for that component. Therefore, there is no distinction between those identities and so should either co-exist in the same struct, or under each discrete component (caveat: see 1. above).
Based on 1. and 3. above, I'm leaning towards co-locating all identities in the same location. Therefore, I would like to see how this API would look like. @machi1990 can you update the API to reflect this, or at least post a sample snippet? That is, unless we're at the point that options 1 or 2 in my original comment start looking very much the same in light of my comment 1. above?

machi1990 · 2025-05-16T16:05:24Z

there is no mechanism to remove a component when it is EOL'd because that produces an incompatibility that makes moving to new schemas more difficult

An API deprecation policy is what will address this best. However, I think that this con is also applicable to the centralised list approach when the EOL'd component had extra knobs that are defined somewhere in the cluster's spec.

While I can appreciate that the current API compatibility is treated fast and loose, with a map (or other arbitrary list), no schema change is required to be communicated to clients and have all clients update to serialize properly. The wider spread the clients, the harder the required shifts.

@deads2k The point in my statement (quoted below)

I think that this con is also applicable to the centralised list approach when the EOL'd component had extra knobs that are defined somewhere in the cluster's spec.

is that, if a component is EOL-ed, no schema is required for the centralised approach (option 1) if and only if the component didn't have extra configurations i.e the only configurations for it were the identity information.
Consider this sample schema that mimics option1;

{

"azure": {
   "operators_authentication": {
      "managed_identities": {
         "control_plane_operators": {
             "component1": { // this component has extra configs
                 "resource_id": "foo-bar"
              }
         }
      }
   }
} 
"component1ExtraConfigs": {
  "config1": ...
  "config2" ...
}
}

If component1 is EOL'd, the .azure.operators_authentication.managed_identities_control_plane_operators map gives you the benefit of no schema change, but you've the change of schema anyway for the case of ".component1ExtraConfigs". So yes, what you said is true only for the case of a component whose configuration is only the identity information.

machi1990 · 2025-05-16T16:13:25Z

@vkareh

Based on 1. and 3. above, I'm leaning towards co-locating all identities in the same location. Therefore, I would like to see how this API would look like. @machi1990 can you update the API to reflect this, or at least post a sample snippet? That is, unless we're at the point that options 1 or 2 in my original comment start looking very much the same in light of my comment 1. above?

By "I'm leaning towards co-locating all identities in the same location." - that is co-locating only the identity information in one place and the component related information in a different place? i.e pretty much option (1) from #1037 (review)

If so, the snippet of that API would look like

{

"azure": {
   "operators_authentication": {
      "managed_identities": {
         "control_plane_operators": {
             "image-registry": {
                 "resource_id": "foo-bar"
              }
              .... extra control plane components identities
         }, 
         "data_plane_operators": {
           "image-registry": {
             "resource_id": "bazinga"
           }
           ... extra data plane components identities
         }
      }
   }
} ,
"image-registry": {
  "state": "enabled"
}
}

The snippet assumes that we'll get rid of the .capabilities section for enablement/disablement of components. If we stick with .capabilities, the snippet would look like

{

"azure": {
   "operators_authentication": {
      "managed_identities": {
         "control_plane_operators": {
             "image-registry": {
                 "resource_id": "foo-bar"
              }
              .... extra control plane components identities
         }, 
         "data_plane_operators": {
           "image-registry": {
             "resource_id": "bazinga"
           }
           ... extra data plane components identities
         }
      }
   }
}, 
"capabilities": {
  "disabled": []
}
}

miguelsorianod · 2025-05-16T16:23:28Z

Using {component: {aws: {}, azure: {}}} will get cumbersome fast. In fact, it already is cumbersome to think/type in that dimension. If we go the way of de-centralized identities, the way I would split it would look more like

Doesn't the same apply to the other way around? With the other way around you will have:

{
aws:
    functionality1:
    functionalityN:
azure:
    functionality1:
    functionalityN:
}

which is equivalent and has the same arguments that you raised.

On top of that, now you will have components that do not have cloud specific configuration and you will end up with something like:

{
aws:
    functionality1:
    functionalityN:
azure:
    functionality1:
    functionalityN:
functionalityM:
}

But in the other way around you don't have the functionality configuration separate so the argument of "it is functionally the same as the original option 1" does not hold in that case because you do not have the configuration separate.

Using the required* prefix in the attribute is a hard no. It breaks both options 1 and 2, breaks future-proofing of the API, and confuses the heck out of me when thinking of "optional components that have required identities". 😜

Using required* as the name is not what breaks the API in option 2. What breaks option 2 is the fact that we have a section that is used to designate a set of operators that are required. The aspect that breaks it is the existence of that section where over time some elements in it do not apply anymore. The name is independent of that, using required just makes it less accurate. What breaks option 2 is the existence of the section and not the name chosen for it. Just having the section named "operators_authentication" without the required part would still break option 2. Do you agree with that assessment ? If so, therefore the discussion there with that option is not about the name, it is about finding an alternative that does not leverage that section

mbarnes · 2025-05-16T18:38:44Z

From an outsider perspective, it seems to me like the central problem here is how to express the role of an identity in the cluster. Each OpenShift version has a discrete set of roles for identities; some required, some not.

I had an alternate idea to the above proposals. Bear with me because it's not fully baked yet.

Instead of trying to cram this all into the cluster API, what if these roles were exposed as a new set of pre-defined top-level endpoints? I'll call them "identity roles". An identity role and its endpoint might look something like:

GET /api/clusters_mgmt/v1/identity_roles/{identity_role_id}
{
    "required": true/false,
    "cloud_providers": [
        CloudProviderLink,
        ...
     ],
     "cluster_plane": "control"/"data",
     "purpose": operator or capability name (whatever we decide on)
}

Individual identity roles are static. As new OpenShift versions introduce new operators/functionalities that require identities, Cluster Service could publish new identity roles. Similarly, if a previously required identity role became optional in later versions (or vice versa), this would entail a new identity role.

The schema for /api/clusters_mgmt/v1/versions/{version_id} could then be extended to return a list of identity roles appropriate for that version across all cloud providers:

GET /api/clusters_mgmt/v1/versions/openshift-v4.18.1
{
   ...
   "identity_roles": [
      IdentityRoleLink,
      ...
   }
}

Then when a user POSTs a cluster manifest, the "managed_identities" section could just be a list (or map) of identity role ID and resource ID pairings.

"azure": {
   "operators_authentication": {
      "managed_identities": [
         {
            "role": "identity-role-1",
            "resource_id": "foo-bar"
         },
         {
            "role": "identity-role-2",
            "resource_id": "bazinga"
         },
         ...
      ]
   }
},

I'm not sure if identity role IDs should be meaningful names or random like cluster IDs. I'm leaning toward random, just because naming is hard. It would require some effort on the user's part to look up identity roles for a given OpenShift version, but that level of effort seems inescapable whatever we settle on.

Maybe there's a seed of a workable idea here. I'm not as familiar with the design constraints on this API as the rest of you. I'm just trying to break our thinking out of the box we seem to be stuck in.

vkareh · 2025-05-26T20:37:31Z

To follow up on an example of a feature that contains cloud-specific configurations, we have etcd encryption. Today the model looks like:

{
  "etcd_encryption": true,
  "aws": {
    "etcd_encryption": {
      "kms_key_arn": "..."
    }
  },
  "gcp_encryption_key": {
    "kms_key_service_account": "..."
    "key_location": "...",
    "key_name": "...",
    "key_ring": "..."
  }
}

Under the guidelines I outlined, this is largely correct (even though it should be nested in the case of GCP and ideally we refrain from using a boolean for the top-level config). If we follow the guidelines, the API should look like:

{
  "etcd_encryption": {
    "state": "enabled"
  },
  "aws": {
    "etcd_encryption": {
      "kms_key_arn": "..."
    }
  },
  "azure": {
    "managed_identities": {
      "etcd_encryption": {
        "resource_id": "..."
      }
    }
  },
  "gcp": {
    "etcd_encryption": {
      "kms_key_service_account": "..."
      "key_location": "...",
      "key_name": "...",
      "key_ring": "..."
    }
  }
}

machi1990 · 2025-05-27T07:04:57Z

To follow up on an example of a feature that contains cloud-specific configurations, we have etcd encryption. Today the model looks like:

{
  "etcd_encryption": true,
  "aws": {
    "etcd_encryption": {
      "kms_key_arn": "..."
    }
  },
  "gcp_encryption_key": {
    "kms_key_service_account": "..."
    "key_location": "...",
    "key_name": "...",
    "key_ring": "..."
  }
}

Under the guidelines I outlined, this is largely correct (even though it should be nested in the case of GCP and ideally we refrain from using a boolean for the top-level config). If we follow the guidelines, the API should look like:

{
  "etcd_encryption": {
    "state": "enabled"
  },
  "aws": {
    "etcd_encryption": {
      "kms_key_arn": "..."
    }
  },
  "azure": {
    "managed_identities": {
      "etcd_encryption": {
        "resource_id": "..."
      }
    }
  },
  "gcp": {
    "etcd_encryption": {
      "kms_key_service_account": "..."
      "key_location": "...",
      "key_name": "...",
      "key_ring": "..."
    }
  }
}

The way I see it, the struct will be more like below. And from the struct we can clearly how distributed are feature related knobs which has a high cognitive load when reasoning out about the data model; the cognitive load is both to the user of the API and the maintainers of the API (cs developers).
This is a tradeoff and one of the implication we'll be taking about option (1) and it is worth capturing that in the guideline.

{
  "etcd_encryption": {
    "state": "enabled"
  },
  "aws": { 
     "sts": {
        "operator_iam_roles": [
          .... {
            "name": "kms-provider",
            "role_arn": "kms-role-arn",
            ....
          }
       ]
     },
    "etcd_encryption": {
      "kms_key_arn": "..."
    }
  },
  "azure": {
    "etcd_encryption": {
        "foo_bar_ksm_key": {
          ...Azure KV related info
        }
      },
    "managed_identities": {
      "control_plane_operators": {
         ...
        "kms": {
            "resource_id": ....
          }
       }, 
      "data_plane_operators": {
        ....
      }
    }
  },
  "gcp": {
    "etcd_encryption": {
      "kms_key_service_account": "..."
      "key_location": "...",
      "key_name": "...",
      "key_ring": "..."
    }
  }
}

vkareh · 2025-06-02T12:26:11Z

@machi1990

The way I see it, the struct will be more like below. [...]

Yes, this is what I had in mind. There might be ways of improving the on the cognitive load issue you mention by rearranging some things inside the cloud component (possibly grouping all identities closer together under .azure.managed_identities?), but what you posted seems correct to me.

machi1990 · 2025-06-02T13:03:02Z

@machi1990

The way I see it, the struct will be more like below. [...]

Yes, this is what I had in mind.

Thank you for confirming.

There might be ways of improving the on the cognitive load issue you mention by rearranging some things inside the cloud component (possibly grouping all identities closer together under .azure.managed_identities?), but what you posted seems correct to me.

I don't think any possible arrangement in the cloud component will make the situation better. In my opinion, the distributed nature of the knobs for a certain feature is what adds to the cognitive load when reasoning about the feature and I only see option(2) as a solution for that; at the "cost" of potentially duplicating the cloud across different features. Between the two, the choice was to go for the option(1) in this regard. This is one implication.

@vkareh @tzvatot there are other more broader cases highlighted in #1037 (comment), please take a look at that comment and chime in.

tzvatot · 2025-06-03T08:01:22Z

From what I see in #1037 (comment), I feel the current guidelines fulfil their requirements.

miguelsorianod · 2025-06-03T10:15:08Z

Ok so if you think that's the correct design, and you agree on Manyanda's example, #1037 (comment) including the key name when you specify the data of the etcd related identity,

Then I think the way forward with this MR is to close it, because that's the current design that it is already in the main branch regarding the azure managed identities area. You can take a look at the main branch to see how it is being defined.

Can you confirm that we are ok with this way forward? If that's the case, we'll close the MR without merging.

machi1990 · 2025-06-03T10:54:35Z

Below is the schema of what we've in main. The schema captures the configuration of etc encryption for 3 clouds we support.

For the aro-hcp case, the cluster.azure.operators_authentication.managed_identities contains two subsections which are control_plane_operators and data_plane_operators. These sections are represented by a Map struct, the key being the operator name.

{
  "etcd_encryption": {
    "state": "enabled"
  },
  "aws": { 
     "sts": {
        "operator_iam_roles": [
          .... {
            "name": "kms-provider",
            "role_arn": "kms-role-arn",
            ....
          }
       ]
     },
    "etcd_encryption": {
      "kms_key_arn": "..."
    }
  },
  "azure": {
    "etcd_encryption": {
        "foo_bar_ksm_key": {
          ...Azure KV related info
        }
      },
      "operators_authentication": {
           "managed_identities": {
                "control_plane_operators": { // this is a Map struct with key being the operator
                   ...
                  "kms": { // here the key is the operator name
                      "resource_id": ....
                    }
                }, 
                "data_plane_operators": { // this is a Map struct with key being the operator name
                  ....
                }
             }
       }
  },
  "gcp": {
    "etcd_encryption": {
      "kms_key_service_account": "..."
      "key_location": "...",
      "key_name": "...",
      "key_ring": "..."
    }
  }
}

@vkareh @tzvatot if you agree with that, I'll close the PR.

tzvatot · 2025-06-03T15:01:43Z

"kms": { // here the key is the operator name

Is the operator name "kms"?

tzvatot · 2025-06-03T15:02:56Z

The schema captures the configuration of etc encryption for 3 clouds we support.

Schema looks good to me. @vkareh ?

machi1990 · 2025-06-03T15:10:40Z

"kms": { // here the key is the operator name
Is the operator name "kms"?

Yes. The name of the key in the payload will be whatever is in this config file https://github.com/Azure/ARO-HCP/blob/main/cluster-service/deploy/templates/azure-operators-managed-identities-config.configmap.yaml#L89 (the key in the config there are the operator names)

We decided to structure optional features differently after a lengthy discussion in openshift-online/ocm-api-model#1037 This keeps the nice test framework in ocm_test.go even though it doesn't do much now. This can be expanded upon in the future.

capture the optional managed identities proposal using API Model.

e7b0b5c

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 15, 2025

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 15, 2025

miguelsorianod reviewed May 15, 2025

View reviewed changes

add enabled/disabled state to show that an optional feature can be en…

ebf3fd4

…abled or disabled via a flag

vkareh requested changes May 15, 2025

View reviewed changes

openshift-ci bot assigned vkareh May 15, 2025

add cloud authentication mechanism to the image_registry functionality

2271d41

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 16, 2025

deads2k reviewed May 16, 2025

View reviewed changes

machi1990 requested a review from vkareh June 2, 2025 08:53

machi1990 requested a review from tzvatot June 3, 2025 12:09

machi1990 closed this Jun 5, 2025

machi1990 deleted the image-registry-proposal-as-optional-functionality branch June 6, 2025 06:49

mbarnes mentioned this pull request Jun 26, 2025

api: Replace ClusterCapabilitiesProfile with ClusterImageRegistryProfile Azure/ARO-HCP#1978

Merged

		@@ -0,0 +1,4 @@
		struct ClusterImageRegistry {
		ControlPlaneIdentity AzureControlPlaneManagedIdentity

capture the optional managed identities proposal using API Model. #1037

capture the optional managed identities proposal using API Model. #1037

Uh oh!

Conversation

machi1990 commented May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci bot commented May 15, 2025

Uh oh!

openshift-ci bot commented May 15, 2025

Uh oh!

miguelsorianod May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

miguelsorianod May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

miguelsorianod May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

miguelsorianod May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

miguelsorianod commented May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vkareh left a comment

Choose a reason for hiding this comment

Uh oh!

miguelsorianod commented May 16, 2025

Uh oh!

machi1990 commented May 16, 2025

Uh oh!

miguelsorianod commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vkareh commented May 16, 2025

Uh oh!

miguelsorianod commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

machi1990 commented May 16, 2025

Uh oh!

deads2k commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

deads2k commented May 16, 2025

Uh oh!

machi1990 commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

deads2k commented May 16, 2025

Uh oh!

deads2k commented May 16, 2025

Uh oh!

vkareh commented May 16, 2025

Uh oh!

machi1990 commented May 16, 2025

Uh oh!

machi1990 commented May 16, 2025

machi1990 commented May 15, 2025 •

edited

Loading

miguelsorianod May 15, 2025 •

edited

Loading

miguelsorianod May 16, 2025 •

edited

Loading

miguelsorianod May 15, 2025 •

edited

Loading

miguelsorianod May 16, 2025 •

edited

Loading

miguelsorianod commented May 15, 2025 •

edited

Loading

miguelsorianod commented May 16, 2025 •

edited

Loading

miguelsorianod commented May 16, 2025 •

edited

Loading

deads2k commented May 16, 2025 •

edited

Loading

machi1990 commented May 16, 2025 •

edited

Loading

miguelsorianod commented May 16, 2025 •

edited

Loading

mbarnes commented May 16, 2025 •

edited

Loading

miguelsorianod commented Jun 3, 2025 •

edited

Loading