Skip to content

capture the optional managed identities proposal using API Model. #1037

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

machi1990
Copy link
Contributor

@machi1990 machi1990 commented May 15, 2025

{
 ...rest of cluster spec
 "azure": {
   "required_operators_authentication": {
     "managed_identities": {
       "managed_identities_data_plane_identity_url": "https://dummyhost.identity.azure.net",
       "control_plane_operators_managed_identities": {
         "operator1": {
           "resource_id": "example_resource_id"
         },
       },
       "data_plane_operators_managed_identities": {
         "example_operator": {
           "resource_id": "example_resource_id"
         }
       },
       "service_managed_identity": {
         "resource_id": "example_resource_id"
       }
     }
   },
   "image_registry": {
     "state": "enabled",
     "control_plane_identity": {
       "resource_id": "example_resource_id"
       },
     "data_plane_identity": {
       "resource_id": "example_resource_id"
     }
   }
 }
}

The PR captures the above proposal as stated out in the DDR around optional managed identities.
However, there are a few challenges with the current proposal;

  • if in the future an operator becomes not required all the time anymore, it'll be an API breaking change as we'll have to move from the required_operators_authentication section to its own section as par the current proposal. How do we people see a better design to future proof ourselves? This is the case we're having now with the image registry operator taken as an example.
  • How are generic optional features that requires Managed/Workload identities designed across different clouds?
    The proposal above designs it as cluster.image_registry however, the concept of managed identities is tied to Azure.
    Other clouds have different concepts / naming. An alternative to the proposal above will be to have cluster.azure.image_registry and cluster.aws.image_registry i.e each cloud section in the cluster's spec will have its own type defining of the feature. However, this duplicates the feature definition across different clouds.

Another alternative is cluster.image_registry.aws and cluster.image_registry.azure i.e on the feature itself, have configurations per cloud. This duplicates the cloud platform everywhere.

I am opening this discussion so that we can iterate on the design and come up with a solid one to captured in the DDR and adopted across offerings and specifically for aro-hcp optional features design.

@tzvatot @vkareh @miguelsorianod @deads2k @flavianmissi @mbarnes

Copy link

openshift-ci bot commented May 15, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 15, 2025
Copy link

openshift-ci bot commented May 15, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: machi1990

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 15, 2025
@@ -0,0 +1,4 @@
struct ClusterImageRegistry {
ControlPlaneIdentity AzureControlPlaneManagedIdentity
Copy link
Contributor

@miguelsorianod miguelsorianod May 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another related question we have here is: what happens if the cluster image registry functionality is not supported only on azure but also on aws. We would need to have here something like:

ClusterImageRegistry:
  State: enabled/disabled
  AWS:
    ControlPlaneIdentity AWSSTSControlPlaneIdentity
    DataPlaneIdentity AWSSTSDataPlaneIdentity
  Azure:
    ControlPlaneIdentity AzureControlPlaneManagedIdentity
    DataPlaneIdentity AzureDataPlaneManagedIdentity

or similar (if we want we could put the cloud providers in a union type inside ClusterImageRegistry, etc...)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting problem. For completion, the way we've solved this in ROSA is by providing an array of arbitrary operators that require identity:

{
  "aws": {
    "operator_iam_roles": [
      {"id": "openshift-image-registry-operator", "role_arn": "<identity resource>"},
    ]
  }
}

However, when it comes to any knobs for said operator, they don't live there, they are provided elsewhere on an as-needed basis.

We should probably follow a similar model.

Copy link
Contributor

@miguelsorianod miguelsorianod May 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A similar approach is what's implemented in master currently for aro-hcp. This was implemented before optional functionalities came into play. The difference with what you described is that instead of having an array we have a free-form map where the key is the id of the operator.

I explain the detail about that here #1037 (comment)

The use of a map instead of an array is intentional. The reason is that arrays are not PATCH friendly. JSON Merge Patch doesn’t support adding, removing, or changing array elements (except by explicitly replacing the whole array).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've pushed an update on this in line with what Miguel was suggesting above.

// Defines how the operators of the cluster authenticate to Azure.
// Required during creation.
// Immutable.
OperatorsAuthentication AzureOperatorsAuthentication
RequiredOperatorsAuthentication AzureOperatorsAuthentication
Copy link
Contributor

@miguelsorianod miguelsorianod May 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if this remains called OperatorsAuthentication, where you put the core operators information only, the same issue would happen. The name doesn't really change much in this proposal in that regard.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name would matter if you group all identities together, regardless of whether they are for core components or optional components.

Copy link
Contributor

@miguelsorianod miguelsorianod May 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name would matter if you group all identities together, regardless of whether they are for core components or optional components.

Yes in that case the name would matter, that's correct. In the master branch that's how the API currently looks like if you want to take a look:
This is, all identities are grouped together and there's no differentiation between core and optional components. There is differentiation between control plane and data plane identities because they are conceptually different, configured in a different way and receive/return a different set of data. The structure we use to represent the set of identities is a free-form map where the key is an arbitrary name that we define representing that operator/functionality identity and the value it is the associated identity information. The reason a map was chosen is because it allows you to introduce new keys without needing to change the API definition. It is as well PATCH friendly. The set of key names are defined by us.

As you well explained in #1037 (review) there's tradeoffs between the two approaches.

@miguelsorianod
Copy link
Contributor

miguelsorianod commented May 15, 2025

As a side note:

Even though a PR is a good place where to leave comments, have them tracked and have discussions around them, the fact that we are iterating on the custom ocm api model definitions makes hard to understand how would that actually end up looking: As in how the json payload will look like, the openapi definition, the go types, ... . None of those are tracked in the repository and therefore neither in the PR.

Copy link
Member

@vkareh vkareh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two paths forward here:

  1. Bundle all identities; component knobs go in separate structs
  • Pros:
    • All identities are in the same location
    • Good separation of concerns: Identities belong to the cloud platform, whereas components belong to the cluster.
    • Single location for validating all identities and providing status
    • More consistent with ROSA
  • Cons:
    • Harder to distinguish visually which identities are required
  1. Co-locate identity within the discrete component configuration
  • Pros:
    • Identities belong where they are used
    • Good separation of configuration
    • Easy to visualize what's required based on components
  • Cons:
    • More places for validation/status is cumbersome for client consumption
    • Inconsistent with ROSA

The current option of bundling identities for some components, but co-locating for others is inconsistent and it seems harder to future-proof. Let's not do that and instead let's focus our efforts on improving on one of these two options above. My preference is for the first option, as it's more consistent with how ROSA does it today, but I'm willing to concede based on good arguments for the second option.

@miguelsorianod
Copy link
Contributor

There are two paths forward here:

1. Bundle all identities; component knobs go in separate structs


* Pros:
  
  * All identities are in the same location
  * Good separation of concerns: Identities belong to the cloud platform, whereas components belong to the cluster.
  * Single location for validating all identities and providing status
  * More consistent with ROSA

* Cons:
  
  * Harder to distinguish visually which identities are required


2. Co-locate identity within the discrete component configuration


* Pros:
  
  * Identities belong where they are used
  * Good separation of configuration
  * Easy to visualize what's required based on components

* Cons:
  
  * More places for validation/status is cumbersome for client consumption
  * Inconsistent with ROSA

The current option of bundling identities for some components, but co-locating for others is inconsistent and it seems harder to future-proof. Let's not do that and instead let's focus our efforts on improving on one of these two options above. My preference is for the first option, as it's more consistent with how ROSA does it today, but I'm willing to concede based on good arguments for the second option.

I think when considering this it is also relevant to keep in mind cluster upgrades:
When a Cluster is upgraded to an OCP version, aside from changing the version attribute in the API payload, if the new OCP version has new required core operators the end user will need to introduce the corresponding managed identity information for them. In the same way, if the new OCP version stops using previously existing core operators the end user would need to remove them when performing the patch. Something similar could happen if for some reason an optional CS functionality stops being supported. Although the latter seems unlikely as it would be a breaking change

@machi1990
Copy link
Contributor Author

Thanks @vkareh @miguelsorianod for chiming in.

I am for option 2; which is feature/functionality centric.

Re: consistency with rosa, as mentioned in slack thread, rosa-hcp isn't consistency with itself. That is STS iam roles for operators are configured in the STS operators roles however there is a feature (an optional one, a user can chose to activate ir or not) named AuditLog forwarding which follows the design as what's proposed in the DDR; similar to option 2.

Re: status of managed identities; Me, as user I care more of the status of the cluster or the extra functionalities that I've enabled on this cluster. The cluster already has a "status" object which can be enriched to return the "state of the world at time T of managed identities or anything else we would like to report". For example today, as part of inflight checks on MIs (if the MIs don't exist or anything) the status object would report that. Additionally, the inflight check status endpoint provides additional status of whatever the failure was. I personally find the cluster.status object to be a place where the "state of the world" is reported and not the managed_identities part which is the spec.

More places for validation - this concern is already the case. The cluster payload is big and the user/consumer of the API needs to make sure that they are providing a valid payload. In regards to validation, some functionality validation are interconnected by the knobs that are associated with that validation. Having knobs related to the functionality in one place helps quiet a lot in this and makes it easier for the user and CS developers themselves to perform those validations.
e.g the image registry case, if everything is in one spec, it is a matter of validating that image registry spec. However, if one knob on managed identity is in the "managed_identities" spec, the other knob about feature enablement is on "ocp_capabilities" spec, the other xxx knob is on the "image_registry" related spec, validating this becomes hard in my opinion.

user wanting to know the managed identities associated to a cluster at a central place while looking at the cluster schema Me, as a user the API, this is an interesting use case that second option doesn't offer however there are solution to that (e.g a dedicate endpoint similar to what rosa offers that returns the list of all operators roles of a given cluster) if this really becomes an issue from the visualisation point of view. Also this use case, doesn't address the next followup question a user might ask themselves; why do I have this MI? To which feature / functionality is this needed for? The answer can be "for this operator identified by this name " but operator names change or could change - so it is important (from my point of view) that a user clearly sees the presence of a certain managed identity from functionality/feature standpoint.

@miguelsorianod
Copy link
Contributor

miguelsorianod commented May 16, 2025

Thanks @vkareh @miguelsorianod for chiming in.

I am for option 2; which is feature/functionality centric.

Re: consistency with rosa, as mentioned in slack thread, rosa-hcp isn't consistency with itself. That is STS iam roles for operators are configured in the STS operators roles however there is a feature (an optional one, a user can chose to activate ir or not) named AuditLog forwarding which follows the design as what's proposed in the DDR; similar to option 2.

Re: status of managed identities; Me, as user I care more of the status of the cluster or the extra functionalities that I've enabled on this cluster. The cluster already has a "status" object which can be enriched to return the "state of the world at time T of managed identities or anything else we would like to report". For example today, as part of inflight checks on MIs (if the MIs don't exist or anything) the status object would report that. Additionally, the inflight check status endpoint provides additional status of whatever the failure was. I personally find the cluster.status object to be a place where the "state of the world" is reported and not the managed_identities part which is the spec.

More places for validation - this concern is already the case. The cluster payload is big and the user/consumer of the API needs to make sure that they are providing a valid payload. In regards to validation, some functionality validation are interconnected by the knobs that are associated with that validation. Having knobs related to the functionality in one place helps quiet a lot in this and makes it easier for the user and CS developers themselves to perform those validations. e.g the image registry case, if everything is in one spec, it is a matter of validating that image registry spec. However, if one knob on managed identity is in the "managed_identities" spec, the other knob about feature enablement is on "ocp_capabilities" spec, the other xxx knob is on the "image_registry" related spec, validating this becomes hard in my opinion.

user wanting to know the managed identities associated to a cluster at a central place while looking at the cluster schema Me, as a user the API, this is an interesting use case that second option doesn't offer however there are solution to that (e.g a dedicate endpoint similar to what rosa offers that returns the list of all operators roles of a given cluster) if this really becomes an issue from the visualisation point of view. Also this use case, doesn't address the next followup question a user might ask themselves; why do I have this MI? To which feature / functionality is this needed for? The answer can be "for this operator identified by this name " but operator names change or could change - so it is important (from my point of view) that a user clearly sees the presence of a certain managed identity from functionality/feature standpoint.

I have a similar thinking to Manyanda and it is what I commented in previous discussions.

For me, if we decide to go to this route though we need to define how would we provide authentication support if the given optional functionality is supported in multiple cloud providers. This is, imagine you have the image registry functionality which is supported in both AWS and Azure based cluster types. In the case of AWS ones an AWS STS identity would be required and in the case of Azure based clusters an Azure Managed Identity would be required. How would this look like at CS API level? this is considering that we currently have an azure and an aws top level section in the Cluster payload too.
In the case of approach 2 we would also need to decide how would we define the API for the "required/core" operators.

@vkareh
Copy link
Member

vkareh commented May 16, 2025

@miguelsorianod

if the new OCP version stops using previously existing core operators the end user would need to remove them when performing the patch

Not strictly true. The identity would be unnecessary, but not removing it would not affect cluster operation/upgrade. In either case, this seems orthogonal to the shape of the API, the user behaviour would exist in either case.

@machi1990

I am for option 2; which is feature/functionality centric.

Yes. I personally like the ergonomics of this approach. However, we need to define it very carefully, since based on @miguelsorianod's comment, it will become awkward for OCP components that require identities and are supported on more than one cloud/platform. That alone is enough to tip me in the other direction.

Can you update this PR to see how it would look in this case?

@miguelsorianod
Copy link
Contributor

miguelsorianod commented May 16, 2025

Not strictly true. The identity would be unnecessary, but not removing it would not affect cluster operation/upgrade.

👍 . That's accurate for the case of the removal. However, I think conceptually that's what we should require. I think we shouldn't have identities information in the API payload that are not used nor supported anymore.

In either case, this seems orthogonal to the shape of the API, the user behaviour would exist in either case.

The user behavior would exist in either case, but we have to think of how the experience of the user would change depending on how the API is designed. As in what changes in the API payload the user would need to perform when performing a cluster upgrade via a PATCH. Like, they would need to change this attribute here and that other one there etc.

@machi1990
Copy link
Contributor Author

@miguelsorianod

if the new OCP version stops using previously existing core operators the end user would need to remove them when performing the patch

Not strictly true. The identity would be unnecessary, but not removing it would not affect cluster operation/upgrade. In either case, this seems orthogonal to the shape of the API, the user behaviour would exist in either case.

@machi1990

I am for option 2; which is feature/functionality centric.

Yes. I personally like the ergonomics of this approach. However, we need to define it very carefully, since based on @miguelsorianod's comment, it will become awkward for OCP components that require identities and are supported on more than one cloud/platform. That alone is enough to tip me in the other direction.

Can you update this PR to see how it would look in this case?

@vkareh @miguelsorianod I've pushed 2271d41

This gives us a model like

image_registry:
  state: enabled/disabled
  authentication:
    azure:
      managed_identity:
        controlPlaneIdentity:
        dataPlaneIdentity:
    aws:
      sts:
        controlPlaneIamRole:
        dataPlaneIamRole:

@deads2k
Copy link
Contributor

deads2k commented May 16, 2025

The RequiredOperatorsAuthentication will not age well as OCP and HCM evolve. An inability to handle product and service evolution over time blocks the API change until it is resolved. Adding a hold to prevent accidental merge until the upgrade/evolution problem is resolved.

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 16, 2025
Comment on lines +10 to +14
// A union type of cloud authentication mechanism.
struct OperatorAuthentication {
Aws AwsOperatorAuthentication
Azure AzureOperatorAuthentication
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While placing the union at this level appears to give a function based API, our experience (and even this PR here) demonstrate the differences between the offerings that we need admit to ourselves. Having a top level union instead of per-component makes a future refactor that fully separates Azure and AWS cleaner. Such a top level split does not preclude embedding the same type for top level configuration, but does provide two clearly separate schemas for evolution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly "if we are to go with option(2)", the way you'll address the cloud concern is by placing per-component knobs in the top level cloud i.e

// the cluster spec 
{
...

"azure": {
    "image_registry": {
      ... // image registry knobs
    }
} 
}

The reason is that it'll be make it easier for future split of the services - the world where aro-hcp has its own CS separate from rosa-hcp?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes for both what I intend to express in the API and the reasoning behind it. I think our experience with trying to have a unified API has aged not-so-well and a we should be willing to reflect that in our API and code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see and thanks for clarifying. That's approach we could take and it has been considered as well. This is what has been commented as a possible alternatives in the PR's description and left out as a comment as well in the first commit of the PR

// We could move it to the .Azure field as it contains Azure specific things.

@deads2k
Copy link
Contributor

deads2k commented May 16, 2025

I am for option 2; which is feature/functionality centric.
Yes. I personally like the ergonomics of this approach. However, we need to define it very carefully, since based on @miguelsorianod's comment, it will become awkward for OCP components that require identities and are supported on more than one cloud/platform. That alone is enough to tip me in the other direction.

Thinking through adding additional identities, a choice of adding them per-component requires

  1. components remain stable over time and don't move. This isn't true for OCP, we move parts of components all the time.
  2. every time a component is added that needs an identity, addition validation wiring must be added to suit
  3. any status reporting code built to ensure the validity of identity post-API call also needs to be updated
  4. every time a component is added, even if it has no other configuration, it must have an API change.
  5. there is no mechanism to remove a component when it is EOL'd because that produces an incompatibility that makes moving to new schemas more difficult

Can it be done? Yes. The con-list is notably longer than listed above though.

@machi1990
Copy link
Contributor Author

machi1990 commented May 16, 2025

Thanks @deads2k for chiming in and additional insights.

components remain stable over time and don't move. This isn't true for OCP, we move parts of components all the time.

The per-component API won't have the operator name in it. It'll be a functionality. If the operator name in OCP changes or moves, as long as the desired functionality is there and can be leveraged, I don't see the move having any effect to the API.

every time a component is added that needs an identity, addition validation wiring must be added to suit

This can be true as well with a centralised list for the case where an optional identity is added. e.g what do you do when the identity is provided but the feature not enabled? What do you when the identity is not provided when the feature is enabled?

any status reporting code built to ensure the validity of identity post-API call also needs to be updated

Same reasoning as above. With the centralised list of identities, if it is an optional identity, you'll need to report it's status conditionally depending on the feature being enabled or not.

  1. every time a component is added, even if it has no other configuration, it must have an API change.

True. Taking care of these kinds of issues was one of the motivations with using a map struct originally for identities cluster.azure.operators_authentication.managed_identities field; the problem that both the RP and CS wanted to solve. However, with some functionality being optional in consequence their operators; we noticed that if we stick with this approach, for a given functionality, we're to have dispersed knobs related to that functionality everywhere - which has its cons.

  1. there is no mechanism to remove a component when it is EOL'd because that produces an incompatibility that makes moving to new schemas more difficult

An API deprecation policy is what will address this best.
However, I think that this con is also applicable to the centralised list approach when the EOL'd component had extra knobs that are defined somewhere in the cluster's spec.

@deads2k
Copy link
Contributor

deads2k commented May 16, 2025

9. there is no mechanism to remove a component when it is EOL'd because that produces an incompatibility that makes moving to new schemas more difficult

An API deprecation policy is what will address this best. However, I think that this con is also applicable to the centralised list approach when the EOL'd component had extra knobs that are defined somewhere in the cluster's spec.

While I can appreciate that the current API compatibility is treated fast and loose, with a map (or other arbitrary list), no schema change is required to be communicated to clients and have all clients update to serialize properly. The wider spread the clients, the harder the required shifts.

7. every time a component is added, even if it has no other configuration, it must have an API change.

True. Taking care of these kinds of issues was one of the motivations with using a map struct originally for identities cluster.azure.operators_authentication.managed_identities field; the problem that both the RP and CS wanted to solve. However, with some functionality being optional in consequence their operators; we noticed that if we stick with this approach, for a given functionality, we're to have dispersed knobs related to that functionality everywhere - which has its cons.

The only listed con is "Harder to distinguish visually which identities are required". Making the change looks a lot like, "the grass must be greener over there". Think carefully about motivations before taking a plunge with more downsides than the current technique.

@deads2k
Copy link
Contributor

deads2k commented May 16, 2025

Even though a PR is a good place where to leave comments, have them tracked and have discussions around them, the fact that we are iterating on the custom ocm api model definitions makes hard to understand how would that actually end up looking: As in how the json payload will look like, the openapi definition, the go types, ... . None of those are tracked in the repository and therefore neither in the PR.

Good points. Have a look at

  1. make it possible to have separate serialization API types and the client to access them ocm-api-metamodel#219
  2. Move generator for clientapi to repo where the model is defined #1024
  3. move JSON structs (serialization) to ocm-api-model ocm-sdk-go#1037

Which do that coalescing and move the openapi and json types into this repo, while using type aliases to avoid causing problems for ocm-sdk-go consumers. I'm anticipating that landing early this summer.

@vkareh
Copy link
Member

vkareh commented May 16, 2025

Thanks for this good discussion, folks!

Some things I can see coming from this:

  1. Using {component: {aws: {}, azure: {}}} will get cumbersome fast. In fact, it already is cumbersome to think/type in that dimension. If we go the way of de-centralized identities, the way I would split it would look more like
{
  "azure": {
    "component": {
      "identity": "..."
    }
  },
  "component": {
    "knob": "..."
  }
}

It separates product functionality from cloud requirement.

  1. Using the required* prefix in the attribute is a hard no. It breaks both options 1 and 2, breaks future-proofing of the API, and confuses the heck out of me when thinking of "optional components that have required identities". 😜

  2. Whether identities are part of a required or for an optional component doesn't matter, since a component is either enabled or not, and the identity is therefore required for that component. Therefore, there is no distinction between those identities and so should either co-exist in the same struct, or under each discrete component (caveat: see 1. above).

  3. Based on 1. and 3. above, I'm leaning towards co-locating all identities in the same location. Therefore, I would like to see how this API would look like. @machi1990 can you update the API to reflect this, or at least post a sample snippet? That is, unless we're at the point that options 1 or 2 in my original comment start looking very much the same in light of my comment 1. above?

@machi1990
Copy link
Contributor Author

  1. there is no mechanism to remove a component when it is EOL'd because that produces an incompatibility that makes moving to new schemas more difficult

An API deprecation policy is what will address this best. However, I think that this con is also applicable to the centralised list approach when the EOL'd component had extra knobs that are defined somewhere in the cluster's spec.

While I can appreciate that the current API compatibility is treated fast and loose, with a map (or other arbitrary list), no schema change is required to be communicated to clients and have all clients update to serialize properly. The wider spread the clients, the harder the required shifts.

@deads2k The point in my statement (quoted below)

I think that this con is also applicable to the centralised list approach when the EOL'd component had extra knobs that are defined somewhere in the cluster's spec.

is that, if a component is EOL-ed, no schema is required for the centralised approach (option 1) if and only if the component didn't have extra configurations i.e the only configurations for it were the identity information.
Consider this sample schema that mimics option1;

{

"azure": {
   "operators_authentication": {
      "managed_identities": {
         "control_plane_operators": {
             "component1": { // this component has extra configs
                 "resource_id": "foo-bar"
              }
         }
      }
   }
} 
"component1ExtraConfigs": {
  "config1": ...
  "config2" ...
}
}

If component1 is EOL'd, the .azure.operators_authentication.managed_identities_control_plane_operators map gives you the benefit of no schema change, but you've the change of schema anyway for the case of ".component1ExtraConfigs". So yes, what you said is true only for the case of a component whose configuration is only the identity information.

@machi1990
Copy link
Contributor Author

@vkareh

Based on 1. and 3. above, I'm leaning towards co-locating all identities in the same location. Therefore, I would like to see how this API would look like. @machi1990 can you update the API to reflect this, or at least post a sample snippet? That is, unless we're at the point that options 1 or 2 in my original comment start looking very much the same in light of my comment 1. above?

By "I'm leaning towards co-locating all identities in the same location." - that is co-locating only the identity information in one place and the component related information in a different place? i.e pretty much option (1) from #1037 (review)

If so, the snippet of that API would look like

{

"azure": {
   "operators_authentication": {
      "managed_identities": {
         "control_plane_operators": {
             "image-registry": {
                 "resource_id": "foo-bar"
              }
              .... extra control plane components identities
         }, 
         "data_plane_operators": {
           "image-registry": {
             "resource_id": "bazinga"
           }
           ... extra data plane components identities
         }
      }
   }
} ,
"image-registry": {
  "state": "enabled"
}
}

The snippet assumes that we'll get rid of the .capabilities section for enablement/disablement of components. If we stick with .capabilities, the snippet would look like

{

"azure": {
   "operators_authentication": {
      "managed_identities": {
         "control_plane_operators": {
             "image-registry": {
                 "resource_id": "foo-bar"
              }
              .... extra control plane components identities
         }, 
         "data_plane_operators": {
           "image-registry": {
             "resource_id": "bazinga"
           }
           ... extra data plane components identities
         }
      }
   }
}, 
"capabilities": {
  "disabled": []
}
}

@miguelsorianod
Copy link
Contributor

miguelsorianod commented May 16, 2025

Using {component: {aws: {}, azure: {}}} will get cumbersome fast. In fact, it already is cumbersome to think/type in that dimension. If we go the way of de-centralized identities, the way I would split it would look more like

Doesn't the same apply to the other way around? With the other way around you will have:

{
aws:
    functionality1:
    functionalityN:
azure:
    functionality1:
    functionalityN:
}

which is equivalent and has the same arguments that you raised.

On top of that, now you will have components that do not have cloud specific configuration and you will end up with something like:

{
aws:
    functionality1:
    functionalityN:
azure:
    functionality1:
    functionalityN:
functionalityM:
}

But in the other way around you don't have the functionality configuration separate so the argument of "it is functionally the same as the original option 1" does not hold in that case because you do not have the configuration separate.

Using the required* prefix in the attribute is a hard no. It breaks both options 1 and 2, breaks future-proofing of the API, and confuses the heck out of me when thinking of "optional components that have required identities". 😜

Using required* as the name is not what breaks the API in option 2. What breaks option 2 is the fact that we have a section that is used to designate a set of operators that are required. The aspect that breaks it is the existence of that section where over time some elements in it do not apply anymore. The name is independent of that, using required just makes it less accurate. What breaks option 2 is the existence of the section and not the name chosen for it. Just having the section named "operators_authentication" without the required part would still break option 2. Do you agree with that assessment ? If so, therefore the discussion there with that option is not about the name, it is about finding an alternative that does not leverage that section

@mbarnes
Copy link

mbarnes commented May 16, 2025

From an outsider perspective, it seems to me like the central problem here is how to express the role of an identity in the cluster. Each OpenShift version has a discrete set of roles for identities; some required, some not.

I had an alternate idea to the above proposals. Bear with me because it's not fully baked yet.

Instead of trying to cram this all into the cluster API, what if these roles were exposed as a new set of pre-defined top-level endpoints? I'll call them "identity roles". An identity role and its endpoint might look something like:

GET /api/clusters_mgmt/v1/identity_roles/{identity_role_id}
{
    "required": true/false,
    "cloud_providers": [
        CloudProviderLink,
        ...
     ],
     "cluster_plane": "control"/"data",
     "purpose": operator or capability name (whatever we decide on)
}

Individual identity roles are static. As new OpenShift versions introduce new operators/functionalities that require identities, Cluster Service could publish new identity roles. Similarly, if a previously required identity role became optional in later versions (or vice versa), this would entail a new identity role.

The schema for /api/clusters_mgmt/v1/versions/{version_id} could then be extended to return a list of identity roles appropriate for that version across all cloud providers:

GET /api/clusters_mgmt/v1/versions/openshift-v4.18.1
{
   ...
   "identity_roles": [
      IdentityRoleLink,
      ...
   }
}

Then when a user POSTs a cluster manifest, the "managed_identities" section could just be a list (or map) of identity role ID and resource ID pairings.

"azure": {
   "operators_authentication": {
      "managed_identities": [
         {
            "role": "identity-role-1",
            "resource_id": "foo-bar"
         },
         {
            "role": "identity-role-2",
            "resource_id": "bazinga"
         },
         ...
      ]
   }
}, 

I'm not sure if identity role IDs should be meaningful names or random like cluster IDs. I'm leaning toward random, just because naming is hard. It would require some effort on the user's part to look up identity roles for a given OpenShift version, but that level of effort seems inescapable whatever we settle on.

Maybe there's a seed of a workable idea here. I'm not as familiar with the design constraints on this API as the rest of you. I'm just trying to break our thinking out of the box we seem to be stuck in.

@vkareh
Copy link
Member

vkareh commented May 26, 2025

To follow up on an example of a feature that contains cloud-specific configurations, we have etcd encryption. Today the model looks like:

{
  "etcd_encryption": true,
  "aws": {
    "etcd_encryption": {
      "kms_key_arn": "..."
    }
  },
  "gcp_encryption_key": {
    "kms_key_service_account": "..."
    "key_location": "...",
    "key_name": "...",
    "key_ring": "..."
  }
}

Under the guidelines I outlined, this is largely correct (even though it should be nested in the case of GCP and ideally we refrain from using a boolean for the top-level config). If we follow the guidelines, the API should look like:

{
  "etcd_encryption": {
    "state": "enabled"
  },
  "aws": {
    "etcd_encryption": {
      "kms_key_arn": "..."
    }
  },
  "azure": {
    "managed_identities": {
      "etcd_encryption": {
        "resource_id": "..."
      }
    }
  },
  "gcp": {
    "etcd_encryption": {
      "kms_key_service_account": "..."
      "key_location": "...",
      "key_name": "...",
      "key_ring": "..."
    }
  }
}

@machi1990
Copy link
Contributor Author

To follow up on an example of a feature that contains cloud-specific configurations, we have etcd encryption. Today the model looks like:

{
  "etcd_encryption": true,
  "aws": {
    "etcd_encryption": {
      "kms_key_arn": "..."
    }
  },
  "gcp_encryption_key": {
    "kms_key_service_account": "..."
    "key_location": "...",
    "key_name": "...",
    "key_ring": "..."
  }
}

Under the guidelines I outlined, this is largely correct (even though it should be nested in the case of GCP and ideally we refrain from using a boolean for the top-level config). If we follow the guidelines, the API should look like:

{
  "etcd_encryption": {
    "state": "enabled"
  },
  "aws": {
    "etcd_encryption": {
      "kms_key_arn": "..."
    }
  },
  "azure": {
    "managed_identities": {
      "etcd_encryption": {
        "resource_id": "..."
      }
    }
  },
  "gcp": {
    "etcd_encryption": {
      "kms_key_service_account": "..."
      "key_location": "...",
      "key_name": "...",
      "key_ring": "..."
    }
  }
}

The way I see it, the struct will be more like below. And from the struct we can clearly how distributed are feature related knobs which has a high cognitive load when reasoning out about the data model; the cognitive load is both to the user of the API and the maintainers of the API (cs developers).
This is a tradeoff and one of the implication we'll be taking about option (1) and it is worth capturing that in the guideline.

{
  "etcd_encryption": {
    "state": "enabled"
  },
  "aws": { 
     "sts": {
        "operator_iam_roles": [
          .... {
            "name": "kms-provider",
            "role_arn": "kms-role-arn",
            ....
          }
       ]
     },
    "etcd_encryption": {
      "kms_key_arn": "..."
    }
  },
  "azure": {
    "etcd_encryption": {
        "foo_bar_ksm_key": {
          ...Azure KV related info
        }
      },
    "managed_identities": {
      "control_plane_operators": {
         ...
        "kms": {
            "resource_id": ....
          }
       }, 
      "data_plane_operators": {
        ....
      }
    }
  },
  "gcp": {
    "etcd_encryption": {
      "kms_key_service_account": "..."
      "key_location": "...",
      "key_name": "...",
      "key_ring": "..."
    }
  }
}

@machi1990 machi1990 requested a review from vkareh June 2, 2025 08:53
@vkareh
Copy link
Member

vkareh commented Jun 2, 2025

@machi1990

The way I see it, the struct will be more like below. [...]

Yes, this is what I had in mind. There might be ways of improving the on the cognitive load issue you mention by rearranging some things inside the cloud component (possibly grouping all identities closer together under .azure.managed_identities?), but what you posted seems correct to me.

@machi1990
Copy link
Contributor Author

@machi1990

The way I see it, the struct will be more like below. [...]

Yes, this is what I had in mind.

Thank you for confirming.

There might be ways of improving the on the cognitive load issue you mention by rearranging some things inside the cloud component (possibly grouping all identities closer together under .azure.managed_identities?), but what you posted seems correct to me.

I don't think any possible arrangement in the cloud component will make the situation better. In my opinion, the distributed nature of the knobs for a certain feature is what adds to the cognitive load when reasoning about the feature and I only see option(2) as a solution for that; at the "cost" of potentially duplicating the cloud across different features. Between the two, the choice was to go for the option(1) in this regard. This is one implication.

@vkareh @tzvatot there are other more broader cases highlighted in #1037 (comment), please take a look at that comment and chime in.

@tzvatot
Copy link
Contributor

tzvatot commented Jun 3, 2025

From what I see in #1037 (comment), I feel the current guidelines fulfil their requirements.

@miguelsorianod
Copy link
Contributor

miguelsorianod commented Jun 3, 2025

Ok so if you think that's the correct design, and you agree on Manyanda's example, #1037 (comment) including the key name when you specify the data of the etcd related identity,

Then I think the way forward with this MR is to close it, because that's the current design that it is already in the main branch regarding the azure managed identities area. You can take a look at the main branch to see how it is being defined.

Can you confirm that we are ok with this way forward? If that's the case, we'll close the MR without merging.

@machi1990
Copy link
Contributor Author

Below is the schema of what we've in main. The schema captures the configuration of etc encryption for 3 clouds we support.

For the aro-hcp case, the cluster.azure.operators_authentication.managed_identities contains two subsections which are control_plane_operators and data_plane_operators. These sections are represented by a Map struct, the key being the operator name.

{
  "etcd_encryption": {
    "state": "enabled"
  },
  "aws": { 
     "sts": {
        "operator_iam_roles": [
          .... {
            "name": "kms-provider",
            "role_arn": "kms-role-arn",
            ....
          }
       ]
     },
    "etcd_encryption": {
      "kms_key_arn": "..."
    }
  },
  "azure": {
    "etcd_encryption": {
        "foo_bar_ksm_key": {
          ...Azure KV related info
        }
      },
      "operators_authentication": {
           "managed_identities": {
                "control_plane_operators": { // this is a Map struct with key being the operator
                   ...
                  "kms": { // here the key is the operator name
                      "resource_id": ....
                    }
                }, 
                "data_plane_operators": { // this is a Map struct with key being the operator name
                  ....
                }
             }
       }
  },
  "gcp": {
    "etcd_encryption": {
      "kms_key_service_account": "..."
      "key_location": "...",
      "key_name": "...",
      "key_ring": "..."
    }
  }
}

@vkareh @tzvatot if you agree with that, I'll close the PR.

@machi1990 machi1990 requested a review from tzvatot June 3, 2025 12:09
@tzvatot
Copy link
Contributor

tzvatot commented Jun 3, 2025

"kms": { // here the key is the operator name

Is the operator name "kms"?

@tzvatot
Copy link
Contributor

tzvatot commented Jun 3, 2025

The schema captures the configuration of etc encryption for 3 clouds we support.

Schema looks good to me. @vkareh ?

@machi1990
Copy link
Contributor Author

"kms": { // here the key is the operator name

Is the operator name "kms"?

Yes. The name of the key in the payload will be whatever is in this config file https://github.com/Azure/ARO-HCP/blob/main/cluster-service/deploy/templates/azure-operators-managed-identities-config.configmap.yaml#L89 (the key in the config there are the operator names)

@machi1990 machi1990 closed this Jun 5, 2025
@machi1990 machi1990 deleted the image-registry-proposal-as-optional-functionality branch June 6, 2025 06:49
mbarnes pushed a commit to Azure/ARO-HCP that referenced this pull request Jun 26, 2025
We decided to structure optional features differently after a
lengthy discussion in
openshift-online/ocm-api-model#1037

This keeps the nice test framework in ocm_test.go even though
it doesn't do much now. This can be expanded upon in the future.
mbarnes pushed a commit to Azure/ARO-HCP that referenced this pull request Jun 26, 2025
We decided to structure optional features differently after a
lengthy discussion in
openshift-online/ocm-api-model#1037

This keeps the nice test framework in ocm_test.go even though
it doesn't do much now. This can be expanded upon in the future.
mbarnes pushed a commit to Azure/ARO-HCP that referenced this pull request Jun 26, 2025
We decided to structure optional features differently after a
lengthy discussion in
openshift-online/ocm-api-model#1037

This keeps the nice test framework in ocm_test.go even though
it doesn't do much now. This can be expanded upon in the future.
mbarnes pushed a commit to Azure/ARO-HCP that referenced this pull request Jun 27, 2025
We decided to structure optional features differently after a
lengthy discussion in
openshift-online/ocm-api-model#1037

This keeps the nice test framework in ocm_test.go even though
it doesn't do much now. This can be expanded upon in the future.
mbarnes pushed a commit to Azure/ARO-HCP that referenced this pull request Jun 30, 2025
We decided to structure optional features differently after a
lengthy discussion in
openshift-online/ocm-api-model#1037

This keeps the nice test framework in ocm_test.go even though
it doesn't do much now. This can be expanded upon in the future.
mbarnes pushed a commit to Azure/ARO-HCP that referenced this pull request Jun 30, 2025
We decided to structure optional features differently after a
lengthy discussion in
openshift-online/ocm-api-model#1037

This keeps the nice test framework in ocm_test.go even though
it doesn't do much now. This can be expanded upon in the future.
mbarnes pushed a commit to Azure/ARO-HCP that referenced this pull request Jun 30, 2025
We decided to structure optional features differently after a
lengthy discussion in
openshift-online/ocm-api-model#1037

This keeps the nice test framework in ocm_test.go even though
it doesn't do much now. This can be expanded upon in the future.
mbarnes pushed a commit to Azure/ARO-HCP that referenced this pull request Jul 1, 2025
We decided to structure optional features differently after a
lengthy discussion in
openshift-online/ocm-api-model#1037

This keeps the nice test framework in ocm_test.go even though
it doesn't do much now. This can be expanded upon in the future.
mbarnes pushed a commit to Azure/ARO-HCP that referenced this pull request Jul 1, 2025
We decided to structure optional features differently after a
lengthy discussion in
openshift-online/ocm-api-model#1037

This keeps the nice test framework in ocm_test.go even though
it doesn't do much now. This can be expanded upon in the future.
mbarnes pushed a commit to Azure/ARO-HCP that referenced this pull request Jul 1, 2025
We decided to structure optional features differently after a
lengthy discussion in
openshift-online/ocm-api-model#1037

This keeps the nice test framework in ocm_test.go even though
it doesn't do much now. This can be expanded upon in the future.
mbarnes pushed a commit to Azure/ARO-HCP that referenced this pull request Jul 1, 2025
We decided to structure optional features differently after a
lengthy discussion in
openshift-online/ocm-api-model#1037

This keeps the nice test framework in ocm_test.go even though
it doesn't do much now. This can be expanded upon in the future.
mbarnes pushed a commit to Azure/ARO-HCP that referenced this pull request Jul 1, 2025
We decided to structure optional features differently after a
lengthy discussion in
openshift-online/ocm-api-model#1037

This keeps the nice test framework in ocm_test.go even though
it doesn't do much now. This can be expanded upon in the future.
mbarnes pushed a commit to Azure/ARO-HCP that referenced this pull request Jul 1, 2025
We decided to structure optional features differently after a
lengthy discussion in
openshift-online/ocm-api-model#1037

This keeps the nice test framework in ocm_test.go even though
it doesn't do much now. This can be expanded upon in the future.
mbarnes pushed a commit to Azure/ARO-HCP that referenced this pull request Jul 3, 2025
We decided to structure optional features differently after a
lengthy discussion in
openshift-online/ocm-api-model#1037

This keeps the nice test framework in ocm_test.go even though
it doesn't do much now. This can be expanded upon in the future.
mbarnes pushed a commit to Azure/ARO-HCP that referenced this pull request Jul 3, 2025
We decided to structure optional features differently after a
lengthy discussion in
openshift-online/ocm-api-model#1037

This keeps the nice test framework in ocm_test.go even though
it doesn't do much now. This can be expanded upon in the future.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants