-
Notifications
You must be signed in to change notification settings - Fork 109
capture the optional managed identities proposal using API Model. #1037
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
capture the optional managed identities proposal using API Model. #1037
Conversation
Skipping CI for Draft Pull Request. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: machi1990 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@@ -0,0 +1,4 @@ | |||
struct ClusterImageRegistry { | |||
ControlPlaneIdentity AzureControlPlaneManagedIdentity |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another related question we have here is: what happens if the cluster image registry functionality is not supported only on azure but also on aws. We would need to have here something like:
ClusterImageRegistry:
State: enabled/disabled
AWS:
ControlPlaneIdentity AWSSTSControlPlaneIdentity
DataPlaneIdentity AWSSTSDataPlaneIdentity
Azure:
ControlPlaneIdentity AzureControlPlaneManagedIdentity
DataPlaneIdentity AzureDataPlaneManagedIdentity
or similar (if we want we could put the cloud providers in a union type inside ClusterImageRegistry, etc...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an interesting problem. For completion, the way we've solved this in ROSA is by providing an array of arbitrary operators that require identity:
{
"aws": {
"operator_iam_roles": [
{"id": "openshift-image-registry-operator", "role_arn": "<identity resource>"},
]
}
}
However, when it comes to any knobs for said operator, they don't live there, they are provided elsewhere on an as-needed basis.
We should probably follow a similar model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A similar approach is what's implemented in master currently for aro-hcp. This was implemented before optional functionalities came into play. The difference with what you described is that instead of having an array we have a free-form map where the key is the id of the operator.
I explain the detail about that here #1037 (comment)
The use of a map instead of an array is intentional. The reason is that arrays are not PATCH friendly. JSON Merge Patch doesn’t support adding, removing, or changing array elements (except by explicitly replacing the whole array).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've pushed an update on this in line with what Miguel was suggesting above.
// Defines how the operators of the cluster authenticate to Azure. | ||
// Required during creation. | ||
// Immutable. | ||
OperatorsAuthentication AzureOperatorsAuthentication | ||
RequiredOperatorsAuthentication AzureOperatorsAuthentication |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if this remains called OperatorsAuthentication, where you put the core operators information only, the same issue would happen. The name doesn't really change much in this proposal in that regard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name would matter if you group all identities together, regardless of whether they are for core components or optional components.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name would matter if you group all identities together, regardless of whether they are for core components or optional components.
Yes in that case the name would matter, that's correct. In the master branch that's how the API currently looks like if you want to take a look:
This is, all identities are grouped together and there's no differentiation between core and optional components. There is differentiation between control plane and data plane identities because they are conceptually different, configured in a different way and receive/return a different set of data. The structure we use to represent the set of identities is a free-form map where the key is an arbitrary name that we define representing that operator/functionality identity and the value it is the associated identity information. The reason a map was chosen is because it allows you to introduce new keys without needing to change the API definition. It is as well PATCH friendly. The set of key names are defined by us.
As you well explained in #1037 (review) there's tradeoffs between the two approaches.
As a side note: Even though a PR is a good place where to leave comments, have them tracked and have discussions around them, the fact that we are iterating on the custom ocm api model definitions makes hard to understand how would that actually end up looking: As in how the json payload will look like, the openapi definition, the go types, ... . None of those are tracked in the repository and therefore neither in the PR. |
…abled or disabled via a flag
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two paths forward here:
- Bundle all identities; component knobs go in separate structs
- Pros:
- All identities are in the same location
- Good separation of concerns: Identities belong to the cloud platform, whereas components belong to the cluster.
- Single location for validating all identities and providing status
- More consistent with ROSA
- Cons:
- Harder to distinguish visually which identities are required
- Co-locate identity within the discrete component configuration
- Pros:
- Identities belong where they are used
- Good separation of configuration
- Easy to visualize what's required based on components
- Cons:
- More places for validation/status is cumbersome for client consumption
- Inconsistent with ROSA
The current option of bundling identities for some components, but co-locating for others is inconsistent and it seems harder to future-proof. Let's not do that and instead let's focus our efforts on improving on one of these two options above. My preference is for the first option, as it's more consistent with how ROSA does it today, but I'm willing to concede based on good arguments for the second option.
I think when considering this it is also relevant to keep in mind cluster upgrades: |
Thanks @vkareh @miguelsorianod for chiming in. I am for option 2; which is feature/functionality centric. Re: consistency with rosa, as mentioned in slack thread, rosa-hcp isn't consistency with itself. That is STS iam roles for operators are configured in the STS operators roles however there is a feature (an optional one, a user can chose to activate ir or not) named AuditLog forwarding which follows the design as what's proposed in the DDR; similar to option 2. Re: status of managed identities; Me, as user I care more of the status of the cluster or the extra functionalities that I've enabled on this cluster. The cluster already has a "status" object which can be enriched to return the "state of the world at time T of managed identities or anything else we would like to report". For example today, as part of inflight checks on MIs (if the MIs don't exist or anything) the status object would report that. Additionally, the inflight check status endpoint provides additional status of whatever the failure was. I personally find the More places for validation - this concern is already the case. The cluster payload is big and the user/consumer of the API needs to make sure that they are providing a valid payload. In regards to validation, some functionality validation are interconnected by the knobs that are associated with that validation. Having knobs related to the functionality in one place helps quiet a lot in this and makes it easier for the user and CS developers themselves to perform those validations. user wanting to know the managed identities associated to a cluster at a central place while looking at the cluster schema Me, as a user the API, this is an interesting use case that second option doesn't offer however there are solution to that (e.g a dedicate endpoint similar to what rosa offers that returns the list of all operators roles of a given cluster) if this really becomes an issue from the visualisation point of view. Also this use case, doesn't address the next followup question a user might ask themselves; why do I have this MI? To which feature / functionality is this needed for? The answer can be "for this operator identified by this name " but operator names change or could change - so it is important (from my point of view) that a user clearly sees the presence of a certain managed identity from functionality/feature standpoint. |
I have a similar thinking to Manyanda and it is what I commented in previous discussions. For me, if we decide to go to this route though we need to define how would we provide authentication support if the given optional functionality is supported in multiple cloud providers. This is, imagine you have the image registry functionality which is supported in both AWS and Azure based cluster types. In the case of AWS ones an AWS STS identity would be required and in the case of Azure based clusters an Azure Managed Identity would be required. How would this look like at CS API level? this is considering that we currently have an |
Not strictly true. The identity would be unnecessary, but not removing it would not affect cluster operation/upgrade. In either case, this seems orthogonal to the shape of the API, the user behaviour would exist in either case.
Yes. I personally like the ergonomics of this approach. However, we need to define it very carefully, since based on @miguelsorianod's comment, it will become awkward for OCP components that require identities and are supported on more than one cloud/platform. That alone is enough to tip me in the other direction. Can you update this PR to see how it would look in this case? |
👍 . That's accurate for the case of the removal. However, I think conceptually that's what we should require. I think we shouldn't have identities information in the API payload that are not used nor supported anymore.
The user behavior would exist in either case, but we have to think of how the experience of the user would change depending on how the API is designed. As in what changes in the API payload the user would need to perform when performing a cluster upgrade via a PATCH. Like, they would need to change this attribute here and that other one there etc. |
@vkareh @miguelsorianod I've pushed 2271d41 This gives us a model like
|
The /hold |
// A union type of cloud authentication mechanism. | ||
struct OperatorAuthentication { | ||
Aws AwsOperatorAuthentication | ||
Azure AzureOperatorAuthentication | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While placing the union at this level appears to give a function based API, our experience (and even this PR here) demonstrate the differences between the offerings that we need admit to ourselves. Having a top level union instead of per-component makes a future refactor that fully separates Azure and AWS cleaner. Such a top level split does not preclude embedding the same type for top level configuration, but does provide two clearly separate schemas for evolution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly "if we are to go with option(2)", the way you'll address the cloud concern is by placing per-component knobs in the top level cloud i.e
// the cluster spec
{
...
"azure": {
"image_registry": {
... // image registry knobs
}
}
}
The reason is that it'll be make it easier for future split of the services - the world where aro-hcp has its own CS separate from rosa-hcp?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes for both what I intend to express in the API and the reasoning behind it. I think our experience with trying to have a unified API has aged not-so-well and a we should be willing to reflect that in our API and code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see and thanks for clarifying. That's approach we could take and it has been considered as well. This is what has been commented as a possible alternatives in the PR's description and left out as a comment as well in the first commit of the PR
// We could move it to the .Azure field as it contains Azure specific things. |
Thinking through adding additional identities, a choice of adding them per-component requires
Can it be done? Yes. The con-list is notably longer than listed above though. |
Thanks @deads2k for chiming in and additional insights.
The per-component API won't have the operator name in it. It'll be a functionality. If the operator name in OCP changes or moves, as long as the desired functionality is there and can be leveraged, I don't see the move having any effect to the API.
This can be true as well with a centralised list for the case where an optional identity is added. e.g what do you do when the identity is provided but the feature not enabled? What do you when the identity is not provided when the feature is enabled?
Same reasoning as above. With the centralised list of identities, if it is an optional identity, you'll need to report it's status conditionally depending on the feature being enabled or not.
True. Taking care of these kinds of issues was one of the motivations with using a
An API deprecation policy is what will address this best. |
While I can appreciate that the current API compatibility is treated fast and loose, with a map (or other arbitrary list), no schema change is required to be communicated to clients and have all clients update to serialize properly. The wider spread the clients, the harder the required shifts.
The only listed con is "Harder to distinguish visually which identities are required". Making the change looks a lot like, "the grass must be greener over there". Think carefully about motivations before taking a plunge with more downsides than the current technique. |
Good points. Have a look at
Which do that coalescing and move the openapi and json types into this repo, while using type aliases to avoid causing problems for ocm-sdk-go consumers. I'm anticipating that landing early this summer. |
Thanks for this good discussion, folks! Some things I can see coming from this:
{
"azure": {
"component": {
"identity": "..."
}
},
"component": {
"knob": "..."
}
} It separates product functionality from cloud requirement.
|
@deads2k The point in my statement (quoted below)
is that, if a component is EOL-ed, no schema is required for the centralised approach (option 1) if and only if the component didn't have extra configurations i.e the only configurations for it were the identity information.
If component1 is EOL'd, the |
By "I'm leaning towards co-locating all identities in the same location." - that is co-locating only the identity information in one place and the component related information in a different place? i.e pretty much option (1) from #1037 (review) If so, the snippet of that API would look like
The snippet assumes that we'll get rid of the
|
Doesn't the same apply to the other way around? With the other way around you will have:
which is equivalent and has the same arguments that you raised. On top of that, now you will have components that do not have cloud specific configuration and you will end up with something like:
But in the other way around you don't have the functionality configuration separate so the argument of "it is functionally the same as the original option 1" does not hold in that case because you do not have the configuration separate.
Using required* as the name is not what breaks the API in option 2. What breaks option 2 is the fact that we have a section that is used to designate a set of operators that are required. The aspect that breaks it is the existence of that section where over time some elements in it do not apply anymore. The name is independent of that, using required just makes it less accurate. What breaks option 2 is the existence of the section and not the name chosen for it. Just having the section named "operators_authentication" without the required part would still break option 2. Do you agree with that assessment ? If so, therefore the discussion there with that option is not about the name, it is about finding an alternative that does not leverage that section |
From an outsider perspective, it seems to me like the central problem here is how to express the role of an identity in the cluster. Each OpenShift version has a discrete set of roles for identities; some required, some not. I had an alternate idea to the above proposals. Bear with me because it's not fully baked yet. Instead of trying to cram this all into the cluster API, what if these roles were exposed as a new set of pre-defined top-level endpoints? I'll call them "identity roles". An identity role and its endpoint might look something like:
Individual identity roles are static. As new OpenShift versions introduce new operators/functionalities that require identities, Cluster Service could publish new identity roles. Similarly, if a previously required identity role became optional in later versions (or vice versa), this would entail a new identity role. The schema for /api/clusters_mgmt/v1/versions/{version_id} could then be extended to return a list of identity roles appropriate for that version across all cloud providers:
Then when a user POSTs a cluster manifest, the
I'm not sure if identity role IDs should be meaningful names or random like cluster IDs. I'm leaning toward random, just because naming is hard. It would require some effort on the user's part to look up identity roles for a given OpenShift version, but that level of effort seems inescapable whatever we settle on. Maybe there's a seed of a workable idea here. I'm not as familiar with the design constraints on this API as the rest of you. I'm just trying to break our thinking out of the box we seem to be stuck in. |
To follow up on an example of a feature that contains cloud-specific configurations, we have etcd encryption. Today the model looks like: {
"etcd_encryption": true,
"aws": {
"etcd_encryption": {
"kms_key_arn": "..."
}
},
"gcp_encryption_key": {
"kms_key_service_account": "..."
"key_location": "...",
"key_name": "...",
"key_ring": "..."
}
} Under the guidelines I outlined, this is largely correct (even though it should be nested in the case of GCP and ideally we refrain from using a boolean for the top-level config). If we follow the guidelines, the API should look like: {
"etcd_encryption": {
"state": "enabled"
},
"aws": {
"etcd_encryption": {
"kms_key_arn": "..."
}
},
"azure": {
"managed_identities": {
"etcd_encryption": {
"resource_id": "..."
}
}
},
"gcp": {
"etcd_encryption": {
"kms_key_service_account": "..."
"key_location": "...",
"key_name": "...",
"key_ring": "..."
}
}
} |
The way I see it, the struct will be more like below. And from the struct we can clearly how distributed are feature related knobs which has a high cognitive load when reasoning out about the data model; the cognitive load is both to the user of the API and the maintainers of the API (cs developers). {
"etcd_encryption": {
"state": "enabled"
},
"aws": {
"sts": {
"operator_iam_roles": [
.... {
"name": "kms-provider",
"role_arn": "kms-role-arn",
....
}
]
},
"etcd_encryption": {
"kms_key_arn": "..."
}
},
"azure": {
"etcd_encryption": {
"foo_bar_ksm_key": {
...Azure KV related info
}
},
"managed_identities": {
"control_plane_operators": {
...
"kms": {
"resource_id": ....
}
},
"data_plane_operators": {
....
}
}
},
"gcp": {
"etcd_encryption": {
"kms_key_service_account": "..."
"key_location": "...",
"key_name": "...",
"key_ring": "..."
}
}
} |
Yes, this is what I had in mind. There might be ways of improving the on the cognitive load issue you mention by rearranging some things inside the cloud component (possibly grouping all identities closer together under |
Thank you for confirming.
I don't think any possible arrangement in the cloud component will make the situation better. In my opinion, the distributed nature of the knobs for a certain feature is what adds to the cognitive load when reasoning about the feature and I only see option(2) as a solution for that; at the "cost" of potentially duplicating the cloud across different features. Between the two, the choice was to go for the option(1) in this regard. This is one implication. @vkareh @tzvatot there are other more broader cases highlighted in #1037 (comment), please take a look at that comment and chime in. |
From what I see in #1037 (comment), I feel the current guidelines fulfil their requirements. |
Ok so if you think that's the correct design, and you agree on Manyanda's example, #1037 (comment) including the key name when you specify the data of the etcd related identity, Then I think the way forward with this MR is to close it, because that's the current design that it is already in the main branch regarding the azure managed identities area. You can take a look at the main branch to see how it is being defined. Can you confirm that we are ok with this way forward? If that's the case, we'll close the MR without merging. |
Below is the schema of what we've in main. The schema captures the configuration of etc encryption for 3 clouds we support. For the aro-hcp case, the {
"etcd_encryption": {
"state": "enabled"
},
"aws": {
"sts": {
"operator_iam_roles": [
.... {
"name": "kms-provider",
"role_arn": "kms-role-arn",
....
}
]
},
"etcd_encryption": {
"kms_key_arn": "..."
}
},
"azure": {
"etcd_encryption": {
"foo_bar_ksm_key": {
...Azure KV related info
}
},
"operators_authentication": {
"managed_identities": {
"control_plane_operators": { // this is a Map struct with key being the operator
...
"kms": { // here the key is the operator name
"resource_id": ....
}
},
"data_plane_operators": { // this is a Map struct with key being the operator name
....
}
}
}
},
"gcp": {
"etcd_encryption": {
"kms_key_service_account": "..."
"key_location": "...",
"key_name": "...",
"key_ring": "..."
}
}
} |
Is the operator name "kms"? |
Schema looks good to me. @vkareh ? |
Yes. The name of the key in the payload will be whatever is in this config file https://github.com/Azure/ARO-HCP/blob/main/cluster-service/deploy/templates/azure-operators-managed-identities-config.configmap.yaml#L89 (the key in the config there are the operator names) |
We decided to structure optional features differently after a lengthy discussion in openshift-online/ocm-api-model#1037 This keeps the nice test framework in ocm_test.go even though it doesn't do much now. This can be expanded upon in the future.
We decided to structure optional features differently after a lengthy discussion in openshift-online/ocm-api-model#1037 This keeps the nice test framework in ocm_test.go even though it doesn't do much now. This can be expanded upon in the future.
We decided to structure optional features differently after a lengthy discussion in openshift-online/ocm-api-model#1037 This keeps the nice test framework in ocm_test.go even though it doesn't do much now. This can be expanded upon in the future.
We decided to structure optional features differently after a lengthy discussion in openshift-online/ocm-api-model#1037 This keeps the nice test framework in ocm_test.go even though it doesn't do much now. This can be expanded upon in the future.
We decided to structure optional features differently after a lengthy discussion in openshift-online/ocm-api-model#1037 This keeps the nice test framework in ocm_test.go even though it doesn't do much now. This can be expanded upon in the future.
We decided to structure optional features differently after a lengthy discussion in openshift-online/ocm-api-model#1037 This keeps the nice test framework in ocm_test.go even though it doesn't do much now. This can be expanded upon in the future.
We decided to structure optional features differently after a lengthy discussion in openshift-online/ocm-api-model#1037 This keeps the nice test framework in ocm_test.go even though it doesn't do much now. This can be expanded upon in the future.
We decided to structure optional features differently after a lengthy discussion in openshift-online/ocm-api-model#1037 This keeps the nice test framework in ocm_test.go even though it doesn't do much now. This can be expanded upon in the future.
We decided to structure optional features differently after a lengthy discussion in openshift-online/ocm-api-model#1037 This keeps the nice test framework in ocm_test.go even though it doesn't do much now. This can be expanded upon in the future.
We decided to structure optional features differently after a lengthy discussion in openshift-online/ocm-api-model#1037 This keeps the nice test framework in ocm_test.go even though it doesn't do much now. This can be expanded upon in the future.
We decided to structure optional features differently after a lengthy discussion in openshift-online/ocm-api-model#1037 This keeps the nice test framework in ocm_test.go even though it doesn't do much now. This can be expanded upon in the future.
We decided to structure optional features differently after a lengthy discussion in openshift-online/ocm-api-model#1037 This keeps the nice test framework in ocm_test.go even though it doesn't do much now. This can be expanded upon in the future.
We decided to structure optional features differently after a lengthy discussion in openshift-online/ocm-api-model#1037 This keeps the nice test framework in ocm_test.go even though it doesn't do much now. This can be expanded upon in the future.
We decided to structure optional features differently after a lengthy discussion in openshift-online/ocm-api-model#1037 This keeps the nice test framework in ocm_test.go even though it doesn't do much now. This can be expanded upon in the future.
We decided to structure optional features differently after a lengthy discussion in openshift-online/ocm-api-model#1037 This keeps the nice test framework in ocm_test.go even though it doesn't do much now. This can be expanded upon in the future.
The PR captures the above proposal as stated out in the DDR around optional managed identities.
However, there are a few challenges with the current proposal;
required_operators_authentication
section to its own section as par the current proposal. How do we people see a better design to future proof ourselves? This is the case we're having now with the image registry operator taken as an example.The proposal above designs it as
cluster.image_registry
however, the concept of managed identities is tied to Azure.Other clouds have different concepts / naming. An alternative to the proposal above will be to have
cluster.azure.image_registry
andcluster.aws.image_registry
i.e each cloud section in the cluster's spec will have its own type defining of the feature. However, this duplicates the feature definition across different clouds.Another alternative is
cluster.image_registry.aws
andcluster.image_registry.azure
i.e on the feature itself, have configurations per cloud. This duplicates the cloud platform everywhere.I am opening this discussion so that we can iterate on the design and come up with a solid one to captured in the DDR and adopted across offerings and specifically for aro-hcp optional features design.
@tzvatot @vkareh @miguelsorianod @deads2k @flavianmissi @mbarnes