-
Notifications
You must be signed in to change notification settings - Fork 378
spec: LocalDevices, NextDevice, and NodeGetCapabilities #84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This feels very similar to the |
Hi @julian-hj, Caching is in fact why I filed #71 and raised it for discussion in the most recent CSI public forum. @saad-ali filed a TODO, but only to make it explicit that the NodeID is immutable:
So yeah, I agree that |
This feels to me like a compelling use case for reversing that decision about immutable node ids, and maybe also changing the name to |
Looking through the spec, and at all instances of In this sense, it does make sense to me that the NodeID is small and immutable. The smallest amount of info that uniquely identifies a node to the plugin, and since it's immutable, it can be cached. The need for knowing |
Is this something that the CO needs to know about? It sounds like this decision can be made local to the Node as an implementation detail of the CSI plugin for EBS.
Won't this information be stale before it reaches the Controller host? This sounds like a race to me. |
NodeID is intended to remain small and immutable. If there's a need for CSI to expose additional node state then we'd need to define a new type and RPC to fetch it. Maybe it's worth stepping back and re-casting the problem space: Given:
There's no OOTB strategy for plugin(a) to directly communicate to plugin(b) over the same (UNIX socket) endpoint that plugin(b) exposes for CSI. The proposal to essentially hack NodeID in order to propagate sideband, ephemeral state feels very dirty and outside the scope of what NodeID was originally intended for: to represent a Node's identity (that shall not change). Alternative proposal 1: extend CSI Node service with a NodeGetState RPC; extend ControllerPublishVolumeRequest to include a NodeState field. This doesn't resolve the possible race that @gpaul identified -- worst case, the controller publish call fails (because the EBS device slot is already populated) and so perhaps the CO backs up and tries again, beginning with NodeGetState. Alternative proposal 2: extend CSI with messaging RPCs (gated by |
I am not sure I agree that changing So the question for me is, do we really want to have 2 separate RPCs and an additional create parameter to handle the set of use cases where the controller plugin needs extra information about the node, or would one RPC suffice? Regarding the race condition, yes I suppose this is a real problem. The obvious answer would be to simply move the attach logic out of |
I agree with @jdef's assessment re proposal 1. It feels necessary as @codenrhoden calls out to include it in capabilities so that the CO is aware it must run this info collection step on the node prior to an attach and therefore pass that information to the attach call. Regarding the race condition, it is present today in K8s. You can see the K8s AWS code has expected race condition if same device captured for two requests. Short of doing more complex solutions to this, it could be called out that 1) CO should retry under certain errors 2) burden is on the plugin to minimize possibility of race conditions. Here the AWS plugin could be modified to randomly choose device names vs choosing only the next available. |
FWIW, there will always be race conditions in play unless the CO completely owns concurrency with respect to parallel volume operations, including post-crash. Unless a storage platform supports a distributed locking mechanism that the CO or the Plug-in can leverage, there is always the potential for some failure during the workflow due to another, similar operation occurring first. Additionally, as @julian-hj said, unless we move certain operations to the Node host / Node plug-in, the fact that a Controller plug-in will need information available only from the Node host will result in these situations. The other option is introducing a message queue that Controller and Node plug-ins can use in order to communicate with one another. However, that may be an implementation detail. I'd rather this be something that the CO handle as I'm sure this won't be the only case. There are additional storage platforms such as Ceph RBD and Fittedcloud (to name two) that require a Node host's participation in workflows relegated to the Controller in the CSI model. |
@akutz I'm not familiar with the workings of Ceph/Fittedcloud so it's hard for me to understand whether the NodeGetState RPC I've proposed would be sufficient for those cases. I also didn't describe very well what a |
Multiple operations racing for the next available device path is unavoidable, but if the Node was the one performing the attach operation, it could immediately retry with successive (or random) device paths until the operation succeeded. If the Node isn't the one attaching the volume, the responsibility to retry is forced upon the CO.
That's a key observation, thanks for pointing that out. I take this to mean that firstly, the split between Node Plugin and Controller Plugin is fundamentally necessary and secondly that there are SP workflows where the Controller Plugin needs ephemeral information from the Node in order to perform it's actions. Does this sum it up? Next question: do Ceph RBD, Fittedcloud or others require bi-directional communication between the Controller Plugin and Node Plugin? If so, a message bus sounds better than adding an RPC. Thanks again for the insights: I'm trying to stake out a solution space I'm unfamiliar with. |
I am pretty familiar with Ceph RBD. bi-directional comms between a controller and node plugin for Ceph are not necessarily required, but where it gets interesting is when it comes to auth keys. Ceph does not have a centralized API for attach/detach. A centralized controller can create/delete volume, but attach and detach has to be done on the node. So for CSI, that would mean that attach and mount logic would need to occur in So, if you wanted to have centralized credentials, then a mechanism to (securely) transport those credentials is required. Otherwise, each node plugin has to be given valid credentials locally. |
I wholeheartedly concur. It's almost as if...
:) We can discuss specifications and possibilities all we want, but there are real-world examples today in libStorage and REX-Ray where platforms are limited by their own backends. Ceph RBD, for example, has to issue all attach operations via the Node Host. It's distributed block storage with a design pattern that more closely follows NFS. I personally like the siloed separation of concern presented by the Controller and Node plug-in models. However, I also know that if you want to support a centralized for all storage platforms, you have to meet them in the middle as you cannot expect all storage platforms to fit your model. Or if you do you will be waiting a while. Whether it's AWS EBS that requires Node-specific information for an otherwise centralizable attach op or it's Ceph RBD that must issue attach operations from a Node host, there are cases where it would be useful for a Node and Controller to speak to one another. Here's my sample case that I plan to introduce in a future CSI implementation of the EBS platform. Under the current spec, when the Controller receives a It would be great if that workflow/logic/data was part of the spec, but I'm not sure it could be without running up the aforementioned concurrency issues. With a message queue the plug-ins could implement locking and other information sharing. Still, I'd prefer any solution to be a part of the spec where possible. |
Right! I was referring to your suggestion but as it was in the original issue description I figured no need to say so explicitly. |
Hi @gpaul, No, I know. I was trying to be funny. All tone and intent is lost in text, sorry :( Thank you for restating it, since, as you said, it is a pretty big shift from the original intent of the issue. |
@akutz 👍
I'm not sure why |
Hi @gpaul, I hate to keep bringing up libStorage, but it, if anything, is a treasure-trove of lessons learned with respect to producing a container-storage orchestrator :) Anyway, the entire purpose for libStorage's centralization of certain operations, such as a Volume Attach operation, was for, exactly as you said, restriction and centralization of credentials required to execute privileged operations. However, there are so many limitations with this model, just starting with who can access these privileged endpoints. The other aspect of centralization is centralized configuration. Credentials are a subset of that. However, the superset, the centralized configuration, can also be handled differently, and, at the same time, solving the credential issue. Distributed config. That's where I intended to take libStorage prior to the introduction of CSI. Products like etcd enable different response data based on the requestor's role. I intended to essentially get rid of the centralized libStorage controller and make every running libStorage process both a client and controller -- that accessed their configs, and any possibly sensitive data, via a distributed config product like etcd. The natural side-effect of this design decision is now you also have distributed, persistence storage for your project, on which you can build message queues, distributed locking, etc... |
Hi All, If possible, here are the possible resolutions I would like to see occur for this issue in the immediate future. Either:
|
@akutz keep the REX-Ray and libStorage examples coming, real stories from the trenches are golden. I find concrete examples illuminating. They eliminate infeasible solutions early. Also, nice list. |
I suggest we go with option 4 from Andrew's list. Kubernetes does not support transfer of a device map between node hosts and controller, and the Kubernetes EBS plugin gets around this by managing a map in-tree. This is not perfect, but it gets the job done without complicating the CSI API further. We can revisit this in the future, if needed (expanding the API is always easier than trimming it). |
I also vote for option 4 |
+1 to option 4. |
Given the consensus I don't think there are followup items for this. Closing. |
The AWS Elastic Block Service requires the path to the next, available device when attaching a volume. Additionally, it's useful for a Controller plug-in to be aware of the local device data from a Node host. I'd like to propose that
NodeGetCapabilities
be extended or some new message be introduced that requires a CO to fetch data from Node hosts to inform a Controller plug-in on a Controller host of a Node's next available device path and the current state of a Node host's local devices.The text was updated successfully, but these errors were encountered: