Skip to content

spec: LocalDevices, NextDevice, and NodeGetCapabilities #84

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
akutz opened this issue Aug 16, 2017 · 23 comments
Closed

spec: LocalDevices, NextDevice, and NodeGetCapabilities #84

akutz opened this issue Aug 16, 2017 · 23 comments

Comments

@akutz
Copy link
Contributor

akutz commented Aug 16, 2017

The AWS Elastic Block Service requires the path to the next, available device when attaching a volume. Additionally, it's useful for a Controller plug-in to be aware of the local device data from a Node host. I'd like to propose that NodeGetCapabilities be extended or some new message be introduced that requires a CO to fetch data from Node hosts to inform a Controller plug-in on a Controller host of a Node's next available device path and the current state of a Node host's local devices.

@julian-hj
Copy link
Contributor

This feels very similar to the GetNodeID requirement, except that I think we intended the result of GetNodeID to be cacheable by the CO. Maybe we could solve this by adding some cache-control semantics to GetNodeID?

@akutz
Copy link
Contributor Author

akutz commented Aug 16, 2017

Hi @julian-hj,

Caching is in fact why I filed #71 and raised it for discussion in the most recent CSI public forum. @saad-ali filed a TODO, but only to make it explicit that the NodeID is immutable:

TODO: Make it clear that NodeID and other IDs can be cached and are immutable.

So yeah, I agree that GetNodeID makes sense, and I'd planned to piggy-back off of that, but others determined that its data should be immutable.

@julian-hj
Copy link
Contributor

This feels to me like a compelling use case for reversing that decision about immutable node ids, and maybe also changing the name to GetNodeInfo, along with some bits in the response to let the CO know if the response can be re-used or not. I can't speak for others, but I certainly didn't have this use case in mind when we made that decision to make the IDs explicitly immutable.

@codenrhoden
Copy link

Looking through the spec, and at all instances of NodeID, it appears to me that the purpose of the NodeID is to be able to provide it to a controller when it is required in order to have that controller attach/detach volumes from remote nodes (e.g. "controller, please attach volume X to node Y). It's secondary purpose comes up in error messages, when you want to provide details about nodes that already have the volume attached.

In this sense, it does make sense to me that the NodeID is small and immutable. The smallest amount of info that uniquely identifies a node to the plugin, and since it's immutable, it can be cached.

The need for knowing nextDevice is real, as EBS requires it. The only time it is needed is for ControllerPublishVolume, I believe. The CO has a lot to juggle in this scenario, as it needs to go get nextDevice from the Node service (via some method that does not exist today), then do ControllerPublishVolume, then back to the node for NodePublishVolume. To me, this makes a case for adding a new Node RPC, and for a controller to indicate via capabilities that it requires local/next Device info. It could be an optional field in ControllerPublishVolumeRequest. Perhaps that way a CO can make that extra step to the node (e.g. getLocalDevices) optional for plugins that don't need it.

@gpaul
Copy link

gpaul commented Aug 17, 2017

The AWS Elastic Block Service requires the path to the next, available device when attaching a volume.

Is this something that the CO needs to know about? It sounds like this decision can be made local to the Node as an implementation detail of the CSI plugin for EBS.

[...] inform Controller plug-in on a Controller host of a Node's next available device path and the current state of a Node host's local devices.

Won't this information be stale before it reaches the Controller host? This sounds like a race to me.

@jdef
Copy link
Member

jdef commented Aug 17, 2017

NodeID is intended to remain small and immutable. If there's a need for CSI to expose additional node state then we'd need to define a new type and RPC to fetch it.

Maybe it's worth stepping back and re-casting the problem space:

Given:

  1. there's two plugin instance types, one instance of each: (a) identity+controller; (b) identity+node
  2. plugin(a) (on a master) needs information from the node (agent) on which plugin(b) is running
  3. CSI endpoints are explicitly local to the node upon which they're running (via UNIX socket)
  4. CSI doesn't provide RPCs to facilitate communication between distributed instances of a plugin - that's left as an exercise for the plugin implementation.

There's no OOTB strategy for plugin(a) to directly communicate to plugin(b) over the same (UNIX socket) endpoint that plugin(b) exposes for CSI.

The proposal to essentially hack NodeID in order to propagate sideband, ephemeral state feels very dirty and outside the scope of what NodeID was originally intended for: to represent a Node's identity (that shall not change).

Alternative proposal 1: extend CSI Node service with a NodeGetState RPC; extend ControllerPublishVolumeRequest to include a NodeState field. This doesn't resolve the possible race that @gpaul identified -- worst case, the controller publish call fails (because the EBS device slot is already populated) and so perhaps the CO backs up and tries again, beginning with NodeGetState.

Alternative proposal 2: extend CSI with messaging RPCs (gated by *Capabilities) that facilitate unreliable, at-least-once messaging between CSI plugin instances, whereby the CO subscribes to message streams from plugin instances and routes messages received from one plugin instance to another. This proposal probably enables tons of other use cases as well, though honestly this seems outside of the stated goals of CSI and is otherwise too much to add for a v1.

@julian-hj
Copy link
Contributor

julian-hj commented Aug 17, 2017

I am not sure I agree that changing GetNodeID to GetNodeInfo is a hack any more than GetNodeID itself is a hack. The basic intent of GetNodeID was to provide information about the node that the controller plugin is not well positioned to gather. We assumed that that information could be immutable and that we could treat it as an ID, but this issue seems to point to a use case where perhaps that assumption is invalid.

So the question for me is, do we really want to have 2 separate RPCs and an additional create parameter to handle the set of use cases where the controller plugin needs extra information about the node, or would one RPC suffice?

Regarding the race condition, yes I suppose this is a real problem. The obvious answer would be to simply move the attach logic out of ControllerPublishVolume and into NodePublishVolume and then use some sort of local locking on the node, but that would require the node plugin to have AWS keys and network access to the AWS EC2 API. I am not sure that's a viable answer for every CO, but if it is, it would render the whole issue moot, I think.

@clintkitson
Copy link

clintkitson commented Aug 17, 2017

I agree with @jdef's assessment re proposal 1. It feels necessary as @codenrhoden calls out to include it in capabilities so that the CO is aware it must run this info collection step on the node prior to an attach and therefore pass that information to the attach call. Regarding the race condition, it is present today in K8s.

You can see the K8s AWS code has expected race condition if same device captured for two requests.
https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/providers/aws/aws.go#L1294-L1298

Short of doing more complex solutions to this, it could be called out that 1) CO should retry under certain errors 2) burden is on the plugin to minimize possibility of race conditions.

Here the AWS plugin could be modified to randomly choose device names vs choosing only the next available.
https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/providers/aws/device_allocator.go#L77-L79

@akutz
Copy link
Contributor Author

akutz commented Aug 17, 2017

FWIW, there will always be race conditions in play unless the CO completely owns concurrency with respect to parallel volume operations, including post-crash. Unless a storage platform supports a distributed locking mechanism that the CO or the Plug-in can leverage, there is always the potential for some failure during the workflow due to another, similar operation occurring first.

Additionally, as @julian-hj said, unless we move certain operations to the Node host / Node plug-in, the fact that a Controller plug-in will need information available only from the Node host will result in these situations. The other option is introducing a message queue that Controller and Node plug-ins can use in order to communicate with one another. However, that may be an implementation detail. I'd rather this be something that the CO handle as I'm sure this won't be the only case. There are additional storage platforms such as Ceph RBD and Fittedcloud (to name two) that require a Node host's participation in workflows relegated to the Controller in the CSI model.

@jdef
Copy link
Member

jdef commented Aug 17, 2017

@akutz I'm not familiar with the workings of Ceph/Fittedcloud so it's hard for me to understand whether the NodeGetState RPC I've proposed would be sufficient for those cases. I also didn't describe very well what a NodeState object would entail - my goal was that it would be complementary to NodeID and would contain ephemeral state that MAY change across invocations.

@gpaul
Copy link

gpaul commented Aug 17, 2017

Multiple operations racing for the next available device path is unavoidable, but if the Node was the one performing the attach operation, it could immediately retry with successive (or random) device paths until the operation succeeded. If the Node isn't the one attaching the volume, the responsibility to retry is forced upon the CO.

Additionally, as @julian-hj said, unless we move certain operations to the Node host / Node plug-in, the fact that a Controller plug-in will need information available only from the Node host will result in these situations.

That's a key observation, thanks for pointing that out. I take this to mean that firstly, the split between Node Plugin and Controller Plugin is fundamentally necessary and secondly that there are SP workflows where the Controller Plugin needs ephemeral information from the Node in order to perform it's actions.

Does this sum it up?

Next question: do Ceph RBD, Fittedcloud or others require bi-directional communication between the Controller Plugin and Node Plugin? If so, a message bus sounds better than adding an RPC.

Thanks again for the insights: I'm trying to stake out a solution space I'm unfamiliar with.

@codenrhoden
Copy link

I am pretty familiar with Ceph RBD. bi-directional comms between a controller and node plugin for Ceph are not necessarily required, but where it gets interesting is when it comes to auth keys.

Ceph does not have a centralized API for attach/detach. A centralized controller can create/delete volume, but attach and detach has to be done on the node. So for CSI, that would mean that attach and mount logic would need to occur in NodePublishVolume. However, in order to perform that attach, a node has to have a cephx key (same idea as an AWS credential). This does not have to be the same credential used by the controller to create/delete a volume.

So, if you wanted to have centralized credentials, then a mechanism to (securely) transport those credentials is required. Otherwise, each node plugin has to be given valid credentials locally.

@akutz
Copy link
Contributor Author

akutz commented Aug 17, 2017

If so, a message bus sounds better than adding an RPC.

I wholeheartedly concur. It's almost as if...

The other option is introducing a message queue that Controller and Node plug-ins can use in order to communicate with one another.

:)

We can discuss specifications and possibilities all we want, but there are real-world examples today in libStorage and REX-Ray where platforms are limited by their own backends. Ceph RBD, for example, has to issue all attach operations via the Node Host. It's distributed block storage with a design pattern that more closely follows NFS.

I personally like the siloed separation of concern presented by the Controller and Node plug-in models. However, I also know that if you want to support a centralized for all storage platforms, you have to meet them in the middle as you cannot expect all storage platforms to fit your model. Or if you do you will be waiting a while. Whether it's AWS EBS that requires Node-specific information for an otherwise centralizable attach op or it's Ceph RBD that must issue attach operations from a Node host, there are cases where it would be useful for a Node and Controller to speak to one another.

Here's my sample case that I plan to introduce in a future CSI implementation of the EBS platform. Under the current spec, when the Controller receives a ControllerPublishVolume request, the message lacks the next device information necessary to do the attach. However, the request does contain the NodeID. I plan on introducing something like etcd, consul, libkv, etc. to provide a message queue that all participating plug-ins will use. The Controller will place a message on the queue that says "Hi, could NodeID=XYZ please tell me your next available device?". The Node host will see that message and say "Huh, I'm NodeID=XYZ, here you go!" and place a response back onto the queue, with the next device ID.

It would be great if that workflow/logic/data was part of the spec, but I'm not sure it could be without running up the aforementioned concurrency issues. With a message queue the plug-ins could implement locking and other information sharing.

Still, I'd prefer any solution to be a part of the spec where possible.

@gpaul
Copy link

gpaul commented Aug 17, 2017

It's almost as if...

Right! I was referring to your suggestion but as it was in the original issue description I figured no need to say so explicitly.

@akutz
Copy link
Contributor Author

akutz commented Aug 17, 2017

Hi @gpaul,

No, I know. I was trying to be funny. All tone and intent is lost in text, sorry :( Thank you for restating it, since, as you said, it is a pretty big shift from the original intent of the issue.

@gpaul
Copy link

gpaul commented Aug 17, 2017

@akutz 👍

the split between Node Plugin and Controller Plugin is fundamentally necessary

I'm not sure why ControllerPublishVolume exists and is furthermore required. I imagine it is to prevent credentials from passing through a compromised Node. I wonder if security concerns could be addressed differently, thereby opening the door to #86

@akutz
Copy link
Contributor Author

akutz commented Aug 17, 2017

Hi @gpaul,

I hate to keep bringing up libStorage, but it, if anything, is a treasure-trove of lessons learned with respect to producing a container-storage orchestrator :) Anyway, the entire purpose for libStorage's centralization of certain operations, such as a Volume Attach operation, was for, exactly as you said, restriction and centralization of credentials required to execute privileged operations.

However, there are so many limitations with this model, just starting with who can access these privileged endpoints.

The other aspect of centralization is centralized configuration. Credentials are a subset of that. However, the superset, the centralized configuration, can also be handled differently, and, at the same time, solving the credential issue.

Distributed config.

That's where I intended to take libStorage prior to the introduction of CSI. Products like etcd enable different response data based on the requestor's role. I intended to essentially get rid of the centralized libStorage controller and make every running libStorage process both a client and controller -- that accessed their configs, and any possibly sensitive data, via a distributed config product like etcd.

The natural side-effect of this design decision is now you also have distributed, persistence storage for your project, on which you can build message queues, distributed locking, etc...

@akutz
Copy link
Contributor Author

akutz commented Aug 17, 2017

Hi All,

If possible, here are the possible resolutions I would like to see occur for this issue in the immediate future. Either:

  1. The CSI spec is enhanced to include transporting a data map from the Node host to the Controller host for certain operations AND the CO owns all locking for operations on a given volume and the CO is responsible for recovering from a crash.
  2. The CSI spec is enhanced to include transporting a data map from the Node host to the Controller host for certain operations AND the group admits there is a known issue with possible race conditions as a result of this enhancement and they will be addressed at a later date.
  3. The group acknowledges there are some storage platforms that are simply not supported in a CSI Centralized model due to operations that occur on the Controller needing information from a Node.
  4. The group acknowledges that platforms have the aforementioned requirements must provide solutions as part of the plug-in implementations, possibly requiring external support software such as etcd, etc.

@gpaul
Copy link

gpaul commented Aug 17, 2017

@akutz keep the REX-Ray and libStorage examples coming, real stories from the trenches are golden. I find concrete examples illuminating. They eliminate infeasible solutions early. Also, nice list.

@saad-ali
Copy link
Member

I suggest we go with option 4 from Andrew's list. Kubernetes does not support transfer of a device map between node hosts and controller, and the Kubernetes EBS plugin gets around this by managing a map in-tree. This is not perfect, but it gets the job done without complicating the CSI API further. We can revisit this in the future, if needed (expanding the API is always easier than trimming it).

@jdef
Copy link
Member

jdef commented Oct 30, 2017

I also vote for option 4

@cpuguy83
Copy link
Contributor

+1 to option 4.

@saad-ali
Copy link
Member

saad-ali commented Nov 1, 2018

Given the consensus I don't think there are followup items for this. Closing.

@saad-ali saad-ali closed this as completed Nov 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants