Skip to content

spec: Fix the error codes related to idempotency #160

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Dec 19, 2017

Conversation

jieyu
Copy link
Member

@jieyu jieyu commented Dec 7, 2017

The patch fixed the error codes that are related to idempotency:

(1) For CreateVolume, if the volume already exists and is compatible, return OK instead. If the volume exists but not compatible, return ALREADY_EXISTS.
(2) For DeleteVolume, if the volume does not exist, return OK instead.
(3) For ControllerUnpublishVolume, if the volume is already detached from the node, return OK instead.

Fixes #157
xref #158

jieyu added 2 commits December 6, 2017 17:55
The patch fixed the error codes that are related to idempotency:

(1) For `CreateVolume`, if the volume already exists and is compatible,
return OK instead. If the volume exists but not compatible, return
ALREADY_EXISTS.

(2) For `DeleteVolume`, if the volume does not exist, return OK instead.

(3) For `ControllerUnpublishVolume`, if the volume is already detached
from the node, return OK instead.

Fixes container-storage-interface#157
xref container-storage-interface#158
@jieyu
Copy link
Member Author

jieyu commented Dec 7, 2017

@@ -724,8 +725,8 @@ If the plugin is unable to complete the ControllerPublishVolume call successfull
If the conditions defined below are encountered, the plugin MUST return the specified gRPC error code.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This call is idempotent, too. Is there similar reasoning that can be applied that allows OK to be returned? Perhaps it's very SP-specific whether or not that is possible.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case it likely isn't sufficient to check volume_capability as the user_credentials used to publish the volume to a node may impose different access policies.

Copy link

@gpaul gpaul Dec 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps

If the plugin determines that the volume has already been published to the node in a way that satisfies the current request it MAY respond with 0 OK.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fix the wording and added a new error code. PTAL

Condition | gRPC Code | Description | Recovery Behavior
| --- | --- | --- | --- |
| Condition | gRPC Code | Description | Recovery Behavior |
|-----------|-----------|-------------|-------------------|
| Volume does not exists | 5 NOT_FOUND | Indicates that a volume corresponding to the specified `volume_id` does not exist. | Caller MUST verify that the `volume_id` is correct and that the volume is accessible and has not been deleted before retrying with exponential back off. |
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NodePublishVolume is also idempotent and should respond 0 OK if the request is already satisfied by an existing mount.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comments below. It's slightly different than volume not found case. I'll add some wording at the top about this.

Condition | gRPC Code | Description | Recovery Behavior
| --- | --- | --- | --- |
| Condition | gRPC Code | Description | Recovery Behavior |
|-----------|-----------|-------------|-------------------|
| Volume does not exists | 5 NOT_FOUND | Indicates that a volume corresponding to the specified `volume_id` does not exist. | Caller MUST verify that the `volume_id` is correct and that the volume is accessible and has not been deleted before retrying with exponential back off. |
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NodeUnpublishVolume should respond 0 OK if the volume is not mounted at the target_path.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is slightly different than the NOT_FOUND case, where the volume does not exist. I'll add some wording at the top to say that if the volume published at target_path has already been unpublished, the plugin should return OK.

spec.md Outdated
| Volume already exists | 6 ALREADY_EXISTS | Indicates that a volume corresponding to the specified volume `name` already exists. Plugin MUST also return a valid `CreateVolumeResponse`. | Caller MUST assume the `CreateVolume` call succeeded. |
| Condition | gRPC Code | Description | Recovery Behavior |
|-----------|-----------|-------------|-------------------|
| Volume already exists but incompatible | 6 ALREADY_EXISTS | Indicates that a volume corresponding to the specified volume `name` already exists but is incompatible with the specified `capacity_range`, `volume_capabilities` and `parameters`. | Caller MUST fix the arguments or use a different `name` before retrying. |
Copy link
Contributor

@cpuguy83 cpuguy83 Dec 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "Volume already exists but is incompatible"

Copy link
Contributor

@cpuguy83 cpuguy83 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM
Just a minor wording nit.

Copy link
Contributor

@julian-hj julian-hj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jieyu
Copy link
Member Author

jieyu commented Dec 7, 2017

Comments addressed. PTAL again. I slightly changed the error code for ControllerPublishVolume to distinguish the following two cases:

  1. volume already published to the same node, but with incompatble arguments (ALREADY_EXISTS)
  2. volume already published to a different node (ABORTED)

Now, the idempotency behavior is more consistent in my mind.

Copy link
Contributor

@julian-hj julian-hj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM again

spec.md Outdated
@@ -434,7 +434,7 @@ A Controller Plugin MUST implement this RPC call if it has `CREATE_DELETE_VOLUME
This RPC will be called by the CO to provision a new volume on behalf of a user (to be consumed as either a block device or a mounted filesystem).

This operation MUST be idempotent.
If a volume corresponding to the specified volume `name` already exists and is compatible with the specified `capacity_range`, `volume_capabilities` and `parameters` in the `CreateVolumeRequest`, the Plugin MUST reply `0 OK` with the corresponding `CreateVolumeResponse`.
If a volume corresponding to the specified volume `name` already exists and is compatible with the specified `capacity_range`, `volume_capabilities`, `parameters` and `user_credentials` in the `CreateVolumeRequest`, the Plugin MUST reply `0 OK` with the corresponding `CreateVolumeResponse`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not convinced that user_credentials should be considered part of the "consistent request" parameters. credentials MAY change over time (e.g. rotation) but the other create-volume parameters already listed here should be consistent/stable for an idempotent request.

spec.md Outdated
@@ -605,7 +605,7 @@ The CO MUST implement the specified error recovery behavior when it encounters t

| Condition | gRPC Code | Description | Recovery Behavior |
|-----------|-----------|-------------|-------------------|
| Volume already exists but incompatible | 6 ALREADY_EXISTS | Indicates that a volume corresponding to the specified volume `name` already exists but is incompatible with the specified `capacity_range`, `volume_capabilities` and `parameters`. | Caller MUST fix the arguments or use a different `name` before retrying. |
| Volume already exists but is incompatible | 6 ALREADY_EXISTS | Indicates that a volume corresponding to the specified volume `name` already exists but is incompatible with the specified `capacity_range`, `volume_capabilities`, `parameters` or `user_credentials`. | Caller MUST fix the arguments or use a different `name` before retrying. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto re: user_credentials

spec.md Outdated
@@ -667,6 +667,8 @@ The Plugin SHOULD perform the work that is necessary for making the volume avail
The Plugin MUST NOT assume that this RPC will be executed on the node where the volume will be used.

This operation MUST be idempotent.
If the volume corresponding to the `volume_id` has already been published at the node corresponding to the `node_id`, and is compatible with the specified `volume_capability`, `readonly` flag and `user_credentials`, the Plugin MUST reply `0 OK`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto re: user_credentials

spec.md Outdated
@@ -729,7 +731,8 @@ The CO MUST implement the specified error recovery behavior when it encounters t
|-----------|-----------|-------------|-------------------|
| Volume does not exists | 5 NOT_FOUND | Indicates that a volume corresponding to the specified `volume_id` does not exist. | Caller MUST verify that the `volume_id` is correct and that the volume is accessible and has not been deleted before retrying with exponential back off. |
| Node does not exists | 5 NOT_FOUND | Indicates that a node corresponding to the specified `node_id` does not exist. | Caller MUST verify that the `node_id` is correct and that the node is available and has not been terminated or deleted before retrying with exponential backoff. |
| Volume published to another node | 6 ALREADY_EXISTS | Indicates that a volume corresponding to the specified `volume_id` is already attached to another node and does not support multi-node attach. If this error code is returned, the Plugin SHOULD specify the `node_id` of the node the volume is already attached to as part of the gRPC `status.message`. | Caller SHOULD ensure the specified volume is not attached to any other node before retrying with exponential back off. |
| Volume published but is incompatible | 6 ALREADY_EXISTS | Indicates that a volume corresponding to the specified `volume_id` has already been published at the node corresponding to the specified `volume_id` but is incompatible with the specified `volume_capability`, `readonly` flag or `user_credentials`. | Caller MUST fix the arguments before retying. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto re: user_credentials

spec.md Outdated
@@ -1026,17 +1029,19 @@ The Plugin SHALL assume that this RPC will be executed on the node where the vol
If the corresponding Controller Plugin has `PUBLISH_UNPUBLISH_VOLUME` controller capability, the CO MUST guarantee that this RPC is called after `ControllerPublishVolume` is called for the given volume on the given node and returns a success.

This operation MUST be idempotent.
If the volume corresponding to the `volume_id` has already been published at the specified `target_path`, and is compatible with the specified `volume_capability`, `readonly` flag and `user_credentials`, the Plugin MUST reply `0 OK`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto re: user_credentials

spec.md Outdated
@@ -1100,11 +1105,11 @@ The CO MUST implement the specified error recovery behavior when it encounters t
| Condition | gRPC Code | Description | Recovery Behavior |
|-----------|-----------|-------------|-------------------|
| Volume does not exists | 5 NOT_FOUND | Indicates that a volume corresponding to the specified `volume_id` does not exist. | Caller MUST verify that the `volume_id` is correct and that the volume is accessible and has not been deleted before retrying with exponential back off. |
| Volume published but is incompatible | 6 ALREADY_EXISTS | Indicates that a volume corresponding to the specified `volume_id` has already been published at the specified `target_path` but is incompatible with the specified `volume_capability`, `readonly` flag or `user_credentials`. | Caller MUST fix the arguments before retying. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto re: user_credentials

@jieyu
Copy link
Member Author

jieyu commented Dec 7, 2017

@julian-hj @cpuguy83 @saad-ali please take a second look. Thanks!

spec.md Outdated
| Volume does not exists | 5 NOT_FOUND | Indicates that a volume corresponding to the specified `volume_id` does not exist. | Caller MUST verify that the `volume_id` is correct and that the volume is accessible and has not been deleted before retrying with exponential back off. |
| Node does not exists | 5 NOT_FOUND | Indicates that a node corresponding to the specified `node_id` does not exist. | Caller MUST verify that the `node_id` is correct and that the node is available and has not been terminated or deleted before retrying with exponential backoff. |
| Volume published to another node | 6 ALREADY_EXISTS | Indicates that a volume corresponding to the specified `volume_id` is already attached to another node and does not support multi-node attach. If this error code is returned, the Plugin SHOULD specify the `node_id` of the node the volume is already attached to as part of the gRPC `status.message`. | Caller SHOULD ensure the specified volume is not attached to any other node before retrying with exponential back off. |
| Volume published but is incompatible | 6 ALREADY_EXISTS | Indicates that a volume corresponding to the specified `volume_id` has already been published at the node corresponding to the specified `volume_id` but is incompatible with the specified `volume_capability` or `readonly` flag . | Caller MUST fix the arguments before retying. |
| Volume published to another node | 10 ABORTED | Indicates that a volume corresponding to the specified `volume_id` has already been published at another node and does not have MULTI_NODE volume capability. If this error code is returned, the Plugin SHOULD specify the `node_id` of the node at which the volume is published as part of the gRPC `status.message`. | Caller SHOULD ensure the specified volume is not published at any other node before retrying with exponential back off. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced that this should be ABORTED. This sounds like FAILED_PRECONDITION to me.

   // A litmus test that may help a service implementor in deciding
    // between FailedPrecondition, Aborted, and Unavailable:
    //  (a) Use Unavailable if the client can retry just the failing call.
    //  (b) Use Aborted if the client should retry at a higher-level
    //      (e.g., restarting a read-modify-write sequence).
    //  (c) Use FailedPrecondition if the client should not retry until
    //      the system state has been explicitly fixed.  E.g., if an "rmdir"
    //      fails because the directory is non-empty, FailedPrecondition
    //      should be returned since the client should not retry unless
    //      they have first fixed up the directory by deleting files from it.
    //  (d) Use FailedPrecondition if the client performs conditional
    //      REST Get/Update/Delete on a resource and the resource on the
    //      server does not match the condition. E.g., conflicting
    //      read-modify-write on the same resource.

The case sounds like "c" or "d" in the example and the CO should not retry until things are fixed... e.g. maybe a unpublish request failed on another node and needs to be re-performed and then retry the publish.

Copy link
Member Author

@jieyu jieyu Dec 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot use FAILED_PRECONDITION because this is what we use for operation pending case. For each rpc, each error status can only have one recovery behavior. Otherwise, the CO won't know how to recover based on the error status.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jieyu,

Hmm, and this ties in neatly with my recommended approach below.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really, really dislike that broad-use error codes are being restricted to mean a single error state when we have the ability to specify the error details in a more RPC-specific manner. Please see below for more information.

@akutz
Copy link
Contributor

akutz commented Dec 7, 2017

I want to review this PR, but I'm at Kubecon. I'd ask you please leave it unmerged until at least tomorrow morning so I have some time away from the conf. to review it this evening. Thank you!

@akutz
Copy link
Contributor

akutz commented Dec 7, 2017

Hi @jieyu,

I took a first pass look at this, and my initial response is the error codes you're attempting to qualify should perhaps be replaced or enhanced with additional details marshalled in the gRPC error details? For example:

// Error details for ControllerPublishVolume
message ControllerPublishVolumeError {
  enum Type {
    UNKNOWN = 0;
    INCOMPATIBLE = 1;
    // Indicates that a volume corresponding to the specified `volume_id` has 
    // already been published  at another node and does not have MULTI_NODE 
    // volume capability.
    // If this error code is returned then the Plugin SHOULD set the `node_id`
    // field to the ID of the node to which the volume is published.
    PUBLISHED_TO_ANOTHER_NODE = 2;
  }
  Type type = 1;
  string node_id = 2;
}

Something akin to the above would be consistent with the intent of the gRPC error model and add structure to the error details instead of expecting a free-form error message field to contain a value without any means of enforcing the value's purpose.

@akutz
Copy link
Contributor

akutz commented Dec 7, 2017

Hi @jieyu,

As an addendum to my previous response...I just do not like the way that certain gRPC error codes can be easily overused and imply multiple error states. It seems like this is a direction that will eventually result in a difficulty to properly deduce the actual error without resorting to inspecting free-form error message fields.

@jdef jdef added this to the v0.1 milestone Dec 7, 2017
@jieyu
Copy link
Member Author

jieyu commented Dec 8, 2017

@akutz, yes, I saw your suggestion and appreciate your feedback!

Discussed in today's WG meeting on this. All CO agreed that it's better not to introduce error details in v0.1. In fact, error detailed should be avoided if possible if error status can be used to distinguish different failure scenarios.

@cpuguy83 @julian-hj @jdef @saad-ali I revised the spec again based on today's discussion, PTAL.

@akutz
Copy link
Contributor

akutz commented Dec 8, 2017

Hi @jieyu,

I disagree. If you buy into error codes in v0.1 then it becomes difficult to move away from those to more discreet error detail later.

I’m troubled by this community’s reluctance to use the gRPC error details for any purpose — whether to encode idemptotent results or additional error information.

Why are we avoiding the value the new gRPC error mode provides?

@akutz
Copy link
Contributor

akutz commented Dec 8, 2017

Also, what discussion today? It’s Thursday. Are you referring to yesterday? Was there a call? Many in the community are at Kubecon and not able to participate. I find the notion of a call when people are unavailable to contribute to the discussion somewhat akin to backroom politics.

@jieyu
Copy link
Member Author

jieyu commented Dec 12, 2017

ping @julian-hj @saad-ali @cpuguy83

Copy link
Contributor

@cpuguy83 cpuguy83 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jieyu
Copy link
Member Author

jieyu commented Dec 19, 2017

ping @saad-ali, any further comments?

@saad-ali
Copy link
Member

Nope. LGTM Thanks!!

@saad-ali saad-ali merged commit 9e88e4b into container-storage-interface:master Dec 19, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Idempotent functions cannot return both data & errors
7 participants