Skip to content

Missing error code (for volume lifecycle, plugin-specific behavior, etc.). #23

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
chhsia0 opened this issue May 27, 2017 · 16 comments
Closed

Comments

@chhsia0
Copy link
Contributor

chhsia0 commented May 27, 2017

Should we add new error codes to indicate a plugin-specific error for each RPC call?

DeleteVolume: When a volume is in the NODE_READY or PUBLISHED state, what error code should we return?

ControllerUnpublishVolume: When a volume is in the PUBLISHED state, what error code should we return?

ListVolumes: Should we add a ListVolumesError?

Misc:
What is UNSUPPORTED_VOLUME_TYPE for?
GetNodeIDError and ProbeNodeError seems redundant.

@jdef
Copy link
Member

jdef commented May 31, 2017

Don't we already have a general/generic error code that implementations can use for vendor-specific codes?

ListVolumes: are there specific errors that you'd like to see? how would the recovery semantics differ across such errors?

@chhsia0
Copy link
Contributor Author

chhsia0 commented Jun 1, 2017

For vendor-specific error handling: the closest error code we have is GeneralError::UNDEFINED. I guess a plugin can use this to return any vendor-specific error. For managing volume lifecycles, I personally prefer using more explicit error codes such as VOLUME_ALREADY_PUBLISHED.

For ListVolumes, maybe we can have an INVALID_TOKEN. Or maybe it's OK to just use GeneralError::UNDEFINED.

@jdef
Copy link
Member

jdef commented Jun 2, 2017

Specific error codes are only useful if there are meaningful recovery semantics. Do you have any recommendations for recovery semantics for additional error codes that you'd find useful?

@cpuguy83
Copy link
Contributor

cpuguy83 commented Jun 2, 2017

How come we aren't using grpc error codes?

@jdef
Copy link
Member

jdef commented Jun 2, 2017 via email

@cpuguy83
Copy link
Contributor

cpuguy83 commented Jun 2, 2017

grpc error codes are for standard ways to communicate classes of errors across the RPC boundary, including errors that are temporary, requests that can be retried vs should not be retried, etc.

@jdef
Copy link
Member

jdef commented Jun 2, 2017

/cc @saad-ali

@chhsia0
Copy link
Contributor Author

chhsia0 commented Jun 7, 2017

FYI we can use grpc::Status::error_details() to deliver a binary string that represents a protobuf or anything we want for error handling.

@stevvooe
Copy link

stevvooe commented Jul 19, 2017

Sorry for butting in here, but this is a bit of a concern and I thought I could help get to the right solution for error handling.

Looking at the proposed API in https://github.com/container-storage-interface/spec/blob/master/spec.md, much of what GRPC provides is being reproduced by this protocol design. For example, the wire format details the algebraic data type of reply/error already and doesn't need to be represented in the higher-level protocol.

The main concern in not using the GRPC errors to signify error conditions is that we force users of this API to correctly map API error semantics into the language semantics. This is often error prone when done manually and is already provided by GRPC generators. Defining an alternative error path will lead to missed errors and bugs.

gRPC error codes are intended for gRPC-protocol-level errors. for CSI ops
they're too low level: they don't allow us to encapsulate additional
information, and don't allow us to accurately model/advertise the recovery
semantics that we want clients to use.

This is simply not true. Take a peak at wire format for details, but protocol errors are actually mapped to HTTP2 codes, which becomes grpc/codes.Internal. The rest of the error codes are for use by the application.

Let's take the Status message. This is the response error protobuf used by implementations, which is part of the request trailer in http2 (grpc::Status::error_details() as stated above). This can pack most semantic information required to direct a client to do the correct thing. If you need help finding a way to do this cleanly, let me know and I can provide some direction.

I usually look to google apis for best practices. They use this error code space at the application level to great effect. I've found that applying those concepts broadly (containerd and swarmkit, among others) has also been quite successful. Their methodology for versioning APIs is also very clean and usable in practice, so I'd implore you to compare the approach.

Again, I am sorry if this is presumptuous, but I think this project will have a large impact and having a pleasant API experience is a huge part of that.

If this is not the right issue to discuss this, let me know I can make another issue to highlight these points.

TL; DR

  1. The current approach is redundant to features already provided in GRPC.
  2. The current approach may lead to incorrect and inconsistent error handling.
  3. Best practices for error handling in GRPC are available across several projects. google apis is a good example, but there are others.
  4. I'm here to help!

@saad-ali
Copy link
Member

Thanks for the great input here. The reason for having our own error handling semantics is that it becomes very easy for the storage plugin to return very specific, actionable error codes to the CO. Using the a bounded set of generic, pre-defined gRPC error codes makes it harder to understand what exactly the problem is and what the expected recovery behavior for the CO should be.

Looking through https://github.com/grpc/grpc/blob/master/src/proto/grpc/status/status.proto it appears that int32 code should be an enum value of google.rpc.Code, but it may accept additional error codes if needed. If that is true, then we can use that to specify more structured error codes.

So I'll take another look and see if we can leverage gRPC error codes instead of redefining our own error handling semantics.

@stevvooe
Copy link

@saad-ali I understand the constraints here and I've actually been down this path myself. In practice, however, we found that using the GRPC error codes was more than sufficient. Coupling the error codes with solid messaging and extending with detail should be enough.

@jieyu
Copy link
Member

jieyu commented Jul 31, 2017

@stevvooe thanks for the feedback. IIUC, the suggestion was to use GRPC canonical error code [1] [2] for some some of the common errors. For instance, INVALID_VOLUME_ID ([3]) will probably translate into INVALID_ARGUMENT ([4]) (the error handling behavior for that error should be the same across all RPCs).

If we cannot find a mapping to the canonical errors, we'll use INTERNAL ([5]). The error_details field will be stringified version of ::google::rpc::Status ([6]) which might contain any type of extensions we want.

The question is: how does the client distinguish two "internal" errors programatically. Are we going to encode an CSI specific error code in the details ([7]) field?

Another route is: we take a step back and really see if it's necessary to introduce CSI specific errors that maps to the "internal" error. Maybe that's GRPC canonical errors are sufficient for most use cases (as @stevvooe suggested).

@cpuguy83
Copy link
Contributor

cpuguy83 commented Aug 1, 2017

Everything should map well, and for things that are truly "internal" it's some error that you can report the details of, but there wouldn't be really anything to do with the error than report.

@jieyu
Copy link
Member

jieyu commented Aug 1, 2017

@cpuguy83 If that's the case, I like that! I think the error code part of the spec can then be re-structured in the following way:

- RPC 1 errors
  - GRPC error code 1:
    - Scenarios:
      - Scenario 1 for this error code:
      - Scenario 2 for this error code:
      ...
    - Error handling
    - 'details' struct if needed
  - GRPC error code 2:
    - Scenarios:
      - Scenario 1 for this error code:
      - Scenario 2 for this error code:
      ...
    - Error handling
    - 'details' struct if needed
  ...

@cpuguy83
Copy link
Contributor

cpuguy83 commented Aug 3, 2017

@jieyu I think this is perfect, and similar to how linux kernel errors are documented.

@jdef
Copy link
Member

jdef commented Dec 19, 2017

resolved by #115, please re-open if you disagree

@jdef jdef closed this as completed Dec 19, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants