You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It is very easy to screw up gRPC's client keepalive feature. Any disagreement between the client and server results in RPCs failing at unpredictable intervals with code=Unknown desc=transport is closing. GRPC-go's own README calls this "hard to debug". Due to this, you must be extremely careful at how you order deploys when you change this configuration.
This should be at least better documented, and possibly should be redesigned. More details about the problems this can cause are in Additional Context below.
Describe the solution you'd like
Clearly document this in all API references and in generic documentation like https://github.com/grpc/grpc/blob/master/doc/keepalive.md and the RFC for this feature. They should all say something like "Prefer to configure keepalive on the server-side, and only configure client keepalive if you are absolutely sure you need it. Incorrectly configured client keepalive will cause RPCs to fail. You must deploy the server with the same configuration before you deploy the client. When disabling this feature, you must deploy the change in the opposite order"
Clients should all implement the "double the keepalive timeout" policy that is described in the client keepalives RFC. As far as I can tell, C and Java implements this, but I don't think Go does? This still isn't perfect. I had a client setting a 10 second keepalive timeout. The default server interval is 5 minutes. This means every client will have 5 connections closed until they get an acceptable keepalive timer. Each time, it will terminate all RPCs in flight.
It is possible this feature could be removed. Go enables TCP keepalives with a 15 second timer by default. The C gRPC implementation could attempt to do the same using SO_KEEPALIVE, and Java could do something similar. This would probably eliminate most WAN NAT/load balancer issues with zero configuration. If a more aggressive configuration is needed, the server could initiate the ping, using the existing server-side keepalive. The disadvantage to this is that clients may not learn a connection is dead for some time. However, robust GRPC applications need to set deadlines on every RPC anyway.
If client-side keepalive is really needed, then it could be redesigned to be less error prone. The ideal way, in my opinion, would be for it to be under server control, since the concern about "denial of service" described in the GRFC only really applies to servers. This would require the server to send some setting to the client about the permitted settings. Possibly GRPC could use some unused bits in the HTTP2 settings frame, or could define a special request where the client could ask the server for additional settings. This would resolve these misconfigurations, at the cost of substantial additional complexity.
Additional context
I just spent a day debugging an error caused when a client had an aggressive keepalive configuration, but the server was using the default configuration. This is a tricky error, because it is timing dependent: it only happens for really slow RPCs, or when the connection is fully active. If it is idle for long enough periods, then the connection does not collect the ping strikes.
Inside of a large organization, gRPC configuration like this tends to get copy/pasted from one service to a new one. When someone copies the configuration for a client, but not for the server, the resulting gRPC configuration breaks at somewhat unpredictable intervals. It will seem like it works fine under light testing, but will fail at unpredictable intervals in production. After I encountered this error, I searched internally for other uses, and I found multiple places where I believe this setting is being used incorrectly.
You can never make the server's configuration more restrictive (increase permitted time)
Deploying this feature correctly is extremely hard. Imagine you have a service in production, and decide you need client keepalives for some reason. There is only one safe order to deploy this configuration change:
Deploy the update server configuration everywhere.
Deploy the clients with the new setting.
If you discover clients are pinging too much, or you want to remove this configuration, you need to do it in the opposite order: you must deploy all clients first, then change the setting on the server. As a result, in complex situations with multiple clients, or clients where you don't control the deploy cycle, you can only ever make this more permissive on the server.
It is very easy to screw up gRPC's client keepalive feature. Any disagreement between the client and server results in RPCs failing at unpredictable intervals with
code=Unknown desc=transport is closing. GRPC-go's own README calls this "hard to debug". Due to this, you must be extremely careful at how you order deploys when you change this configuration.This should be at least better documented, and possibly should be redesigned. More details about the problems this can cause are in Additional Context below.
Describe the solution you'd like
Clearly document this in all API references and in generic documentation like https://github.com/grpc/grpc/blob/master/doc/keepalive.md and the RFC for this feature. They should all say something like "Prefer to configure keepalive on the server-side, and only configure client keepalive if you are absolutely sure you need it. Incorrectly configured client keepalive will cause RPCs to fail. You must deploy the server with the same configuration before you deploy the client. When disabling this feature, you must deploy the change in the opposite order"
Clients should log useful error messages when this occurs. For example, see my specific suggestions for the Go client: Make debugging client keepalive misconfigurations easier grpc-go#4266
Clients should all implement the "double the keepalive timeout" policy that is described in the client keepalives RFC. As far as I can tell, C and Java implements this, but I don't think Go does? This still isn't perfect. I had a client setting a 10 second keepalive timeout. The default server interval is 5 minutes. This means every client will have 5 connections closed until they get an acceptable keepalive timer. Each time, it will terminate all RPCs in flight.
It is possible this feature could be removed. Go enables TCP keepalives with a 15 second timer by default. The C gRPC implementation could attempt to do the same using SO_KEEPALIVE, and Java could do something similar. This would probably eliminate most WAN NAT/load balancer issues with zero configuration. If a more aggressive configuration is needed, the server could initiate the ping, using the existing server-side keepalive. The disadvantage to this is that clients may not learn a connection is dead for some time. However, robust GRPC applications need to set deadlines on every RPC anyway.
If client-side keepalive is really needed, then it could be redesigned to be less error prone. The ideal way, in my opinion, would be for it to be under server control, since the concern about "denial of service" described in the GRFC only really applies to servers. This would require the server to send some setting to the client about the permitted settings. Possibly GRPC could use some unused bits in the HTTP2 settings frame, or could define a special request where the client could ask the server for additional settings. This would resolve these misconfigurations, at the cost of substantial additional complexity.
Additional context
I just spent a day debugging an error caused when a client had an aggressive keepalive configuration, but the server was using the default configuration. This is a tricky error, because it is timing dependent: it only happens for really slow RPCs, or when the connection is fully active. If it is idle for long enough periods, then the connection does not collect the ping strikes.
Inside of a large organization, gRPC configuration like this tends to get copy/pasted from one service to a new one. When someone copies the configuration for a client, but not for the server, the resulting gRPC configuration breaks at somewhat unpredictable intervals. It will seem like it works fine under light testing, but will fail at unpredictable intervals in production. After I encountered this error, I searched internally for other uses, and I found multiple places where I believe this setting is being used incorrectly.
You can never make the server's configuration more restrictive (increase permitted time)
Deploying this feature correctly is extremely hard. Imagine you have a service in production, and decide you need client keepalives for some reason. There is only one safe order to deploy this configuration change:
If you discover clients are pinging too much, or you want to remove this configuration, you need to do it in the opposite order: you must deploy all clients first, then change the setting on the server. As a result, in complex situations with multiple clients, or clients where you don't control the deploy cycle, you can only ever make this more permissive on the server.