Skip to content

The problem caused by "For a server that crashed and then recovered, the result of gRPC sending a request is always DEADLINE_EXCEEDED" #8380

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
a544793138 opened this issue Aug 4, 2021 · 6 comments
Labels

Comments

@a544793138
Copy link

Environment

  • gRPC version 1.30.0
  • linux
  • java
  • JDK 11

Background

Our team is using gRPC to send requests from the client to the server.

I am testing whether the gRPC connection from the client to the server is still available after the server crashes and then recovers.

We did this test because we once found that after the server crashed for a while, even if it recovered, the request sent by the client via gRPC still kept returning DEADLINE_EXCEEDED.

In our client, unless the ManagedChannel is interrupted or closed, we will not re-create the ManagedChannel every time we request it. We will create one after the client is started and hold it forever. But every time we request, we will create a new stub from ManagedChannel to complete the request.

In the test, the time of server crash is not short or long (I used the iptables command to simulate the server crash/recovery: iptables -A INPUT -s server-ip -j DROP / iptables -D INPUT -s server-ip -j DROP), for example, after the server was abnormal, I sent 20 requests and got the DEADLINE result, I immediately resumed the server, the situation mentioned above Appeared. But if I send only one request after the server is abnormal, and the server is restored immediately after getting the DEADLINE result, the above problem does not occur, and the server can return the result normally.

What's even stranger is that there is a thread in our client that detects the health status of the server. What this thread actually does is to create a new gRPC ManagedChannel to send a simple request to the server. At that time, after the server resumes, the result returned by this thread is normal.

I guessed that there was a problem with ManagedChannel, and then I used a tricky method. In the listener of the request result, if the request fails and the status code is DEADLINE_EXCEEDED, etc., I call ManagedChannel.enterIdle(). Then the problem of always returning to DEADLINE_EXCEEDED is solved.

However, our team believes that gRPC should have considered the recovery of client connections after server crashes. Then we focus on the gRPC retry we have been using.

Our retry configuration:

Map<String, Object> retryPolicy = new HashMap<>();
retryPolicy.put("maxAttempts", 4D);
retryPolicy.put("initialBackoff", "0.01s");
retryPolicy.put("maxBackoff", "0.1s");
retryPolicy.put("backoffMultiplier", 2D);
retryPolicy.put("retryableStatusCodes", Arrays.asList("UNAVAILABLE", "DEADLINE_EXCEEDED", "ABORTED", "UNIMPLEMENTED"));

Map<String, Object> methodConfig = new HashMap<>();
methodConfig.put("name", Collections.singletonList(Collections.singletonMap("service", "OurService")));
methodConfig.put("timeout", "3s");
methodConfig.put("retryPolicy", retryPolicy);

Question

  • Could you please tell me why gRPC cannot restore the connection with the server ?
  • The retry function does not seem to be effective, or is it not applicable to the server crash situation described above?
  • I used the method of enterIdle to solve the problem by tricks, but what should I do correctly?
  • I saw the issue about getting retry logs, but I don’t know how to use the following APIs:
CallOptions.withStreamTracerFactory(ClientStreamTracer.Factory factory)
ClientStreamTracer.inboundTrailers(Metadata trailers)
StreamTracer.streamClosed(Status status)

Can you give me the simplest call example? Please

@ejona86
Copy link
Member

ejona86 commented Aug 4, 2021

To begin, you are not simulating a server crash. You are instead simulating a network failure.

Could you please tell me why gRPC cannot restore the connection with the server ?

When you drop packets, that is indistinguishable from an overloaded network. So TCP backs off and retransmits with an exponential backoff. If the network is down for a short enough period of time, TCP should resume and you may even notice buffered RPCs run (including those that saw DEADLINE_EXCEEDED; they could run on the server after they failed on the client).

The big questions are, "how long did you wait for it to recover?" and "how long was the network down?" Both in terms of seconds, not "requests" or the like.

The retry function does not seem to be effective, or is it not applicable to the server crash situation described above?

TCP and gRPC haven't noticed that the network was broken and are slowly trying to restore connectivity. gRPC offers application-level keepalive to help determine situations where a TCP connection is killed by the network, although that's a bit different than blackholing like you did.

Retry is probably retrying, but gRPC is continuing to use a TCP connection that has slowed itself to avoid a congestion collapse. When the TCP connection recovers then RPCs would begin working again. And given enough time I'd expect it to recover.

I used the method of enterIdle to solve the problem by tricks, but what should I do correctly?

I would not suggest your enterIdle approach, as it will bring connections down needlessly. Keepalive could help in some situations for some timings to decide the old TCP connection is broken, but it depends on some of the questions I asked above.

@a544793138
Copy link
Author

Many thanks for your reply and for helping me realize my bad expression.

The big questions are, "how long did you wait for it to recover?" and "how long was the network down?" Both in terms of seconds, not "requests" or the like.

In fact, "20 requests" contains an approximate time, because the return result of each request is DEADLINE. To add, my test process is to send a request to our application with postman, and then our application will act as a client, using gRPC to initiate a request to the server. So I can see a time-consuming of each failed request by postman console. It is about 3 seconds (
I guess this is because we set timeout to 3 seconds in retry policy). So "20 requests and returns" is about 60 seconds. Then there will be the phenomenon of always returning to DEADLINE. I waited for more than 5 minutes at the time, but couldn't see its recovery.

In addition, we also test from the perspective of disconnection time:

  • After the client disconnects from the server, it sends a request once, and after getting a failed return, wait another 10 seconds, then immediately restore the connection, and send the request test again

    • Result: normal.The server can return a normal response.
  • After the client disconnects from the server, it sends a request once, and after getting a failed return, it waits for another 200 seconds, then immediately restores the connection, and sends the request test again

    • Result: failure.The request will always return to DEADLINE, waiting for more than 5 minutes without seeing the recovery.

gRPC offers application-level keepalive to help determine situations where a TCP connection is killed by the network

It is indeed not used keepAlive in our program, we will check its use and function.

@a544793138
Copy link
Author

We try to use when creating ManagedChannel keepAliveTime method, it does solve our problems. In fact, we do not have too high requirements for the time to restore the connection, unless it is too unreasonable.

In short, thank you very much for your reply and help.

In addition, I also tried to debug the code during the previous test. I found the code to retry after the request failed, but in my test scenario, I never saw it enter the retry program. is it ?

io.grpc.internal.RetriableStream#closed(Status status, RpcProgress rpcProgress, Metadata trailers):

...
if (state.winningSubstream == null) {
  // the begin of retry policy
  ...
}

If possible, I hope you can provide examples of APIs that need to be called when injecting the tracking stream. I want to know how to use these APIs correctly:

CallOptions.withStreamTracerFactory(ClientStreamTracer.Factory factory)
ClientStreamTracer.inboundTrailers(Metadata trailers)
StreamTracer.streamClosed(Status status)

@ejona86
Copy link
Member

ejona86 commented Aug 5, 2021

After the client disconnects from the server, it sends a request once, and after getting a failed return, it waits for another 200 seconds, then immediately restores the connection, and sends the request test again

Hmm... I'd have expected 5 minutes to be enough. It seems RTO is limited to 2 minutes on Linux. So 2 minutes (or 4 minutes if very pessimistic) should have been enough.

200 seconds is high enough that some machine's tcp_retries2 setting (if 8 or fewer) may cause TCP to consider the connection dead, but the default of 15 wouldn't. And if TCP killed the connection gRPC should respond by re-creating it and carrying on happily.

So this is a bit of a mystery to me, but it'd take reproduction and deeper investigation to figure out why. And the result would probably be the same as now: "you probably want keepalive enabled."

We try to use when creating ManagedChannel keepAliveTime method, it does solve our problems.

FWIW, note that when we can (if the netty epoll component is working), we'll enable TCP_USER_TIMEOUT to the keepAliveTimeout value. This means that we can detect the "broken" TCP connection without actually waiting for keepAliveTime+keepAliveTimeout. That is to say that even if keepAliveTime is 2 hours, you may notice the TCP connection being considered dead after keepAliveTimeout amount of time.

If possible, I hope you can provide examples of APIs that need to be called when injecting the tracking stream.

In your situation the tracer would mostly show you what you already know and can see from the ClientCall/stub. The tracer API isn't great for monitoring the connection itself.

@a544793138
Copy link
Author

I tested again without the use enterIdle and keepAlive applications.
I found that it takes about **10 minutes ** to return to normal. I just hope it to be useful to you.

Thank you for all your answers and help. I will close this issue.

@ejona86
Copy link
Member

ejona86 commented Aug 6, 2021

I found that it takes about **10 minutes ** to return to normal. I just hope it to be useful to you.

Thank you! It is helpful to know that it does return to normal, just after an inordinate amount of time. And 10 minutes is helpful for future debugging.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 4, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants