-
Notifications
You must be signed in to change notification settings - Fork 3.9k
The problem caused by "For a server that crashed and then recovered, the result of gRPC sending a request is always DEADLINE_EXCEEDED" #8380
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
To begin, you are not simulating a server crash. You are instead simulating a network failure.
When you drop packets, that is indistinguishable from an overloaded network. So TCP backs off and retransmits with an exponential backoff. If the network is down for a short enough period of time, TCP should resume and you may even notice buffered RPCs run (including those that saw DEADLINE_EXCEEDED; they could run on the server after they failed on the client). The big questions are, "how long did you wait for it to recover?" and "how long was the network down?" Both in terms of seconds, not "requests" or the like.
TCP and gRPC haven't noticed that the network was broken and are slowly trying to restore connectivity. gRPC offers application-level keepalive to help determine situations where a TCP connection is killed by the network, although that's a bit different than blackholing like you did. Retry is probably retrying, but gRPC is continuing to use a TCP connection that has slowed itself to avoid a congestion collapse. When the TCP connection recovers then RPCs would begin working again. And given enough time I'd expect it to recover.
I would not suggest your enterIdle approach, as it will bring connections down needlessly. Keepalive could help in some situations for some timings to decide the old TCP connection is broken, but it depends on some of the questions I asked above. |
Many thanks for your reply and for helping me realize my bad expression.
In fact, "20 requests" contains an approximate time, because the return result of each request is DEADLINE. To add, my test process is to send a request to our application with postman, and then our application will act as a client, using gRPC to initiate a request to the server. So I can see a time-consuming of each failed request by postman console. It is about 3 seconds ( In addition, we also test from the perspective of disconnection time:
It is indeed not used |
We try to use when creating ManagedChannel In short, thank you very much for your reply and help. In addition, I also tried to debug the code during the previous test. I found the code to retry after the request failed, but in my test scenario, I never saw it enter the retry program. is it ? io.grpc.internal.RetriableStream#closed(Status status, RpcProgress rpcProgress, Metadata trailers):
...
if (state.winningSubstream == null) {
// the begin of retry policy
...
} If possible, I hope you can provide examples of APIs that need to be called when injecting the tracking stream. I want to know how to use these APIs correctly: CallOptions.withStreamTracerFactory(ClientStreamTracer.Factory factory)
ClientStreamTracer.inboundTrailers(Metadata trailers)
StreamTracer.streamClosed(Status status) |
Hmm... I'd have expected 5 minutes to be enough. It seems RTO is limited to 2 minutes on Linux. So 2 minutes (or 4 minutes if very pessimistic) should have been enough. 200 seconds is high enough that some machine's tcp_retries2 setting (if 8 or fewer) may cause TCP to consider the connection dead, but the default of 15 wouldn't. And if TCP killed the connection gRPC should respond by re-creating it and carrying on happily. So this is a bit of a mystery to me, but it'd take reproduction and deeper investigation to figure out why. And the result would probably be the same as now: "you probably want keepalive enabled."
FWIW, note that when we can (if the netty epoll component is working), we'll enable TCP_USER_TIMEOUT to the keepAliveTimeout value. This means that we can detect the "broken" TCP connection without actually waiting for keepAliveTime+keepAliveTimeout. That is to say that even if keepAliveTime is 2 hours, you may notice the TCP connection being considered dead after keepAliveTimeout amount of time.
In your situation the tracer would mostly show you what you already know and can see from the ClientCall/stub. The tracer API isn't great for monitoring the connection itself. |
I tested again without the use Thank you for all your answers and help. I will close this issue. |
Thank you! It is helpful to know that it does return to normal, just after an inordinate amount of time. And 10 minutes is helpful for future debugging. |
Environment
Background
Our team is using gRPC to send requests from the client to the server.
I am testing whether the gRPC connection from the client to the server is still available after the server crashes and then recovers.
We did this test because we once found that after the server crashed for a while, even if it recovered, the request sent by the client via gRPC still kept returning
DEADLINE_EXCEEDED
.In our client, unless the ManagedChannel is interrupted or closed, we will not re-create the ManagedChannel every time we request it. We will create one after the client is started and hold it forever. But every time we request, we will create a new stub from ManagedChannel to complete the request.
In the test, the time of server crash is not short or long (I used the
iptables
command to simulate the server crash/recovery:iptables -A INPUT -s server-ip -j DROP
/iptables -D INPUT -s server-ip -j DROP
), for example, after the server was abnormal, I sent 20 requests and got the DEADLINE result, I immediately resumed the server, the situation mentioned above Appeared. But if I send only one request after the server is abnormal, and the server is restored immediately after getting the DEADLINE result, the above problem does not occur, and the server can return the result normally.What's even stranger is that there is a thread in our client that detects the health status of the server. What this thread actually does is to create a new
gRPC ManagedChannel
to send a simple request to the server. At that time, after the server resumes, the result returned by this thread is normal.I guessed that there was a problem with ManagedChannel, and then I used a tricky method. In the listener of the request result, if the request fails and the status code is DEADLINE_EXCEEDED, etc., I call
ManagedChannel.enterIdle()
. Then the problem of always returning to DEADLINE_EXCEEDED is solved.However, our team believes that gRPC should have considered the recovery of client connections after server crashes. Then we focus on the gRPC retry we have been using.
Our retry configuration:
Question
enterIdle
to solve the problem by tricks, but what should I do correctly?Can you give me the simplest call example? Please
The text was updated successfully, but these errors were encountered: