In our usage, we encounter a case where the shuffle worker registers timeout and triggers a fatal error, but the shuffle worker process does not exit and this leads to no new worker being spawned to replace the current one .
The reason behind this is that the shuffle worker will execute closeAsync and shutdown all the component services. Obviously, the process will exit after all the non-daemon threads exit. But our metric client start extra thread not close rightly which cause this problem, this should fix by close these threads in the reporter#close method.
But I still think we should improve the shutdown logic a bit. We could explicitly exit the shuffle worker when the termination future completed. So that it will be safe for any situation when there are threads that can not be freed timely.