Improved Heartbeat Write Timer Handling#636
Improved Heartbeat Write Timer Handling#636michaelklishin merged 1 commit intorabbitmq:masterfrom ricado-group:heartbeat-write-deadlock
Conversation
- Resolves an issue with the Heartbeat Write Timer repeatedly firing and blocking on the sychronization lock. The timer will now fire once and be restarted (if not disposed) at the end of method processing.
|
It appears builds are failing due to configuration issues? |
|
@ash-ricado you are correct, CI needs propping every so often. We will QA this, don't worry. |
|
@ash-ricado any way we can reasonably reliably reproduce this? |
kjnilsson
left a comment
There was a problem hiding this comment.
I think this looks fine to me apart from the interval change which I don't know the reason for.
| { | ||
| _heartbeatReadTimer = new Timer(HeartbeatReadTimerCallback); | ||
|
|
||
| _heartbeatReadTimer.Change(300, Timeout.Infinite); |
There was a problem hiding this comment.
Why the change from 200 to 300 here?
There was a problem hiding this comment.
I simply changed this to prevent both the Read and Write Timers firing at the same time. I was considering using Random.Next(100,300) to ensure that with multiple connections, the timers will fire at different times.
Not required to fix the issue this PR deals with. More than happy to remove if desired.
There was a problem hiding this comment.
We strongly discourage heartbeat timeouts < 5 seconds so any timer interval < 1s should be reasonable, and anything < 500 ms is optimal IMO.
|
This looks reasonable. @ash-ricado please undo the interval change (or justify it) and provide some detail on the problem this PR addresses. |
michaelklishin
left a comment
There was a problem hiding this comment.
Let's wait for some extra details first.
|
@michaelklishin Apologies for the delay in my response. Problem solved by this PR Identified by our use of remote nodes that talk to a RabbitMQ Cluster over connections that can sometimes become unstable. Monitoring allowed us to see the ThreadPool increase over a period of an hour until the maximum number of threads was consumed. Should be able to reliably reproduce by publishing data (12K or larger) to a channel every 5 seconds, then drop the network interface that provides access to the RabbitMQ Server. The Heartbeat Write Timer should continue to fire and wait on the lock. NOTE: This does rely on the socket hanging after an unclean network disconnection. If the socket is able to be cleanly shutdown, the attempt to write frames will return immediately. |
|
OK, so injecting a latency spike with Toxiproxy or similar would do. Thank you. |
Proposed Changes
Modified synchronization and parameters of the Write Heartbeat Timer to ensure it can only fire once. It will then be started again at the end of the callback.
This resolves an issue that can occur when writing a heartbeat frame to the socket fails (e.g. broken connection) and the Write Heartbeat Timer continued to fire and block on the lockable
_heartbeatWriteLockobject. Over time this would eventually exhaust the ThreadPool if the socket write didn't return.Types of Changes
What types of changes does your code introduce to this project?
Put an
xin the boxes that applyChecklist
Put an
xin the boxes that apply. You can also fill these out after creatingthe PR. If you're unsure about any of them, don't hesitate to ask on the
mailing list. We're here to help! This is simply a reminder of what we are
going to look for before merging your code.
CONTRIBUTING.mddocumentFurther Comments