Exponential round timeouts cause very long restart times #261
Description
Exponential round timeouts cause very long restart times
Description
Related: #245.
We made some faulty nodes to test the resiliency of the nodes and noticed that the time it takes for the nodes to get back to consensus is very long after 2/3 of the nodes are not byzantine anymore.
It's very possible this was the cause of us thinking our testnet was stuck (#232, #248) but in reality perhaps there was a network partition / other bug in the node discovery causing it, then this making restarts take a very long time.
Inspired by
Their implementation here: getamis/go-ethereum#99
Code here which is a WIP adapted for the SDK:
The main issue here is that if a cluster stops for an hour, say (due to connectivity, etc), it might take 3+ hours for the nodes to recover even if they are all still connected, giving the illusion of a stuck chain. This is not what a chain operator might expect.
The only way to "fix" it, is to get all the nodes to restart, effectively resetting their round to 1.
Admittedly, this is technically "working as intended" in the code but not something you would expect.
Your environment
- OS and version Ubuntu 20
- version of the Polygon SDK 7f2e61d
- branch that causes this issue
develop
with adapted modes from above
Steps to reproduce
- Create a 7 node cluster. Let it produce some blocks.
- Make 3 of the nodes byzantine (don't gossip blocks, gossip wrong messages etc) which will push it below the 2/3 threshold.
- The cluster should stop producing blocks.
- Wait 10 minutes or so.
- Replace the byzantine node with a standard node
- Observe it takes 30m+ to get all nodes to reach the same round and produce blocks again.
- Where the issue is, if you know
- Which commands triggered the issue, if any
Expected behaviour
It should not take so long to recover and start producing blocks again.
Actual behaviour
It takes a very long time for the nodes to reach the same round and produce blocks.
Logs
2021-11-30T17:37:30.154-0500 [DEBUG] polygon.consensus.ibft: state change: new=CommitState
2021-11-30T17:37:30.154-0500 [INFO] polygon.blockchain: write block: num=2680 parent=0xa145ad73ef2d8c399ce14712c068af2beba574288f0a991494ac6e9bcd718536
2021-11-30T18:13:28.291-0500 [INFO] polygon.blockchain: write block: num=2681 parent=0x6a5c7127f8619b017887050065a198f0573eefdb713ffae70bd936558b8415ed
2021-11-30T18:13:38.377-0500 [INFO] polygon.blockchain: write block: num=2682 parent=0x45d5213d7dcb10c7b62d69fbb4ea0a5b51cc0609da7e2332d5cb6a2c863a046a
Proposed solution
We have some ideas here