-
Notifications
You must be signed in to change notification settings - Fork 4.9k
Unintended Consequences from Leader Race #18308
Comments
I didn't discuss network topology above, but we need to take into account that the network delay for any leader's first slot will always be higher than slots 2-4. In the example above, the data for slot 591 traveled straight-ish from Ogden, UT to my node in Frankfurt, DE because that was 93P6's 4th slot. For slot 592, I need to wait for the final shred of 591 to be sent from Ogden, UT to Tokyo, JP before FzPo can start building 592. Then the first shred of 592 is sent from Tokyo to Frankfurt. For slots 593-595, we don't have the extra hop because FzPo is building on its' own blocks. For every leader transition, we need to insert NN milliseconds of 'sync time' when waiting for the first shred of the new leader's first slot. For anyone who has rowed in an 8-person rowing shell, you will know intuitively how this works. There is a slight pause near 'the catch' for all of the rowers to get in sync before the next power stroke. The rhythm of the cluster looks like 1-2-3-4-sync-1-2-3-4-sync-1-2-3-4 and so on. |
This is a related issue: #18044 |
wow. what a cool thing is that we got a well-reasoned bug report even with experiments from our community. I'm proud of it. Recently, I was bothered about skip rates in testnet as well. I was observing a similar behavior like this there and thinking about the patient leaders, too. @behzadnouri @carllin summoning consensus expert and gossip/turbine expert :) |
@brianlong Do we have the full logs from this?
I wouldn't describe Shinobi waiting 2 seconds before voting as comfortable. It waited until it had actually completed the block and frozen it. So it didn't wait any longer than it had to assuming it wants to know the block is actually complete before voting on it. This is different than waiting to start your leader block which happens when the node thinks it waited the appropriate amount of time to receive blocks from previous leaders. |
Hi @sakridge, thanks for looking into this. See logs posted above. I can also upload a bunch of logs from non-voting RPC servers if that will be helpful.
This is the point that I'm making. There are different concepts for the "appropriate amount of time." L1 seemed to give up waiting on L0 much earlier than Shinobi or Block Daemon did when voting. When L1 creates the fork, we are guaranteed to skip 4 slots from either L0 or L1. When I look at |
There's not. That 2 seconds to vote on 592 wasn't baked into the validator code anywhere. It wasn't "waiting" anymore than all the validator does is wait for blocks all the time and vote on them as fast as it can. So that was purely determined by when all the other validators managed to send the Shinobi node the block. If it had had access to better blocks earlier, then it would have voted on those instead. |
@sakridge Why didn't DDnA also follow the same logic? |
In which operation? |
When it created the fork for 596. Shreds for 592 were streaming in at that time, and we were still within the window of time for L0's 4 slots. L1 (DDnA) jumped the gun |
592 window is the first 400ms after 591. DDnA waited 4 slots (1.4s) just to receive 592 and it doesn't know if it would even complete the block or not. |
And this is where the "Unintended Consequences from Leader Race" come into the conversation. What is the benefit of the leader race? We are currently skipping 23.724% of slots and most of those are coming in groups of 4. The race is clearly causing skipped slots. Does the benefit outweigh the cost? |
Delinquency on mainnet is usually quite low -- currently 0.16%. The current skip rate is not from delinquent nodes. |
There is another data point that stands out to me. The duration of 596 on each of the machines. Block Logic (Frankfurt, DE): 0.574 from new fork to vote I am assuming that the network latency from Block Logic to the other two nodes is neutralized because I'm only looking at the local duration on each node. That's a huge gap between Shinobi & the EU validators. I wonder how much of that time was lost in Turbine/Repair? The leader for 592 was also a great distance away in Tokyo. I realize that is only a single data point, but it's worth looking into. I'll review the logs again. The unintended consequences I described above might be exacerbated by messy shred transmissions. Improvements there might make this less of a problem... |
The transmission time for shinobi is this: From this I think it had to repair the slot, so I think that's probably why it took longer: bank frozen came quite late, 1 second after From the compute stats:
Actual compute time was only 110ms. This seems related to an issue we are currently working on to make this faster. |
@sakridge Thanks again for digging into this issue. That data is super-helpful. |
Please let me know if I have visualized this correctly:
When the actual duration of Work diverges from the fixed schedule, the only way to maintain the schedule is to skip Slots. In the example above, the actual Work time for each Leader exceeds the allocated time for their four Slots. By the time we get to the 4th Leader, the actual duration of Work has consumed 5 Slots. To keep the Work in sync with the schedule, we skip all four of the 5th Leader's Slots. The schedule is back in sync starting with Leader 6. In the impatient scenario, the "Leader Race" is the mechanism that provides for skipping Slots to maintain the schedule. In the patient scenario, we accept that block times are longer in order to skip fewer slots. |
Those skipped slots are not benign -- they are responsible for forks and failed transactions. Take a moment to look at this situation from a different point of view. For example, When a price oracle is sending transactions during turn 5 in the schedule, the oracle expects to receive price results in a timely manner. If the skipped slots cause the RPC server to go off on a minority fork, the oracle will fail to receive results within their SLA window. Missing data points are very bad for an oracle. Solution Ideas: A) Adjust the schedule to be closer to the actual work. For example, add 100ms of sync time in leader transition to accommodate the extra data hop between the 4th slot of one leader and the 1st slot of the next leader. |
BTW, that price oracle problem above is real. We are over-provisioning N:1 RPC servers with the hopes that 1/N servers is on the majority fork (highest root). |
@brianlong well, I think RPC's |
so, there were 3 forks at the time:
|
the root cause of forks is that leader 1's blocks are really slow to propagate (it seems this is partly dependent on the transaction count in a block; signature_count in the above log messages are approximate): the odd part is the 4th slot for the leader 1 got full across the cluster before than its 1st slot: |
then, leader 2 and leader 3 are naitvely continue to create minor fork even after they supposedly had received Then, their forks are abandoned by the whole cluster... I guess they should be better off waiting leader 1 more patiently or at least interrupt their fork and hold their remaining leader slots to connect to leader 1's fork, as soon as |
@ryoqun thanks for the great analysis, this tool is super useful for narrowing down cluster-wide behavior! So it seems like there's some correlation between number of transactions and how long things take to propagate. Combined with this observation:
I wonder if this is a failure in turbine to propagate some number of the blocks. You would think that as long as turbine propagated at least 50% of the blocks, we should get recovery (unless erasure recovery is also really slow?). For that block where |
@jbiseda this might be a good thing to look at why turbine is taking so long here. @carllin in this case there was no hope for leader 1 to start it's first slot (84414596) from 84414595, but potentially it could have chosen 84414595 as the parent of 84414598 and/or 84414599. That might have hedged the bet that 84414595 would not be the chosen fork. I also wonder what kind of other propagation signals (epoch slots/votes) might be there that might have even indicated that was a smart thing to do. What do you think? |
@sakridge hmmm yeah, this is an interesting case.
Clarification: I think you mean Analysis: First It's good to note for some background (tagging @jstarry, @AshwinSekar since they were part of this review):
Behavior we're observing: So essentially we have an issue that looks like
Given that this node before the bank Then this node would pick
So indeed, based on the logs we kind of see a scenario where 1) happens but 2) fails briefly:
In summary:
To @sakridge 's point here, the switching threshold prevented us from building
|
Yes sorry, slot 84414592. |
Thanks so much for your excellent work on this issue. Regarding servers being stuck on a lower root, here is an example from earlier today on mainnet RPC:
For server #s 0-3, the root is 68 to 72 slots lower than the other six servers. For defi apps, I'm assuming that servers 0-3 will return inaccurate results until their root catches up with the rest of the cluster. I hope that a smooth transition from one leader to the next can eliminate most of the forks that we see. The problems being solved here will make a huge difference for the apps. |
yeah, definitely the current forky cluster is contributing the differing However, if your defi apps are concerned enough about timing, it would be better for defi apps to do their read consistency management at client side or for the rpcpools to implement session affinity. After all, minor forks are inevitable no matter what due to delinquent leader or small across-globe networking glitches. Personally, i care about these consistency issue (I'm a nit-picky guy... lol). so I'm intending to fix at web3 level but no time up to now...: #15705 |
I wouldn't say for 100% certain, but I don't think those differences are really due to forking. All the nodes are essentially processing the forks in the same way. There are probably timing differences between nodes though in terms of receiving network packets or encountering processing delays. I think providing logs or metrics to compare 2 of the nodes with a large difference might tell us something about it. edit... I noticed I disagree a bit with Ryo here. A new root for all nodes could be delayed because of forking, but if one node has set a fork and another hasn't, it should just mean that the slower node has not processed blocks yet which are already available in the network. |
I spent some time looking at another sequence of skipped slots from the point of view of three RPC nodes. This is the slot sequence from epoch 202 (annotated with location)
I have also attached a zip file with logs & a spreadsheet showing a side-by-side comparison. There are few points that stand out for me.
|
I'm not sure what "abandon its remaining slots" above will actually look like. Deliver no blocks? Just ticks? Say 'pass' like when playing poker? |
(CodePen - consensus2.pdf this is the rainbow chart of same time window) I'll dig in later... |
@brianlong sorry, i couldn't reply in timely manner at all. i was too busy to fix the skip problem... still working on this. and as part of this, I've finally made the rainbow tool accessible (hopefully) for all (buggy and slow still): https://codepen.io/ryoqun/full/JjNbeyE you can input your validator identity and see how it produced blocks (and possibly how you won the main fork if there are competing forks). |
As an illustration, this is how to read the chart. hope labor-intensive log comparison is the past thing:
|
@ryoqun Woah! That's a cool trick. I'll need a little time to dig into this. Thanks! |
Uh oh!
There was an error while loading. Please reload this page.
Problem
The race between the current leader "L0" and the next leader in line "L1", described in https://github.com/solana-labs/solana/blob/22a18a68e3ee68ae013d647e62e12128433d7230/docs/src/implemented-proposals/leader-leader-transition.md, is causing skipped slots. See the "Supporting Information" section below for a view of four skipped slots from the perspective of three different validators.
I believe that the race is intended to shorten the gaps in transaction processing when L0 is offline. Instead, with slow network propagation, we see situations where L1 creates a new fork for their 4 slots and then loses out when the voters vote for L0. In cases like this, the ONLY WAY to resolve the conflict is to skip 4 slots. Ironically, the policy intended to minimize the impact of skipped slots is actually causing them.
I see an inconsistency between the relative patience of L1 and the voters. L1 is acting in an impatient manner by jumping ahead to create a new fork. In contrast, the voters will patiently wait for L0's work to arrive before voting. We are asking/begging for skipped slots/forks when L1 and the voters are working from a different set of rules.
Supporting Information
Thanks to @nyet | Blockdaemon & @zantetsu | Shinobi Systems for sharing logs. I see some inconsistent behavior related to event timing between the leader and the validators. Here are some details for discussion. Relevant slots & leaders:
For all three of us (Block Daemon, BlockLogic, & Shinobi), the transition from 591 => 592 looked normal. Here are some events from Zantetsu's log (timestamps are seconds):
You'll notice that DDnA started to send shreds for 596 shortly after FzPo sent shreds for 592.
After both banks were frozen, Shinobi voted on 592 about 2.26 seconds after the first shreds arrived and 2.46 seconds after voting for 591. Keep those times in mind as we go along. I think they matter.
For Block Daemon, we also see a smooth handoff from 591 => 592. They vote for 592 about 1.30 seconds after receiving the first shred and 1.74 seconds after voting for 591.
On my server, the hand-off from 591-592 took a little longer (1.14 seconds). However, it looks like DDnA didn't wait for a reasonable grace period before starting a new fork. The fork for 596 was created 0.32 seconds after the first shred from 592 and 1.47 seconds after voting for 591. The frozen bank for 592 was delivered 1.02 seconds after the first shred and 2.17 seconds after voting for 591.
So, here is where I see the inconsistent behavior, Shinobi was comfortable waiting over 2 seconds before voting on 592. Yet, DDnA created a new fork less than 1.5 seconds after voting on 591.
If DDnA had waited a similar period, it would have seen the frozen bank for 592 and NOT created the fork at 596.
Proposed Solution Ideas
To resolve the conflict, we need L1 and the voters to share the same behavior. Either the cluster is patient or impatient, not both at the same time. Here are a couple of solution ideas. One solution for a patient cluster and another for the impatient fast-twitch crowd.
Patient Cluster: L1 should only jump ahead when L0 is known to be delinquent. Otherwise, wait for the shreds to arrive and proceed normally. Please note that L0 in the example above is located in Tokyo and it will have a longer propagation time to most of the nodes in the cluster. The cluster will be much more tolerant of network propagation delays. The cluster will also have higher transactional throughput because we are not skipping as many slots. (There will still be a maximum wait period to avoid stalls. Both L1 and the voters will use the same wait period.)
The Impatient Cluster: The behavior of the voters will be consistent with L1, not L0. The voters will recognize that L0 was slow to send data, and will prefer to vote for L1 in these cases. This solution may have a further unintended consequence of forcing nodes out of locations like Tokyo. We can't be too impatient if we want to see validator nodes all over the world. This solution also means that we are allowing more skipped slots which lowers the transactional throughput of the cluster. Sometimes faster isn't actually faster.
We'll need to debate the Patient v. Impatient cluster before arriving at a final solution. Once decided, the path forward will be clear.
The text was updated successfully, but these errors were encountered: