-
Notifications
You must be signed in to change notification settings - Fork 214
Rianhughes/tendermint sync #2962
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
68b3693
to
883fe4f
Compare
b4efe35
to
8f0605d
Compare
7ffb2da
to
9d6f89f
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2962 +/- ##
==========================================
- Coverage 71.79% 71.77% -0.02%
==========================================
Files 267 268 +1
Lines 28808 28901 +93
==========================================
+ Hits 20683 20744 +61
- Misses 6728 6756 +28
- Partials 1397 1401 +4 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
b8b6021
to
1afdb2d
Compare
// Stop syncing when we receive a quorum of prevotes | ||
if t.uponPolkaAny() || t.uponPolkaNil() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we stop syncing if we receive a quorum of prevotes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should stop syncing when we receive a quorum of prevotes at our current height. Eg
t=0
Our node is at height 0
Network is at height 100
Msgs are at height 100 (+-1).
We don't see a quorum of prevotes at the current height 0, so we don't stop syncing.
t=t'
Our node is at height 110
Network is at height 100
Msgs are at 110. We are at height 110.
We see a quorum of prevotes at our height 100, so we should stop syncing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this mean we can be blocked forever if not receiving the precommits? I'm thinking about the case:
- Receive prevote quorum at height 100.
- Lost internet connection for 20 seconds.
- Connectivity restored, network moved to height 110 and don't broadcast the precommits for 100 anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another case is:
- We're at height 1, network is at height 1000000.
- Attackers send a prevote quorum for height 1.
- This is triggered, which blocks the sync at height 1 forever.
// Todo: this needs added to the spec. | ||
L2GasConsumed: 1, | ||
} | ||
s.proposalStore.Store(msgH, &buildResult) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
buildResult
must be written to proposalStore
first, otherwise there can be a race condition where the driver decides to commit quickly and cannot find the proposal in proposalStore
, because it's not written yet. We did this similarly in proposal stream demux.
|
||
precommits := s.getPrecommits(types.Height(committedBlock.Block.Number)) | ||
for _, precommit := range precommits { | ||
s.driverPrecommitCh <- precommit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should select with context.Context
if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To cancel early?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any "naked" blocking operation is a source of deadlock preventing graceful shutdown.
toValue func(*felt.Felt) V | ||
toHash func(*felt.Felt) H | ||
proposalStore *proposal.ProposalStore[H] | ||
blockCh chan p2pSync.BlockBody |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blockCh chan p2pSync.BlockBody | |
blockCh <-chan p2pSync.BlockBody |
// Todo: this needs added to the spec. | ||
L2GasConsumed: 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should also check with Starkware to understand whether the node is expected to validate the block or blindly trust it as long as there are 2f+1 commits.
// Todo: this interface allows us to mock the P2P service until we implement additional tests / test infrastructure | ||
type WithBlockCh interface { | ||
service.Service | ||
WithBlockCh(blockCh chan p2pSync.BlockBody) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WithBlockCh(blockCh chan p2pSync.BlockBody) | |
Listen() <-chan p2pSync.BlockBody |
This is because ideally, the write side should be the one "owns" the channel, because a write to a closed channel can panic while a read doesn't. The existing code already write data to a channel, so we can expose it instead of forwarding it again to another channel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To do this, we can:
- Initialize the channel in
New
- Return this channel in
Listen
- Modify
pipeline.Bridge
utils to accept a channel as an argument instead of initializing it inside.
func (s *Sync[V, H, A]) Run(originalCtx context.Context) { | ||
ctx, cancel := context.WithCancel(originalCtx) | ||
go func() { | ||
s.syncService.WithBlockCh(s.blockCh) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can result in race condition, because it's possible to be called after the syncService
starts receiving blocks.
This PR implements the sync service for consensus. Its purpose is to sync to the chain head, then switch off. It is not a mechanism to catchup to the chain head if we fall behind (which should be addressed separately).
The service asks P2P for the next block. It then queries peers for the precommits associated with this block (assuming they won't be in the header), builds the proposal, and sends all of this to the
Driver
. TheDriver
should then commit it (by triggeringline 49
). The sync service is stopped whenever the state machine sees a quorum ofprevotes
(earliest possible indication we are at the chain head), and sends a signal to the sync service to shut down.Note: sync requires the precommits to be exposed. Currently they are not. To push the block through the state machine, we may have to forge them. Ie create a {H,R,sender, ID} for each sender for a given block.
Note: The p2p logic has little to no tests (eg no test for the
Run()
function for thep2p.Service
type). To get around this I implemented a new interface (WithBlockCh
), until we implement the p2p tests. This should probably be done next given it's a core part of the nodes functionality.