Skip to content

Rianhughes/tendermint sync #2962

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open

Conversation

rianhughes
Copy link
Contributor

@rianhughes rianhughes commented Jul 8, 2025

This PR implements the sync service for consensus. Its purpose is to sync to the chain head, then switch off. It is not a mechanism to catchup to the chain head if we fall behind (which should be addressed separately).

The service asks P2P for the next block. It then queries peers for the precommits associated with this block (assuming they won't be in the header), builds the proposal, and sends all of this to the Driver. The Driver should then commit it (by triggering line 49). The sync service is stopped whenever the state machine sees a quorum of prevotes (earliest possible indication we are at the chain head), and sends a signal to the sync service to shut down.

Note: sync requires the precommits to be exposed. Currently they are not. To push the block through the state machine, we may have to forge them. Ie create a {H,R,sender, ID} for each sender for a given block.

Note: The p2p logic has little to no tests (eg no test for the Run() function for the p2p.Service type). To get around this I implemented a new interface (WithBlockCh), until we implement the p2p tests. This should probably be done next given it's a core part of the nodes functionality.

@rianhughes rianhughes force-pushed the rianhughes/tendermint-sync branch from 68b3693 to 883fe4f Compare July 8, 2025 09:59
@rianhughes rianhughes force-pushed the rianhughes/tendermint-sync branch 2 times, most recently from b4efe35 to 8f0605d Compare July 9, 2025 09:54
@rianhughes rianhughes marked this pull request as ready for review July 9, 2025 09:57
@rianhughes rianhughes force-pushed the rianhughes/tendermint-sync branch from 7ffb2da to 9d6f89f Compare July 9, 2025 13:13
Copy link

codecov bot commented Jul 9, 2025

Codecov Report

Attention: Patch coverage is 65.81197% with 40 lines in your changes missing coverage. Please review.

Project coverage is 71.77%. Comparing base (475f08d) to head (beba841).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
p2p/sync/sync.go 0.00% 27 Missing ⚠️
consensus/tendermint/process.go 0.00% 5 Missing and 1 partial ⚠️
p2p/p2p.go 0.00% 6 Missing ⚠️
consensus/types/action.go 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2962      +/-   ##
==========================================
- Coverage   71.79%   71.77%   -0.02%     
==========================================
  Files         267      268       +1     
  Lines       28808    28901      +93     
==========================================
+ Hits        20683    20744      +61     
- Misses       6728     6756      +28     
- Partials     1397     1401       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@rianhughes rianhughes force-pushed the rianhughes/tendermint-sync branch from b8b6021 to 1afdb2d Compare July 10, 2025 08:12
Comment on lines +92 to +93
// Stop syncing when we receive a quorum of prevotes
if t.uponPolkaAny() || t.uponPolkaNil() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we stop syncing if we receive a quorum of prevotes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should stop syncing when we receive a quorum of prevotes at our current height. Eg

t=0
Our node is at height 0
Network is at height 100
Msgs are at height 100 (+-1).
We don't see a quorum of prevotes at the current height 0, so we don't stop syncing.

t=t'
Our node is at height 110
Network is at height 100
Msgs are at 110. We are at height 110.
We see a quorum of prevotes at our height 100, so we should stop syncing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this mean we can be blocked forever if not receiving the precommits? I'm thinking about the case:

  • Receive prevote quorum at height 100.
  • Lost internet connection for 20 seconds.
  • Connectivity restored, network moved to height 110 and don't broadcast the precommits for 100 anymore.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another case is:

  • We're at height 1, network is at height 1000000.
  • Attackers send a prevote quorum for height 1.
  • This is triggered, which blocks the sync at height 1 forever.

// Todo: this needs added to the spec.
L2GasConsumed: 1,
}
s.proposalStore.Store(msgH, &buildResult)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buildResult must be written to proposalStore first, otherwise there can be a race condition where the driver decides to commit quickly and cannot find the proposal in proposalStore, because it's not written yet. We did this similarly in proposal stream demux.


precommits := s.getPrecommits(types.Height(committedBlock.Block.Number))
for _, precommit := range precommits {
s.driverPrecommitCh <- precommit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should select with context.Context if possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To cancel early?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any "naked" blocking operation is a source of deadlock preventing graceful shutdown.

toValue func(*felt.Felt) V
toHash func(*felt.Felt) H
proposalStore *proposal.ProposalStore[H]
blockCh chan p2pSync.BlockBody
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
blockCh chan p2pSync.BlockBody
blockCh <-chan p2pSync.BlockBody

Comment on lines +111 to +112
// Todo: this needs added to the spec.
L2GasConsumed: 1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also check with Starkware to understand whether the node is expected to validate the block or blindly trust it as long as there are 2f+1 commits.

// Todo: this interface allows us to mock the P2P service until we implement additional tests / test infrastructure
type WithBlockCh interface {
service.Service
WithBlockCh(blockCh chan p2pSync.BlockBody)
Copy link
Contributor

@infrmtcs infrmtcs Jul 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
WithBlockCh(blockCh chan p2pSync.BlockBody)
Listen() <-chan p2pSync.BlockBody

This is because ideally, the write side should be the one "owns" the channel, because a write to a closed channel can panic while a read doesn't. The existing code already write data to a channel, so we can expose it instead of forwarding it again to another channel.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To do this, we can:

  • Initialize the channel in New
  • Return this channel in Listen
  • Modify pipeline.Bridge utils to accept a channel as an argument instead of initializing it inside.

func (s *Sync[V, H, A]) Run(originalCtx context.Context) {
ctx, cancel := context.WithCancel(originalCtx)
go func() {
s.syncService.WithBlockCh(s.blockCh)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can result in race condition, because it's possible to be called after the syncService starts receiving blocks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants