Rianhughes/tendermint sync #2962

rianhughes · 2025-07-08T09:54:40Z

This PR implements the sync service for consensus. Its purpose is to sync to the chain head, then switch off. It is not a mechanism to catchup to the chain head if we fall behind (which should be addressed separately).

The service asks P2P for the next block. It then queries peers for the precommits associated with this block (assuming they won't be in the header), builds the proposal, and sends all of this to the Driver. The Driver should then commit it (by triggering line 49). The sync service is stopped whenever the state machine sees a quorum of prevotes (earliest possible indication we are at the chain head), and sends a signal to the sync service to shut down.

Note: sync requires the precommits to be exposed. Currently they are not. To push the block through the state machine, we may have to forge them. Ie create a {H,R,sender, ID} for each sender for a given block.

Note: The p2p logic has little to no tests (eg no test for the Run() function for the p2p.Service type). To get around this I implemented a new interface (WithBlockCh), until we implement the p2p tests. This should probably be done next given it's a core part of the nodes functionality.

codecov · 2025-07-09T13:24:35Z

Codecov Report

Attention: Patch coverage is 65.81197% with 40 lines in your changes missing coverage. Please review.

Project coverage is 71.77%. Comparing base (475f08d) to head (beba841).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
p2p/sync/sync.go	0.00%	27 Missing ⚠️
consensus/tendermint/process.go	0.00%	5 Missing and 1 partial ⚠️
p2p/p2p.go	0.00%	6 Missing ⚠️
consensus/types/action.go	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2962      +/-   ##
==========================================
- Coverage   71.79%   71.77%   -0.02%     
==========================================
  Files         267      268       +1     
  Lines       28808    28901      +93     
==========================================
+ Hits        20683    20744      +61     
- Misses       6728     6756      +28     
- Partials     1397     1401       +4

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

infrmtcs · 2025-07-16T06:30:14Z

consensus/tendermint/process.go

+		// Stop syncing when we receive a quorum of prevotes
+		if t.uponPolkaAny() || t.uponPolkaNil() {


Why do we stop syncing if we receive a quorum of prevotes?

We should stop syncing when we receive a quorum of prevotes at our current height. Eg

t=0
Our node is at height 0
Network is at height 100
Msgs are at height 100 (+-1).
We don't see a quorum of prevotes at the current height 0, so we don't stop syncing.

t=t'
Our node is at height 110
Network is at height 100
Msgs are at 110. We are at height 110.
We see a quorum of prevotes at our height 100, so we should stop syncing.

Doesn't this mean we can be blocked forever if not receiving the precommits? I'm thinking about the case:

Receive prevote quorum at height 100.

Lost internet connection for 20 seconds.

Connectivity restored, network moved to height 110 and don't broadcast the precommits for 100 anymore.

Another case is:

We're at height 1, network is at height 1000000.

Attackers send a prevote quorum for height 1.

This is triggered, which blocks the sync at height 1 forever.

consensus/driver/driver.go

consensus/sync/sync.go

infrmtcs · 2025-07-16T06:44:54Z

consensus/sync/sync.go

+				// Todo: this needs added to the spec.
+				L2GasConsumed: 1,
+			}
+			s.proposalStore.Store(msgH, &buildResult)


buildResult must be written to proposalStore first, otherwise there can be a race condition where the driver decides to commit quickly and cannot find the proposal in proposalStore, because it's not written yet. We did this similarly in proposal stream demux.

infrmtcs · 2025-07-16T06:45:34Z

consensus/sync/sync.go

+
+			precommits := s.getPrecommits(types.Height(committedBlock.Block.Number))
+			for _, precommit := range precommits {
+				s.driverPrecommitCh <- precommit


I think we should select with context.Context if possible.

To cancel early?

Any "naked" blocking operation is a source of deadlock preventing graceful shutdown.

consensus/sync/sync.go

infrmtcs · 2025-07-16T06:46:56Z

consensus/sync/sync.go

+	toValue       func(*felt.Felt) V
+	toHash        func(*felt.Felt) H
+	proposalStore *proposal.ProposalStore[H]
+	blockCh       chan p2pSync.BlockBody


Suggested change

blockCh chan p2pSync.BlockBody

blockCh <-chan p2pSync.BlockBody

infrmtcs · 2025-07-16T06:51:44Z

consensus/sync/sync.go

+				// Todo: this needs added to the spec.
+				L2GasConsumed: 1,


I think we should also check with Starkware to understand whether the node is expected to validate the block or blindly trust it as long as there are 2f+1 commits.

consensus/sync/sync.go

infrmtcs · 2025-07-16T07:02:55Z

p2p/p2p.go

+// Todo: this interface allows us to mock the P2P service until we implement additional tests / test infrastructure
+type WithBlockCh interface {
+	service.Service
+	WithBlockCh(blockCh chan p2pSync.BlockBody)


Suggested change

WithBlockCh(blockCh chan p2pSync.BlockBody)

Listen() <-chan p2pSync.BlockBody

This is because ideally, the write side should be the one "owns" the channel, because a write to a closed channel can panic while a read doesn't. The existing code already write data to a channel, so we can expose it instead of forwarding it again to another channel.

To do this, we can:

Initialize the channel in New

Return this channel in Listen

Modify pipeline.Bridge utils to accept a channel as an argument instead of initializing it inside.

p2p/sync/sync.go

infrmtcs · 2025-07-21T09:37:09Z

consensus/sync/sync.go

+func (s *Sync[V, H, A]) Run(originalCtx context.Context) {
+	ctx, cancel := context.WithCancel(originalCtx)
+	go func() {
+		s.syncService.WithBlockCh(s.blockCh)


This can result in race condition, because it's possible to be called after the syncService starts receiving blocks.

rianhughes force-pushed the rianhughes/tendermint-sync branch from 68b3693 to 883fe4f Compare July 8, 2025 09:59

rianhughes added the Tendermint label Jul 8, 2025

rianhughes temporarily deployed to Development July 8, 2025 10:14 — with GitHub Actions Inactive

rianhughes temporarily deployed to Development July 8, 2025 12:31 — with GitHub Actions Inactive

rianhughes temporarily deployed to Development July 8, 2025 14:04 — with GitHub Actions Inactive

rianhughes force-pushed the rianhughes/tendermint-sync branch 2 times, most recently from b4efe35 to 8f0605d Compare July 9, 2025 09:54

rianhughes marked this pull request as ready for review July 9, 2025 09:57

rianhughes requested review from rodrigo-pino and infrmtcs July 9, 2025 09:57

rianhughes temporarily deployed to Development July 9, 2025 10:23 — with GitHub Actions Inactive

rianhughes temporarily deployed to Development July 9, 2025 12:41 — with GitHub Actions Inactive

Tendermint: implement mechanism to sync to chain head

9d6f89f

rianhughes force-pushed the rianhughes/tendermint-sync branch from 7ffb2da to 9d6f89f Compare July 9, 2025 13:13

rianhughes temporarily deployed to Development July 10, 2025 08:10 — with GitHub Actions Inactive

refactor to reduce git-diff

1afdb2d

rianhughes force-pushed the rianhughes/tendermint-sync branch from b8b6021 to 1afdb2d Compare July 10, 2025 08:12

new test

beba841

rianhughes had a problem deploying to Development July 10, 2025 09:57 — with GitHub Actions Error