Refactoring Parachain Consensus in Cumulus #2301
Description
Proposing a refactor for Cumulus' consensus code.
Requirements
The main issue with Cumulus’ current architecture is the ownership relationship between the collator and the parachain consensus code. The architecture currently has a Collator
own a T: ParachainConsensus
, which it invokes internally with a build_collation
function. This is backwards: consensus should be at the top, as it may need to be highly specialized per parachain - especially as blockspace offerings such as on-demand parachains and elastic scaling evolve.
Collating takes vastly different inputs as it’s only required for liveness and needs to interface with external blockspace markets. Examples of these inputs:
- Notifications of new relay-chain heads.
- New pending transactions
- Time
- New parachain blocks
Things we should intend to support:
- Asynchronous backing: collation not driven by the relay chain (Prepare block authoring for asynchronous backing #2267)
- Parallelized collation authoring (when multiple execution cores at a time are possible)
- Ordering on-demand execution cores or other types of blockspace purchases.
- Pre-validation functions (Torrent-style fetching for PoVs: high-level polkadot-sdk#968)
- quorum-based collation (tendermint-style)
- slot-based collation (aura/sassafras-style)
- scraping the relay chain for upcoming claims on execution cores
- max execution times set by relay chain (Ensure that collators respect the backers timeout polkadot-sdk#72)
- low-economic-security finality for parachain blocks, by having collators come to a consensus on an "inner" block which is then wrapped into a collation and can be re-submitted as many times as needed until it lands on the relay chain, even with different relay parents.
Proposal
The general idea is that each collator should spawn a background worker whose responsibility it is to actually create collations and share them in the network. We also need an import queue, just like Substrate, although we have a requirement for parachains which doesn’t exist in vanilla Substrate: the parachain runtime is responsible for verifying blocks’ seals. This is required in parachains, because the Polkadot validators need to check the seal as well.
The worker should have access to a import_and_share_collation
to actually import the block locally and share this collation to the network. Separation of creating a collation and sharing it with the network is important, because it may be that a collation requires a quorum of collators. In that case, we need to create the collation, then collect a seal through a network protocol, then amend the collation, and only then share it.
Submitting blockspace orders or bids can be done in the same worker or in a separate background task. The general framework doesn't need to support this yet, but we should write this code (#2154)
We should also not remove any current client/collator
or aura
code, as it is depended on by outwards users. Instead, this should be built alongside, in a backwards-compatible way, giving users time to adjust their code.
Aura
The new framework should not be merged until a fully backwards-compatible "aura" implementation is done alongside it, which can be swapped in by downstream users. This would come with a deprecation, not a deletion, of the old code for doing this.
Actually rewriting Aura logic is not ideal, so it’d be better to still wrap the SlotWorker
as is currently done, even though it was not originally designed for this use-case. To do this, we need to modularize the simple_slot_worker
a bit more. fn on_slot
currently does these things
1. Calculate remaining proposal duration
2. Slot claiming (including skipping if certain criteria are not met)
3. Creating a proposal + storage proof
4. Sealing the block with a digest
5. Importing the block
To be clear, the slot worker already has a lot of these things as individual functions, but on_slot
still does a bunch of the work that we'd like to split out. Especially separating (3), (4), and (5).
These should be split out into separate helper functions, to the extent that they aren’t already. For basic aura, for instance, the worker logic should detect a slot (not based on real time, as is currently done), and then: compute (1) outside, then call into the aura slot worker for (2), (3), and (4), and then handle (5) itself alongside sharing to the network.
As for import queues: because of the “runtime needs to check seals” property, we can get away with simpler import queues that only do basic checks, like making sure a parablock’s slot isn’t in the future. These too should be customizable at the top level of Cumulus. For migrations to new consensus algorithms, the verifier should also be able to check the runtime API or version of the block, to internally delegate verification to different internal verifiers.