Skip to content

Versioned bundle and Execution receipt #3623

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 24 commits into
base: main
Choose a base branch
from
Open

Conversation

vedhavyas
Copy link
Contributor

@vedhavyas vedhavyas commented Jul 9, 2025

First of all, changes may appear big but most of the changes are cosmetic.
I did not change everything at once but rather inroduced necessary intermediate types so that I'm aware which parts require direct changes, like runtime, and which parts require an actual compatilibility code. These intermediate types would have be removed in the following commits.

Notable changes

  • New Bundle V1 version
  • New ER version V0
  • New FraudProof V1 version that will hold both old and new Bundle Versions.

For the reviewers, you can either follow commit by commit to see how the migration is applied or look at overall changes. Overall changes are surprisingly easy to understand.

Please let me know @NingLin-P if there are changes that missed that may cause incomaptibility on current Taurus.

Cleanup

We can safely remove

  • Bundle V0
  • FraudProof V0
    Once the new traurus is deployed. As for the mainnet, assuming, this PR lands before we instantiate domains this should be safe to remove as well.

Closes: #2942

Code contributor checklist:

@vedhavyas vedhavyas requested a review from teor2345 July 9, 2025 10:12
Copy link
Member

@teor2345 teor2345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, but it's a large change. What is our testing and risk mitigation strategy?

There are a lot of new functions and types without documentation. Can you add a short description to the production ones? (Tests aren't as important, but it can still be useful to say what a test is meant to check.)

What does a version upgrade look like? Do we need documentation that says how to do it, and tests to make sure it works? (Or a documented manual test process.)

There are some duplicated code blocks and types we could fix up, to avoid confusion.

) -> ReceiptType {
let receipt_number = execution_receipt.domain_block_number;
let ExecutionReceiptRef::V0(ExecutionReceiptV0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will this hard-coded V0 work when we upgrade the version?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We currently has only one version of execution receipt on runtime. If we introduce a new version, then we should handle it at that point based on new changes.

@vedhavyas
Copy link
Contributor Author

What is our testing and risk mitigation strategy?

Same as usual.

  • Deploy Local testnet with latest client but older runtime and do an upgrade.
  • Sync taurus node from scratch with latest client.

Both should cover all the cases

What does a version upgrade look like? Do we need documentation that says how to do it, and tests to make sure it works? (Or a documented manual test process.)

no, its purely defined by the runtime and client uses whatever runtime expects. Only thing we should aware is the node release should go in first followed by runtime upgrade for Taurus. For mainnet, since there are no domains, we can release a client as soon as taurus is tested.

@teor2345
Copy link
Member

no, its purely defined by the runtime and client uses whatever runtime expects. Only thing we should aware is the node release should go in first followed by runtime upgrade for Taurus.

These things seem important to document somewhere, because we'll be doing another upgrade on mainnet with domains at some point in the future.

@vedhavyas
Copy link
Contributor Author

These things seem important to document somewhere, because we'll be doing another upgrade on mainnet with domains at some point in the future.

Not necessarily. Take a look at how XDM is versioned. This is same as that. Runtime defines the versions that need to be used and client uses that.

As for protocol specs, there are no changes to the specs since nothing has changed but rather made type versioned

@teor2345 teor2345 added the needs benchmarking Significant benchmark or runtime code changes, re-run benchmarks before the runtime release label Jul 10, 2025
@vedhavyas vedhavyas requested a review from NingLin-P July 11, 2025 11:27
Copy link
Member

@NingLin-P NingLin-P left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a lot of changes and a lot of duplicated code, I don't have the confidence to find every issue (if any) with my bare eye, thus let's see how the test goes.

@vedhavyas
Copy link
Contributor Author

vedhavyas commented Jul 14, 2025

@NingLin-P I think there maybe confusion on Bundle version and Er version when it comes to storing them

A runtime will only accept the Bundle version that is defined on the Runtime's CurrentVersion.
If a runtime says it expects V1 bundle version, then bundle submission must be of version V1 and not any other.

Now coming to ER version, it changes a bit.
ER version is defined based on the consensus block number at which ER is derived
So if the Runtime's ER version at the time of deriving block is V0, then runtime will accept V0 ER when that ER is submitted through the bundle or singleton receipt since between ER is derived and Submitted there maybe a Runtime upgrade and ER version would have changed.

now coming to how we store the versions, we hook into the set_code
When we initiate the runtime upgrade, example spec_version 2, set_code is called on runtime with spec_version 1
Spec version 1's current versions are stored against the block number which included set_code extrinsics. So any ERs derived until the upgraded number should come in the same ER version as spec version 1 ER version
From upgraded_number + 1, ER version may have changed, and all the ER derived from that number should come with version accepted by new Runtime.

I have added tests to explain this better.

Let me know if this is clear or we can do a sync call

Note: conflicts will be fixed once we have an approval from the team else merge commits will pollute the actual commits from this PR

Copy link
Contributor Author

@vedhavyas vedhavyas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies again for force pushes.

@jfrank-summit and I decided to remove the compatibility code and instead run a devent to verify this in the interest of time.

Due to that, I just simply removed the old versions instead of re-working from ground up as this might take more time.

For the reviewers, for this PR alone, please look at the full diff instead of individual commits since this will be much more cleaner due to no compatibility code.

Appreciate your patience and understanding on this.

Note: Now that we dont need compatibility code, there will be a follow-up PR based on this PR to remove any comaptibility code we had for Domains in Taurus

@vedhavyas vedhavyas requested review from NingLin-P and teor2345 July 16, 2025 06:23
@vedhavyas vedhavyas added breaking-execution This PR introduces breaking changes to the execution and removed needs benchmarking Significant benchmark or runtime code changes, re-run benchmarks before the runtime release labels Jul 16, 2025
@NingLin-P
Copy link
Member

I actually prefer a hook when the runtime is upgraded over on_initialize. This to me is much more easy to understand when the versions actually change ie... at when runtime is actually upgraded.

As for the tests you have written, will take a look why its failing later today

I don't have a strong opinion on either using set_code or on_initialize, as long as @teor2345 is okay, since he will also maintain this code.

As for the test in 00a4c43, I have investigated the failure and did find an issue with the implementation in this PR.

More specifically, when the ER version is upgraded from V0 to V1 in consensus block #N, the runtime will assume using V0 for block #N and using V1 starting from block #N+1. While on the client side, when querying the ER versions in block #N it will return V1, thus it will use V1 to construct ER derived from #N. This inconsistency will cause the bundle to be rejected and the domain chain to stop progressing.

FYI, in 048fd2b, this edge case is handled by:

// NOTE: the new runtime code is take affect in the next block, thus we pass the
// `parent_number` since which brings the new runtime code. And most important, in the
// client side while using the runtime API at `parent_hash` to query the bundle/ER
// version, it will also return the bundle/ER version in the new runtime code.
Self::record_new_bundle_and_er_version(parent_number);

Copy link
Member

@NingLin-P NingLin-P left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest LGTM

Copy link
Member

@teor2345 teor2345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

This is a good change, and we should merge this PR eventually. But I do not know the best time to merge it.

I wonder if it is better to do ongoing maintenance of the current code. I think there are risks in merging within the next week (or two). And risks and costs in delaying other plans to fit it in.

Review of PR Complexity

Different members of the team have different ideas of how complex this PR is.

I think this PR is now one of our most complex PRs, because of:

  • the number of reviews/comments/questions
  • the number changed lines
    • I appreciate we've reduced that by about 25%
  • the number of PR changes after opening the PR
  • the number of bugs found during review and reviewer testing
    • including broken benchmarks

Suggested Next Steps

We're already doing some things to reduce the risk of this PR.

Another way to reduce the risk is splitting the PR into:

  • refactors that do not change functionality at all
    • renames, with the exact replacement commands in the commit message
    • turning fields into functions, with replacements in the commit
    • code movement
    • any tests that work on the old code
  • everything else that changes functionality
    • including tests for those changes

Then the first PR would be easy to review and merge, and we could focus our review efforts on the second PR.

I know this is extra work, but it might be the fastest way forward?

Copy link
Contributor Author

@vedhavyas vedhavyas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More specifically, when the ER version is upgraded from V0 to V1 in consensus block #N, the runtime will assume using V0 for block #N and using V1 starting from block #N+1. While on the client side, when querying the ER versions in block #N it will return V1, thus it will use V1 to construct ER derived from #N. This inconsistency will cause the bundle to be rejected and the domain chain to stop progressing.

Thanks for pointing it out @NingLin-P . This was indeed missed from my end. Should be fixed in the new commits.

I wonder if it is better to do ongoing maintenance of the current code. I think there are risks in merging within the next week (or two). And risks and costs in delaying other plans to fit it in.

@teor2345 The whole reason why the compatibility code is removed to actually push to devnet and subsequently to mainnet since we dont have a tarus beyond this point.

Another way to reduce the risk is splitting the PR into:

I do not agree on this. IF a PR changes or introduces somethings, then its tests and everything affected by it should be part of the PR. We did this earlier and it did not bode well especially its multiple steps of opening a new PR and context switching to something else.

I would rather have reviewers take as much as they need to understand the changes and the side-effects it brings, if any, before merging the PR or else its a No-go from my end.

@teor2345
Copy link
Member

Another way to reduce the risk is splitting the PR into:

I do not agree on this. IF a PR changes or introduces somethings, then its tests and everything affected by it should be part of the PR. We did this earlier and it did not bode well especially its multiple steps of opening a new PR and context switching to something else.

Hmm, I'm not sure if there's been a misunderstanding here.

Why are tests needed for "refactors that do not change functionality at all"?

teor2345
teor2345 previously approved these changes Jul 18, 2025
Copy link
Member

@teor2345 teor2345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some non-blocking code style questions.

There will be some residual risk no matter how much we review this. Let's do some initial testing, and see if any bugs remain?

That will give us a better idea of the risk of this change, and we can decide what to do next when we know more.

Comment on lines +3308 to +3314
let versions = PreviousBundleAndExecutionReceiptVersions::<T>::get();
match versions.get(&block_number) {
// no upgrade happened at this number, so safe to return the current version
None => T::CurrentBundleAndExecutionReceiptVersion::get(),
// upgrade did happen at this block, so return the version stored at this number.
Some(version) => *version,
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code seems to be the same as:

bundle_and_execution_receipt_version_for_consensus_number(block_number, ...)

Would it be better to remove the code duplication? (Or explain in comments why they are different.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm why you think this would be similar here. are the code comments not clear ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because both functions return the current version if there is no upgrade in this block, and the previous version if there is?

Copy link
Contributor Author

@vedhavyas vedhavyas Jul 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair! But we cannot remove this function anyway since the block_number unlike other would require use to get current block number and is straightforward getter where in other function introduces maybe even though we always get the Version.

I would still prefer to keep this as is for easier read

Note: the other function does a loop iteration to find the current version for any block number while this one does a direct lookup.

Comment on lines +3262 to +3272
pub(crate) fn set_previous_bundle_and_execution_receipt_version<SV, PV, BEV>(
block_number: BlockNumberFor<T>,
set_version: SV,
previous_versions: PV,
current_version: BEV,
) where
SV: Fn(BTreeMap<BlockNumberFor<T>, BEV>),
PV: Fn() -> BTreeMap<BlockNumberFor<T>, BEV>,
BEV: PartialEq,
{
let mut versions = previous_versions();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an unusual code style.

Normally we would just pass a BTreeMap<BlockNumberFor<T>, BEV> directly to the function, return a BTreeMap<BlockNumberFor<T>, BEV> from the function, and then use the return value to set the storage.

Since the get and set are called unconditionally, I'm not sure why we're passing Fns to this function.

Passing Fns will reduce the amount of inlining and optimisation the compiler can do, potentially reducing performance (or increasing code size).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explained why I did above

Comment on lines +2943 to +2952
pub(crate) fn bundle_and_execution_receipt_version_for_consensus_number<PV, BEV>(
er_derived_number: BlockNumberFor<T>,
previous_versions: PV,
current_version: BEV,
) -> Option<BEV>
where
PV: Fn() -> BTreeMap<BlockNumberFor<T>, BEV>,
BEV: Copy + Clone,
{
let versions = previous_versions();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar feedback here, normally we would just pass BTreeMap<BlockNumberFor<T>, BEV> to this function directly.

Passing a Fn will reduce the amount of inlining and optimisation the compiler can do, potentially reducing performance (or increasing code size).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done specifically for testing. PTAL at testing this logic with mock versions and mock storage but reusing the same the same logic to pick the correct versions

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made these comments based on the test code.

What stops us calling the production code as:

bundle_and_execution_receipt_version_for_consensus_number(
    er_derived_number,
    CurrentBundleAndExecutionReceiptVersion::get(),
    PreviousBundleAndExecutionReceiptVersions::get(),
)

And the test code as:

bundle_and_execution_receipt_version_for_consensus_number(
    er_derived_number,
    MockCurrentBundleAndExecutionReceiptVersion::get(),
    MockPreviousBundleAndExecutionReceiptVersions::get(),
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is due to the change in type between production and testing for BundleAndExecutionReceiptVersion and in the inner types.

Copy link
Member

@teor2345 teor2345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears 3 benchmarks are broken now:

- pallet_messenger_from_domains_extension::from_domains_relay_message
- pallet_messenger_from_domains_extension::from_domains_relay_message_channel_open
- pallet_messenger_from_domains_extension::from_domains_relay_message_response

https://github.com/autonomys/subspace/actions/runs/16361523792/job/46230168298?pr=3623#step:6:5460

@vedhavyas
Copy link
Contributor Author

Nice! very helpful. Thanks 🙏🏼

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking-execution This PR introduces breaking changes to the execution
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Versioned types that are shared between Runtime and Client for easy upgrades
4 participants