Skip to content

Forward CoreInfo via an digest to the runtime #9002

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 27 commits into
base: master
Choose a base branch
from

Conversation

bkchr
Copy link
Member

@bkchr bkchr commented Jun 26, 2025

Before this pull request we had this rather inflexible SelectCore type in parachain-system. It was just taking the last byte of the block number as the core selector. This resulted in issues like #8893. While it was not totally static, it was very complicated to forward the needed information to the runtime. In the case of running with block bundling (500ms blocks), multiple blocks are actually validated on the same core. Finding out the selector and offset without having access to the claim queue is rather hard. The claim queue could be forwarded to the runtime, but it would waste POV size as we would need to include the entire claim queue of all parachains.

This pull request solves the problem by moving the entire core selection to the collator side. From there the information is passed via a PreRuntime digest to the runtime. The CoreInfo contains the selector, claim_queue_offset and number_of_cores. Doing this on the collator side is fine as long as we don't have slot durations that are lower than the relay chain slot duration. As we have agreed to always have equal or bigger slot durations on parachains, there should be no problem with this change.

Downstream users need to remove the SelectCore type from the parachain_system::Config:

- type SelectCore = ...;
+

Closes: #8893 #8906

@bkchr bkchr requested review from skunert and alindima June 26, 2025 15:41
@bkchr bkchr requested a review from a team as a code owner June 26, 2025 15:41
@bkchr bkchr added T0-node This PR/Issue is related to the topic “node”. T9-cumulus This PR/Issue is related to cumulus. labels Jun 26, 2025
@bkchr
Copy link
Member Author

bkchr commented Jun 27, 2025

/cmd fmt

Copy link
Contributor

@skunert skunert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall changes look good.

I think it is worth mentioning that these changes have some impact on which component is limiting our block-building throughput. With the static CoreSelector we had before, we were always authoring at a fixed claim queue offset. So after our cores for that offset were used up, we would skip authoring. After this change, we are still limited by the slot_timer, which is updated based on the claim queue offset at 0. However, the main responsibility lies with the Velocity in the runtime. It needs to be configured correctly in order to prevent excessive production.

Also quickly discussed with @bkchr the possibility of abusing the dynamic claim queue offset in a scenario where a elastic-scaling chain is configured for example for 3 cores, but has only 1 scheduled. In that scenario, the velocity and runtime constraints are too generous and allow block stealing of future authors.

One thing that is not yet clear to me (or I forgot) is the exact backing time point:

  • If I build a block on claim queue offset 2, will this block be backed immediately or only when this claim arrives at position 0 in two relay chain blocks.
  • If I build a block on claim queue offset 2, can the cores at offset 0 and 1 still be used, or are they "blocked" by the usage of offset 2? From intuition I would expect that we need to use the cores in order.

@alindima do you know the details of these points?

/// Determine the core for the given `para_id`.
///
/// Takes into account the `parent` core to find the next available core.
async fn determine_core<Header: HeaderT, RI: RelayChainInterface + 'static>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have test for this one?

relay_parent: &RelayHeader,
para_id: ParaId,
parent: &Header,
) -> Result<Option<(CoreSelector, ClaimQueueOffset, CoreIndex, u16)>, ()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to add a little doc what the u16 is here.

});

for (offset, cores) in offset_to_core_count {
if (offset as u32) < claim_queue_offset {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why bother adding items with offset < claim_queue_offset to the map in the first place?

Comment on lines 550 to 564
let res = if relay_parent_offset >
core_info.as_ref().map(|ci| ci.claim_queue_offset).unwrap_or_default().0 as u32
{
claim_queue.find_core(para_id, 0, 0)
} else {
claim_queue.find_core(
para_id,
core_info.as_ref().map_or(0, |ci| ci.selector.0 as u32 + 1),
core_info
.as_ref()
.map_or(0, |ci| ci.claim_queue_offset.0 as u32 - relay_parent_offset),
)
};

Ok(res)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this part a bit hard to digest, what do you think about this?

Suggested change
let res = if relay_parent_offset >
core_info.as_ref().map(|ci| ci.claim_queue_offset).unwrap_or_default().0 as u32
{
claim_queue.find_core(para_id, 0, 0)
} else {
claim_queue.find_core(
para_id,
core_info.as_ref().map_or(0, |ci| ci.selector.0 as u32 + 1),
core_info
.as_ref()
.map_or(0, |ci| ci.claim_queue_offset.0 as u32 - relay_parent_offset),
)
};
Ok(res)
let (cores_claimed, queue_offset) = match core_info {
Some(CoreInfo { selector, claim_queue_offset, .. })
if relay_parent_offset <= claim_queue_offset.0 as u32 =>
(selector.0 as u32 + 1, claim_queue_offset.0 as u32 - relay_parent_offset),
_ => (0, 0),
};
Ok(claim_queue.find_core(para_id, cores_claimed, queue_offset))

?claimed_cores,
"Claimed cores.",
slot_timer.update_scheduling(
claim_queue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use number_of_cores?

@skunert
Copy link
Contributor

skunert commented Jul 9, 2025

One more thing to think about is backward compatibility. These changes here are breaking, since older runtimes which use the CoreSelector runtiem API are not longer compatible with this node. However, technically ES is already released and chains are able to use it.

@alindima
Copy link
Contributor

alindima commented Jul 9, 2025

One more thing to think about is backward compatibility. These changes here are breaking, since older runtimes which use the CoreSelector runtiem API are not longer compatible with this node. However, technically ES is already released and chains are able to use it.

They can't yet use it since the v2 receipts feature is not yet enabled (but will soon be). And even after it's enabled, they could only use it if the enabled the experimental-ump-signals compile feature (or implemented their own custom logic for sending UMP signals).

but would be indeed worth thinking what's the worst case scenario if they did

@alindima
Copy link
Contributor

alindima commented Jul 9, 2025

If I build a block on claim queue offset 2, will this block be backed immediately or only when this claim arrives at position 0 in two relay chain blocks.

If you also have a claim at offset 0 on the same core it will be backed immediately.

If I build a block on claim queue offset 2, can the cores at offset 0 and 1 still be used, or are they "blocked" by the usage of offset 2? From intuition I would expect that we need to use the cores in order.

You can only occupy the core if you have the full candidate chain up until the latest included candidate of the para. And you can only occupy the cores at offset 0.

Therefore, you can't occupy a core at offset 0 if it's not building on the latest included block (or if you have the full chain being backed right now at offset 0). So your intuition is right

@bkchr bkchr changed the title Forward CoreInfo via an inherent to the runtime Forward CoreInfo via an digest to the runtime Aug 4, 2025
@bkchr bkchr requested a review from skunert August 5, 2025 18:14
@bkchr
Copy link
Member Author

bkchr commented Aug 5, 2025

/cmd prdoc --audience runtime_dev --bump major

@paritytech-workflow-stopper
Copy link

All GitHub workflows were cancelled due to failure one of the required jobs.
Failed workflow url: https://github.com/paritytech/polkadot-sdk/actions/runs/16761436884
Failed job name: cargo-clippy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T0-node This PR/Issue is related to the topic “node”. T9-cumulus This PR/Issue is related to cumulus.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CoreSelector wraparound causes some skipped blocks
3 participants