Skip to content

Checkpointing part 2: final integration#846

Draft
awariac wants to merge 18 commits intokuznia-rdzeni:masterfrom
awariac:piotro/checkpointing-integration
Draft

Checkpointing part 2: final integration#846
awariac wants to merge 18 commits intokuznia-rdzeni:masterfrom
awariac:piotro/checkpointing-integration

Conversation

@awariac
Copy link
Copy Markdown
Member

@awariac awariac commented Oct 20, 2025

Checkpointing is finally here!!! 🎉 (it took a bit of work and debug)

Integrates rollback-on-branch-misprediction flow and use of instruction tags into the core.
Async interrupts and other exceptions are still handled by flushing the old way.

TODO:
fix other unit tests, verify linux boot, verify benchmarks

@awariac awariac added enhancement New feature or request performance Improves performance benchmark Benchmarks should be run for this change microarch Involves the processor's microarchitecture labels Oct 20, 2025
@awariac awariac marked this pull request as draft October 20, 2025 20:07
@awariac awariac added benchmark Benchmarks should be run for this change and removed benchmark Benchmarks should be run for this change labels Oct 20, 2025
@github-actions
Copy link
Copy Markdown

Benchmarks summary

Performance benchmarks

aha-mont64 crc32 minver nettle-sha256 nsichneu slre statemate ud
0.495 0.605 0.47 0.659 0.4 0.388 0.4 0.501

You can view all the metrics here.

Synthesis benchmarks (basic)

Device utilisation: (ECP5) LUTs used as DFF: (ECP5) LUTs used as carry: (ECP5) LUTs used as ram: (ECP5) Max clock frequency (Fmax)
17252 5004 1574 1796 47

Synthesis benchmarks (full)

Device utilisation: (ECP5) LUTs used as DFF: (ECP5) LUTs used as carry: (ECP5) LUTs used as ram: (ECP5) Max clock frequency (Fmax)
31793 8199 2058 2324 42

@awariac
Copy link
Copy Markdown
Member Author

awariac commented Oct 24, 2025

yoooo

@tilk
Copy link
Copy Markdown
Member

tilk commented Oct 27, 2025

Nice improvement!

@awariac
Copy link
Copy Markdown
Member Author

awariac commented Nov 5, 2025

Nice improvement!
@tilk

yeah, it is, but I expected it to be bigger (hope that's not a wishful thinking). Did you have any expectations of it?

@tilk
Copy link
Copy Markdown
Member

tilk commented Nov 6, 2025

yeah, it is, but I expected it to be bigger (hope that's not a wishful thinking). Did you have any expectations of it?

Also hoped for a bigger difference, but it looks like there are other factors at play. A few of the most obvious for me are:

  • Mispredictions still have a cost (filling the pipeline, keeping the functional units busy with not needed operations).
  • The core cannot execute dependent instructions cycle after cycle as there is no forwarding from instruction results to RS (and full forwarding would probably be costly to implement). Superscalarity would help discover independent instructions faster.
  • According to metrics, quite a lot of instructions seem to linger in RS for some time. There could be multiple reasons for that: maybe the results are not coming in fast enough, or maybe the other end of the pipeline is the limiting factor. I would consider adding more metrics to RS to help understand the behavior.
  • Lack of superscalarity means that if multiple instructions in different RS have their operand ready in the same cycle, they still need to complete in sequence.
  • The LSU blocks reordering of instructions.

It's hard to decide the importance of these factors on the final performance. There is probably no single biggest reason, I'm starting to believe that OoO performance is complex and comes from a combination of many factors.

@awariac
Copy link
Copy Markdown
Member Author

awariac commented Nov 7, 2025

verify linux boot

and it doesn't boot, so more obscure bugs awaiting :/

@awariac
Copy link
Copy Markdown
Member Author

awariac commented Nov 8, 2025

#451 would be useful to find some manageable test cases

@tilk
Copy link
Copy Markdown
Member

tilk commented Nov 8, 2025

Given that benchmarks and other full core tests work fine, most probably the issue is related to M-mode and interrupt handling. Indeed RISCV-DV claims to cover this.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 2, 2026

Benchmarks summary

Performance benchmarks

aha-mont64 crc32 minver nettle-sha256 nsichneu slre statemate ud
▲ 0.517 (+0.085) ▲ 0.596 (+0.042) ▲ 0.477 (+0.115) ▲ 0.672 (+0.018) ▲ 0.406 (+0.044) ▲ 0.397 (+0.095) ▲ 0.404 (+0.072) ▲ 0.505 (+0.067)

You can view all the metrics here.

Synthesis benchmarks (basic)

Device utilisation: (ECP5) LUTs used as DFF: (ECP5) LUTs used as carry: (ECP5) LUTs used as ram: (ECP5) Max clock frequency (Fmax)
▲ 16659 (+1333) ▲ 4989 (+761) ▲ 1570 (+120) ▲ 1796 (+244) ▼ 48 (-2)

Synthesis benchmarks (full)

Device utilisation: (ECP5) LUTs used as DFF: (ECP5) LUTs used as carry: (ECP5) LUTs used as ram: (ECP5) Max clock frequency (Fmax)
▲ 33134 (+1387) ▲ 8459 (+770) ▲ 2086 (+152) ▲ 2420 (+244) ▼ 40 (-1)

@tilk tilk added the nlnet The work is part of the NLnet grant label Apr 10, 2026
awariac

This comment was marked as outdated.

@awariac
Copy link
Copy Markdown
Member Author

awariac commented May 4, 2026

finally made a merge with superscalarity, new frontend and other changes.
stopped working correctly on core and checkpointing tests.
needs further investigation

@awariac awariac force-pushed the piotro/checkpointing-integration branch from b1c7a58 to be81f6d Compare May 4, 2026 15:19
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 4, 2026

Benchmarks summary

Performance benchmarks

aha-mont64 crc32 minver nettle-sha256 nsichneu slre statemate ud
▲ 0.503 (+0.072) ▲ 0.605 (+0.051) ▲ 0.492 (+0.110) ▲ 0.670 (+0.017) ▲ 0.402 (+0.041) ▲ 0.385 (+0.080) ▲ 0.393 (+0.062) ▲ 0.504 (+0.054)

You can view all the metrics here.

Synthesis benchmarks (basic)

Device utilisation: (ECP5) LUTs used as DFF: (ECP5) LUTs used as carry: (ECP5) LUTs used as ram: (ECP5) Max clock frequency (Fmax)
▲ 18368 (+392) ▲ 5565 (+765) ▲ 1490 (+140) ▲ 1612 (+240) ▼ 41 (-6)

Synthesis benchmarks (full)

Device utilisation: (ECP5) LUTs used as DFF: (ECP5) LUTs used as carry: (ECP5) LUTs used as ram: (ECP5) Max clock frequency (Fmax)
▲ 50392 (+1855) ▲ 10948 (+764) ▲ 3668 (+112) ▲ 2860 (+224) ▼ 24 (-11)

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 4, 2026

Benchmarks summary

Performance benchmarks

aha-mont64 crc32 minver nettle-sha256 nsichneu slre statemate ud
▲ 0.503 (+0.072) ▲ 0.605 (+0.051) ▲ 0.492 (+0.110) ▲ 0.670 (+0.017) ▲ 0.402 (+0.041) ▲ 0.385 (+0.080) ▲ 0.393 (+0.062) ▲ 0.504 (+0.054)

You can view all the metrics here.

Synthesis benchmarks (basic)

Device utilisation: (ECP5) LUTs used as DFF: (ECP5) LUTs used as carry: (ECP5) LUTs used as ram: (ECP5) Max clock frequency (Fmax)
▲ 18559 (+583) ▲ 5565 (+765) ▲ 1490 (+140) ▲ 1612 (+240) ▼ 44 (-3)

Synthesis benchmarks (full)

Device utilisation: (ECP5) LUTs used as DFF: (ECP5) LUTs used as carry: (ECP5) LUTs used as ram: (ECP5) Max clock frequency (Fmax)
▲ 49882 (+1345) ▲ 10948 (+764) ▲ 3668 (+112) ▲ 2860 (+224) ▼ 27 (-9)

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 4, 2026

Benchmarks summary

Performance benchmarks

aha-mont64 crc32 minver nettle-sha256 nsichneu slre statemate ud
▲ 0.503 (+0.072) ▲ 0.605 (+0.051) ▲ 0.492 (+0.110) ▲ 0.670 (+0.017) ▲ 0.402 (+0.041) ▲ 0.385 (+0.080) ▲ 0.393 (+0.062) ▲ 0.504 (+0.054)

You can view all the metrics here.

Synthesis benchmarks (basic)

Device utilisation: (ECP5) LUTs used as DFF: (ECP5) LUTs used as carry: (ECP5) LUTs used as ram: (ECP5) Max clock frequency (Fmax)
▲ 18676 (+700) ▲ 5565 (+765) ▲ 1458 (+108) ▲ 1612 (+240) ▼ 43 (-4)

Synthesis benchmarks (full)

Device utilisation: (ECP5) LUTs used as DFF: (ECP5) LUTs used as carry: (ECP5) LUTs used as ram: (ECP5) Max clock frequency (Fmax)
▲ 48875 (+338) ▲ 10948 (+764) ▲ 3664 (+108) ▲ 2860 (+224) ▼ 25 (-10)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

benchmark Benchmarks should be run for this change enhancement New feature or request microarch Involves the processor's microarchitecture nlnet The work is part of the NLnet grant performance Improves performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants