Skip to content

Fix performance problems after #532#545

Merged
tilk merged 10 commits into
kuznia-rdzeni:masterfrom
awariac:piotro888/interrupt-fmax-fix
Jan 4, 2024
Merged

Fix performance problems after #532#545
tilk merged 10 commits into
kuznia-rdzeni:masterfrom
awariac:piotro888/interrupt-fmax-fix

Conversation

@awariac
Copy link
Copy Markdown
Member

@awariac awariac commented Dec 18, 2023

#532 introduced loss in Fmax and IPC.
Both problems are fixed and explained in comments.
For other details see discussion at end of comments in #532

@awariac awariac added the performance Improves performance label Dec 18, 2023
Comment thread coreblocks/structs_common/exception.py Outdated
self.layouts = gp.get(ExceptionRegisterLayouts)

# Break long combinational paths from single-cycle FUs
# Insertion of FIFO is fine, because self.report is always ready, and will delay report by
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a single FIFO for all FUs. What will happen when two FUs report an exception in the same clock cycle? Did that work previously?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is the same behaviour as last time. Both FUs cannot report exception at the same time, and only one will proceed with its Transaction.
This is a performance problem, because calls to report are under Ifs so all FUs reporting instruction are mutually exclusive at some stage.
As we discussed on the meeting, it may be solved via call under condition (btw, does it fix that -- will it block parent transction if we want to call report but it is not ready??).
Optimizing to allow reporting multiple exceptions at one cycle is not worth it (we have large flush/resume penalty, so one additional blocked cycle wouldn't change too much).
Anyway it is topic for another PR.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still wondering about the cycle delays.

  • The delay from e.g. CSR to ROB change is in best case only one cycle. CSR's get_result goes through FuncBlocksUnifier, which uses a Collector, which uses a Forwarder, which finally connects to the transaction in ResultAnnouncement. The same transaction calls rob_mark_done, which updates the ROB.

  • On the other hand, the update of the ExceptionCauseRegister is two cycles from CSR's get_result: one cycle for the FIFO, and another cycle for the register update.

  • Therefore there is a possibility for Retirement to see the ROB exception bit set, but not have the right cause.

Am I wrong?

Ideally, I would like to have this latency-insensitive; in other words, Retirement should wait if the rob_id in ExceptionCauseRegister is not the rob_id of currently processed entry. One way to do this cleanly would be to add a rob_id argument to the ExceptionCauseRegister's get method, and use validate_arguments to only accept the call when the rob_id matches the rob_id in the register.

Copy link
Copy Markdown
Member Author

@awariac awariac Dec 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, you are right, thanks! I missed the sync delay in ExceptionCauseRegister.

Retirement solution seems nice.
Now I think about it, we don't even need to do that! We just need to check a valid bit of ExceptionCauseRegister. We only care about the first trap. And if it is processed, ExceptionCauseRegister is in the clear() state!
I will look into validate_arguments, I didn't used it before. Hope it is possible to do that (we need exception bit from rob_peek and exception cause register result to calculate it)

At the same time I'm preparing Retirement refactor (and will make PR for it in a minute), I think it should be merged first (but lets see what the reception will be)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I think about it, we don't even need to do that! We just need to check a valid bit of ExceptionCauseRegister. We only care about the first trap.

The first trap in wall clock time might come from an instruction which was not the first in instruction order. So the valid bit will be 1, but the rob_id will be wrong, and we get the wrong cause. Am I missing something?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, my bad

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is still one problem - ExceptionCauseRegister allows reporting multiple exceptions for one instruction, and decides if the cause should be updated. In this case rob_id would match, but returned cause would be outdated.

Looking at current usage of ExceptionCauseRegister, this functionality may be not needed at all. All exceptions in FUs are reported in one place only (when setting exception bit), so priority is selected locally. If any exception happens before FU it is also reported in ExceptionFU - priority also needs to be manually determined in pipeline.

I think we can safely add limitation to support reporting only one exception per rob_id.
ECR will be simplified and I don't know if the addition of separation FIFO for Fmax will be necessary. Should it stay? It probably be too much noticeable, because JB mispedictions report exceptions one cycle earlier that your example.

lekcyjna123
lekcyjna123 previously approved these changes Dec 18, 2023
@awariac awariac changed the title Fix preformace problems after #532 Fix performance problems after #532 Dec 18, 2023
@lekcyjna123 lekcyjna123 self-requested a review December 20, 2023 19:53
@awariac awariac marked this pull request as draft December 21, 2023 10:55
@tilk tilk mentioned this pull request Dec 21, 2023
@awariac awariac force-pushed the piotro888/interrupt-fmax-fix branch from a73ca05 to 3387b08 Compare December 23, 2023 14:12
Comment on lines 143 to 147
with condition(m, priority=False) as cond:
with cond(commit):
retire_instr(rob_entry)
with cond(~commit): # Not using default, because we want to block if condition is not ready
with cond(~commit):
flush_instr(rob_entry)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition in retirement can be eliminated with merge of #551, by using retire_instr and flush_instr instead of setting commit flag. Non-blocking is guaranteed by condition that is needed on higher level for rob.exception (and makes more sense)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I remember right, condition was put here not for aesthetics, but to avoid locking FRAT's rename method.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be removed because it would be separated by existing condition of (~)rob_entry.exception.

No longer valid, because that exception condition caused problems with 2* LUT usage.
I fixed it by using separate Transaction instead of validate_args. get method is not blocking so excpetion condition can be removed and commit condition stays the same.

@awariac awariac marked this pull request as ready for review December 23, 2023 14:16
@awariac
Copy link
Copy Markdown
Member Author

awariac commented Dec 23, 2023

For some reason Fmax is now 40 MHz with critical path not related to changes here :/

EDIT: LUTS increased *2. Am I doing something wrong, or there is a bug in those functionalities?

@awariac awariac marked this pull request as draft December 23, 2023 14:39
@awariac awariac force-pushed the piotro888/interrupt-fmax-fix branch from 3387b08 to 70a8619 Compare December 23, 2023 14:44
@awariac awariac force-pushed the piotro888/interrupt-fmax-fix branch from 4736793 to 96400a6 Compare January 2, 2024 14:31
@awariac awariac marked this pull request as ready for review January 2, 2024 15:24
@awariac
Copy link
Copy Markdown
Member Author

awariac commented Jan 2, 2024

Fixed, Fmax is 54 MHz, LUT usage back to normal.
It was caused by rob_entry.exception condition. Maybe it needs investigation, I don't see a reason why it doubled the LUT usage. Still happened at this commit after removal of validate_args.
For more change details refer to review comments

@tilk tilk dismissed lekcyjna123’s stale review January 2, 2024 15:37

Stale review

@lekcyjna123
Copy link
Copy Markdown
Contributor

lekcyjna123 commented Jan 2, 2024

It is interesting. I checkouted 96400a6 generated verilog by ./scripts/gen_verilog.py -c basic and tried to synthesise it with Quartus, but I got syntax error. After some manual tweaks code get accepted. On the newest commit there is no more syntax error reported by Quartus.

EDIT: Btw. it looks like there is some regression in detecting amaranth memory by Quartus. On this PR there is 0 bits of memory found by Quratus. On my master thesis it was more than 100k bits.

Info (276007): RAM logic "top.main_module:main_module|top.main_module.c:c|top.main_module.c.icache:icache|top.main_module.c.icache.mem:mem|data_mem_0_rp" is uninferred due to asynchronous read logic

@tilk tilk merged commit 9b5a156 into kuznia-rdzeni:master Jan 4, 2024
github-actions Bot pushed a commit that referenced this pull request Jan 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Improves performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants