Fix performance problems after #532#545
Conversation
| self.layouts = gp.get(ExceptionRegisterLayouts) | ||
|
|
||
| # Break long combinational paths from single-cycle FUs | ||
| # Insertion of FIFO is fine, because self.report is always ready, and will delay report by |
There was a problem hiding this comment.
This is a single FIFO for all FUs. What will happen when two FUs report an exception in the same clock cycle? Did that work previously?
There was a problem hiding this comment.
It is the same behaviour as last time. Both FUs cannot report exception at the same time, and only one will proceed with its Transaction.
This is a performance problem, because calls to report are under Ifs so all FUs reporting instruction are mutually exclusive at some stage.
As we discussed on the meeting, it may be solved via call under condition (btw, does it fix that -- will it block parent transction if we want to call report but it is not ready??).
Optimizing to allow reporting multiple exceptions at one cycle is not worth it (we have large flush/resume penalty, so one additional blocked cycle wouldn't change too much).
Anyway it is topic for another PR.
There was a problem hiding this comment.
I'm still wondering about the cycle delays.
-
The delay from e.g. CSR to ROB change is in best case only one cycle. CSR's
get_resultgoes throughFuncBlocksUnifier, which uses aCollector, which uses aForwarder, which finally connects to the transaction inResultAnnouncement. The same transaction callsrob_mark_done, which updates the ROB. -
On the other hand, the update of the
ExceptionCauseRegisteris two cycles from CSR'sget_result: one cycle for the FIFO, and another cycle for the register update. -
Therefore there is a possibility for
Retirementto see the ROB exception bit set, but not have the right cause.
Am I wrong?
Ideally, I would like to have this latency-insensitive; in other words, Retirement should wait if the rob_id in ExceptionCauseRegister is not the rob_id of currently processed entry. One way to do this cleanly would be to add a rob_id argument to the ExceptionCauseRegister's get method, and use validate_arguments to only accept the call when the rob_id matches the rob_id in the register.
There was a problem hiding this comment.
Oh, you are right, thanks! I missed the sync delay in ExceptionCauseRegister.
Retirement solution seems nice.
Now I think about it, we don't even need to do that! We just need to check a valid bit of ExceptionCauseRegister. We only care about the first trap. And if it is processed, ExceptionCauseRegister is in the clear() state!
I will look into validate_arguments, I didn't used it before. Hope it is possible to do that (we need exception bit from rob_peek and exception cause register result to calculate it)
At the same time I'm preparing Retirement refactor (and will make PR for it in a minute), I think it should be merged first (but lets see what the reception will be)
There was a problem hiding this comment.
Now I think about it, we don't even need to do that! We just need to check a valid bit of ExceptionCauseRegister. We only care about the first trap.
The first trap in wall clock time might come from an instruction which was not the first in instruction order. So the valid bit will be 1, but the rob_id will be wrong, and we get the wrong cause. Am I missing something?
There was a problem hiding this comment.
There is still one problem - ExceptionCauseRegister allows reporting multiple exceptions for one instruction, and decides if the cause should be updated. In this case rob_id would match, but returned cause would be outdated.
Looking at current usage of ExceptionCauseRegister, this functionality may be not needed at all. All exceptions in FUs are reported in one place only (when setting exception bit), so priority is selected locally. If any exception happens before FU it is also reported in ExceptionFU - priority also needs to be manually determined in pipeline.
I think we can safely add limitation to support reporting only one exception per rob_id.
ECR will be simplified and I don't know if the addition of separation FIFO for Fmax will be necessary. Should it stay? It probably be too much noticeable, because JB mispedictions report exceptions one cycle earlier that your example.
a73ca05 to
3387b08
Compare
| with condition(m, priority=False) as cond: | ||
| with cond(commit): | ||
| retire_instr(rob_entry) | ||
| with cond(~commit): # Not using default, because we want to block if condition is not ready | ||
| with cond(~commit): | ||
| flush_instr(rob_entry) |
There was a problem hiding this comment.
This condition in retirement can be eliminated with merge of #551, by using retire_instr and flush_instr instead of setting commit flag. Non-blocking is guaranteed by condition that is needed on higher level for rob.exception (and makes more sense)
There was a problem hiding this comment.
If I remember right, condition was put here not for aesthetics, but to avoid locking FRAT's rename method.
There was a problem hiding this comment.
It could be removed because it would be separated by existing condition of (~)rob_entry.exception.
No longer valid, because that exception condition caused problems with 2* LUT usage.
I fixed it by using separate Transaction instead of validate_args. get method is not blocking so excpetion condition can be removed and commit condition stays the same.
|
For some reason Fmax is now 40 MHz with critical path not related to changes here :/ EDIT: LUTS increased *2. Am I doing something wrong, or there is a bug in those functionalities? |
3387b08 to
70a8619
Compare
4736793 to
96400a6
Compare
|
Fixed, Fmax is 54 MHz, LUT usage back to normal. |
|
It is interesting. I checkouted 96400a6 generated verilog by EDIT: Btw. it looks like there is some regression in detecting amaranth memory by Quartus. On this PR there is
|
#532 introduced loss in Fmax and IPC.
Both problems are fixed and explained in comments.
For other details see discussion at end of comments in #532