-
Notifications
You must be signed in to change notification settings - Fork 111
Lower instruction decoding and dispatch overhead #88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Known issue: Lower performance than When we test both of them using
TODO: Use |
I use perf to find the hot spot of
│ if (!emulate(rv, block->ir + i))
0.02 │ lea (%rax,%rax,2),%rdx
- 10.16 │ mov 0x18(%r14),%rax
0.26 │ lea (%rax,%rdx,4),%rbx
│ emulate():
│ rv->compressed = (ir->insn_len == INSN_16);
0.26 │ movzbl 0x9(%rbx),%esi
0.09 │ cmp $0x2,%sil
- 9.74 │ sete 0x1cc(%rbp)
│ switch (ir->opcode) {
│ cmpb $0x7a,0x7(%rbx)
0.10 │ ↓ ja 1a40
- 9.02 │ movzbl 0x7(%rbx),%eax
0.33 │ movslq (%r12,%rax,4),%rax
0.26 │ add %r12,%rax
- 9.40 │ notrack jmpq *%rax
... ▒
│ rv->PC += ir->insn_len;
0.02 │ 170: add %r10d,%esi
0.13 │ mov %esi,0xd0(%rbp)
│ block_emulate():
│ rv->csr_cycle++;
0.01 │ 179: mov 0x198(%rbp),%rax
│ for (uint32_t i = 0; i < block->n_insn; i++) {
- 8.18 │ add $0x1,%r13d
│ rv->csr_cycle++;
0.01 │ add $0x1,%rax
0.88 │ mov %rax,0x198(%rbp)
│ for (uint32_t i = 0; i < block->n_insn; i++) {
│ cmp (%r14),%r13d
- 8.36 │ ↑ jb 100
│ for (auto &i : block.insn) {
│ mov 0x18(%rsi),%rbx
0.04 │ cmp %r13,%rbx
│ ↓ je ed
│ mov %rdi,%rbp
2.11 │ lea std::piecewise_construct+0x478,%r12
│ nop
│ // enforce zero regiser
│ rv->X[rv_reg_zero] = 0;
│ 30: movl $0x0,0x50(%rbp)
│ switch (i.opcode) {
0.04 │ cmpb $0x5b,0x8(%rbx)
│ emulate():
2.37 │ ↓ ja f01
- 14.38 │ movzbl 0x8(%rbx),%eax
0.06 │ movslq (%r12,%rax,4),%rax
0.20 │ add %r12,%rax
- 15.71 │ notrack jmpq *%rax According to the perf result, I assume the following codes may cause the rv32emu's slowness. 1. rv->compressed = (ir->insn_len == INSN_16); -> sete
2. rv->csr_cycle++; -> mov
3. emulate(rv, block->ir + i)
4. for (uint32_t i = 0; i < block->n_insn; i++) |
First, I try to remove the following code. (Just for checking its influences on emulator.) rv->compressed = (ir->insn_len == INSN_16);
rv->csr_cycle++; And execute Dhrystone(1.1-mc), 10000000 passes, 4425698 microseconds, 1283 DMIPS The performance from 1114 to 1283 DMIPS. |
For typical RISC-V programs, it is unlikely to switch between RV32I and RV32C frequently. Therefore, we don't have to check RV32C in every cycle. Such information can be carried at the instruction decoding stage. |
Second, I rewrite the void emulate_block(riscv_t *rv, block_t &block)
{
- for (auto &i : block.insn) {
+ for (uint32_t i = 0; i < block.instructions; i++) {
// enforce zero regiser
rv->X[rv_reg_zero] = 0;
// emulate an instruction
- emulate(*rv, i);
+ emulate(*rv, block.insn[i]);
}
} And execute Dhrystone(1.1-mc), 10000000 passes, 4232523 microseconds, 1341 DMIPS The performance from 1631 to 1341 DMIPS. Use perf to find the difference of emulate(*rv, block.insn[i]);
│ mov %ebp,%eax
│ _ZNSt6vectorI9rv_insn_tSaIS0_EEixEm():
│ */
│ reference
│ operator[](size_type __n) _GLIBCXX_NOEXCEPT
│ {
│ __glibcxx_requires_subscript(__n);
│ return *(this->_M_impl._M_start + __n);
│ lea (%rax,%rax,2),%rdx
-11.53 │ mov 0x18(%r13),%rax
1.37 │ lea (%rax,%rdx,4),%rbx
bjdump:│Warnin_Z13emulate_blockP7riscv_tR7block_t():
│ switch (i.opcode) {
0.07 │ cmpb $0x5b,0x8(%rbx)
│ emulate():
1.44 │ ↓ ja fe3 We can find the new implementation increases the instruction |
To prevent pointless movement of data, we can combine the |
I think we can pass the value of instruction length in IR to the exception handler rather than verifying it each time the emulate function is called, because we only need the value of #92 is the proposed change for reviewing. |
Since #93 is merged into |
commit 285a988 provides a minor increase in performance that reflects the above. |
#95 is an on-going experiment which attempts to lower IR instruction dispatching overheads. @Risheng1128, please measure and compare computed-goto vs. TCO. |
I implement computed-goto version in Use computed-goto for efficient dispatch, and test it and Use TCO of C compiler to speed up emulation with cpu: i5-12500
However, the computed-goto result with I think computed-goto may have a slightly efficient performance than TCO. |
Because computed-goto can only decrease the overhead of instruction dispatching in a single block instead of the entire program scope, the speedup would not be immediately apparent. If neither gcc nor clang are present, classic switch-case is employed as a fallback. Close sysprog21#88 Co-authored-by: Jim Huang <[email protected]> Signed-off-by: Jim Huang <[email protected]>
wip/instruction-decode
branch breaks RISC-V instruction decoding and emulation into separate stage, meaning that it is feasible to incorporate further IR optimizations and JIT code generation. However, we do need additional efforts to make it practical:wip/jit
does;All of the above should appear in
wip/instruction-decode
branch before its merge intomaster
branch.The text was updated successfully, but these errors were encountered: