Skip to content

Lower instruction decoding and dispatch overhead #88

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jserv opened this issue Nov 7, 2022 · 11 comments
Closed

Lower instruction decoding and dispatch overhead #88

jserv opened this issue Nov 7, 2022 · 11 comments
Assignees

Comments

@jserv
Copy link
Contributor

jserv commented Nov 7, 2022

wip/instruction-decode branch breaks RISC-V instruction decoding and emulation into separate stage, meaning that it is feasible to incorporate further IR optimizations and JIT code generation. However, we do need additional efforts to make it practical:

  1. Executing RISC-V instructions by compiling the program a basic block at a time, thus avoiding unnecessary translation;
  2. Implementing an efficient way to look in a hash map for a code block matching the current program counter as wip/jit does;
  3. Reducing IR dispatch cost means of computed-goto or tail-call elimination (as wasm3 does).

All of the above should appear in wip/instruction-decode branch before its merge into master branch.

@Risheng1128
Copy link
Collaborator

Risheng1128 commented Nov 28, 2022

Known issue: Lower performance than refactor-rv32

When we test both of them using dhrystone.elf, it is discovered that the performance of the wip/instruction-decode branch is worse than refactor-rv32.

  • refactor-rv32 : 1631 DMIPS
  • rv32emu - wip/instruction-decode : 1138 DMIPS

TODO: Use perf to determine the root cause of rv32emu's slowness

@Risheng1128
Copy link
Collaborator

Risheng1128 commented Dec 5, 2022

I use perf to find the hot spot of rv32emu and refactor-rv32 respectively. Both of them execute dhrystone.elf.

rv32emu:

        │      if (!emulate(rv, block->ir + i))                                                                                                                                                            
   0.02 │        lea       (%rax,%rax,2),%rdx                                                                                                                                                              
- 10.16 │        mov       0x18(%r14),%rax                                                                                                                                                                 
   0.26 │        lea       (%rax,%rdx,4),%rbx                                                                                                                                                              
        │      emulate():                                                                                                                                                                                  
        │      rv->compressed = (ir->insn_len == INSN_16);                                                                                                                                                 
   0.26 │        movzbl    0x9(%rbx),%esi                                                                                                                                                                  
   0.09 │        cmp       $0x2,%sil                                                                                                                                                                       
-  9.74 │        sete      0x1cc(%rbp)                                                                                                                                                                     
        │      switch (ir->opcode) {                                                                                                                                                                       
        │        cmpb      $0x7a,0x7(%rbx)                                                                                                                                                                 
   0.10 │      ↓ ja        1a40                                                                                                                                                                            
-  9.02 │        movzbl    0x7(%rbx),%eax                                                                                                                                                                  
   0.33 │        movslq    (%r12,%rax,4),%rax                                                                                                                                                              
   0.26 │        add       %r12,%rax                                                                                                                                                                       
-  9.40 │        notrack   jmpq *%rax                                                                                                                                                                      
   ...                                                                                                                                                                                       ▒
        │      rv->PC += ir->insn_len;                                                                                                                                                                    
   0.02 │ 170:   add       %r10d,%esi                                                                                                                                                                      
   0.13 │        mov       %esi,0xd0(%rbp)                                                                                                                                                                 
        │      block_emulate():                                                                                                                                                                            
        │      rv->csr_cycle++;                                                                                                                                                                            
   0.01 │ 179:   mov       0x198(%rbp),%rax                                                                                                                                                                
        │      for (uint32_t i = 0; i < block->n_insn; i++) {                                                                                                                                              
-  8.18 │        add       $0x1,%r13d                                                                                                                                                                      
        │      rv->csr_cycle++;                                                                                                                                                                            
   0.01 │        add       $0x1,%rax                                                                                                                                                                       
   0.88 │        mov       %rax,0x198(%rbp)                                                                                                                                                                
        │      for (uint32_t i = 0; i < block->n_insn; i++) {                                                                                                                                              
        │        cmp       (%r14),%r13d                                                                                                                                                                    
-  8.36 │      ↑ jb        100 

refactor-rv32:

        │     for (auto &i : block.insn) {                                                                                                                                                                 
        │       mov       0x18(%rsi),%rbx                                                                                                                                                                  
   0.04 │       cmp       %r13,%rbx                                                                                                                                                                        
        │     ↓ je        ed                                                                                                                                                                               
        │       mov       %rdi,%rbp                                                                                                                                                                        
   2.11 │       lea       std::piecewise_construct+0x478,%r12                                                                                                                                              
        │       nop                                                                                                                                                                                        
        │     // enforce zero regiser                                                                                                                                                                      
        │     rv->X[rv_reg_zero] = 0;                                                                                                                                                                      
        │ 30:   movl      $0x0,0x50(%rbp)                                                                                                                                                                  
        │     switch (i.opcode) {                                                                                                                                                                          
   0.04 │       cmpb      $0x5b,0x8(%rbx)                                                                                                                                                                  
        │     emulate():                                                                                                                                                                                   
   2.37 │     ↓ ja        f01                                                                                                                                                                              
- 14.38 │       movzbl    0x8(%rbx),%eax                                                                                                                                                                   
   0.06 │       movslq    (%r12,%rax,4),%rax                                                                                                                                                               
   0.20 │       add       %r12,%rax                                                                                                                                                                        
- 15.71 │       notrack   jmpq *%rax           

According to the perf result, I assume the following codes may cause the rv32emu's slowness.

1. rv->compressed = (ir->insn_len == INSN_16); -> sete
2. rv->csr_cycle++; -> mov
3. emulate(rv, block->ir + i)
4. for (uint32_t i = 0; i < block->n_insn; i++)

@Risheng1128
Copy link
Collaborator

Risheng1128 commented Dec 5, 2022

First, I try to remove the following code. (Just for checking its influences on emulator.)

rv->compressed = (ir->insn_len == INSN_16);
rv->csr_cycle++;

And execute dhrystone.elf again.

Dhrystone(1.1-mc), 10000000 passes, 4425698 microseconds, 1283 DMIPS

The performance from 1114 to 1283 DMIPS.

@jserv
Copy link
Contributor Author

jserv commented Dec 5, 2022

First, I try to remove the following code. (Just to check its influences on emulator.)

rv->compressed = (ir->insn_len == INSN_16);
rv->csr_cycle++;

The performance from 1114 to 1283 DMIPS.

For typical RISC-V programs, it is unlikely to switch between RV32I and RV32C frequently. Therefore, we don't have to check RV32C in every cycle. Such information can be carried at the instruction decoding stage.

@Risheng1128
Copy link
Collaborator

Risheng1128 commented Dec 5, 2022

Second, I rewrite the refactor-rv32 as following like rv32emu. (To check the influence of loop.)

void emulate_block(riscv_t *rv, block_t &block)
{
- for (auto &i : block.insn) {
+ for (uint32_t i = 0; i < block.instructions; i++) {
        // enforce zero regiser
        rv->X[rv_reg_zero] = 0;
        // emulate an instruction
-       emulate(*rv, i);
+       emulate(*rv, block.insn[i]);
  }
}

And execute dhrystone.elf

Dhrystone(1.1-mc), 10000000 passes, 4232523 microseconds, 1341 DMIPS

The performance from 1631 to 1341 DMIPS.

Use perf to find the difference of refactor-rv32.

emulate(*rv, block.insn[i]);
       │        mov       %ebp,%eax
       │      _ZNSt6vectorI9rv_insn_tSaIS0_EEixEm():
       │      */
       │      reference
       │      operator[](size_type __n) _GLIBCXX_NOEXCEPT
       │      {
       │      __glibcxx_requires_subscript(__n);
       │      return *(this->_M_impl._M_start + __n);
       │        lea       (%rax,%rax,2),%rdx
-11.53 │        mov       0x18(%r13),%rax
  1.37 │        lea       (%rax,%rdx,4),%rbx
bjdump:│Warnin_Z13emulate_blockP7riscv_tR7block_t():
       │      switch (i.opcode) {
  0.07 │        cmpb      $0x5b,0x8(%rbx)
       │      emulate():
  1.44 │      ↓ ja        fe3

We can find the new implementation increases the instruction mov 0x18(%r13),%rax, and can find the same one in rv32emu

@jserv
Copy link
Contributor Author

jserv commented Dec 5, 2022

We can find the new implementation increases the instruction mov 0x18(%r13),%rax, and can find the same one in rv32emu

To prevent pointless movement of data, we can combine the emulate function body with emulate_block.

@qwe661234
Copy link
Collaborator

qwe661234 commented Dec 7, 2022

First, I try to remove the following code. (Just to check its influences on emulator.)

rv->compressed = (ir->insn_len == INSN_16);
rv->csr_cycle++;

The performance from 1114 to 1283 DMIPS.

For typical RISC-V programs, it is unlikely to switch between RV32I and RV32C frequently. Therefore, we don't have to check RV32C in every cycle. Such information can be carried at the instruction decoding stage.

I think we can pass the value of instruction length in IR to the exception handler rather than verifying it each time the emulate function is called, because we only need the value of rv->compressed in the exception handler.

#92 is the proposed change for reviewing.

@jserv
Copy link
Contributor Author

jserv commented Dec 11, 2022

Since #93 is merged into master branch, let's concentrate on on master branch now.

@jserv
Copy link
Contributor Author

jserv commented Dec 12, 2022

Second, I rewrite the refactor-rv32 as following like rv32emu. (To check the influence of loop.)
[...]
Use perf to find the difference of refactor-rv32.
[...]
We can find the new implementation increases the instruction mov 0x18(%r13),%rax, and can find the same one in rv32emu

commit 285a988 provides a minor increase in performance that reflects the above.

@jserv
Copy link
Contributor Author

jserv commented Dec 13, 2022

#95 is an on-going experiment which attempts to lower IR instruction dispatching overheads. @Risheng1128, please measure and compare computed-goto vs. TCO.
If TCO does matter, we can then safely remove computed-goto specific build rules.

@Risheng1128
Copy link
Collaborator

Risheng1128 commented Dec 14, 2022

I implement computed-goto version in Use computed-goto for efficient dispatch, and test it and Use TCO of C compiler to speed up emulation with coremark.elf and dhrystone.elf respectively. The following is the result:

cpu: i5-12500
compiler: gcc - 9.4.0

case coremark dhrystone
TCO 1132.824525 iter/sec 1195 DMIPS
computed-goto 1121.039724 iter/sec 1286 DMIPS

However, the computed-goto result with dhrystone.elf fluctuates wildly. It is usually about 1200 ~ 1300, and TCO is usually about 1180 ~ 1200.

I think computed-goto may have a slightly efficient performance than TCO.

@jserv jserv closed this as completed in 1b0391a Dec 16, 2022
vestata pushed a commit to vestata/rv32emu that referenced this issue Jan 24, 2025
Because computed-goto can only decrease the overhead of instruction
dispatching in a single block instead of the entire program scope,
the speedup would not be immediately apparent.

If neither gcc nor clang are present, classic switch-case is employed
as a fallback.

Close sysprog21#88

Co-authored-by: Jim Huang <[email protected]>
Signed-off-by: Jim Huang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants