Lower instruction decoding and dispatch overhead #88

jserv · 2022-11-07T19:53:24Z

wip/instruction-decode branch breaks RISC-V instruction decoding and emulation into separate stage, meaning that it is feasible to incorporate further IR optimizations and JIT code generation. However, we do need additional efforts to make it practical:

Executing RISC-V instructions by compiling the program a basic block at a time, thus avoiding unnecessary translation;
Implementing an efficient way to look in a hash map for a code block matching the current program counter as wip/jit does;
Reducing IR dispatch cost means of computed-goto or tail-call elimination (as wasm3 does).

All of the above should appear in wip/instruction-decode branch before its merge into master branch.

The text was updated successfully, but these errors were encountered:

Risheng1128 · 2022-11-28T20:00:05Z

Known issue: Lower performance than refactor-rv32

When we test both of them using dhrystone.elf, it is discovered that the performance of the wip/instruction-decode branch is worse than refactor-rv32.

refactor-rv32 : 1631 DMIPS
rv32emu - wip/instruction-decode : 1138 DMIPS

TODO: Use perf to determine the root cause of rv32emu's slowness

Risheng1128 · 2022-12-05T10:37:12Z

I use perf to find the hot spot of rv32emu and refactor-rv32 respectively. Both of them execute dhrystone.elf.

rv32emu:

        │      if (!emulate(rv, block->ir + i))                                                                                                                                                            
   0.02 │        lea       (%rax,%rax,2),%rdx                                                                                                                                                              
- 10.16 │        mov       0x18(%r14),%rax                                                                                                                                                                 
   0.26 │        lea       (%rax,%rdx,4),%rbx                                                                                                                                                              
        │      emulate():                                                                                                                                                                                  
        │      rv->compressed = (ir->insn_len == INSN_16);                                                                                                                                                 
   0.26 │        movzbl    0x9(%rbx),%esi                                                                                                                                                                  
   0.09 │        cmp       $0x2,%sil                                                                                                                                                                       
-  9.74 │        sete      0x1cc(%rbp)                                                                                                                                                                     
        │      switch (ir->opcode) {                                                                                                                                                                       
        │        cmpb      $0x7a,0x7(%rbx)                                                                                                                                                                 
   0.10 │      ↓ ja        1a40                                                                                                                                                                            
-  9.02 │        movzbl    0x7(%rbx),%eax                                                                                                                                                                  
   0.33 │        movslq    (%r12,%rax,4),%rax                                                                                                                                                              
   0.26 │        add       %r12,%rax                                                                                                                                                                       
-  9.40 │        notrack   jmpq *%rax                                                                                                                                                                      
   ...                                                                                                                                                                                       ▒
        │      rv->PC += ir->insn_len;                                                                                                                                                                    
   0.02 │ 170:   add       %r10d,%esi                                                                                                                                                                      
   0.13 │        mov       %esi,0xd0(%rbp)                                                                                                                                                                 
        │      block_emulate():                                                                                                                                                                            
        │      rv->csr_cycle++;                                                                                                                                                                            
   0.01 │ 179:   mov       0x198(%rbp),%rax                                                                                                                                                                
        │      for (uint32_t i = 0; i < block->n_insn; i++) {                                                                                                                                              
-  8.18 │        add       $0x1,%r13d                                                                                                                                                                      
        │      rv->csr_cycle++;                                                                                                                                                                            
   0.01 │        add       $0x1,%rax                                                                                                                                                                       
   0.88 │        mov       %rax,0x198(%rbp)                                                                                                                                                                
        │      for (uint32_t i = 0; i < block->n_insn; i++) {                                                                                                                                              
        │        cmp       (%r14),%r13d                                                                                                                                                                    
-  8.36 │      ↑ jb        100

refactor-rv32:

        │     for (auto &i : block.insn) {                                                                                                                                                                 
        │       mov       0x18(%rsi),%rbx                                                                                                                                                                  
   0.04 │       cmp       %r13,%rbx                                                                                                                                                                        
        │     ↓ je        ed                                                                                                                                                                               
        │       mov       %rdi,%rbp                                                                                                                                                                        
   2.11 │       lea       std::piecewise_construct+0x478,%r12                                                                                                                                              
        │       nop                                                                                                                                                                                        
        │     // enforce zero regiser                                                                                                                                                                      
        │     rv->X[rv_reg_zero] = 0;                                                                                                                                                                      
        │ 30:   movl      $0x0,0x50(%rbp)                                                                                                                                                                  
        │     switch (i.opcode) {                                                                                                                                                                          
   0.04 │       cmpb      $0x5b,0x8(%rbx)                                                                                                                                                                  
        │     emulate():                                                                                                                                                                                   
   2.37 │     ↓ ja        f01                                                                                                                                                                              
- 14.38 │       movzbl    0x8(%rbx),%eax                                                                                                                                                                   
   0.06 │       movslq    (%r12,%rax,4),%rax                                                                                                                                                               
   0.20 │       add       %r12,%rax                                                                                                                                                                        
- 15.71 │       notrack   jmpq *%rax

According to the perf result, I assume the following codes may cause the rv32emu's slowness.

1. rv->compressed = (ir->insn_len == INSN_16); -> sete
2. rv->csr_cycle++; -> mov
3. emulate(rv, block->ir + i)
4. for (uint32_t i = 0; i < block->n_insn; i++)

Risheng1128 · 2022-12-05T10:47:43Z

First, I try to remove the following code. (Just for checking its influences on emulator.)

rv->compressed = (ir->insn_len == INSN_16);
rv->csr_cycle++;

And execute dhrystone.elf again.

Dhrystone(1.1-mc), 10000000 passes, 4425698 microseconds, 1283 DMIPS

The performance from 1114 to 1283 DMIPS.

jserv · 2022-12-05T11:35:15Z

First, I try to remove the following code. (Just to check its influences on emulator.)
rv->compressed = (ir->insn_len == INSN_16);
rv->csr_cycle++;
The performance from 1114 to 1283 DMIPS.

For typical RISC-V programs, it is unlikely to switch between RV32I and RV32C frequently. Therefore, we don't have to check RV32C in every cycle. Such information can be carried at the instruction decoding stage.

Risheng1128 · 2022-12-05T11:39:42Z

Second, I rewrite the refactor-rv32 as following like rv32emu. (To check the influence of loop.)

void emulate_block(riscv_t *rv, block_t &block)
{
- for (auto &i : block.insn) {
+ for (uint32_t i = 0; i < block.instructions; i++) {
        // enforce zero regiser
        rv->X[rv_reg_zero] = 0;
        // emulate an instruction
-       emulate(*rv, i);
+       emulate(*rv, block.insn[i]);
  }
}

And execute dhrystone.elf

Dhrystone(1.1-mc), 10000000 passes, 4232523 microseconds, 1341 DMIPS

The performance from 1631 to 1341 DMIPS.

Use perf to find the difference of refactor-rv32.

emulate(*rv, block.insn[i]);
       │        mov       %ebp,%eax
       │      _ZNSt6vectorI9rv_insn_tSaIS0_EEixEm():
       │      */
       │      reference
       │      operator[](size_type __n) _GLIBCXX_NOEXCEPT
       │      {
       │      __glibcxx_requires_subscript(__n);
       │      return *(this->_M_impl._M_start + __n);
       │        lea       (%rax,%rax,2),%rdx
-11.53 │        mov       0x18(%r13),%rax
  1.37 │        lea       (%rax,%rdx,4),%rbx
bjdump:│Warnin_Z13emulate_blockP7riscv_tR7block_t():
       │      switch (i.opcode) {
  0.07 │        cmpb      $0x5b,0x8(%rbx)
       │      emulate():
  1.44 │      ↓ ja        fe3

We can find the new implementation increases the instruction mov 0x18(%r13),%rax, and can find the same one in rv32emu

jserv · 2022-12-05T17:04:31Z

We can find the new implementation increases the instruction mov 0x18(%r13),%rax, and can find the same one in rv32emu

To prevent pointless movement of data, we can combine the emulate function body with emulate_block.

qwe661234 · 2022-12-07T09:01:07Z

First, I try to remove the following code. (Just to check its influences on emulator.)
rv->compressed = (ir->insn_len == INSN_16);
rv->csr_cycle++;
The performance from 1114 to 1283 DMIPS.
For typical RISC-V programs, it is unlikely to switch between RV32I and RV32C frequently. Therefore, we don't have to check RV32C in every cycle. Such information can be carried at the instruction decoding stage.

I think we can pass the value of instruction length in IR to the exception handler rather than verifying it each time the emulate function is called, because we only need the value of rv->compressed in the exception handler.

#92 is the proposed change for reviewing.

jserv · 2022-12-11T15:41:37Z

Since #93 is merged into master branch, let's concentrate on on master branch now.

jserv · 2022-12-12T01:11:56Z

Second, I rewrite the refactor-rv32 as following like rv32emu. (To check the influence of loop.)
[...]
Use perf to find the difference of refactor-rv32.
[...]
We can find the new implementation increases the instruction mov 0x18(%r13),%rax, and can find the same one in rv32emu

commit 285a988 provides a minor increase in performance that reflects the above.

jserv · 2022-12-13T13:19:55Z

#95 is an on-going experiment which attempts to lower IR instruction dispatching overheads. @Risheng1128, please measure and compare computed-goto vs. TCO.
If TCO does matter, we can then safely remove computed-goto specific build rules.

Risheng1128 · 2022-12-14T21:17:16Z

I implement computed-goto version in Use computed-goto for efficient dispatch, and test it and Use TCO of C compiler to speed up emulation with coremark.elf and dhrystone.elf respectively. The following is the result:

cpu: i5-12500
compiler: gcc - 9.4.0

case	coremark	dhrystone
TCO	1132.824525 iter/sec	1195 DMIPS
computed-goto	1121.039724 iter/sec	1286 DMIPS

However, the computed-goto result with dhrystone.elf fluctuates wildly. It is usually about 1200 ~ 1300, and TCO is usually about 1180 ~ 1200.

I think computed-goto may have a slightly efficient performance than TCO.

Because computed-goto can only decrease the overhead of instruction dispatching in a single block instead of the entire program scope, the speedup would not be immediately apparent. If neither gcc nor clang are present, classic switch-case is employed as a fallback. Close sysprog21#88 Co-authored-by: Jim Huang <[email protected]> Signed-off-by: Jim Huang <[email protected]>

jserv assigned Risheng1128 Nov 7, 2022

Risheng1128 mentioned this issue Nov 13, 2022

Introduce basic block #91

Merged

jserv assigned qwe661234 Dec 5, 2022

jserv closed this as completed in 1b0391a Dec 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lower instruction decoding and dispatch overhead #88

Lower instruction decoding and dispatch overhead #88

jserv commented Nov 7, 2022 •

edited

Loading

Risheng1128 commented Nov 28, 2022 •

edited

Loading

Risheng1128 commented Dec 5, 2022 •

edited

Loading

Risheng1128 commented Dec 5, 2022 •

edited

Loading

jserv commented Dec 5, 2022

Risheng1128 commented Dec 5, 2022 •

edited

Loading

jserv commented Dec 5, 2022

qwe661234 commented Dec 7, 2022 •

edited by jserv

Loading

jserv commented Dec 11, 2022

jserv commented Dec 12, 2022

jserv commented Dec 13, 2022

Risheng1128 commented Dec 14, 2022 •

edited

Loading

Lower instruction decoding and dispatch overhead #88

Lower instruction decoding and dispatch overhead #88

Comments

jserv commented Nov 7, 2022 • edited Loading

Risheng1128 commented Nov 28, 2022 • edited Loading

Risheng1128 commented Dec 5, 2022 • edited Loading

Risheng1128 commented Dec 5, 2022 • edited Loading

jserv commented Dec 5, 2022

Risheng1128 commented Dec 5, 2022 • edited Loading

jserv commented Dec 5, 2022

qwe661234 commented Dec 7, 2022 • edited by jserv Loading

jserv commented Dec 11, 2022

jserv commented Dec 12, 2022

jserv commented Dec 13, 2022

Risheng1128 commented Dec 14, 2022 • edited Loading

jserv commented Nov 7, 2022 •

edited

Loading

Risheng1128 commented Nov 28, 2022 •

edited

Loading

Risheng1128 commented Dec 5, 2022 •

edited

Loading

Risheng1128 commented Dec 5, 2022 •

edited

Loading

Risheng1128 commented Dec 5, 2022 •

edited

Loading

qwe661234 commented Dec 7, 2022 •

edited by jserv

Loading

Risheng1128 commented Dec 14, 2022 •

edited

Loading