Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Introduce preliminary macro operation fusion #132

Merged
merged 1 commit into from
May 29, 2023

Conversation

qwe661234
Copy link
Collaborator

@qwe661234 qwe661234 commented May 22, 2023

Through our observations, we have identified certain patterns in instruction sequences. By converting these specific RISC-V instruction patterns into faster and equivalent code, we can significantly improve execution efficiency.

In our current analysis, we focus on a commonly used benchmark and have found the following frequently occurring instruction patterns: auipc + addi, auipc + add, multiple sw, and multiple lw.

Metric commit fba5802 macro fuse operation Speedup
CoreMark 1351.065 (Iterations/Sec) 1352.843 (Iterations/Sec) +0.13%
dhrystone 1073 DMIPS 1146 DMIPS +6.8%
nqueens 8295 msec 7824 msec +6.0%

@qwe661234 qwe661234 force-pushed the Add_fuse_operation branch from bb1b72f to 3ae3059 Compare May 22, 2023 06:22
src/decode.h Show resolved Hide resolved
src/decode.h Outdated Show resolved Hide resolved
src/decode.h Outdated Show resolved Hide resolved
src/emulate.c Outdated Show resolved Hide resolved
src/emulate.c Outdated Show resolved Hide resolved
src/emulate.c Show resolved Hide resolved
src/emulate.c Show resolved Hide resolved
@jserv
Copy link
Contributor

jserv commented May 22, 2023

To enhance execution efficiency, we employ instruction fusion by combining sequences that adhere to specific patterns into fused instructions. Currently, we have incorporated four fused instructions: auipc + addi, auipc + add, multiple sw, and multiple lw.

You shall show some numbers to illustrate how we can benefit from macro operation fusion.
In addition, why were 4 patterns were picked? Denote them with existing benchmark programs.

@jserv jserv changed the title Add fuse instruction Introduce macro operation fusion May 22, 2023
src/emulate.c Outdated Show resolved Hide resolved
src/emulate.c Outdated Show resolved Hide resolved
src/emulate.c Outdated Show resolved Hide resolved
@qwe661234 qwe661234 force-pushed the Add_fuse_operation branch from 3ae3059 to b66f4dc Compare May 27, 2023 08:23
@qwe661234 qwe661234 requested a review from jserv May 27, 2023 08:27
src/decode.h Outdated Show resolved Hide resolved
@qwe661234 qwe661234 force-pushed the Add_fuse_operation branch from b66f4dc to 3f84ce5 Compare May 27, 2023 11:26
@qwe661234 qwe661234 requested a review from jserv May 27, 2023 11:26
@qwe661234 qwe661234 force-pushed the Add_fuse_operation branch from 3f84ce5 to 1f9cbea Compare May 27, 2023 11:32
src/riscv.c Outdated Show resolved Hide resolved
src/emulate.c Outdated Show resolved Hide resolved
src/emulate.c Outdated Show resolved Hide resolved
src/decode.h Outdated Show resolved Hide resolved
src/emulate.c Outdated Show resolved Hide resolved
@@ -1219,6 +1220,60 @@ RVOP(cswsp, {
})
#endif

/* auipc + addi */
Copy link
Contributor

@jserv jserv May 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to manipulate the sequence lui + addi?
See #81 (comment)

Disassembly of CoreMark:

   10324:       000087b7                lui     a5,0x8
   10328:       b0578793                addi    a5,a5,-1275 # 0x7b05

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible. however, there are some problems when running qrcode.elf if we import this pattern, so I skip it in this pull request.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible. however, there are some problems when running qrcode.elf if we import this pattern, so I skip it in this pull request.

Add a comment starting with "FIXME: lui + addi"

rv->PC += ir->insn_len * (ir->imm2 - 1);
})

/* multiple lw */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lw is the most frequent instruction (see #34), and we might dive into its use case more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you handle the following case? (disassembly from CoreMark

   10248:       03012603                lw      a2,48(sp)
   1024c:       01c11583                lh      a1,28(sp)
   10250:       03412503                lw      a0,52(sp)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition, consider the following scenario:

   10a84:       01c12083                lw      ra,28(sp)
   10a88:       07f47513                andi    a0,s0,127
   10a8c:       01812403                lw      s0,24(sp)
   10a90:       01412483                lw      s1,20(sp)
   10a94:       01012903                lw      s2,16(sp)
   10a98:       00c12983                lw      s3,12(sp)

It can be regarded as 5 lw. Roughly speaking, if peephole optimization can be applied, we shall benefit from further optimizations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another case: (disassembly from CoreMark)

   10c08:       01162023                sw      a7,0(a2)
   10c0c:       00052783                lw      a5,0(a0)
   10c10:       00059883                lh      a7,0(a1)
   10c14:       00259603                lh      a2,2(a1)
   10c18:       00f82023                sw      a5,0(a6)
   10c1c:       01052023                sw      a6,0(a0)
   10c20:       00e82223                sw      a4,4(a6)
   10c24:       0006a783                lw      a5,0(a3)

Mixture of sw and lw.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another case: (disassembly from CoreMark)

   10c08:       01162023                sw      a7,0(a2)
   10c0c:       00052783                lw      a5,0(a0)
   10c10:       00059883                lh      a7,0(a1)
   10c14:       00259603                lh      a2,2(a1)
   10c18:       00f82023                sw      a5,0(a6)
   10c1c:       01052023                sw      a6,0(a0)
   10c20:       00e82223                sw      a4,4(a6)
   10c24:       0006a783                lw      a5,0(a3)

Mixture of sw and lw.

In this case, the memory address is not contiguous, what we can do just pack these instructions, but we cannot save any operation, such as checking misaligned.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition, consider the following scenario:

   10a84:       01c12083                lw      ra,28(sp)
   10a88:       07f47513                andi    a0,s0,127
   10a8c:       01812403                lw      s0,24(sp)
   10a90:       01412483                lw      s1,20(sp)
   10a94:       01012903                lw      s2,16(sp)
   10a98:       00c12983                lw      s3,12(sp)

It can be regarded as 5 lw. Roughly speaking, if peephole optimization can be applied, we shall benefit from further optimizations.

In this case, we can pack the last four instruction lw. if we want to handle this case by packing 5 lw, we need to reorder the instruction. For example, swap the first and the second instruction.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you handle the following case? (disassembly from CoreMark

   10248:       03012603                lw      a2,48(sp)
   1024c:       01c11583                lh      a1,28(sp)
   10250:       03412503                lw      a0,52(sp)

Ditto, if we want to handle this case, we need some strategies to reorder the instructions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this pull request, let's concentrate on preliminary support of macro operation fusion. You shall add some comments for further efforts such as instruction reordering.

@jserv jserv changed the title Introduce macro operation fusion Introduce preliminary macro operation fusion May 28, 2023
@qwe661234 qwe661234 force-pushed the Add_fuse_operation branch from a7b8455 to 8933804 Compare May 29, 2023 08:44
src/emulate.c Outdated Show resolved Hide resolved
@qwe661234 qwe661234 force-pushed the Add_fuse_operation branch from 8933804 to fc9c3b8 Compare May 29, 2023 08:59
@qwe661234 qwe661234 force-pushed the Add_fuse_operation branch from fc9c3b8 to 56b14b8 Compare May 29, 2023 08:59
src/emulate.c Fixed Show fixed Hide fixed
@qwe661234 qwe661234 requested a review from jserv May 29, 2023 09:01
src/emulate.c Outdated Show resolved Hide resolved
@qwe661234 qwe661234 force-pushed the Add_fuse_operation branch from 56b14b8 to 743110f Compare May 29, 2023 09:17
src/emulate.c Fixed Show fixed Hide fixed
Copy link
Contributor

@jserv jserv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add some FIXME/TODO comments which address more macro operation fusion we can pay attention to.

src/emulate.c Outdated Show resolved Hide resolved
src/emulate.c Show resolved Hide resolved
@qwe661234 qwe661234 force-pushed the Add_fuse_operation branch from 743110f to 9636542 Compare May 29, 2023 09:32
case rv_insn_lw:
COMBINE_MEM_OPS(1);
break;
/* FIXME: lui + addi */

Check notice

Code scanning / CodeQL

FIXME comment

FIXME comment: lui + addi
@qwe661234 qwe661234 requested a review from jserv May 29, 2023 09:58
jserv

This comment was marked as outdated.

jserv

This comment was marked as duplicate.

Through our observations, we have identified certain patterns in instruction
sequences. By converting these specific RISC-V instruction patterns into
faster and equivalent code, we can significantly improve execution efficiency.

In our current analysis, we focus on a commonly used benchmark and have
found the following frequently occurring instruction patterns: auipc + addi,
auipc + add, multiple sw, and multiple lw.

|  Metric  |     commit fba5802       |    macro fuse operation   |Speedup|
|----------+--------------------------+---------------------------+-------|
| CoreMark | 1351.065 (Iterations/Sec)|  1352.843 (Iterations/Sec)|+0.13% |
| dhrystone|       1073 DMIPS         |        1146 DMIPS         | +6.8% |
| nqueens  |       8295 msec          |        7824 msec          | +6.0% |
@qwe661234 qwe661234 force-pushed the Add_fuse_operation branch from 9636542 to 18213bc Compare May 29, 2023 15:33
@qwe661234
Copy link
Collaborator Author

Check CI failure.

In debug mode, the rv_step only emulates one instruction per step, specifically, it executes only the first instruction in a basic block then translate next basic block in PC + 4. If we apply macro fusion operations in debug mode, errors can occur. For instance, fusing auipc and addi and executing them together. However, the subsequent instruction is not a nop because the emulator only emulates the first instruction in a basic block. Consequently, the following instruction remains addi, resulting in an error because the result become auipc + addi + addi.

Therefore, we cannot do fuse operation in debug mode.

@qwe661234 qwe661234 requested a review from jserv May 29, 2023 15:45
@jserv jserv merged commit 5fb9d8b into sysprog21:master May 29, 2023
vestata pushed a commit to vestata/rv32emu that referenced this pull request Jan 24, 2025
Introduce preliminary macro operation fusion
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants