c.mulh instruction #208

gsmecher · 2023-01-16T17:28:40Z

A c.mulh instruction to mirror c.mul in zcb would allow the following to be expressed in purely compressed instructions:

32-bit DSP (dot product/convolution/correlation)
wider multiplies (64 bits and upwards)

There is room in the instruction encodings directly below c.mul (see Table 16-6, RISC-V Unprivileged ISA) - there is a natural spot for c.mulh and it seems a shame not to take advantage of the symmetry here.

Related: #9. The MAC proposal was (reasonably) excluded as out-of-scope here. A c.mulh bookend to c.mul would partly address the same need but fits better within the mandate of this WG.

gsmecher · 2023-01-16T17:51:02Z

Here's a little additional colour: I develop the minimax RISC-V implementation (https://github.com/gsmecher/minimax), which directly executes only 16-bit instructions and emulates 32-bit instructions in microcode. Because of this architecture, there is an ~100x performance difference between compressed and uncompressed instructions. The ZC* extensions are critical because more code in the 16-bit space is a massive performance improvement for minimax.

More specifically: Minimax is currently useless for DSP because all multiplies must be emulated. A c.mul instruction brings high-performance 16-bit convolutions within reach. A c.mulh instruction would make performant 32-bit DSP possible in this core. This would open the core up to an entirely new type of application.

I understand that a ~100x performance differential in a weird implementation is a Me problem, not a You problem. I also understand it's late in the cycle for a new instruction. However: there is a perfect encoding candidate right below c.mul (see Table 16.6, RISC-V Unprivileged ISA). Since c.mul implies M, which mandates a 32x32 -> 64-bit multiplier anyways, the implementation cost is basically the multiplexer to select between low and high words of the 64-bit product. From this perspective: is the lack of c.mulh a deliberate decision, or just something nobody has requested yet?

Again, thanks for all your efforts - I am excited about this WG's work, with or without c.mulh.

jnk0le · 2023-01-16T18:39:48Z

The c.mulh would be usually paired with c.mul and in this case, there needs to be an explicit move due to destructive operand.
Therefore c.mulh doesn't offer enough size reduction

gsmecher · 2023-01-16T18:54:36Z

@jnk0le - that is an excellent point. A synthetic c.mulh using Zcb instructions would be pretty short, and it looks like Zcb has all the right primitives already. Thank you - closing.

gsmecher · 2023-01-20T21:47:50Z

On reflection, @jnk0le, I closed this issue too soon. Your comment is predicated on paired c.mul/c.mulh, which is true for wider multiplies. It's not true for DSP algorithms (FFTs, FIRs, correlations, etc.). Because these are not general-purpose algorithms, they don't show up on general-purpose benchmarks.

I am, unfortunately, flatly wrong about the "obvious" encoding space for c.mulh. The slot below c.mulh in Table 16-6 is already consumed in Zc* and it does not look like there's an easy place to put c.mulh even if it were justified.

tovine · 2023-01-20T21:51:07Z

If you have any suggested benchmarks (preferably open source but IIRC ELF files is all you need) then I guess it should be possible/easy to add support for analyzing the impact of a c.mulh encoding to the hca script in this repository.

gsmecher · 2023-01-20T22:55:03Z

Thanks, @tovine: for DSP applications, ARM's CMSIS DSP library 0 seems like a good reference point. CMSIS is Apache licensed and should be directly portable to RISC-V (see e.g. 1).

The following is an example convolution kernel adapted from 2. The loop is unrolled by a factor of 4, but here's the compact version first:

#include <stdint.h>

typedef int32_t q31_t;
typedef int64_t q63_t;

int conv(int k, const q31_t * px, const q31_t * py) {
    q63_t sum;
    while(k > 0u) {
      sum = (q31_t) ((((q63_t) sum << 32) +
                      ((q63_t) * px++ * (*py--))) >> 32);

      k--;
    }
    return sum;
}

I get:

$ clang --target=riscv32 -march=rv32gc -Oz -c -o conv.o conv.c
$ riscv64-unknown-elf-objdump -d conv.o
00000000 <conv>:
   0:   c911                    beqz    a0,14 <conv+0x14>
   2:   4198                    lw      a4,0(a1)
   4:   421c                    lw      a5,0(a2)
   6:   0591                    addi    a1,a1,4
   8:   1671                    addi    a2,a2,-4
   a:   02e79733                mulh    a4,a5,a4
   e:   96ba                    add     a3,a3,a4
  10:   157d                    addi    a0,a0,-1
  12:   f965                    bnez    a0,2 <conv+0x2>
  14:   8536                    mv      a0,a3
  16:   8082                    ret

There are 8 instructions in the non-unrolled loop, of which mulh is the only uncompressed one. Pretty tough to make a case that c.mulh would improve code density here. I am not sure if there's a strong microarchitectural argument, and/or whether that's relevant to this WG anyway.

If I use the 4-way unrolled version of the kernel instead, I get something the following:

$ clang --target=riscv32 -march=rv32gc -Oz -c -o conv4.o conv4.c
$ riscv64-unknown-elf-objdump -d conv4.o
00000000 <conv>:
   0:   587d                    li      a6,-1
   2:   c529                    beqz    a0,4c <conv+0x4c>
   4:   419c                    lw      a5,0(a1)
   6:   4218                    lw      a4,0(a2)
   8:   02f71733                mulh    a4,a4,a5
   c:   0045a283                lw      t0,4(a1)
  10:   ffc62303                lw      t1,-4(a2)
  14:   98ba                    add     a7,a7,a4
  16:   0085a383                lw      t2,8(a1)
  1a:   ff862683                lw      a3,-8(a2)
  1e:   02531733                mulh    a4,t1,t0
  22:   0108f7b3                and     a5,a7,a6
  26:   973e                    add     a4,a4,a5
  28:   027698b3                mulh    a7,a3,t2
  2c:   45dc                    lw      a5,12(a1)
  2e:   ff462683                lw      a3,-12(a2)
  32:   01077733                and     a4,a4,a6
  36:   9746                    add     a4,a4,a7
  38:   01077733                and     a4,a4,a6
  3c:   02f696b3                mulh    a3,a3,a5
  40:   00d708b3                add     a7,a4,a3
  44:   157d                    addi    a0,a0,-1
  46:   05c1                    addi    a1,a1,16
  48:   1641                    addi    a2,a2,-16
  4a:   fd4d                    bnez    a0,4 <conv+0x4>
  4c:   8546                    mv      a0,a7
  4e:   8082                    ret

Here, there are 24 instructions in the loop (13 uncompressed, of which 4 are mulh.) There are RVC equivalents to everything except c.mulh - though it looks like it would take a different compiler or handwritten kernel to make a 16-bit c.mulh stand out.

An FFT implementation would be nearly as general-purpose as a convolution, but CMSIS doesn't have one. I'll poke around and see if I can put together something similar.

Ultimately, though, this is somewhat application-specific and I don't know if a benchmark belongs in your tree or not.

gsmecher · 2023-01-20T23:19:44Z

Here's an FFT, taken brutally and somewhat arbitrarily from cmsis-dsp.

#include <stdint.h>

typedef int32_t q31_t;
typedef int64_t q63_t;

#define multAcc_32x32_keep32_R(a, x, y) a = (q31_t) (((((q63_t) a) << 32) + ((q63_t) x * y) + 0x80000000LL ) >> 32)
#define multSub_32x32_keep32_R(a, x, y) a = (q31_t) (((((q63_t) a) << 32) - ((q63_t) x * y) + 0x80000000LL ) >> 32)
#define mult_32x32_keep32_R(a, x, y) a = (q31_t) (((q63_t) x * y + 0x80000000LL ) >> 32)
#define multAcc_32x32_keep32(a, x, y) a += (q31_t) (((q63_t) x * y) >> 32)
#define multSub_32x32_keep32(a, x, y) a -= (q31_t) (((q63_t) x * y) >> 32)
#define mult_32x32_keep32(a, x, y) a = (q31_t) (((q63_t) x * y ) >> 32)

void arm_split_rifft_q31(
        q31_t * pSrc,
        uint32_t fftLen,
  const q31_t * pATable,
  const q31_t * pBTable,
        q31_t * pDst,
        uint32_t modifier) {       
        q31_t outR, outI;
  const q31_t *pCoefA, *pCoefB;
        q31_t CoefA1, CoefA2, CoefB1;
        q31_t *pIn1 = &pSrc[0], *pIn2 = &pSrc[2 * fftLen + 1];

  pCoefA = &pATable[0];
  pCoefB = &pBTable[0];

  while (fftLen > 0U) {
     CoefA1 = *pCoefA++;
     CoefA2 = *pCoefA;

     mult_32x32_keep32_R (outR, *pIn1, CoefA1);
     mult_32x32_keep32_R (outI, *pIn1++, -CoefA2);
     multAcc_32x32_keep32_R (outR, *pIn1, CoefA2);
     multAcc_32x32_keep32_R (outI, *pIn1++, CoefA1);
     multAcc_32x32_keep32_R (outR, *pIn2, CoefA2);
     CoefB1 = *pCoefB;

     multSub_32x32_keep32_R (outI, *pIn2--, CoefB1);
     multAcc_32x32_keep32_R (outR, *pIn2, CoefB1);
     multAcc_32x32_keep32_R (outI, *pIn2--, CoefA2);

     *pDst++ = outR;
     *pDst++ = outI;

     pCoefB = pCoefB + (modifier * 2);
     pCoefA = pCoefA + (modifier * 2 - 1);

     fftLen--;
  }
}

I get the following:

$ clang --target=riscv32 -march=rv32gc -Oz -c -o cmsis_fft.o cmsis_fft.c
$ riscv64-unknown-elf-objdump -d cmsis_fft.o
00000000 <arm_split_rifft_q31>:
   0:	1141                	addi	sp,sp,-16
   2:	c622                	sw	s0,12(sp)
   4:	c426                	sw	s1,8(sp)
   6:	c24a                	sw	s2,4(sp)
   8:	c04e                	sw	s3,0(sp)
   a:	4801                	li	a6,0
   c:	58fd                	li	a7,-1
   e:	800002b7          	lui	t0,0x80000
  12:	00359313          	slli	t1,a1,0x3
  16:	932a                	add	t1,t1,a0
  18:	0311                	addi	t1,t1,4
  1a:	00379393          	slli	t2,a5,0x3
  1e:	c9e5                	beqz	a1,10e <arm_split_rifft_q31+0x10e>
  20:	01060eb3          	add	t4,a2,a6
  24:	000eae03          	lw	t3,0(t4)
  28:	411c                	lw	a5,0(a0)
  2a:	01068f33          	add	t5,a3,a6
  2e:	004eae83          	lw	t4,4(t4)
  32:	03c79fb3          	mulh	t6,a5,t3
  36:	03c78433          	mul	s0,a5,t3
  3a:	005404b3          	add	s1,s0,t0
  3e:	0084b433          	sltu	s0,s1,s0
  42:	9fa2                	add	t6,t6,s0
  44:	41d00433          	neg	s0,t4
  48:	02879933          	mulh	s2,a5,s0
  4c:	028787b3          	mul	a5,a5,s0
  50:	00578433          	add	s0,a5,t0
  54:	4144                	lw	s1,4(a0)
  56:	00f437b3          	sltu	a5,s0,a5
  5a:	993e                	add	s2,s2,a5
  5c:	011fffb3          	and	t6,t6,a7
  60:	03d499b3          	mulh	s3,s1,t4
  64:	03d48433          	mul	s0,s1,t4
  68:	005407b3          	add	a5,s0,t0
  6c:	0087b7b3          	sltu	a5,a5,s0
  70:	97ce                	add	a5,a5,s3
  72:	9fbe                	add	t6,t6,a5
  74:	01197933          	and	s2,s2,a7
  78:	03c497b3          	mulh	a5,s1,t3
  7c:	03c484b3          	mul	s1,s1,t3
  80:	00548433          	add	s0,s1,t0
  84:	009434b3          	sltu	s1,s0,s1
  88:	00032403          	lw	s0,0(t1)
  8c:	97a6                	add	a5,a5,s1
  8e:	01278e33          	add	t3,a5,s2
  92:	011fffb3          	and	t6,t6,a7
  96:	03d41933          	mulh	s2,s0,t4
  9a:	03d404b3          	mul	s1,s0,t4
  9e:	005487b3          	add	a5,s1,t0
  a2:	0097b7b3          	sltu	a5,a5,s1
  a6:	000f2483          	lw	s1,0(t5)
  aa:	97ca                	add	a5,a5,s2
  ac:	01f78f33          	add	t5,a5,t6
  b0:	011e7e33          	and	t3,t3,a7
  b4:	02849fb3          	mulh	t6,s1,s0
  b8:	02848433          	mul	s0,s1,s0
  bc:	0082b433          	sltu	s0,t0,s0
  c0:	ffc32783          	lw	a5,-4(t1)
  c4:	947e                	add	s0,s0,t6
  c6:	408e0e33          	sub	t3,t3,s0
  ca:	011f7f33          	and	t5,t5,a7
  ce:	02979fb3          	mulh	t6,a5,s1
  d2:	029784b3          	mul	s1,a5,s1
  d6:	00548433          	add	s0,s1,t0
  da:	00943433          	sltu	s0,s0,s1
  de:	008f84b3          	add	s1,t6,s0
  e2:	9f26                	add	t5,t5,s1
  e4:	011e7e33          	and	t3,t3,a7
  e8:	03d794b3          	mulh	s1,a5,t4
  ec:	03d787b3          	mul	a5,a5,t4
  f0:	00578433          	add	s0,a5,t0
  f4:	00f437b3          	sltu	a5,s0,a5
  f8:	97a6                	add	a5,a5,s1
  fa:	97f2                	add	a5,a5,t3
  fc:	01e72023          	sw	t5,0(a4)
 100:	c35c                	sw	a5,4(a4)
 102:	15fd                	addi	a1,a1,-1
 104:	981e                	add	a6,a6,t2
 106:	0521                	addi	a0,a0,8
 108:	1361                	addi	t1,t1,-8
 10a:	0721                	addi	a4,a4,8
 10c:	f991                	bnez	a1,20 <arm_split_rifft_q31+0x20>
 10e:	4432                	lw	s0,12(sp)
 110:	44a2                	lw	s1,8(sp)
 112:	4912                	lw	s2,4(sp)
 114:	4982                	lw	s3,0(sp)
 116:	0141                	addi	sp,sp,16
 118:	8082                	ret

This implementation looks like a dead end for c.mulh: most of the loop consists of 32-bit instructions (mul/mulh pairs not least among them, per @jnk0le's point above).

gsmecher · 2023-01-20T23:31:07Z

Just to be 100% clear: I think this idea is DOA for non-technical reasons (schedule, scope) and technical reasons outside the scope of all my analysis above (lack of available encoding space for c.mulh). It is probably not worth discussing the motivational benchmarks alone if it's moot.

abukharmeh · 2023-01-23T14:08:23Z

In the initial analysis phase, we generated frequency count of all instructions used in all benchmarks we have used, and before trying to fit to any compressed encoding, the saving from c.mulh were way too low to consider. Like you said this would be more applicable in DSP type application , but would need to prove some how that the code size saving per encoding bit spent is worth while, and I think that would be fairly hard.

Zc* extension is frozen at its current state, so its very unlikely that any new instructions would be added. But if you would like to make a better case for such instruction, I would encourage you to find opensource benchmarks that you think would benefit from it, and PR their compilation instructions for RISC-V and link to their source here thus we can evaluate using them when we work on the next iteration of the extension !

gsmecher · 2023-01-23T17:15:16Z

Thanks - this context is helpful, and I understand and agree with what you say. I will keep my eyes open for a justifiable benchmark that supports c.mulh.

gsmecher closed this as completed Jan 16, 2023

gsmecher reopened this Jan 20, 2023

abukharmeh added the future idea Something for a future version of the spec label Jan 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

c.mulh instruction #208

c.mulh instruction #208

gsmecher commented Jan 16, 2023 •

edited

Loading

gsmecher commented Jan 16, 2023

jnk0le commented Jan 16, 2023

gsmecher commented Jan 16, 2023

gsmecher commented Jan 20, 2023

tovine commented Jan 20, 2023

gsmecher commented Jan 20, 2023 •

edited

Loading

gsmecher commented Jan 20, 2023

gsmecher commented Jan 20, 2023

abukharmeh commented Jan 23, 2023 •

edited

Loading

gsmecher commented Jan 23, 2023

c.mulh instruction #208

c.mulh instruction #208

Comments

gsmecher commented Jan 16, 2023 • edited Loading

gsmecher commented Jan 16, 2023

jnk0le commented Jan 16, 2023

gsmecher commented Jan 16, 2023

gsmecher commented Jan 20, 2023

tovine commented Jan 20, 2023

gsmecher commented Jan 20, 2023 • edited Loading

gsmecher commented Jan 20, 2023

gsmecher commented Jan 20, 2023

abukharmeh commented Jan 23, 2023 • edited Loading

gsmecher commented Jan 23, 2023

gsmecher commented Jan 16, 2023 •

edited

Loading

gsmecher commented Jan 20, 2023 •

edited

Loading

abukharmeh commented Jan 23, 2023 •

edited

Loading