[llvm] [RISCV] Strength reduce mul by 2^M - 3/5/9 (PR #88993)

Wed Apr 17 20:39:31 PDT 2024

================
@@ -49,19 +53,24 @@ define i64 @add_mul_combine_accept_a3(i64 %x) {
 ; RV32IMB-LABEL: add_mul_combine_accept_a3:
 ; RV32IMB:       # %bb.0:
 ; RV32IMB-NEXT:    li a2, 29
-; RV32IMB-NEXT:    mul a1, a1, a2
-; RV32IMB-NEXT:    mulhu a3, a0, a2
-; RV32IMB-NEXT:    add a1, a3, a1
-; RV32IMB-NEXT:    mul a2, a0, a2
+; RV32IMB-NEXT:    mulhu a2, a0, a2
----------------
wangpc-pp wrote:

I'm not opposed to this optimization.
For high performance core (the latency will be 3-4 normally, 2 is too extreme), the units for simple ALU should be more than multiplier units, so this optimization should be an uplift.
For RV32 core, they are normally single issue with fewer units and the latency of multiplier can be very small because of low frequency (if you agree), I think this optimization will be a decrease.

> First, I agree that we are likely to want a per processor cost based threshold here.
> For comparison, gcc appears to be defaulting to an expansion budget of at least four instructions for rv64gc. (example: https://godbolt.org/z/zzvqfha8f)

Yeah, I agree too. Actually this is what GCC is doing. The cost of mul is 2 for generic OoO, 1 for size, 4 for rocket and others (https://godbolt.org/z/8nvdbo5s1)
https://github.com/gcc-mirror/gcc/blob/58a0b190a256bd2a184554de0fae0031a614ec67/gcc/config/riscv/riscv.cc#L275-L292
We may add similar subtarget feature in LLVM too.

https://github.com/llvm/llvm-project/pull/88993