[llvm] [RISCV] Strength reduce mul by 2^M - 3/5/9 (PR #88993)

Wed Apr 17 10:59:27 PDT 2024

================
@@ -49,19 +53,24 @@ define i64 @add_mul_combine_accept_a3(i64 %x) {
 ; RV32IMB-LABEL: add_mul_combine_accept_a3:
 ; RV32IMB:       # %bb.0:
 ; RV32IMB-NEXT:    li a2, 29
-; RV32IMB-NEXT:    mul a1, a1, a2
-; RV32IMB-NEXT:    mulhu a3, a0, a2
-; RV32IMB-NEXT:    add a1, a3, a1
-; RV32IMB-NEXT:    mul a2, a0, a2
+; RV32IMB-NEXT:    mulhu a2, a0, a2
----------------
preames wrote:

A couple things here.

First, I agree that we are likely to want a per processor cost based threshold here.  The only question for me is order of work and relative importance, not net result.

I wasn't aware SCR1 has a single cycle latency.  I glanced at the schedule model for that just now, and given the only latencies set appear to be div/rem, I have to ask.  Are you sure this is actually true?

Second, I'll argue that instruction count != latency.  In this particular case, we have two independent instructions followed by a dependent subtract.  On any core with multiple issue and single cycle shl/add, I'd expect that to have a latency of 2.  Obviously, this argument can be taken to a ridiculous extreme, but it isn't obvious to me that a core with single cycle mul only wants single cycle expansions.  The SCR1 might be a weird example here - in the schedule model it looks like there's only one ALU?  Even then, that's not solely a function of latency.  

For comparison, gcc appears to be defaulting to an expansion budget of at least four instructions for rv64gc.

https://github.com/llvm/llvm-project/pull/88993