[llvm] [RISCV] Strength reduce mul by 2^M - 3/5/9 (PR #88993)
Philip Reames via llvm-commits
llvm-commits at lists.llvm.org
Wed Apr 17 10:59:27 PDT 2024
================
@@ -49,19 +53,24 @@ define i64 @add_mul_combine_accept_a3(i64 %x) {
; RV32IMB-LABEL: add_mul_combine_accept_a3:
; RV32IMB: # %bb.0:
; RV32IMB-NEXT: li a2, 29
-; RV32IMB-NEXT: mul a1, a1, a2
-; RV32IMB-NEXT: mulhu a3, a0, a2
-; RV32IMB-NEXT: add a1, a3, a1
-; RV32IMB-NEXT: mul a2, a0, a2
+; RV32IMB-NEXT: mulhu a2, a0, a2
----------------
preames wrote:
A couple things here.
First, I agree that we are likely to want a per processor cost based threshold here. The only question for me is order of work and relative importance, not net result.
I wasn't aware SCR1 has a single cycle latency. I glanced at the schedule model for that just now, and given the only latencies set appear to be div/rem, I have to ask. Are you sure this is actually true?
Second, I'll argue that instruction count != latency. In this particular case, we have two independent instructions followed by a dependent subtract. On any core with multiple issue and single cycle shl/add, I'd expect that to have a latency of 2. Obviously, this argument can be taken to a ridiculous extreme, but it isn't obvious to me that a core with single cycle mul only wants single cycle expansions. The SCR1 might be a weird example here - in the schedule model it looks like there's only one ALU? Even then, that's not solely a function of latency.
For comparison, gcc appears to be defaulting to an expansion budget of at least four instructions for rv64gc.
https://github.com/llvm/llvm-project/pull/88993
More information about the llvm-commits
mailing list