[PATCH] D132322: [AArch64][SelectionDAG] Optimize multiplication by constant

Mon Aug 22 18:23:13 PDT 2022

efriedma added a comment.

It's not obvious that the replacement sequences are consistently faster.  At least on some cores, "add x8, x8, w0, sxtw #1" and "smull x0, w0, w8" have exactly the same throughput, so transforming from the smull to a two-instruction sequence involving the add isn't really profitable.

On a related note, many cores have optimizations for arithmetic with lsl #n, so we should prefer that over uxtw/sxtw.

================
Comment at: llvm/test/CodeGen/AArch64/mul_pow2.ll:183

+; TODO: mov w8, w0 + lsl x8, x8, #2 should combine into lsl x8, x0, #2
 define i64 @test6_umaddl(i32 %x, i64 %y) {
----------------
I think the suggestion misses a zero-extension.  Should be able to ubfiz, though.

================
Comment at: llvm/test/CodeGen/AArch64/mul_pow2.ll:296
+; CHECK-NEXT:    add x8, x8, w0, sxtw #1
+; CHECK-NEXT:    neg x0, x8
 ; CHECK-NEXT:    ret
----------------
I think you can save an instruction here: instead of "-(x*4+x*2)", compute x*2-x*8.

================
Comment at: llvm/test/CodeGen/AArch64/mul_pow2.ll:518
+; CHECK-NEXT:    add w8, w0, w0, lsl #1
+; CHECK-NEXT:    neg w0, w8
 ; CHECK-NEXT:    ret
----------------
This seems to be overriding our existing logic here to produce a worse result.

================
Comment at: llvm/test/CodeGen/AArch64/sve-intrinsics-counting-elems-i32.ll:169
+; CHECK-NEXT:    dech x8, vl16, mul #8
+; CHECK-NEXT:    add w0, w0, w8
 ; CHECK-NEXT:    ret
----------------
Probably need some logic to allow folding inch/dech, assuming there isn't some reason to avoid them.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D132322/new/

https://reviews.llvm.org/D132322