[PATCH] D132322: [AArch64][SelectionDAG] Optimize multiplication by constant

Tue Sep 13 06:37:32 PDT 2022

dmgreen added inline comments.

================
Comment at: llvm/lib/Target/AArch64/AArch64ISelLowering.cpp:13712
+    return false;
+  SDNode *Mul = *C->use_begin();
+  // Block the SExt/ZExt as ADD/SUB with sxtw/zxtw has low throughput.
----------------
We can't just expect the first user of a const will be the mul we are interested in.

All the other handling of decomposing mul into add+shift is currently done in performMulCombine. Would it be better to just alter the code that is already there? It would make it easier to be more precise with the costmodel.

================
Comment at: llvm/test/CodeGen/AArch64/mul_pow2.ll:146-148
+; CHECK-NEXT:    lsl w8, w0, #2
+; CHECK-NEXT:    add w8, w8, w0, lsl #1
+; CHECK-NEXT:    add w0, w8, w1
----------------
I think that (considering the mov as free in terms of latency), in this case the madd would be worth 2-3, the lsl+add_lsl+add would cost 3-4. It would depend heavily on the exact cpu though.
For i64 muls the madd would have a higher cost (it was 4 on one cpu I tested, but newer cpus are better).

================
Comment at: llvm/test/CodeGen/AArch64/mul_pow2.ll:309
+; GISEL-NEXT:    ret
+  %mul = mul nsw i32 %x, 6
+  %sub = add nsw i32 %mul, -1
----------------
Is this the case you are interested in? Could we change the existing costmodel to be more precise with which sub operand it considers free?
```
    // Conservatively do not lower to shift+add+shift if the mul might be
    // folded into madd or msub.
    if (N->hasOneUse() && (N->use_begin()->getOpcode() == ISD::ADD ||
                           N->use_begin()->getOpcode() == ISD::SUB))
      return SDValue();
```

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D132322/new/

https://reviews.llvm.org/D132322