[llvm] [X86] Attempt to use VPMADD52L/VPMULUDQ instead of VPMULLQ on slow VPMULLQ targets (or when VPMULLQ is unavailable) (PR #171760)

Sat Dec 13 07:01:02 PST 2025

================
@@ -49926,6 +49873,39 @@ static SDValue combineMul(SDNode *N, SelectionDAG &DAG,
   if (SDValue V = combineMulToPMULDQ(N, DL, DAG, Subtarget))
     return V;
 
+  // ==============================================================
+  // Optimize VPMULLQ on slow targets
+  // ==============================================================
+  if (VT.getScalarType() == MVT::i64 && Subtarget.hasSlowPMULLQ()) {
+    SDValue Op0 = N->getOperand(0);
+    SDValue Op1 = N->getOperand(1);
+
+    KnownBits Known0 = DAG.computeKnownBits(Op0);
+    KnownBits Known1 = DAG.computeKnownBits(Op1);
+    unsigned Count0 = Known0.countMinLeadingZeros();
+    unsigned Count1 = Known1.countMinLeadingZeros();
+
+    // Optimization 1: Use VPMULUDQ (32-bit multiply).
+    // If the upper 32 bits are zero, we can use the standard PMULUDQ
+    // instruction. This is generally the fastest option and widely supported.
+    if (Count0 >= 32 && Count1 >= 32) {
+      return DAG.getNode(X86ISD::PMULUDQ, DL, VT, Op0, Op1);
+    }
----------------
RKSimon wrote:

(style) unnecessary braces

https://github.com/llvm/llvm-project/pull/171760