[llvm] [X86] Generate `vpmuludq` instead of `vpmullq` (PR #121456)

Wed Jan 1 23:37:40 PST 2025

https://github.com/abhishek-kaushik22 created https://github.com/llvm/llvm-project/pull/121456

When lowering `_mm512_mul_epu32` intrinsic if the generated value if later used in a vector shuffle we generate `vpmullq` instead of `vpmuludq` (https://godbolt.org/z/WbaGMqs8e) because `SimplifyDemandedVectorElts` simplifies the arguments and we fail the combine to `PMULDQ`.

Added an override to `shouldSimplifyDemandedVectorElts` in `X86TargetLowering` to check if we can combine the `MUL` to `PMULDQ` first.

>From a0551f887bf63971ecb3bb16155b48972bb631b8 Mon Sep 17 00:00:00 2001
From: abhishek-kaushik22 <abhishek.kaushik at intel.com>
Date: Thu, 2 Jan 2025 13:05:07 +0530
Subject: [PATCH] [X86] Generate `vpmuludq` instead of `vpmullq`

When lowering `_mm512_mul_epu32` intrinsic if the generated value if later used in a vector shuffle we generate `vpmullq` instead of `vpmuludq` (https://godbolt.org/z/WbaGMqs8e) because `SimplifyDemandedVectorElts` simplifies the arguments and we fail the combine to `PMULDQ`.

Added an override to `shouldSimplifyDemandedVectorElts` in `X86TargetLowering` to check if we can combine the `MUL` to `PMULDQ` first.
---
 llvm/lib/Target/X86/X86ISelLowering.cpp | 21 +++++++++++++++++++++
 llvm/lib/Target/X86/X86ISelLowering.h   |  3 +++
 2 files changed, 24 insertions(+)

diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index a0514e93d6598b..e104264bcbf918 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -60832,3 +60832,24 @@ Align X86TargetLowering::getPrefLoopAlignment(MachineLoop *ML) const {
     return Align(1ULL << ExperimentalPrefInnermostLoopAlignment);
   return TargetLowering::getPrefLoopAlignment();
 }
+
+bool X86TargetLowering::shouldSimplifyDemandedVectorElts(
+    SDValue Op, const TargetLoweringOpt &TLO) const {
+  if (Op.getOpcode() == ISD::VECTOR_SHUFFLE) {
+    SDValue V0 = peekThroughBitcasts(Op.getOperand(0));
+    SDValue V1 = peekThroughBitcasts(Op.getOperand(1));
+
+    if (V0.getOpcode() == ISD::MUL || V1.getOpcode() == ISD::MUL) {
+      SDNode *Mul = V0.getOpcode() == ISD::MUL ? V0.getNode() : V1.getNode();
+      SelectionDAG &DAG = TLO.DAG;
+      const X86Subtarget &Subtarget = DAG.getSubtarget<X86Subtarget>();
+      const SDLoc DL(Mul);
+
+      if (SDValue V = combineMulToPMULDQ(Mul, DL, DAG, Subtarget)) {
+        DAG.ReplaceAllUsesWith(Mul, V.getNode());
+        return false;
+      }
+    }
+  }
+  return true;
+}
diff --git a/llvm/lib/Target/X86/X86ISelLowering.h b/llvm/lib/Target/X86/X86ISelLowering.h
index 2b7a8eaf249d83..0a6cd53f557bb2 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.h
+++ b/llvm/lib/Target/X86/X86ISelLowering.h
@@ -1207,6 +1207,9 @@ namespace llvm {
 
     bool hasBitTest(SDValue X, SDValue Y) const override;
 
+    bool shouldSimplifyDemandedVectorElts(
+        SDValue Op, const TargetLoweringOpt &TLO) const override;
+
     bool shouldProduceAndByConstByHoistingConstFromShiftsLHSOfAnd(
         SDValue X, ConstantSDNode *XC, ConstantSDNode *CC, SDValue Y,
         unsigned OldShiftOpcode, unsigned NewShiftOpcode,