[llvm] [RFC][IR] Add llvm.masked.{udiv, sdiv, urem, srem} intrinsics (PR #189705)
Luke Lau via llvm-commits
llvm-commits at lists.llvm.org
Wed Apr 1 03:21:30 PDT 2026
https://github.com/lukel97 updated https://github.com/llvm/llvm-project/pull/189705
>From 922cf810b5c4be4742220e611f44ee3052a3c901 Mon Sep 17 00:00:00 2001
From: Luke Lau <luke at igalia.com>
Date: Tue, 31 Mar 2026 22:05:11 +0800
Subject: [PATCH 1/7] [IR] Add llvm.masked.{udiv,sdiv,urem,srem} intrinsics
---
llvm/docs/LangRef.rst | 110 +++
llvm/include/llvm/CodeGen/ISDOpcodes.h | 12 +
llvm/include/llvm/IR/Intrinsics.td | 28 +
.../include/llvm/Target/TargetSelectionDAG.td | 9 +
.../SelectionDAG/LegalizeIntegerTypes.cpp | 43 +
llvm/lib/CodeGen/SelectionDAG/LegalizeTypes.h | 7 +
.../SelectionDAG/LegalizeVectorOps.cpp | 23 +
.../SelectionDAG/LegalizeVectorTypes.cpp | 98 +++
.../lib/CodeGen/SelectionDAG/SelectionDAG.cpp | 15 +
.../SelectionDAG/SelectionDAGBuilder.cpp | 24 +
.../SelectionDAG/SelectionDAGDumper.cpp | 8 +
llvm/lib/CodeGen/TargetLoweringBase.cpp | 5 +
.../AArch64/masked-sdiv-fixed-length.ll | 469 +++++++++++
.../CodeGen/AArch64/masked-sdiv-scalable.ll | 74 ++
.../AArch64/masked-srem-fixed-length.ll | 502 ++++++++++++
.../CodeGen/AArch64/masked-srem-scalable.ll | 86 ++
.../AArch64/masked-udiv-fixed-length.ll | 468 +++++++++++
.../CodeGen/AArch64/masked-udiv-scalable.ll | 74 ++
.../AArch64/masked-urem-fixed-length.ll | 501 ++++++++++++
.../CodeGen/AArch64/masked-urem-scalable.ll | 86 ++
llvm/test/CodeGen/PowerPC/masked-sdiv.ll | 399 +++++++++
llvm/test/CodeGen/PowerPC/masked-srem.ll | 463 +++++++++++
llvm/test/CodeGen/PowerPC/masked-udiv.ll | 397 +++++++++
llvm/test/CodeGen/PowerPC/masked-urem.ll | 461 +++++++++++
llvm/test/CodeGen/RISCV/rvv/masked-sdiv.ll | 303 +++++++
llvm/test/CodeGen/RISCV/rvv/masked-srem.ll | 303 +++++++
llvm/test/CodeGen/RISCV/rvv/masked-udiv.ll | 302 +++++++
llvm/test/CodeGen/RISCV/rvv/masked-urem.ll | 302 +++++++
llvm/test/CodeGen/X86/masked-sdiv.ll | 758 +++++++++++++++++
llvm/test/CodeGen/X86/masked-srem.ll | 762 ++++++++++++++++++
llvm/test/CodeGen/X86/masked-udiv.ll | 756 +++++++++++++++++
llvm/test/CodeGen/X86/masked-urem.ll | 760 +++++++++++++++++
32 files changed, 8608 insertions(+)
create mode 100644 llvm/test/CodeGen/AArch64/masked-sdiv-fixed-length.ll
create mode 100644 llvm/test/CodeGen/AArch64/masked-sdiv-scalable.ll
create mode 100644 llvm/test/CodeGen/AArch64/masked-srem-fixed-length.ll
create mode 100644 llvm/test/CodeGen/AArch64/masked-srem-scalable.ll
create mode 100644 llvm/test/CodeGen/AArch64/masked-udiv-fixed-length.ll
create mode 100644 llvm/test/CodeGen/AArch64/masked-udiv-scalable.ll
create mode 100644 llvm/test/CodeGen/AArch64/masked-urem-fixed-length.ll
create mode 100644 llvm/test/CodeGen/AArch64/masked-urem-scalable.ll
create mode 100644 llvm/test/CodeGen/PowerPC/masked-sdiv.ll
create mode 100644 llvm/test/CodeGen/PowerPC/masked-srem.ll
create mode 100644 llvm/test/CodeGen/PowerPC/masked-udiv.ll
create mode 100644 llvm/test/CodeGen/PowerPC/masked-urem.ll
create mode 100644 llvm/test/CodeGen/RISCV/rvv/masked-sdiv.ll
create mode 100644 llvm/test/CodeGen/RISCV/rvv/masked-srem.ll
create mode 100644 llvm/test/CodeGen/RISCV/rvv/masked-udiv.ll
create mode 100644 llvm/test/CodeGen/RISCV/rvv/masked-urem.ll
create mode 100644 llvm/test/CodeGen/X86/masked-sdiv.ll
create mode 100644 llvm/test/CodeGen/X86/masked-srem.ll
create mode 100644 llvm/test/CodeGen/X86/masked-udiv.ll
create mode 100644 llvm/test/CodeGen/X86/masked-urem.ll
diff --git a/llvm/docs/LangRef.rst b/llvm/docs/LangRef.rst
index 5584e0828d3cd..6850160788ab9 100644
--- a/llvm/docs/LangRef.rst
+++ b/llvm/docs/LangRef.rst
@@ -27895,6 +27895,116 @@ The '``llvm.masked.compressstore``' intrinsic is designed for compressing data i
Other targets may support this intrinsic differently, for example, by lowering it into a sequence of branches that guard scalar store operations.
+Masked Vector Arithmetic Intrinsics
+-----------------------------------
+
+'``llvm.masked.udiv.*``' Intrinsics
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+This is an overloaded intrinsic.
+
+::
+
+ declare <8 x i32> @llvm.masked.udiv.v8i32(<8 x i32> <op1>, <8 x i32> <op2>, <8 x i1> <mask>)
+ declare <vscale x 2 x i64> @llvm.masked.udiv.nxv2i64(<vscale x 2 x i64> <op1>, <vscale x 2 x i64> <op2>, <vscale x 2 x i1> <mask>)
+
+Overview:
+"""""""""
+
+Performs unsigned division (:ref:`udiv <i_udiv>`) of two vectors of integers, but only on enabled lanes.
+
+Arguments:
+""""""""""
+
+The first two arguments and the result have the same vector of integer type. The third argument is the vector mask and has the same number of elements as the result vector type.
+
+Semantics:
+""""""""""
+
+Unlike :ref:`udiv <i_udiv>`, disabled lanes produce poison and division by zero on disabled lanes is not undefined behavior. Division by zero on enabled lanes is still undefined behavior.
+
+'``llvm.masked.sdiv.*``' Intrinsics
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+This is an overloaded intrinsic.
+
+::
+
+ declare <8 x i32> @llvm.masked.sdiv.v8i32(<8 x i32> <op1>, <8 x i32> <op2>, <8 x i1> <mask>)
+ declare <vscale x 2 x i64> @llvm.masked.sdiv.nxv2i64(<vscale x 2 x i64> <op1>, <vscale x 2 x i64> <op2>, <vscale x 2 x i1> <mask>)
+
+Overview:
+"""""""""
+
+Performs signed division (:ref:`sdiv <i_sdiv>`) of two vectors of integers, but only on enabled lanes.
+
+Arguments:
+""""""""""
+
+The first two arguments and the result have the same vector of integer type. The third argument is the vector mask and has the same number of elements as the result vector type.
+
+Semantics:
+""""""""""
+
+Unlike :ref:`sdiv <i_sdiv>`, disabled lanes produce poison and division by zero on disabled lanes is not undefined behavior. Division by zero on enabled lanes is still undefined behavior.
+
+'``llvm.masked.urem.*``' Intrinsics
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+This is an overloaded intrinsic.
+
+::
+
+ declare <8 x i32> @llvm.masked.urem.v8i32(<8 x i32> <op1>, <8 x i32> <op2>, <8 x i1> <mask>)
+ declare <vscale x 2 x i64> @llvm.masked.urem.nxv2i64(<vscale x 2 x i64> <op1>, <vscale x 2 x i64> <op2>, <vscale x 2 x i1> <mask>)
+
+Overview:
+"""""""""
+
+Computes the remainder from the unsigned division (:ref:`urem <i_urem>`) of two vectors of integers, but only on enabled lanes.
+
+Arguments:
+""""""""""
+
+The first two arguments and the result have the same vector of integer type. The third argument is the vector mask and has the same number of elements as the result vector type.
+
+Semantics:
+""""""""""
+
+Unlike :ref:`urem <i_urem>`, disabled lanes produce poison and taking the remainder of a division by zero on disabled lanes is not undefined behavior. Taking the remainder of a division by zero on enabled lanes is still undefined behavior.
+
+'``llvm.masked.srem.*``' Intrinsics
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+This is an overloaded intrinsic.
+
+::
+
+ declare <8 x i32> @llvm.masked.srem.v8i32(<8 x i32> <op1>, <8 x i32> <op2>, <8 x i1> <mask>)
+ declare <vscale x 2 x i64> @llvm.masked.srem.nxv2i64(<vscale x 2 x i64> <op1>, <vscale x 2 x i64> <op2>, <vscale x 2 x i1> <mask>)
+
+Overview:
+"""""""""
+
+Computes the remainder from the signed division (:ref:`srem <i_srem>`) of two vectors of integers, but only on enabled lanes.
+
+Arguments:
+""""""""""
+
+The first two arguments and the result have the same vector of integer type. The third argument is the vector mask and has the same number of elements as the result vector type.
+
+Semantics:
+""""""""""
+
+Unlike :ref:`srem <i_srem>`, disabled lanes produce poison and taking the remainder of a division by zero on disabled lanes is not undefined behavior. Taking the remainder of a division by zero on enabled lanes is still undefined behavior.
Memory Use Markers
------------------
diff --git a/llvm/include/llvm/CodeGen/ISDOpcodes.h b/llvm/include/llvm/CodeGen/ISDOpcodes.h
index 2bc575621f01f..ed7927e47b88e 100644
--- a/llvm/include/llvm/CodeGen/ISDOpcodes.h
+++ b/llvm/include/llvm/CodeGen/ISDOpcodes.h
@@ -1618,6 +1618,14 @@ enum NodeType {
LOOP_DEPENDENCE_WAR_MASK,
LOOP_DEPENDENCE_RAW_MASK,
+ /// Masked vector arithmetic that returns poison on disabled lanes. Disabled
+ /// lanes do not have undefined behaviour on division by zero. The first two
+ /// operands are input vectors, the third operand is the mask.
+ MASKED_UDIV,
+ MASKED_SDIV,
+ MASKED_UREM,
+ MASKED_SREM,
+
/// llvm.clear_cache intrinsic
/// Operands: Input Chain, Start Addres, End Address
/// Outputs: Output Chain
@@ -1650,6 +1658,10 @@ LLVM_ABI NodeType getOppositeSignednessMinMaxOpcode(unsigned MinMaxOpc);
/// For example ISD::AND for ISD::VECREDUCE_AND.
LLVM_ABI NodeType getVecReduceBaseOpcode(unsigned VecReduceOpcode);
+/// Given a \p MaskedOpc of ISD::MASKED_(U|S)(DIV|REM), returns the unmasked
+/// ISD::(U|S)(DIV|REM).
+LLVM_ABI NodeType getUnmaskedBinOpOpcode(unsigned MaskedOpc);
+
/// Whether this is a vector-predicated Opcode.
LLVM_ABI bool isVPOpcode(unsigned Opcode);
diff --git a/llvm/include/llvm/IR/Intrinsics.td b/llvm/include/llvm/IR/Intrinsics.td
index f0e6fa692bb68..368d112161829 100644
--- a/llvm/include/llvm/IR/Intrinsics.td
+++ b/llvm/include/llvm/IR/Intrinsics.td
@@ -2606,6 +2606,34 @@ def int_experimental_vector_compress:
[LLVMMatchType<0>, LLVMScalarOrSameVectorWidth<0, llvm_i1_ty>, LLVMMatchType<0>],
[IntrNoMem]>;
+def int_masked_udiv:
+ DefaultAttrsIntrinsic<[llvm_anyvector_ty],
+ [LLVMMatchType<0>,
+ LLVMMatchType<0>,
+ LLVMScalarOrSameVectorWidth<0, llvm_i1_ty>],
+ [IntrNoMem]>;
+
+def int_masked_sdiv:
+ DefaultAttrsIntrinsic<[llvm_anyvector_ty],
+ [LLVMMatchType<0>,
+ LLVMMatchType<0>,
+ LLVMScalarOrSameVectorWidth<0, llvm_i1_ty>],
+ [IntrNoMem]>;
+
+def int_masked_urem:
+ DefaultAttrsIntrinsic<[llvm_anyvector_ty],
+ [LLVMMatchType<0>,
+ LLVMMatchType<0>,
+ LLVMScalarOrSameVectorWidth<0, llvm_i1_ty>],
+ [IntrNoMem]>;
+
+def int_masked_srem:
+ DefaultAttrsIntrinsic<[llvm_anyvector_ty],
+ [LLVMMatchType<0>,
+ LLVMMatchType<0>,
+ LLVMScalarOrSameVectorWidth<0, llvm_i1_ty>],
+ [IntrNoMem]>;
+
// Test whether a pointer is associated with a type metadata identifier.
def int_type_test : DefaultAttrsIntrinsic<[llvm_i1_ty], [llvm_ptr_ty, llvm_metadata_ty],
[IntrNoMem, IntrSpeculatable]>;
diff --git a/llvm/include/llvm/Target/TargetSelectionDAG.td b/llvm/include/llvm/Target/TargetSelectionDAG.td
index 3cf110615f279..573342846b4cf 100644
--- a/llvm/include/llvm/Target/TargetSelectionDAG.td
+++ b/llvm/include/llvm/Target/TargetSelectionDAG.td
@@ -282,6 +282,11 @@ def SDTMaskedScatter : SDTypeProfile<0, 4, [
SDTCisSameNumEltsAs<0, 1>, SDTCisSameNumEltsAs<0, 3>
]>;
+def SDTMaskedIntBinOp: SDTypeProfile<1, 3, [
+ SDTCisVec<0>, SDTCisSameAs<0, 1>, SDTCisSameAs<0, 2>,
+ SDTCisSameNumEltsAs<0,3>, SDTCisInt<0>, SDTCisInt<3>
+]>;
+
def SDTVectorCompress : SDTypeProfile<1, 3, [
SDTCisVec<0>, SDTCisSameAs<0, 1>,
SDTCisVec<2>, SDTCisSameNumEltsAs<1, 2>,
@@ -455,6 +460,10 @@ def sdiv : SDNode<"ISD::SDIV" , SDTIntBinOp>;
def udiv : SDNode<"ISD::UDIV" , SDTIntBinOp>;
def srem : SDNode<"ISD::SREM" , SDTIntBinOp>;
def urem : SDNode<"ISD::UREM" , SDTIntBinOp>;
+def masked_udiv : SDNode<"ISD::MASKED_UDIV", SDTMaskedIntBinOp>;
+def masked_sdiv : SDNode<"ISD::MASKED_SDIV", SDTMaskedIntBinOp>;
+def masked_urem : SDNode<"ISD::MASKED_UREM", SDTMaskedIntBinOp>;
+def masked_srem : SDNode<"ISD::MASKED_SREM", SDTMaskedIntBinOp>;
def sdivrem : SDNode<"ISD::SDIVREM" , SDTIntBinHiLoOp>;
def udivrem : SDNode<"ISD::UDIVREM" , SDTIntBinHiLoOp>;
def srl : SDNode<"ISD::SRL" , SDTIntShiftOp>;
diff --git a/llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp b/llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp
index c6a4fe0b64cd7..8b42f64927bce 100644
--- a/llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp
@@ -235,6 +235,15 @@ void DAGTypeLegalizer::PromoteIntegerResult(SDNode *N, unsigned ResNo) {
case ISD::VP_UDIV:
case ISD::VP_UREM: Res = PromoteIntRes_ZExtIntBinOp(N); break;
+ case ISD::MASKED_UDIV:
+ case ISD::MASKED_UREM:
+ Res = PromoteIntRes_ZExtMaskedIntBinOp(N);
+ break;
+ case ISD::MASKED_SDIV:
+ case ISD::MASKED_SREM:
+ Res = PromoteIntRes_SExtMaskedIntBinOp(N);
+ break;
+
case ISD::SADDO:
case ISD::SSUBO: Res = PromoteIntRes_SADDSUBO(N, ResNo); break;
case ISD::UADDO:
@@ -1543,6 +1552,22 @@ SDValue DAGTypeLegalizer::PromoteIntRes_ZExtIntBinOp(SDNode *N) {
Mask, EVL);
}
+SDValue DAGTypeLegalizer::PromoteIntRes_ZExtMaskedIntBinOp(SDNode *N) {
+ SDValue LHS = ZExtPromotedInteger(N->getOperand(0));
+ SDValue RHS = ZExtPromotedInteger(N->getOperand(1));
+ SDValue Mask = N->getOperand(2);
+ return DAG.getNode(N->getOpcode(), SDLoc(N), LHS.getValueType(), LHS, RHS,
+ Mask);
+}
+
+SDValue DAGTypeLegalizer::PromoteIntRes_SExtMaskedIntBinOp(SDNode *N) {
+ SDValue LHS = SExtPromotedInteger(N->getOperand(0));
+ SDValue RHS = SExtPromotedInteger(N->getOperand(1));
+ SDValue Mask = N->getOperand(2);
+ return DAG.getNode(N->getOpcode(), SDLoc(N), LHS.getValueType(), LHS, RHS,
+ Mask);
+}
+
SDValue DAGTypeLegalizer::PromoteIntRes_UMINUMAX(SDNode *N) {
SDValue LHS = N->getOperand(0);
SDValue RHS = N->getOperand(1);
@@ -2171,6 +2196,12 @@ bool DAGTypeLegalizer::PromoteIntegerOperand(SDNode *N, unsigned OpNo) {
case ISD::GET_ACTIVE_LANE_MASK:
Res = PromoteIntOp_GET_ACTIVE_LANE_MASK(N);
break;
+ case ISD::MASKED_UDIV:
+ case ISD::MASKED_SDIV:
+ case ISD::MASKED_UREM:
+ case ISD::MASKED_SREM:
+ Res = PromoteIntOp_MaskedBinOp(N, OpNo);
+ break;
case ISD::PARTIAL_REDUCE_UMLA:
case ISD::PARTIAL_REDUCE_SMLA:
case ISD::PARTIAL_REDUCE_SUMLA:
@@ -3016,6 +3047,18 @@ SDValue DAGTypeLegalizer::PromoteIntOp_GET_ACTIVE_LANE_MASK(SDNode *N) {
return SDValue(DAG.UpdateNodeOperands(N, NewOps), 0);
}
+SDValue DAGTypeLegalizer::PromoteIntOp_MaskedBinOp(SDNode *N, unsigned OpNo) {
+ assert(OpNo == 2);
+ SmallVector<SDValue, 3> NewOps(N->ops());
+
+ if (TLI.getBooleanContents(NewOps[2].getValueType()) ==
+ TargetLowering::ZeroOrNegativeOneBooleanContent)
+ NewOps[2] = SExtPromotedInteger(NewOps[2]);
+ else
+ NewOps[2] = ZExtPromotedInteger(NewOps[2]);
+ return SDValue(DAG.UpdateNodeOperands(N, NewOps), 0);
+}
+
SDValue DAGTypeLegalizer::PromoteIntOp_PARTIAL_REDUCE_MLA(SDNode *N) {
SmallVector<SDValue, 1> NewOps(N->ops());
switch (N->getOpcode()) {
diff --git a/llvm/lib/CodeGen/SelectionDAG/LegalizeTypes.h b/llvm/lib/CodeGen/SelectionDAG/LegalizeTypes.h
index 4362845450acf..84c91a80ade79 100644
--- a/llvm/lib/CodeGen/SelectionDAG/LegalizeTypes.h
+++ b/llvm/lib/CodeGen/SelectionDAG/LegalizeTypes.h
@@ -333,6 +333,8 @@ class LLVM_LIBRARY_VISIBILITY DAGTypeLegalizer {
SDValue PromoteIntRes_SimpleIntBinOp(SDNode *N);
SDValue PromoteIntRes_ZExtIntBinOp(SDNode *N);
SDValue PromoteIntRes_SExtIntBinOp(SDNode *N);
+ SDValue PromoteIntRes_ZExtMaskedIntBinOp(SDNode *N);
+ SDValue PromoteIntRes_SExtMaskedIntBinOp(SDNode *N);
SDValue PromoteIntRes_UMINUMAX(SDNode *N);
SDValue PromoteIntRes_SIGN_EXTEND_INREG(SDNode *N);
SDValue PromoteIntRes_SRA(SDNode *N);
@@ -422,6 +424,7 @@ class LLVM_LIBRARY_VISIBILITY DAGTypeLegalizer {
SDValue PromoteIntOp_GET_ACTIVE_LANE_MASK(SDNode *N);
SDValue PromoteIntOp_PARTIAL_REDUCE_MLA(SDNode *N);
SDValue PromoteIntOp_LOOP_DEPENDENCE_MASK(SDNode *N, unsigned OpNo);
+ SDValue PromoteIntOp_MaskedBinOp(SDNode *N, unsigned OpNo);
void SExtOrZExtPromotedOperands(SDValue &LHS, SDValue &RHS);
void PromoteSetCCOperands(SDValue &LHS,SDValue &RHS, ISD::CondCode Code);
@@ -828,6 +831,7 @@ class LLVM_LIBRARY_VISIBILITY DAGTypeLegalizer {
SDValue ScalarizeVecRes_MERGE_VALUES(SDNode *N, unsigned ResNo);
SDValue ScalarizeVecRes_LOOP_DEPENDENCE_MASK(SDNode *N);
SDValue ScalarizeVecRes_BinOp(SDNode *N);
+ SDValue ScalarizeVecRes_MaskedBinOp(SDNode *N);
SDValue ScalarizeVecRes_CMP(SDNode *N);
SDValue ScalarizeVecRes_TernaryOp(SDNode *N);
SDValue ScalarizeVecRes_UnaryOp(SDNode *N);
@@ -882,6 +886,7 @@ class LLVM_LIBRARY_VISIBILITY DAGTypeLegalizer {
SDValue ScalarizeVecOp_FAKE_USE(SDNode *N);
SDValue ScalarizeVecOp_VECTOR_FIND_LAST_ACTIVE(SDNode *N);
SDValue ScalarizeVecOp_CTTZ_ELTS(SDNode *N);
+ SDValue ScalarizeVecOp_MaskedBinOp(SDNode *N, unsigned OpNo);
//===--------------------------------------------------------------------===//
// Vector Splitting Support: LegalizeVectorTypes.cpp
@@ -911,6 +916,7 @@ class LLVM_LIBRARY_VISIBILITY DAGTypeLegalizer {
// Vector Result Splitting: <128 x ty> -> 2 x <64 x ty>.
void SplitVectorResult(SDNode *N, unsigned ResNo);
void SplitVecRes_BinOp(SDNode *N, SDValue &Lo, SDValue &Hi);
+ void SplitVecRes_MaskedBinOp(SDNode *N, SDValue &Lo, SDValue &Hi);
void SplitVecRes_TernaryOp(SDNode *N, SDValue &Lo, SDValue &Hi);
void SplitVecRes_CMP(SDNode *N, SDValue &Lo, SDValue &Hi);
void SplitVecRes_UnaryOp(SDNode *N, SDValue &Lo, SDValue &Hi);
@@ -1064,6 +1070,7 @@ class LLVM_LIBRARY_VISIBILITY DAGTypeLegalizer {
SDValue WidenVecRes_Ternary(SDNode *N);
SDValue WidenVecRes_Binary(SDNode *N);
+ SDValue WidenVecRes_MaskedBinary(SDNode *N);
SDValue WidenVecRes_CMP(SDNode *N);
SDValue WidenVecRes_BinaryCanTrap(SDNode *N);
SDValue WidenVecRes_BinaryWithExtraScalarOp(SDNode *N);
diff --git a/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorOps.cpp b/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorOps.cpp
index b6e7c275bb3a7..131e24ad73c06 100644
--- a/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorOps.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorOps.cpp
@@ -139,6 +139,7 @@ class VectorLegalizer {
SDValue ExpandVP_FABS(SDNode *Node);
SDValue ExpandVP_FCOPYSIGN(SDNode *Node);
SDValue ExpandLOOP_DEPENDENCE_MASK(SDNode *N);
+ SDValue ExpandMaskedBinOp(SDNode *N);
SDValue ExpandSELECT(SDNode *Node);
std::pair<SDValue, SDValue> ExpandLoad(SDNode *N);
SDValue ExpandStore(SDNode *N);
@@ -482,6 +483,10 @@ SDValue VectorLegalizer::LegalizeOp(SDValue Op) {
case ISD::UCMP:
case ISD::LOOP_DEPENDENCE_WAR_MASK:
case ISD::LOOP_DEPENDENCE_RAW_MASK:
+ case ISD::MASKED_UDIV:
+ case ISD::MASKED_SDIV:
+ case ISD::MASKED_UREM:
+ case ISD::MASKED_SREM:
Action = TLI.getOperationAction(Node->getOpcode(), Node->getValueType(0));
break;
case ISD::SMULFIX:
@@ -1390,6 +1395,12 @@ void VectorLegalizer::Expand(SDNode *Node, SmallVectorImpl<SDValue> &Results) {
return;
}
break;
+ case ISD::MASKED_UDIV:
+ case ISD::MASKED_SDIV:
+ case ISD::MASKED_UREM:
+ case ISD::MASKED_SREM:
+ Results.push_back(ExpandMaskedBinOp(Node));
+ return;
}
SDValue Unrolled = DAG.UnrollVectorOp(Node);
@@ -1917,6 +1928,18 @@ SDValue VectorLegalizer::ExpandLOOP_DEPENDENCE_MASK(SDNode *N) {
return DAG.getNode(ISD::GET_ACTIVE_LANE_MASK, DL, VT, LaneOffset, MaskN);
}
+SDValue VectorLegalizer::ExpandMaskedBinOp(SDNode *N) {
+ // Masked bin ops don't have undefined behaviour when dividing by zero
+ // on disabled lanes and produce poison instead. Replace the divisor on the
+ // disabled lanes with 1 to avoid division by zero.
+ SDLoc dl(N);
+ EVT VT = N->getValueType(0);
+ SDValue SafeDivisor = DAG.getSelect(
+ dl, VT, N->getOperand(2), N->getOperand(1), DAG.getConstant(1, dl, VT));
+ return DAG.getNode(ISD::getUnmaskedBinOpOpcode(N->getOpcode()), dl, VT,
+ N->getOperand(0), SafeDivisor);
+}
+
void VectorLegalizer::ExpandFP_TO_UINT(SDNode *Node,
SmallVectorImpl<SDValue> &Results) {
// Attempt to expand using TargetLowering.
diff --git a/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp b/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
index 20f8d2fb4ed73..4d6172bcfbdaa 100644
--- a/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
@@ -209,6 +209,13 @@ void DAGTypeLegalizer::ScalarizeVectorResult(SDNode *N, unsigned ResNo) {
R = ScalarizeVecRes_BinOp(N);
break;
+ case ISD::MASKED_UDIV:
+ case ISD::MASKED_SDIV:
+ case ISD::MASKED_UREM:
+ case ISD::MASKED_SREM:
+ R = ScalarizeVecRes_MaskedBinOp(N);
+ break;
+
case ISD::SCMP:
case ISD::UCMP:
R = ScalarizeVecRes_CMP(N);
@@ -263,6 +270,27 @@ SDValue DAGTypeLegalizer::ScalarizeVecRes_BinOp(SDNode *N) {
LHS.getValueType(), LHS, RHS, N->getFlags());
}
+SDValue DAGTypeLegalizer::ScalarizeVecRes_MaskedBinOp(SDNode *N) {
+ SDLoc DL(N);
+ SDValue LHS = GetScalarizedVector(N->getOperand(0));
+ SDValue RHS = GetScalarizedVector(N->getOperand(1));
+ SDValue Mask = N->getOperand(2);
+ EVT MaskVT = Mask.getValueType();
+ // The vselect result and input vectors need scalarizing, but it's
+ // not a given that the mask does. For instance, in AVX512 v1i1 is legal.
+ // See the similar logic in ScalarizeVecRes_SETCC.
+ if (getTypeAction(MaskVT) == TargetLowering::TypeScalarizeVector)
+ Mask = GetScalarizedVector(Mask);
+ else
+ Mask = DAG.getExtractVectorElt(DL, MaskVT.getVectorElementType(), Mask, 0);
+ // Masked binary ops don't have UB on disabled lanes but produce poison, so
+ // use 1 as the divisor to avoid division by zero.
+ SDValue Divisor = DAG.getSelect(DL, LHS.getValueType(), Mask, RHS,
+ DAG.getConstant(1, DL, LHS.getValueType()));
+ return DAG.getNode(ISD::getUnmaskedBinOpOpcode(N->getOpcode()), DL,
+ LHS.getValueType(), LHS, Divisor);
+}
+
SDValue DAGTypeLegalizer::ScalarizeVecRes_CMP(SDNode *N) {
SDLoc DL(N);
@@ -913,6 +941,12 @@ bool DAGTypeLegalizer::ScalarizeVectorOperand(SDNode *N, unsigned OpNo) {
case ISD::CTTZ_ELTS_ZERO_POISON:
Res = ScalarizeVecOp_CTTZ_ELTS(N);
break;
+ case ISD::MASKED_UDIV:
+ case ISD::MASKED_SDIV:
+ case ISD::MASKED_UREM:
+ case ISD::MASKED_SREM:
+ Res = ScalarizeVecOp_MaskedBinOp(N, OpNo);
+ break;
}
// If the result is null, the sub-method took care of registering results etc.
@@ -1237,6 +1271,21 @@ SDValue DAGTypeLegalizer::ScalarizeVecOp_CTTZ_ELTS(SDNode *N) {
return DAG.getZExtOrTrunc(SetCC, SDLoc(N), N->getValueType(0));
}
+SDValue DAGTypeLegalizer::ScalarizeVecOp_MaskedBinOp(SDNode *N, unsigned OpNo) {
+ assert(OpNo == 2 && "Can only scalarize mask operand");
+ SDLoc DL(N);
+ EVT VT = N->getOperand(0).getValueType().getVectorElementType();
+ SDValue LHS = DAG.getExtractVectorElt(DL, VT, N->getOperand(0), 0);
+ SDValue RHS = DAG.getExtractVectorElt(DL, VT, N->getOperand(1), 0);
+ SDValue Mask = GetScalarizedVector(N->getOperand(2));
+ // Masked binary ops don't have UB on disabled lanes but produce poison, so
+ // use 1 as the divisor to avoid division by zero.
+ SDValue BinOp =
+ DAG.getNode(ISD::getUnmaskedBinOpOpcode(N->getOpcode()), DL, VT, LHS,
+ DAG.getSelect(DL, VT, Mask, RHS, DAG.getConstant(1, DL, VT)));
+ return DAG.getNode(ISD::SCALAR_TO_VECTOR, DL, N->getValueType(0), BinOp);
+}
+
//===----------------------------------------------------------------------===//
// Result Vector Splitting
//===----------------------------------------------------------------------===//
@@ -1498,6 +1547,12 @@ void DAGTypeLegalizer::SplitVectorResult(SDNode *N, unsigned ResNo) {
case ISD::VP_FCOPYSIGN:
SplitVecRes_BinOp(N, Lo, Hi);
break;
+ case ISD::MASKED_UDIV:
+ case ISD::MASKED_SDIV:
+ case ISD::MASKED_UREM:
+ case ISD::MASKED_SREM:
+ SplitVecRes_MaskedBinOp(N, Lo, Hi);
+ break;
case ISD::FMA: case ISD::VP_FMA:
case ISD::FSHL:
case ISD::VP_FSHL:
@@ -1629,6 +1684,23 @@ void DAGTypeLegalizer::SplitVecRes_BinOp(SDNode *N, SDValue &Lo, SDValue &Hi) {
{LHSHi, RHSHi, MaskHi, EVLHi}, Flags);
}
+void DAGTypeLegalizer::SplitVecRes_MaskedBinOp(SDNode *N, SDValue &Lo,
+ SDValue &Hi) {
+ SDValue LHSLo, LHSHi;
+ GetSplitVector(N->getOperand(0), LHSLo, LHSHi);
+ SDValue RHSLo, RHSHi;
+ GetSplitVector(N->getOperand(1), RHSLo, RHSHi);
+ auto [MaskLo, MaskHi] = SplitMask(N->getOperand(2));
+ SDLoc dl(N);
+
+ const SDNodeFlags Flags = N->getFlags();
+ unsigned Opcode = N->getOpcode();
+ Lo = DAG.getNode(Opcode, dl, LHSLo.getValueType(), LHSLo, RHSLo, MaskLo,
+ Flags);
+ Hi = DAG.getNode(Opcode, dl, LHSHi.getValueType(), LHSHi, RHSHi, MaskHi,
+ Flags);
+}
+
void DAGTypeLegalizer::SplitVecRes_TernaryOp(SDNode *N, SDValue &Lo,
SDValue &Hi) {
SDValue Op0Lo, Op0Hi;
@@ -5118,6 +5190,13 @@ void DAGTypeLegalizer::WidenVectorResult(SDNode *N, unsigned ResNo) {
Res = WidenVecRes_Binary(N);
break;
+ case ISD::MASKED_UDIV:
+ case ISD::MASKED_SDIV:
+ case ISD::MASKED_UREM:
+ case ISD::MASKED_SREM:
+ Res = WidenVecRes_MaskedBinary(N);
+ break;
+
case ISD::SCMP:
case ISD::UCMP:
Res = WidenVecRes_CMP(N);
@@ -5348,6 +5427,25 @@ SDValue DAGTypeLegalizer::WidenVecRes_Binary(SDNode *N) {
{InOp1, InOp2, Mask, N->getOperand(3)}, N->getFlags());
}
+SDValue DAGTypeLegalizer::WidenVecRes_MaskedBinary(SDNode *N) {
+ SDLoc dl(N);
+ EVT WidenVT = TLI.getTypeToTransformTo(*DAG.getContext(), N->getValueType(0));
+ SDValue InOp1 = GetWidenedVector(N->getOperand(0));
+ SDValue InOp2 = GetWidenedVector(N->getOperand(1));
+ SDValue Mask = N->getOperand(2);
+ EVT MaskVT = Mask.getValueType();
+ if (getTypeAction(MaskVT) == TargetLowering::TypeWidenVector)
+ Mask = GetWidenedMask(Mask, WidenVT.getVectorElementCount());
+ else {
+ EVT WidenMaskVT = WidenVT.changeVectorElementType(
+ *DAG.getContext(), MaskVT.getVectorElementType());
+ Mask = DAG.getInsertSubvector(dl, DAG.getConstant(0, dl, WidenMaskVT), Mask,
+ 0);
+ }
+ return DAG.getNode(N->getOpcode(), dl, WidenVT, InOp1, InOp2, Mask,
+ N->getFlags());
+}
+
SDValue DAGTypeLegalizer::WidenVecRes_CMP(SDNode *N) {
LLVMContext &Ctxt = *DAG.getContext();
SDLoc dl(N);
diff --git a/llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp b/llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp
index 3716de880cce3..a457fb028e9d3 100644
--- a/llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp
@@ -505,6 +505,21 @@ ISD::NodeType ISD::getVecReduceBaseOpcode(unsigned VecReduceOpcode) {
}
}
+ISD::NodeType ISD::getUnmaskedBinOpOpcode(unsigned MaskedOpc) {
+ switch (MaskedOpc) {
+ case ISD::MASKED_UDIV:
+ return ISD::UDIV;
+ case ISD::MASKED_SDIV:
+ return ISD::SDIV;
+ case ISD::MASKED_UREM:
+ return ISD::UREM;
+ case ISD::MASKED_SREM:
+ return ISD::SREM;
+ default:
+ llvm_unreachable("Expected masked binop opcode");
+ }
+}
+
bool ISD::isVPOpcode(unsigned Opcode) {
switch (Opcode) {
default:
diff --git a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
index 69ba9721401e1..6ebd0966b7fd1 100644
--- a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
@@ -8488,6 +8488,30 @@ void SelectionDAGBuilder::visitIntrinsicCall(const CallInst &I,
getValue(I.getOperand(1)), getValue(I.getOperand(2)),
DAG.getConstant(0, sdl, MVT::i64)));
return;
+ case Intrinsic::masked_udiv:
+ setValue(&I,
+ DAG.getNode(ISD::MASKED_UDIV, sdl, EVT::getEVT(I.getType()),
+ getValue(I.getOperand(0)), getValue(I.getOperand(1)),
+ getValue(I.getOperand(2))));
+ return;
+ case Intrinsic::masked_sdiv:
+ setValue(&I,
+ DAG.getNode(ISD::MASKED_SDIV, sdl, EVT::getEVT(I.getType()),
+ getValue(I.getOperand(0)), getValue(I.getOperand(1)),
+ getValue(I.getOperand(2))));
+ return;
+ case Intrinsic::masked_urem:
+ setValue(&I,
+ DAG.getNode(ISD::MASKED_UREM, sdl, EVT::getEVT(I.getType()),
+ getValue(I.getOperand(0)), getValue(I.getOperand(1)),
+ getValue(I.getOperand(2))));
+ return;
+ case Intrinsic::masked_srem:
+ setValue(&I,
+ DAG.getNode(ISD::MASKED_SREM, sdl, EVT::getEVT(I.getType()),
+ getValue(I.getOperand(0)), getValue(I.getOperand(1)),
+ getValue(I.getOperand(2))));
+ return;
}
}
diff --git a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGDumper.cpp b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGDumper.cpp
index 6c9387418ae2e..ce78072d21114 100644
--- a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGDumper.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGDumper.cpp
@@ -614,6 +614,14 @@ std::string SDNode::getOperationName(const SelectionDAG *G) const {
return "loop_dep_war";
case ISD::LOOP_DEPENDENCE_RAW_MASK:
return "loop_dep_raw";
+ case ISD::MASKED_UDIV:
+ return "masked_udiv";
+ case ISD::MASKED_SDIV:
+ return "masked_sdiv";
+ case ISD::MASKED_UREM:
+ return "masked_urem";
+ case ISD::MASKED_SREM:
+ return "masked_srem";
// Vector Predication
#define BEGIN_REGISTER_VP_SDNODE(SDID, LEGALARG, NAME, ...) \
diff --git a/llvm/lib/CodeGen/TargetLoweringBase.cpp b/llvm/lib/CodeGen/TargetLoweringBase.cpp
index 2ad00eaaadc98..2f1e3f2f3ff7a 100644
--- a/llvm/lib/CodeGen/TargetLoweringBase.cpp
+++ b/llvm/lib/CodeGen/TargetLoweringBase.cpp
@@ -1252,6 +1252,11 @@ void TargetLoweringBase::initActions() {
setOperationAction(ISD::RESET_FPENV, VT, Expand);
setOperationAction(ISD::MSTORE, VT, Expand);
+
+ setOperationAction(ISD::MASKED_UDIV, VT, Expand);
+ setOperationAction(ISD::MASKED_SDIV, VT, Expand);
+ setOperationAction(ISD::MASKED_UREM, VT, Expand);
+ setOperationAction(ISD::MASKED_SREM, VT, Expand);
}
// Most targets ignore the @llvm.prefetch intrinsic.
diff --git a/llvm/test/CodeGen/AArch64/masked-sdiv-fixed-length.ll b/llvm/test/CodeGen/AArch64/masked-sdiv-fixed-length.ll
new file mode 100644
index 0000000000000..44d56c2a2afe7
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/masked-sdiv-fixed-length.ll
@@ -0,0 +1,469 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
+; RUN: llc -mtriple aarch64 < %s | FileCheck %s --check-prefix=NEON
+; RUN: llc -mtriple aarch64 -mattr=+sve < %s | FileCheck %s --check-prefix=SVE
+
+; Legal
+define <4 x i32> @sdiv_v4i32(<4 x i32> %x, <4 x i32> %y, <4 x i1> %m) {
+; NEON-LABEL: sdiv_v4i32:
+; NEON: // %bb.0:
+; NEON-NEXT: shl v2.4h, v2.4h, #15
+; NEON-NEXT: mov w9, v1.s[1]
+; NEON-NEXT: mov w10, v0.s[1]
+; NEON-NEXT: mov w11, v1.s[2]
+; NEON-NEXT: mov w12, v0.s[2]
+; NEON-NEXT: mov w13, v0.s[3]
+; NEON-NEXT: cmlt v2.4h, v2.4h, #0
+; NEON-NEXT: umov w8, v2.h[1]
+; NEON-NEXT: tst w8, #0xffff
+; NEON-NEXT: csinc w8, w9, wzr, ne
+; NEON-NEXT: umov w9, v2.h[0]
+; NEON-NEXT: sdiv w8, w10, w8
+; NEON-NEXT: fmov w10, s1
+; NEON-NEXT: tst w9, #0xffff
+; NEON-NEXT: fmov w9, s0
+; NEON-NEXT: csinc w10, w10, wzr, ne
+; NEON-NEXT: sdiv w9, w9, w10
+; NEON-NEXT: umov w10, v2.h[2]
+; NEON-NEXT: tst w10, #0xffff
+; NEON-NEXT: csinc w10, w11, wzr, ne
+; NEON-NEXT: umov w11, v2.h[3]
+; NEON-NEXT: sdiv w10, w12, w10
+; NEON-NEXT: mov w12, v1.s[3]
+; NEON-NEXT: fmov s0, w9
+; NEON-NEXT: tst w11, #0xffff
+; NEON-NEXT: mov v0.s[1], w8
+; NEON-NEXT: csinc w9, w12, wzr, ne
+; NEON-NEXT: sdiv w8, w13, w9
+; NEON-NEXT: mov v0.s[2], w10
+; NEON-NEXT: mov v0.s[3], w8
+; NEON-NEXT: ret
+;
+; SVE-LABEL: sdiv_v4i32:
+; SVE: // %bb.0:
+; SVE-NEXT: shl v2.4h, v2.4h, #15
+; SVE-NEXT: mov w9, v1.s[1]
+; SVE-NEXT: // kill: def $q0 killed $q0 def $z0
+; SVE-NEXT: mov w11, v1.s[2]
+; SVE-NEXT: ptrue p0.s, vl4
+; SVE-NEXT: cmlt v2.4h, v2.4h, #0
+; SVE-NEXT: umov w8, v2.h[1]
+; SVE-NEXT: umov w10, v2.h[0]
+; SVE-NEXT: tst w8, #0xffff
+; SVE-NEXT: csinc w8, w9, wzr, ne
+; SVE-NEXT: fmov w9, s1
+; SVE-NEXT: tst w10, #0xffff
+; SVE-NEXT: umov w10, v2.h[2]
+; SVE-NEXT: csinc w9, w9, wzr, ne
+; SVE-NEXT: fmov s3, w9
+; SVE-NEXT: mov w9, v1.s[3]
+; SVE-NEXT: tst w10, #0xffff
+; SVE-NEXT: csinc w10, w11, wzr, ne
+; SVE-NEXT: mov v3.s[1], w8
+; SVE-NEXT: umov w8, v2.h[3]
+; SVE-NEXT: mov v3.s[2], w10
+; SVE-NEXT: tst w8, #0xffff
+; SVE-NEXT: csinc w8, w9, wzr, ne
+; SVE-NEXT: mov v3.s[3], w8
+; SVE-NEXT: sdiv z0.s, p0/m, z0.s, z3.s
+; SVE-NEXT: // kill: def $q0 killed $q0 killed $z0
+; SVE-NEXT: ret
+ %res = call <4 x i32> @llvm.masked.sdiv(<4 x i32> %x, <4 x i32> %y, <4 x i1> %m)
+ ret <4 x i32> %res
+}
+
+define <2 x i64> @sdiv_v2i64(<2 x i64> %x, <2 x i64> %y, <2 x i1> %m) {
+; NEON-LABEL: sdiv_v2i64:
+; NEON: // %bb.0:
+; NEON-NEXT: shl v2.2s, v2.2s, #31
+; NEON-NEXT: mov x9, v1.d[1]
+; NEON-NEXT: cmlt v2.2s, v2.2s, #0
+; NEON-NEXT: mov w8, v2.s[1]
+; NEON-NEXT: fmov w10, s2
+; NEON-NEXT: cmp w8, #0
+; NEON-NEXT: fmov x8, d1
+; NEON-NEXT: csinc x9, x9, xzr, ne
+; NEON-NEXT: cmp w10, #0
+; NEON-NEXT: fmov x10, d0
+; NEON-NEXT: csinc x8, x8, xzr, ne
+; NEON-NEXT: sdiv x8, x10, x8
+; NEON-NEXT: mov x10, v0.d[1]
+; NEON-NEXT: sdiv x9, x10, x9
+; NEON-NEXT: fmov d0, x8
+; NEON-NEXT: mov v0.d[1], x9
+; NEON-NEXT: ret
+;
+; SVE-LABEL: sdiv_v2i64:
+; SVE: // %bb.0:
+; SVE-NEXT: shl v2.2s, v2.2s, #31
+; SVE-NEXT: mov x9, v1.d[1]
+; SVE-NEXT: // kill: def $q0 killed $q0 def $z0
+; SVE-NEXT: fmov x10, d1
+; SVE-NEXT: ptrue p0.d, vl2
+; SVE-NEXT: cmlt v2.2s, v2.2s, #0
+; SVE-NEXT: mov w8, v2.s[1]
+; SVE-NEXT: cmp w8, #0
+; SVE-NEXT: fmov w8, s2
+; SVE-NEXT: csinc x9, x9, xzr, ne
+; SVE-NEXT: cmp w8, #0
+; SVE-NEXT: csinc x8, x10, xzr, ne
+; SVE-NEXT: fmov d1, x8
+; SVE-NEXT: mov v1.d[1], x9
+; SVE-NEXT: sdiv z0.d, p0/m, z0.d, z1.d
+; SVE-NEXT: // kill: def $q0 killed $q0 killed $z0
+; SVE-NEXT: ret
+ %res = call <2 x i64> @llvm.masked.sdiv(<2 x i64> %x, <2 x i64> %y, <2 x i1> %m)
+ ret <2 x i64> %res
+}
+
+; Splitting
+define <4 x i64> @sdiv_v4i64(<4 x i64> %x, <4 x i64> %y, <4 x i1> %m) {
+; NEON-LABEL: sdiv_v4i64:
+; NEON: // %bb.0:
+; NEON-NEXT: ushll v4.4s, v4.4h, #0
+; NEON-NEXT: mov x9, v2.d[1]
+; NEON-NEXT: mov x10, v0.d[1]
+; NEON-NEXT: mov x11, v3.d[1]
+; NEON-NEXT: fmov x12, d3
+; NEON-NEXT: shl v5.2s, v4.2s, #31
+; NEON-NEXT: cmlt v5.2s, v5.2s, #0
+; NEON-NEXT: mov w8, v5.s[1]
+; NEON-NEXT: cmp w8, #0
+; NEON-NEXT: csinc x8, x9, xzr, ne
+; NEON-NEXT: fmov w9, s5
+; NEON-NEXT: sdiv x8, x10, x8
+; NEON-NEXT: fmov x10, d2
+; NEON-NEXT: cmp w9, #0
+; NEON-NEXT: fmov x9, d0
+; NEON-NEXT: ext v0.16b, v4.16b, v4.16b, #8
+; NEON-NEXT: csinc x10, x10, xzr, ne
+; NEON-NEXT: shl v0.2s, v0.2s, #31
+; NEON-NEXT: cmlt v0.2s, v0.2s, #0
+; NEON-NEXT: sdiv x9, x9, x10
+; NEON-NEXT: mov w10, v0.s[1]
+; NEON-NEXT: cmp w10, #0
+; NEON-NEXT: fmov w10, s0
+; NEON-NEXT: csinc x11, x11, xzr, ne
+; NEON-NEXT: cmp w10, #0
+; NEON-NEXT: csinc x10, x12, xzr, ne
+; NEON-NEXT: fmov x12, d1
+; NEON-NEXT: sdiv x10, x12, x10
+; NEON-NEXT: mov x12, v1.d[1]
+; NEON-NEXT: fmov d0, x9
+; NEON-NEXT: mov v0.d[1], x8
+; NEON-NEXT: sdiv x11, x12, x11
+; NEON-NEXT: fmov d1, x10
+; NEON-NEXT: mov v1.d[1], x11
+; NEON-NEXT: ret
+;
+; SVE-LABEL: sdiv_v4i64:
+; SVE: // %bb.0:
+; SVE-NEXT: ushll v4.4s, v4.4h, #0
+; SVE-NEXT: mov x9, v2.d[1]
+; SVE-NEXT: // kill: def $q0 killed $q0 def $z0
+; SVE-NEXT: // kill: def $q1 killed $q1 def $z1
+; SVE-NEXT: ptrue p0.d, vl2
+; SVE-NEXT: shl v5.2s, v4.2s, #31
+; SVE-NEXT: cmlt v5.2s, v5.2s, #0
+; SVE-NEXT: mov w8, v5.s[1]
+; SVE-NEXT: fmov w10, s5
+; SVE-NEXT: cmp w8, #0
+; SVE-NEXT: fmov x8, d2
+; SVE-NEXT: csinc x9, x9, xzr, ne
+; SVE-NEXT: cmp w10, #0
+; SVE-NEXT: csinc x8, x8, xzr, ne
+; SVE-NEXT: fmov d2, x8
+; SVE-NEXT: mov v2.d[1], x9
+; SVE-NEXT: mov x9, v3.d[1]
+; SVE-NEXT: sdiv z0.d, p0/m, z0.d, z2.d
+; SVE-NEXT: ext v2.16b, v4.16b, v4.16b, #8
+; SVE-NEXT: // kill: def $q0 killed $q0 killed $z0
+; SVE-NEXT: shl v2.2s, v2.2s, #31
+; SVE-NEXT: cmlt v2.2s, v2.2s, #0
+; SVE-NEXT: mov w8, v2.s[1]
+; SVE-NEXT: fmov w10, s2
+; SVE-NEXT: cmp w8, #0
+; SVE-NEXT: fmov x8, d3
+; SVE-NEXT: csinc x9, x9, xzr, ne
+; SVE-NEXT: cmp w10, #0
+; SVE-NEXT: csinc x8, x8, xzr, ne
+; SVE-NEXT: fmov d2, x8
+; SVE-NEXT: mov v2.d[1], x9
+; SVE-NEXT: sdiv z1.d, p0/m, z1.d, z2.d
+; SVE-NEXT: // kill: def $q1 killed $q1 killed $z1
+; SVE-NEXT: ret
+ %res = call <4 x i64> @llvm.masked.sdiv(<4 x i64> %x, <4 x i64> %y, <4 x i1> %m)
+ ret <4 x i64> %res
+}
+
+; Widening
+define <2 x i32> @sdiv_v2i32(<2 x i32> %x, <2 x i32> %y, <2 x i1> %m) {
+; NEON-LABEL: sdiv_v2i32:
+; NEON: // %bb.0:
+; NEON-NEXT: shl v2.2s, v2.2s, #31
+; NEON-NEXT: // kill: def $d0 killed $d0 def $q0
+; NEON-NEXT: fmov w8, s0
+; NEON-NEXT: cmlt v2.2s, v2.2s, #0
+; NEON-NEXT: and v1.8b, v1.8b, v2.8b
+; NEON-NEXT: mvn v2.8b, v2.8b
+; NEON-NEXT: sub v1.2s, v1.2s, v2.2s
+; NEON-NEXT: fmov w9, s1
+; NEON-NEXT: mov w10, v1.s[1]
+; NEON-NEXT: sdiv w8, w8, w9
+; NEON-NEXT: mov w9, v0.s[1]
+; NEON-NEXT: sdiv w9, w9, w10
+; NEON-NEXT: fmov s0, w8
+; NEON-NEXT: mov v0.s[1], w9
+; NEON-NEXT: // kill: def $d0 killed $d0 killed $q0
+; NEON-NEXT: ret
+;
+; SVE-LABEL: sdiv_v2i32:
+; SVE: // %bb.0:
+; SVE-NEXT: shl v2.2s, v2.2s, #31
+; SVE-NEXT: ptrue p0.s, vl2
+; SVE-NEXT: // kill: def $d0 killed $d0 def $z0
+; SVE-NEXT: cmlt v2.2s, v2.2s, #0
+; SVE-NEXT: and v1.8b, v1.8b, v2.8b
+; SVE-NEXT: mvn v2.8b, v2.8b
+; SVE-NEXT: sub v1.2s, v1.2s, v2.2s
+; SVE-NEXT: sdiv z0.s, p0/m, z0.s, z1.s
+; SVE-NEXT: // kill: def $d0 killed $d0 killed $z0
+; SVE-NEXT: ret
+ %res = call <2 x i32> @llvm.masked.sdiv(<2 x i32> %x, <2 x i32> %y, <2 x i1> %m)
+ ret <2 x i32> %res
+}
+
+; Promotion
+define <4 x i16> @sdiv_v4i16(<4 x i16> %x, <4 x i16> %y, <4 x i1> %m) {
+; NEON-LABEL: sdiv_v4i16:
+; NEON: // %bb.0:
+; NEON-NEXT: shl v2.4h, v2.4h, #15
+; NEON-NEXT: // kill: def $d0 killed $d0 def $q0
+; NEON-NEXT: smov w8, v0.h[1]
+; NEON-NEXT: cmlt v2.4h, v2.4h, #0
+; NEON-NEXT: and v1.8b, v1.8b, v2.8b
+; NEON-NEXT: mvn v2.8b, v2.8b
+; NEON-NEXT: sub v1.4h, v1.4h, v2.4h
+; NEON-NEXT: smov w9, v1.h[1]
+; NEON-NEXT: smov w10, v1.h[0]
+; NEON-NEXT: smov w11, v1.h[2]
+; NEON-NEXT: smov w12, v1.h[3]
+; NEON-NEXT: sdiv w8, w8, w9
+; NEON-NEXT: smov w9, v0.h[0]
+; NEON-NEXT: sdiv w9, w9, w10
+; NEON-NEXT: smov w10, v0.h[2]
+; NEON-NEXT: sdiv w10, w10, w11
+; NEON-NEXT: smov w11, v0.h[3]
+; NEON-NEXT: fmov s0, w9
+; NEON-NEXT: mov v0.h[1], w8
+; NEON-NEXT: sdiv w8, w11, w12
+; NEON-NEXT: mov v0.h[2], w10
+; NEON-NEXT: mov v0.h[3], w8
+; NEON-NEXT: // kill: def $d0 killed $d0 killed $q0
+; NEON-NEXT: ret
+;
+; SVE-LABEL: sdiv_v4i16:
+; SVE: // %bb.0:
+; SVE-NEXT: shl v2.4h, v2.4h, #15
+; SVE-NEXT: sshll v0.4s, v0.4h, #0
+; SVE-NEXT: ptrue p0.s, vl4
+; SVE-NEXT: cmlt v2.4h, v2.4h, #0
+; SVE-NEXT: and v1.8b, v1.8b, v2.8b
+; SVE-NEXT: mvn v2.8b, v2.8b
+; SVE-NEXT: sub v1.4h, v1.4h, v2.4h
+; SVE-NEXT: sshll v1.4s, v1.4h, #0
+; SVE-NEXT: sdiv z0.s, p0/m, z0.s, z1.s
+; SVE-NEXT: xtn v0.4h, v0.4s
+; SVE-NEXT: ret
+ %res = call <4 x i16> @llvm.masked.sdiv(<4 x i16> %x, <4 x i16> %y, <4 x i1> %m)
+ ret <4 x i16> %res
+}
+
+; Scalarization
+define <1 x i64> @sdiv_v1i164(<1 x i64> %x, <1 x i64> %y, <1 x i1> %m) {
+; NEON-LABEL: sdiv_v1i164:
+; NEON: // %bb.0:
+; NEON-NEXT: // kill: def $d1 killed $d1 def $q1
+; NEON-NEXT: fmov x8, d1
+; NEON-NEXT: // kill: def $d0 killed $d0 def $q0
+; NEON-NEXT: fmov x9, d0
+; NEON-NEXT: tst w0, #0x1
+; NEON-NEXT: csinc x8, x8, xzr, ne
+; NEON-NEXT: sdiv x8, x9, x8
+; NEON-NEXT: fmov d0, x8
+; NEON-NEXT: ret
+;
+; SVE-LABEL: sdiv_v1i164:
+; SVE: // %bb.0:
+; SVE-NEXT: // kill: def $d1 killed $d1 def $q1
+; SVE-NEXT: fmov x8, d1
+; SVE-NEXT: // kill: def $d0 killed $d0 def $q0
+; SVE-NEXT: fmov x9, d0
+; SVE-NEXT: tst w0, #0x1
+; SVE-NEXT: csinc x8, x8, xzr, ne
+; SVE-NEXT: sdiv x8, x9, x8
+; SVE-NEXT: fmov d0, x8
+; SVE-NEXT: ret
+ %res = call <1 x i64> @llvm.masked.sdiv(<1 x i64> %x, <1 x i64> %y, <1 x i1> %m)
+ ret <1 x i64> %res
+}
+
+; Expansion
+define <2 x i128> @sdiv_v2i128(<2 x i128> %x, <2 x i128> %y, <2 x i1> %m) {
+; NEON-LABEL: sdiv_v2i128:
+; NEON: // %bb.0:
+; NEON-NEXT: stp x30, x25, [sp, #-64]! // 16-byte Folded Spill
+; NEON-NEXT: stp x24, x23, [sp, #16] // 16-byte Folded Spill
+; NEON-NEXT: stp x22, x21, [sp, #32] // 16-byte Folded Spill
+; NEON-NEXT: stp x20, x19, [sp, #48] // 16-byte Folded Spill
+; NEON-NEXT: .cfi_def_cfa_offset 64
+; NEON-NEXT: .cfi_offset w19, -8
+; NEON-NEXT: .cfi_offset w20, -16
+; NEON-NEXT: .cfi_offset w21, -24
+; NEON-NEXT: .cfi_offset w22, -32
+; NEON-NEXT: .cfi_offset w23, -40
+; NEON-NEXT: .cfi_offset w24, -48
+; NEON-NEXT: .cfi_offset w25, -56
+; NEON-NEXT: .cfi_offset w30, -64
+; NEON-NEXT: // kill: def $d0 killed $d0 def $q0
+; NEON-NEXT: fmov w8, s0
+; NEON-NEXT: mov x21, x3
+; NEON-NEXT: mov x22, x2
+; NEON-NEXT: mov x19, x7
+; NEON-NEXT: mov x20, x6
+; NEON-NEXT: mov w25, v0.s[1]
+; NEON-NEXT: tst w8, #0x1
+; NEON-NEXT: csel x3, x5, xzr, ne
+; NEON-NEXT: csinc x2, x4, xzr, ne
+; NEON-NEXT: bl __divti3
+; NEON-NEXT: tst w25, #0x1
+; NEON-NEXT: mov x23, x0
+; NEON-NEXT: mov x24, x1
+; NEON-NEXT: csel x3, x19, xzr, ne
+; NEON-NEXT: csinc x2, x20, xzr, ne
+; NEON-NEXT: mov x0, x22
+; NEON-NEXT: mov x1, x21
+; NEON-NEXT: bl __divti3
+; NEON-NEXT: mov x2, x0
+; NEON-NEXT: mov x3, x1
+; NEON-NEXT: mov x0, x23
+; NEON-NEXT: mov x1, x24
+; NEON-NEXT: ldp x20, x19, [sp, #48] // 16-byte Folded Reload
+; NEON-NEXT: ldp x22, x21, [sp, #32] // 16-byte Folded Reload
+; NEON-NEXT: ldp x24, x23, [sp, #16] // 16-byte Folded Reload
+; NEON-NEXT: ldp x30, x25, [sp], #64 // 16-byte Folded Reload
+; NEON-NEXT: ret
+;
+; SVE-LABEL: sdiv_v2i128:
+; SVE: // %bb.0:
+; SVE-NEXT: stp x30, x25, [sp, #-64]! // 16-byte Folded Spill
+; SVE-NEXT: stp x24, x23, [sp, #16] // 16-byte Folded Spill
+; SVE-NEXT: stp x22, x21, [sp, #32] // 16-byte Folded Spill
+; SVE-NEXT: stp x20, x19, [sp, #48] // 16-byte Folded Spill
+; SVE-NEXT: .cfi_def_cfa_offset 64
+; SVE-NEXT: .cfi_offset w19, -8
+; SVE-NEXT: .cfi_offset w20, -16
+; SVE-NEXT: .cfi_offset w21, -24
+; SVE-NEXT: .cfi_offset w22, -32
+; SVE-NEXT: .cfi_offset w23, -40
+; SVE-NEXT: .cfi_offset w24, -48
+; SVE-NEXT: .cfi_offset w25, -56
+; SVE-NEXT: .cfi_offset w30, -64
+; SVE-NEXT: // kill: def $d0 killed $d0 def $q0
+; SVE-NEXT: fmov w8, s0
+; SVE-NEXT: mov x21, x3
+; SVE-NEXT: mov x22, x2
+; SVE-NEXT: mov x19, x7
+; SVE-NEXT: mov x20, x6
+; SVE-NEXT: mov w25, v0.s[1]
+; SVE-NEXT: tst w8, #0x1
+; SVE-NEXT: csel x3, x5, xzr, ne
+; SVE-NEXT: csinc x2, x4, xzr, ne
+; SVE-NEXT: bl __divti3
+; SVE-NEXT: tst w25, #0x1
+; SVE-NEXT: mov x23, x0
+; SVE-NEXT: mov x24, x1
+; SVE-NEXT: csel x3, x19, xzr, ne
+; SVE-NEXT: csinc x2, x20, xzr, ne
+; SVE-NEXT: mov x0, x22
+; SVE-NEXT: mov x1, x21
+; SVE-NEXT: bl __divti3
+; SVE-NEXT: mov x2, x0
+; SVE-NEXT: mov x3, x1
+; SVE-NEXT: mov x0, x23
+; SVE-NEXT: mov x1, x24
+; SVE-NEXT: ldp x20, x19, [sp, #48] // 16-byte Folded Reload
+; SVE-NEXT: ldp x22, x21, [sp, #32] // 16-byte Folded Reload
+; SVE-NEXT: ldp x24, x23, [sp, #16] // 16-byte Folded Reload
+; SVE-NEXT: ldp x30, x25, [sp], #64 // 16-byte Folded Reload
+; SVE-NEXT: ret
+ %res = call <2 x i128> @llvm.masked.sdiv(<2 x i128> %x, <2 x i128> %y, <2 x i1> %m)
+ ret <2 x i128> %res
+}
+
+; Promotion and widening
+define <3 x i10> @sdiv_v3i10(<3 x i10> %x, <3 x i10> %y, <3 x i1> %m) {
+; NEON-LABEL: sdiv_v3i10:
+; NEON: // %bb.0:
+; NEON-NEXT: fmov s0, w6
+; NEON-NEXT: fmov s1, w3
+; NEON-NEXT: ldr w8, [sp]
+; NEON-NEXT: fmov s2, w0
+; NEON-NEXT: mov v0.h[1], w7
+; NEON-NEXT: mov v1.h[1], w4
+; NEON-NEXT: mov v2.h[1], w1
+; NEON-NEXT: mov v0.h[2], w8
+; NEON-NEXT: mov v1.h[2], w5
+; NEON-NEXT: mov v2.h[2], w2
+; NEON-NEXT: shl v0.4h, v0.4h, #15
+; NEON-NEXT: shl v1.4h, v1.4h, #6
+; NEON-NEXT: shl v2.4h, v2.4h, #6
+; NEON-NEXT: cmlt v0.4h, v0.4h, #0
+; NEON-NEXT: sshr v1.4h, v1.4h, #6
+; NEON-NEXT: sshr v2.4h, v2.4h, #6
+; NEON-NEXT: and v1.8b, v1.8b, v0.8b
+; NEON-NEXT: mvn v0.8b, v0.8b
+; NEON-NEXT: smov w8, v2.h[0]
+; NEON-NEXT: sub v0.4h, v1.4h, v0.4h
+; NEON-NEXT: smov w9, v0.h[0]
+; NEON-NEXT: sdiv w0, w8, w9
+; NEON-NEXT: smov w8, v2.h[1]
+; NEON-NEXT: smov w9, v0.h[1]
+; NEON-NEXT: sdiv w1, w8, w9
+; NEON-NEXT: smov w8, v2.h[2]
+; NEON-NEXT: smov w9, v0.h[2]
+; NEON-NEXT: sdiv w2, w8, w9
+; NEON-NEXT: ret
+;
+; SVE-LABEL: sdiv_v3i10:
+; SVE: // %bb.0:
+; SVE-NEXT: fmov s0, w6
+; SVE-NEXT: fmov s1, w3
+; SVE-NEXT: ldr w8, [sp]
+; SVE-NEXT: fmov s2, w0
+; SVE-NEXT: ptrue p0.s, vl4
+; SVE-NEXT: mov v0.h[1], w7
+; SVE-NEXT: mov v1.h[1], w4
+; SVE-NEXT: mov v2.h[1], w1
+; SVE-NEXT: mov v0.h[2], w8
+; SVE-NEXT: mov v1.h[2], w5
+; SVE-NEXT: mov v2.h[2], w2
+; SVE-NEXT: shl v0.4h, v0.4h, #15
+; SVE-NEXT: shl v1.4h, v1.4h, #6
+; SVE-NEXT: shl v2.4h, v2.4h, #6
+; SVE-NEXT: cmlt v0.4h, v0.4h, #0
+; SVE-NEXT: sshr v1.4h, v1.4h, #6
+; SVE-NEXT: sshr v2.4h, v2.4h, #6
+; SVE-NEXT: and v1.8b, v1.8b, v0.8b
+; SVE-NEXT: mvn v0.8b, v0.8b
+; SVE-NEXT: sub v0.4h, v1.4h, v0.4h
+; SVE-NEXT: sshll v1.4s, v2.4h, #0
+; SVE-NEXT: sshll v0.4s, v0.4h, #0
+; SVE-NEXT: sdivr z0.s, p0/m, z0.s, z1.s
+; SVE-NEXT: xtn v0.4h, v0.4s
+; SVE-NEXT: umov w0, v0.h[0]
+; SVE-NEXT: umov w1, v0.h[1]
+; SVE-NEXT: umov w2, v0.h[2]
+; SVE-NEXT: ret
+ %res = call <3 x i10> @llvm.masked.sdiv(<3 x i10> %x, <3 x i10> %y, <3 x i1> %m)
+ ret <3 x i10> %res
+}
diff --git a/llvm/test/CodeGen/AArch64/masked-sdiv-scalable.ll b/llvm/test/CodeGen/AArch64/masked-sdiv-scalable.ll
new file mode 100644
index 0000000000000..bb3ce4124907a
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/masked-sdiv-scalable.ll
@@ -0,0 +1,74 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
+; RUN: llc -mtriple aarch64 -mattr=+sve < %s | FileCheck %s
+
+define <vscale x 4 x i16> @sdiv_nxv4i16(<vscale x 4 x i16> %x, <vscale x 4 x i16> %y, <vscale x 4 x i1> %m) {
+; CHECK-LABEL: sdiv_nxv4i16:
+; CHECK: // %bb.0:
+; CHECK-NEXT: ptrue p1.s
+; CHECK-NEXT: mov z2.s, #1 // =0x1
+; CHECK-NEXT: sxth z1.s, p1/m, z1.s
+; CHECK-NEXT: sxth z0.s, p1/m, z0.s
+; CHECK-NEXT: sel z1.s, p0, z1.s, z2.s
+; CHECK-NEXT: sdiv z0.s, p1/m, z0.s, z1.s
+; CHECK-NEXT: ret
+ %res = call <vscale x 4 x i16> @llvm.masked.sdiv(<vscale x 4 x i16> %x, <vscale x 4 x i16> %y, <vscale x 4 x i1> %m)
+ ret <vscale x 4 x i16> %res
+}
+
+define <vscale x 4 x i32> @sdiv_nxv4i32(<vscale x 4 x i32> %x, <vscale x 4 x i32> %y, <vscale x 4 x i1> %m) {
+; CHECK-LABEL: sdiv_nxv4i32:
+; CHECK: // %bb.0:
+; CHECK-NEXT: mov z2.s, #1 // =0x1
+; CHECK-NEXT: sel z1.s, p0, z1.s, z2.s
+; CHECK-NEXT: ptrue p0.s
+; CHECK-NEXT: sdiv z0.s, p0/m, z0.s, z1.s
+; CHECK-NEXT: ret
+ %res = call <vscale x 4 x i32> @llvm.masked.sdiv(<vscale x 4 x i32> %x, <vscale x 4 x i32> %y, <vscale x 4 x i1> %m)
+ ret <vscale x 4 x i32> %res
+}
+
+define <vscale x 8 x i32> @sdiv_nxv8i32(<vscale x 8 x i32> %x, <vscale x 8 x i32> %y, <vscale x 8 x i1> %m) {
+; CHECK-LABEL: sdiv_nxv8i32:
+; CHECK: // %bb.0:
+; CHECK-NEXT: mov z4.s, #1 // =0x1
+; CHECK-NEXT: punpklo p1.h, p0.b
+; CHECK-NEXT: punpkhi p0.h, p0.b
+; CHECK-NEXT: sel z2.s, p1, z2.s, z4.s
+; CHECK-NEXT: ptrue p1.s
+; CHECK-NEXT: sdiv z0.s, p1/m, z0.s, z2.s
+; CHECK-NEXT: sel z2.s, p0, z3.s, z4.s
+; CHECK-NEXT: sdiv z1.s, p1/m, z1.s, z2.s
+; CHECK-NEXT: ret
+ %res = call <vscale x 8 x i32> @llvm.masked.sdiv(<vscale x 8 x i32> %x, <vscale x 8 x i32> %y, <vscale x 8 x i1> %m)
+ ret <vscale x 8 x i32> %res
+}
+
+define <vscale x 8 x i16> @sdiv_nxv8i16(<vscale x 8 x i16> %x, <vscale x 8 x i16> %y, <vscale x 8 x i1> %m) {
+; CHECK-LABEL: sdiv_nxv8i16:
+; CHECK: // %bb.0:
+; CHECK-NEXT: mov z2.h, #1 // =0x1
+; CHECK-NEXT: sel z1.h, p0, z1.h, z2.h
+; CHECK-NEXT: sunpkhi z2.s, z0.h
+; CHECK-NEXT: sunpklo z0.s, z0.h
+; CHECK-NEXT: ptrue p0.s
+; CHECK-NEXT: sunpkhi z3.s, z1.h
+; CHECK-NEXT: sunpklo z1.s, z1.h
+; CHECK-NEXT: sdiv z2.s, p0/m, z2.s, z3.s
+; CHECK-NEXT: sdiv z0.s, p0/m, z0.s, z1.s
+; CHECK-NEXT: uzp1 z0.h, z0.h, z2.h
+; CHECK-NEXT: ret
+ %res = call <vscale x 8 x i16> @llvm.masked.sdiv(<vscale x 8 x i16> %x, <vscale x 8 x i16> %y, <vscale x 8 x i1> %m)
+ ret <vscale x 8 x i16> %res
+}
+
+define <vscale x 2 x i64> @sdiv_nxv2i64(<vscale x 2 x i64> %x, <vscale x 2 x i64> %y, <vscale x 2 x i1> %m) {
+; CHECK-LABEL: sdiv_nxv2i64:
+; CHECK: // %bb.0:
+; CHECK-NEXT: mov z2.d, #1 // =0x1
+; CHECK-NEXT: sel z1.d, p0, z1.d, z2.d
+; CHECK-NEXT: ptrue p0.d
+; CHECK-NEXT: sdiv z0.d, p0/m, z0.d, z1.d
+; CHECK-NEXT: ret
+ %res = call <vscale x 2 x i64> @llvm.masked.sdiv(<vscale x 2 x i64> %x, <vscale x 2 x i64> %y, <vscale x 2 x i1> %m)
+ ret <vscale x 2 x i64> %res
+}
diff --git a/llvm/test/CodeGen/AArch64/masked-srem-fixed-length.ll b/llvm/test/CodeGen/AArch64/masked-srem-fixed-length.ll
new file mode 100644
index 0000000000000..c822e9eb0afa8
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/masked-srem-fixed-length.ll
@@ -0,0 +1,502 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
+; RUN: llc -mtriple aarch64 < %s | FileCheck %s --check-prefix=NEON
+; RUN: llc -mtriple aarch64 -mattr=+sve < %s | FileCheck %s --check-prefix=SVE
+
+; Legal
+define <4 x i32> @srem_v4i32(<4 x i32> %x, <4 x i32> %y, <4 x i1> %m) {
+; NEON-LABEL: srem_v4i32:
+; NEON: // %bb.0:
+; NEON-NEXT: shl v2.4h, v2.4h, #15
+; NEON-NEXT: mov w9, v1.s[1]
+; NEON-NEXT: fmov w12, s1
+; NEON-NEXT: mov w10, v0.s[1]
+; NEON-NEXT: mov w15, v1.s[2]
+; NEON-NEXT: mov w16, v0.s[2]
+; NEON-NEXT: mov w18, v1.s[3]
+; NEON-NEXT: mov w0, v0.s[3]
+; NEON-NEXT: cmlt v2.4h, v2.4h, #0
+; NEON-NEXT: umov w8, v2.h[1]
+; NEON-NEXT: umov w11, v2.h[0]
+; NEON-NEXT: umov w14, v2.h[2]
+; NEON-NEXT: umov w17, v2.h[3]
+; NEON-NEXT: tst w8, #0xffff
+; NEON-NEXT: csinc w8, w9, wzr, ne
+; NEON-NEXT: tst w11, #0xffff
+; NEON-NEXT: fmov w11, s0
+; NEON-NEXT: csinc w12, w12, wzr, ne
+; NEON-NEXT: sdiv w9, w10, w8
+; NEON-NEXT: tst w14, #0xffff
+; NEON-NEXT: csinc w14, w15, wzr, ne
+; NEON-NEXT: tst w17, #0xffff
+; NEON-NEXT: sdiv w13, w11, w12
+; NEON-NEXT: msub w8, w9, w8, w10
+; NEON-NEXT: sdiv w15, w16, w14
+; NEON-NEXT: msub w11, w13, w12, w11
+; NEON-NEXT: csinc w12, w18, wzr, ne
+; NEON-NEXT: fmov s0, w11
+; NEON-NEXT: mov v0.s[1], w8
+; NEON-NEXT: sdiv w9, w0, w12
+; NEON-NEXT: msub w8, w15, w14, w16
+; NEON-NEXT: mov v0.s[2], w8
+; NEON-NEXT: msub w8, w9, w12, w0
+; NEON-NEXT: mov v0.s[3], w8
+; NEON-NEXT: ret
+;
+; SVE-LABEL: srem_v4i32:
+; SVE: // %bb.0:
+; SVE-NEXT: shl v2.4h, v2.4h, #15
+; SVE-NEXT: mov w9, v1.s[1]
+; SVE-NEXT: // kill: def $q0 killed $q0 def $z0
+; SVE-NEXT: mov w11, v1.s[2]
+; SVE-NEXT: ptrue p0.s, vl4
+; SVE-NEXT: cmlt v2.4h, v2.4h, #0
+; SVE-NEXT: umov w8, v2.h[1]
+; SVE-NEXT: umov w10, v2.h[0]
+; SVE-NEXT: tst w8, #0xffff
+; SVE-NEXT: csinc w8, w9, wzr, ne
+; SVE-NEXT: fmov w9, s1
+; SVE-NEXT: tst w10, #0xffff
+; SVE-NEXT: umov w10, v2.h[2]
+; SVE-NEXT: csinc w9, w9, wzr, ne
+; SVE-NEXT: fmov s3, w9
+; SVE-NEXT: mov w9, v1.s[3]
+; SVE-NEXT: tst w10, #0xffff
+; SVE-NEXT: csinc w10, w11, wzr, ne
+; SVE-NEXT: mov v3.s[1], w8
+; SVE-NEXT: umov w8, v2.h[3]
+; SVE-NEXT: mov v3.s[2], w10
+; SVE-NEXT: tst w8, #0xffff
+; SVE-NEXT: csinc w8, w9, wzr, ne
+; SVE-NEXT: mov v3.s[3], w8
+; SVE-NEXT: movprfx z1, z0
+; SVE-NEXT: sdiv z1.s, p0/m, z1.s, z3.s
+; SVE-NEXT: mls v0.4s, v1.4s, v3.4s
+; SVE-NEXT: // kill: def $q0 killed $q0 killed $z0
+; SVE-NEXT: ret
+ %res = call <4 x i32> @llvm.masked.srem(<4 x i32> %x, <4 x i32> %y, <4 x i1> %m)
+ ret <4 x i32> %res
+}
+
+define <2 x i64> @srem_v2i64(<2 x i64> %x, <2 x i64> %y, <2 x i1> %m) {
+; NEON-LABEL: srem_v2i64:
+; NEON: // %bb.0:
+; NEON-NEXT: shl v2.2s, v2.2s, #31
+; NEON-NEXT: mov x9, v1.d[1]
+; NEON-NEXT: mov x12, v0.d[1]
+; NEON-NEXT: cmlt v2.2s, v2.2s, #0
+; NEON-NEXT: mov w8, v2.s[1]
+; NEON-NEXT: fmov w10, s2
+; NEON-NEXT: cmp w8, #0
+; NEON-NEXT: fmov x8, d1
+; NEON-NEXT: csinc x9, x9, xzr, ne
+; NEON-NEXT: cmp w10, #0
+; NEON-NEXT: fmov x10, d0
+; NEON-NEXT: sdiv x13, x12, x9
+; NEON-NEXT: csinc x8, x8, xzr, ne
+; NEON-NEXT: sdiv x11, x10, x8
+; NEON-NEXT: msub x9, x13, x9, x12
+; NEON-NEXT: msub x8, x11, x8, x10
+; NEON-NEXT: fmov d0, x8
+; NEON-NEXT: mov v0.d[1], x9
+; NEON-NEXT: ret
+;
+; SVE-LABEL: srem_v2i64:
+; SVE: // %bb.0:
+; SVE-NEXT: shl v2.2s, v2.2s, #31
+; SVE-NEXT: mov x9, v1.d[1]
+; SVE-NEXT: // kill: def $q0 killed $q0 def $z0
+; SVE-NEXT: fmov x10, d1
+; SVE-NEXT: ptrue p0.d, vl2
+; SVE-NEXT: cmlt v2.2s, v2.2s, #0
+; SVE-NEXT: mov w8, v2.s[1]
+; SVE-NEXT: cmp w8, #0
+; SVE-NEXT: fmov w8, s2
+; SVE-NEXT: csinc x9, x9, xzr, ne
+; SVE-NEXT: cmp w8, #0
+; SVE-NEXT: csinc x8, x10, xzr, ne
+; SVE-NEXT: fmov d1, x8
+; SVE-NEXT: mov v1.d[1], x9
+; SVE-NEXT: movprfx z2, z0
+; SVE-NEXT: sdiv z2.d, p0/m, z2.d, z1.d
+; SVE-NEXT: mls z0.d, p0/m, z2.d, z1.d
+; SVE-NEXT: // kill: def $q0 killed $q0 killed $z0
+; SVE-NEXT: ret
+ %res = call <2 x i64> @llvm.masked.srem(<2 x i64> %x, <2 x i64> %y, <2 x i1> %m)
+ ret <2 x i64> %res
+}
+
+; Splitting
+define <4 x i64> @srem_v4i64(<4 x i64> %x, <4 x i64> %y, <4 x i1> %m) {
+; NEON-LABEL: srem_v4i64:
+; NEON: // %bb.0:
+; NEON-NEXT: ushll v4.4s, v4.4h, #0
+; NEON-NEXT: mov x9, v2.d[1]
+; NEON-NEXT: mov x10, v0.d[1]
+; NEON-NEXT: fmov x12, d2
+; NEON-NEXT: mov x15, v3.d[1]
+; NEON-NEXT: fmov x16, d3
+; NEON-NEXT: mov x18, v1.d[1]
+; NEON-NEXT: shl v5.2s, v4.2s, #31
+; NEON-NEXT: cmlt v5.2s, v5.2s, #0
+; NEON-NEXT: mov w8, v5.s[1]
+; NEON-NEXT: fmov w11, s5
+; NEON-NEXT: cmp w8, #0
+; NEON-NEXT: csinc x8, x9, xzr, ne
+; NEON-NEXT: cmp w11, #0
+; NEON-NEXT: fmov x11, d0
+; NEON-NEXT: ext v0.16b, v4.16b, v4.16b, #8
+; NEON-NEXT: csinc x12, x12, xzr, ne
+; NEON-NEXT: sdiv x9, x10, x8
+; NEON-NEXT: shl v0.2s, v0.2s, #31
+; NEON-NEXT: cmlt v0.2s, v0.2s, #0
+; NEON-NEXT: mov w14, v0.s[1]
+; NEON-NEXT: cmp w14, #0
+; NEON-NEXT: fmov w14, s0
+; NEON-NEXT: csinc x15, x15, xzr, ne
+; NEON-NEXT: sdiv x13, x11, x12
+; NEON-NEXT: msub x8, x9, x8, x10
+; NEON-NEXT: cmp w14, #0
+; NEON-NEXT: csinc x14, x16, xzr, ne
+; NEON-NEXT: fmov x16, d1
+; NEON-NEXT: sdiv x17, x16, x14
+; NEON-NEXT: msub x9, x13, x12, x11
+; NEON-NEXT: fmov d0, x9
+; NEON-NEXT: mov v0.d[1], x8
+; NEON-NEXT: sdiv x0, x18, x15
+; NEON-NEXT: msub x10, x17, x14, x16
+; NEON-NEXT: fmov d1, x10
+; NEON-NEXT: msub x11, x0, x15, x18
+; NEON-NEXT: mov v1.d[1], x11
+; NEON-NEXT: ret
+;
+; SVE-LABEL: srem_v4i64:
+; SVE: // %bb.0:
+; SVE-NEXT: ushll v4.4s, v4.4h, #0
+; SVE-NEXT: mov x9, v2.d[1]
+; SVE-NEXT: // kill: def $q1 killed $q1 def $z1
+; SVE-NEXT: // kill: def $q0 killed $q0 def $z0
+; SVE-NEXT: ptrue p0.d, vl2
+; SVE-NEXT: shl v5.2s, v4.2s, #31
+; SVE-NEXT: ext v4.16b, v4.16b, v4.16b, #8
+; SVE-NEXT: cmlt v5.2s, v5.2s, #0
+; SVE-NEXT: shl v4.2s, v4.2s, #31
+; SVE-NEXT: mov w8, v5.s[1]
+; SVE-NEXT: fmov w10, s5
+; SVE-NEXT: cmlt v4.2s, v4.2s, #0
+; SVE-NEXT: cmp w8, #0
+; SVE-NEXT: fmov x8, d2
+; SVE-NEXT: csinc x9, x9, xzr, ne
+; SVE-NEXT: cmp w10, #0
+; SVE-NEXT: fmov w10, s4
+; SVE-NEXT: csinc x8, x8, xzr, ne
+; SVE-NEXT: fmov d2, x8
+; SVE-NEXT: mov w8, v4.s[1]
+; SVE-NEXT: mov v2.d[1], x9
+; SVE-NEXT: mov x9, v3.d[1]
+; SVE-NEXT: cmp w8, #0
+; SVE-NEXT: fmov x8, d3
+; SVE-NEXT: csinc x9, x9, xzr, ne
+; SVE-NEXT: cmp w10, #0
+; SVE-NEXT: movprfx z5, z0
+; SVE-NEXT: sdiv z5.d, p0/m, z5.d, z2.d
+; SVE-NEXT: csinc x8, x8, xzr, ne
+; SVE-NEXT: fmov d3, x8
+; SVE-NEXT: mov v3.d[1], x9
+; SVE-NEXT: movprfx z4, z1
+; SVE-NEXT: sdiv z4.d, p0/m, z4.d, z3.d
+; SVE-NEXT: mls z0.d, p0/m, z5.d, z2.d
+; SVE-NEXT: // kill: def $q0 killed $q0 killed $z0
+; SVE-NEXT: mls z1.d, p0/m, z4.d, z3.d
+; SVE-NEXT: // kill: def $q1 killed $q1 killed $z1
+; SVE-NEXT: ret
+ %res = call <4 x i64> @llvm.masked.srem(<4 x i64> %x, <4 x i64> %y, <4 x i1> %m)
+ ret <4 x i64> %res
+}
+
+; Widening
+define <2 x i32> @srem_v2i32(<2 x i32> %x, <2 x i32> %y, <2 x i1> %m) {
+; NEON-LABEL: srem_v2i32:
+; NEON: // %bb.0:
+; NEON-NEXT: shl v2.2s, v2.2s, #31
+; NEON-NEXT: // kill: def $d0 killed $d0 def $q0
+; NEON-NEXT: fmov w8, s0
+; NEON-NEXT: mov w11, v0.s[1]
+; NEON-NEXT: cmlt v2.2s, v2.2s, #0
+; NEON-NEXT: and v1.8b, v1.8b, v2.8b
+; NEON-NEXT: mvn v2.8b, v2.8b
+; NEON-NEXT: sub v1.2s, v1.2s, v2.2s
+; NEON-NEXT: fmov w9, s1
+; NEON-NEXT: mov w12, v1.s[1]
+; NEON-NEXT: sdiv w10, w8, w9
+; NEON-NEXT: sdiv w13, w11, w12
+; NEON-NEXT: msub w8, w10, w9, w8
+; NEON-NEXT: fmov s0, w8
+; NEON-NEXT: msub w9, w13, w12, w11
+; NEON-NEXT: mov v0.s[1], w9
+; NEON-NEXT: // kill: def $d0 killed $d0 killed $q0
+; NEON-NEXT: ret
+;
+; SVE-LABEL: srem_v2i32:
+; SVE: // %bb.0:
+; SVE-NEXT: shl v2.2s, v2.2s, #31
+; SVE-NEXT: ptrue p0.s, vl2
+; SVE-NEXT: // kill: def $d0 killed $d0 def $z0
+; SVE-NEXT: cmlt v2.2s, v2.2s, #0
+; SVE-NEXT: and v1.8b, v1.8b, v2.8b
+; SVE-NEXT: mvn v2.8b, v2.8b
+; SVE-NEXT: sub v1.2s, v1.2s, v2.2s
+; SVE-NEXT: movprfx z2, z0
+; SVE-NEXT: sdiv z2.s, p0/m, z2.s, z1.s
+; SVE-NEXT: mls v0.2s, v2.2s, v1.2s
+; SVE-NEXT: // kill: def $d0 killed $d0 killed $z0
+; SVE-NEXT: ret
+ %res = call <2 x i32> @llvm.masked.srem(<2 x i32> %x, <2 x i32> %y, <2 x i1> %m)
+ ret <2 x i32> %res
+}
+
+; Promotion
+define <4 x i16> @srem_v4i16(<4 x i16> %x, <4 x i16> %y, <4 x i1> %m) {
+; NEON-LABEL: srem_v4i16:
+; NEON: // %bb.0:
+; NEON-NEXT: shl v2.4h, v2.4h, #15
+; NEON-NEXT: // kill: def $d0 killed $d0 def $q0
+; NEON-NEXT: smov w11, v0.h[0]
+; NEON-NEXT: smov w8, v0.h[1]
+; NEON-NEXT: smov w14, v0.h[2]
+; NEON-NEXT: smov w17, v0.h[3]
+; NEON-NEXT: cmlt v2.4h, v2.4h, #0
+; NEON-NEXT: and v1.8b, v1.8b, v2.8b
+; NEON-NEXT: mvn v2.8b, v2.8b
+; NEON-NEXT: sub v1.4h, v1.4h, v2.4h
+; NEON-NEXT: smov w12, v1.h[0]
+; NEON-NEXT: smov w9, v1.h[1]
+; NEON-NEXT: smov w15, v1.h[2]
+; NEON-NEXT: smov w18, v1.h[3]
+; NEON-NEXT: sdiv w13, w11, w12
+; NEON-NEXT: sdiv w10, w8, w9
+; NEON-NEXT: msub w11, w13, w12, w11
+; NEON-NEXT: fmov s0, w11
+; NEON-NEXT: sdiv w16, w14, w15
+; NEON-NEXT: msub w8, w10, w9, w8
+; NEON-NEXT: mov v0.h[1], w8
+; NEON-NEXT: sdiv w9, w17, w18
+; NEON-NEXT: msub w8, w16, w15, w14
+; NEON-NEXT: mov v0.h[2], w8
+; NEON-NEXT: msub w8, w9, w18, w17
+; NEON-NEXT: mov v0.h[3], w8
+; NEON-NEXT: // kill: def $d0 killed $d0 killed $q0
+; NEON-NEXT: ret
+;
+; SVE-LABEL: srem_v4i16:
+; SVE: // %bb.0:
+; SVE-NEXT: shl v2.4h, v2.4h, #15
+; SVE-NEXT: ptrue p0.s, vl4
+; SVE-NEXT: cmlt v2.4h, v2.4h, #0
+; SVE-NEXT: and v1.8b, v1.8b, v2.8b
+; SVE-NEXT: mvn v2.8b, v2.8b
+; SVE-NEXT: sub v1.4h, v1.4h, v2.4h
+; SVE-NEXT: sshll v2.4s, v0.4h, #0
+; SVE-NEXT: sshll v3.4s, v1.4h, #0
+; SVE-NEXT: sdiv z2.s, p0/m, z2.s, z3.s
+; SVE-NEXT: xtn v2.4h, v2.4s
+; SVE-NEXT: mls v0.4h, v2.4h, v1.4h
+; SVE-NEXT: ret
+ %res = call <4 x i16> @llvm.masked.srem(<4 x i16> %x, <4 x i16> %y, <4 x i1> %m)
+ ret <4 x i16> %res
+}
+
+; Scalarization
+define <1 x i64> @srem_v1i164(<1 x i64> %x, <1 x i64> %y, <1 x i1> %m) {
+; NEON-LABEL: srem_v1i164:
+; NEON: // %bb.0:
+; NEON-NEXT: // kill: def $d1 killed $d1 def $q1
+; NEON-NEXT: fmov x8, d1
+; NEON-NEXT: // kill: def $d0 killed $d0 def $q0
+; NEON-NEXT: fmov x9, d0
+; NEON-NEXT: tst w0, #0x1
+; NEON-NEXT: csinc x8, x8, xzr, ne
+; NEON-NEXT: sdiv x10, x9, x8
+; NEON-NEXT: msub x8, x10, x8, x9
+; NEON-NEXT: fmov d0, x8
+; NEON-NEXT: ret
+;
+; SVE-LABEL: srem_v1i164:
+; SVE: // %bb.0:
+; SVE-NEXT: // kill: def $d1 killed $d1 def $q1
+; SVE-NEXT: fmov x8, d1
+; SVE-NEXT: // kill: def $d0 killed $d0 def $q0
+; SVE-NEXT: fmov x9, d0
+; SVE-NEXT: tst w0, #0x1
+; SVE-NEXT: csinc x8, x8, xzr, ne
+; SVE-NEXT: sdiv x10, x9, x8
+; SVE-NEXT: msub x8, x10, x8, x9
+; SVE-NEXT: fmov d0, x8
+; SVE-NEXT: ret
+ %res = call <1 x i64> @llvm.masked.srem(<1 x i64> %x, <1 x i64> %y, <1 x i1> %m)
+ ret <1 x i64> %res
+}
+
+; Expansion
+define <2 x i128> @srem_v2i128(<2 x i128> %x, <2 x i128> %y, <2 x i1> %m) {
+; NEON-LABEL: srem_v2i128:
+; NEON: // %bb.0:
+; NEON-NEXT: stp x30, x25, [sp, #-64]! // 16-byte Folded Spill
+; NEON-NEXT: stp x24, x23, [sp, #16] // 16-byte Folded Spill
+; NEON-NEXT: stp x22, x21, [sp, #32] // 16-byte Folded Spill
+; NEON-NEXT: stp x20, x19, [sp, #48] // 16-byte Folded Spill
+; NEON-NEXT: .cfi_def_cfa_offset 64
+; NEON-NEXT: .cfi_offset w19, -8
+; NEON-NEXT: .cfi_offset w20, -16
+; NEON-NEXT: .cfi_offset w21, -24
+; NEON-NEXT: .cfi_offset w22, -32
+; NEON-NEXT: .cfi_offset w23, -40
+; NEON-NEXT: .cfi_offset w24, -48
+; NEON-NEXT: .cfi_offset w25, -56
+; NEON-NEXT: .cfi_offset w30, -64
+; NEON-NEXT: // kill: def $d0 killed $d0 def $q0
+; NEON-NEXT: fmov w8, s0
+; NEON-NEXT: mov x21, x3
+; NEON-NEXT: mov x22, x2
+; NEON-NEXT: mov x19, x7
+; NEON-NEXT: mov x20, x6
+; NEON-NEXT: mov w25, v0.s[1]
+; NEON-NEXT: tst w8, #0x1
+; NEON-NEXT: csel x3, x5, xzr, ne
+; NEON-NEXT: csinc x2, x4, xzr, ne
+; NEON-NEXT: bl __modti3
+; NEON-NEXT: tst w25, #0x1
+; NEON-NEXT: mov x23, x0
+; NEON-NEXT: mov x24, x1
+; NEON-NEXT: csel x3, x19, xzr, ne
+; NEON-NEXT: csinc x2, x20, xzr, ne
+; NEON-NEXT: mov x0, x22
+; NEON-NEXT: mov x1, x21
+; NEON-NEXT: bl __modti3
+; NEON-NEXT: mov x2, x0
+; NEON-NEXT: mov x3, x1
+; NEON-NEXT: mov x0, x23
+; NEON-NEXT: mov x1, x24
+; NEON-NEXT: ldp x20, x19, [sp, #48] // 16-byte Folded Reload
+; NEON-NEXT: ldp x22, x21, [sp, #32] // 16-byte Folded Reload
+; NEON-NEXT: ldp x24, x23, [sp, #16] // 16-byte Folded Reload
+; NEON-NEXT: ldp x30, x25, [sp], #64 // 16-byte Folded Reload
+; NEON-NEXT: ret
+;
+; SVE-LABEL: srem_v2i128:
+; SVE: // %bb.0:
+; SVE-NEXT: stp x30, x25, [sp, #-64]! // 16-byte Folded Spill
+; SVE-NEXT: stp x24, x23, [sp, #16] // 16-byte Folded Spill
+; SVE-NEXT: stp x22, x21, [sp, #32] // 16-byte Folded Spill
+; SVE-NEXT: stp x20, x19, [sp, #48] // 16-byte Folded Spill
+; SVE-NEXT: .cfi_def_cfa_offset 64
+; SVE-NEXT: .cfi_offset w19, -8
+; SVE-NEXT: .cfi_offset w20, -16
+; SVE-NEXT: .cfi_offset w21, -24
+; SVE-NEXT: .cfi_offset w22, -32
+; SVE-NEXT: .cfi_offset w23, -40
+; SVE-NEXT: .cfi_offset w24, -48
+; SVE-NEXT: .cfi_offset w25, -56
+; SVE-NEXT: .cfi_offset w30, -64
+; SVE-NEXT: // kill: def $d0 killed $d0 def $q0
+; SVE-NEXT: fmov w8, s0
+; SVE-NEXT: mov x21, x3
+; SVE-NEXT: mov x22, x2
+; SVE-NEXT: mov x19, x7
+; SVE-NEXT: mov x20, x6
+; SVE-NEXT: mov w25, v0.s[1]
+; SVE-NEXT: tst w8, #0x1
+; SVE-NEXT: csel x3, x5, xzr, ne
+; SVE-NEXT: csinc x2, x4, xzr, ne
+; SVE-NEXT: bl __modti3
+; SVE-NEXT: tst w25, #0x1
+; SVE-NEXT: mov x23, x0
+; SVE-NEXT: mov x24, x1
+; SVE-NEXT: csel x3, x19, xzr, ne
+; SVE-NEXT: csinc x2, x20, xzr, ne
+; SVE-NEXT: mov x0, x22
+; SVE-NEXT: mov x1, x21
+; SVE-NEXT: bl __modti3
+; SVE-NEXT: mov x2, x0
+; SVE-NEXT: mov x3, x1
+; SVE-NEXT: mov x0, x23
+; SVE-NEXT: mov x1, x24
+; SVE-NEXT: ldp x20, x19, [sp, #48] // 16-byte Folded Reload
+; SVE-NEXT: ldp x22, x21, [sp, #32] // 16-byte Folded Reload
+; SVE-NEXT: ldp x24, x23, [sp, #16] // 16-byte Folded Reload
+; SVE-NEXT: ldp x30, x25, [sp], #64 // 16-byte Folded Reload
+; SVE-NEXT: ret
+ %res = call <2 x i128> @llvm.masked.srem(<2 x i128> %x, <2 x i128> %y, <2 x i1> %m)
+ ret <2 x i128> %res
+}
+
+; Promotion and widening
+define <3 x i10> @srem_v3i10(<3 x i10> %x, <3 x i10> %y, <3 x i1> %m) {
+; NEON-LABEL: srem_v3i10:
+; NEON: // %bb.0:
+; NEON-NEXT: fmov s0, w6
+; NEON-NEXT: fmov s1, w3
+; NEON-NEXT: ldr w8, [sp]
+; NEON-NEXT: fmov s2, w0
+; NEON-NEXT: mov v0.h[1], w7
+; NEON-NEXT: mov v1.h[1], w4
+; NEON-NEXT: mov v2.h[1], w1
+; NEON-NEXT: mov v0.h[2], w8
+; NEON-NEXT: mov v1.h[2], w5
+; NEON-NEXT: mov v2.h[2], w2
+; NEON-NEXT: shl v0.4h, v0.4h, #15
+; NEON-NEXT: shl v1.4h, v1.4h, #6
+; NEON-NEXT: shl v2.4h, v2.4h, #6
+; NEON-NEXT: cmlt v0.4h, v0.4h, #0
+; NEON-NEXT: sshr v1.4h, v1.4h, #6
+; NEON-NEXT: sshr v2.4h, v2.4h, #6
+; NEON-NEXT: and v1.8b, v1.8b, v0.8b
+; NEON-NEXT: mvn v0.8b, v0.8b
+; NEON-NEXT: smov w8, v2.h[0]
+; NEON-NEXT: smov w11, v2.h[1]
+; NEON-NEXT: smov w14, v2.h[2]
+; NEON-NEXT: sub v0.4h, v1.4h, v0.4h
+; NEON-NEXT: smov w9, v0.h[0]
+; NEON-NEXT: smov w12, v0.h[1]
+; NEON-NEXT: smov w15, v0.h[2]
+; NEON-NEXT: sdiv w10, w8, w9
+; NEON-NEXT: sdiv w13, w11, w12
+; NEON-NEXT: msub w0, w10, w9, w8
+; NEON-NEXT: sdiv w16, w14, w15
+; NEON-NEXT: msub w1, w13, w12, w11
+; NEON-NEXT: msub w2, w16, w15, w14
+; NEON-NEXT: ret
+;
+; SVE-LABEL: srem_v3i10:
+; SVE: // %bb.0:
+; SVE-NEXT: fmov s0, w6
+; SVE-NEXT: fmov s1, w3
+; SVE-NEXT: ldr w8, [sp]
+; SVE-NEXT: fmov s2, w0
+; SVE-NEXT: ptrue p0.s, vl4
+; SVE-NEXT: mov v0.h[1], w7
+; SVE-NEXT: mov v1.h[1], w4
+; SVE-NEXT: mov v2.h[1], w1
+; SVE-NEXT: mov v0.h[2], w8
+; SVE-NEXT: mov v1.h[2], w5
+; SVE-NEXT: mov v2.h[2], w2
+; SVE-NEXT: shl v0.4h, v0.4h, #15
+; SVE-NEXT: shl v1.4h, v1.4h, #6
+; SVE-NEXT: shl v2.4h, v2.4h, #6
+; SVE-NEXT: cmlt v0.4h, v0.4h, #0
+; SVE-NEXT: sshr v1.4h, v1.4h, #6
+; SVE-NEXT: sshr v2.4h, v2.4h, #6
+; SVE-NEXT: and v1.8b, v1.8b, v0.8b
+; SVE-NEXT: mvn v0.8b, v0.8b
+; SVE-NEXT: sshll v3.4s, v2.4h, #0
+; SVE-NEXT: sub v0.4h, v1.4h, v0.4h
+; SVE-NEXT: sshll v1.4s, v0.4h, #0
+; SVE-NEXT: sdivr z1.s, p0/m, z1.s, z3.s
+; SVE-NEXT: xtn v1.4h, v1.4s
+; SVE-NEXT: mls v2.4h, v1.4h, v0.4h
+; SVE-NEXT: umov w0, v2.h[0]
+; SVE-NEXT: umov w1, v2.h[1]
+; SVE-NEXT: umov w2, v2.h[2]
+; SVE-NEXT: ret
+ %res = call <3 x i10> @llvm.masked.srem(<3 x i10> %x, <3 x i10> %y, <3 x i1> %m)
+ ret <3 x i10> %res
+}
diff --git a/llvm/test/CodeGen/AArch64/masked-srem-scalable.ll b/llvm/test/CodeGen/AArch64/masked-srem-scalable.ll
new file mode 100644
index 0000000000000..aebe7a7bf3b20
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/masked-srem-scalable.ll
@@ -0,0 +1,86 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
+; RUN: llc -mtriple aarch64 -mattr=+sve < %s | FileCheck %s
+
+define <vscale x 4 x i16> @srem_nxv4i16(<vscale x 4 x i16> %x, <vscale x 4 x i16> %y, <vscale x 4 x i1> %m) {
+; CHECK-LABEL: srem_nxv4i16:
+; CHECK: // %bb.0:
+; CHECK-NEXT: ptrue p1.s
+; CHECK-NEXT: mov z2.s, #1 // =0x1
+; CHECK-NEXT: sxth z1.s, p1/m, z1.s
+; CHECK-NEXT: sxth z0.s, p1/m, z0.s
+; CHECK-NEXT: sel z1.s, p0, z1.s, z2.s
+; CHECK-NEXT: movprfx z2, z0
+; CHECK-NEXT: sdiv z2.s, p1/m, z2.s, z1.s
+; CHECK-NEXT: mls z0.s, p1/m, z2.s, z1.s
+; CHECK-NEXT: ret
+ %res = call <vscale x 4 x i16> @llvm.masked.srem(<vscale x 4 x i16> %x, <vscale x 4 x i16> %y, <vscale x 4 x i1> %m)
+ ret <vscale x 4 x i16> %res
+}
+
+define <vscale x 4 x i32> @srem_nxv4i32(<vscale x 4 x i32> %x, <vscale x 4 x i32> %y, <vscale x 4 x i1> %m) {
+; CHECK-LABEL: srem_nxv4i32:
+; CHECK: // %bb.0:
+; CHECK-NEXT: mov z2.s, #1 // =0x1
+; CHECK-NEXT: sel z1.s, p0, z1.s, z2.s
+; CHECK-NEXT: ptrue p0.s
+; CHECK-NEXT: movprfx z2, z0
+; CHECK-NEXT: sdiv z2.s, p0/m, z2.s, z1.s
+; CHECK-NEXT: mls z0.s, p0/m, z2.s, z1.s
+; CHECK-NEXT: ret
+ %res = call <vscale x 4 x i32> @llvm.masked.srem(<vscale x 4 x i32> %x, <vscale x 4 x i32> %y, <vscale x 4 x i1> %m)
+ ret <vscale x 4 x i32> %res
+}
+
+define <vscale x 8 x i32> @srem_nxv8i32(<vscale x 8 x i32> %x, <vscale x 8 x i32> %y, <vscale x 8 x i1> %m) {
+; CHECK-LABEL: srem_nxv8i32:
+; CHECK: // %bb.0:
+; CHECK-NEXT: mov z4.s, #1 // =0x1
+; CHECK-NEXT: punpklo p1.h, p0.b
+; CHECK-NEXT: punpkhi p0.h, p0.b
+; CHECK-NEXT: sel z2.s, p1, z2.s, z4.s
+; CHECK-NEXT: sel z3.s, p0, z3.s, z4.s
+; CHECK-NEXT: ptrue p1.s
+; CHECK-NEXT: movprfx z5, z0
+; CHECK-NEXT: sdiv z5.s, p1/m, z5.s, z2.s
+; CHECK-NEXT: movprfx z4, z1
+; CHECK-NEXT: sdiv z4.s, p1/m, z4.s, z3.s
+; CHECK-NEXT: mls z0.s, p1/m, z5.s, z2.s
+; CHECK-NEXT: mls z1.s, p1/m, z4.s, z3.s
+; CHECK-NEXT: ret
+ %res = call <vscale x 8 x i32> @llvm.masked.srem(<vscale x 8 x i32> %x, <vscale x 8 x i32> %y, <vscale x 8 x i1> %m)
+ ret <vscale x 8 x i32> %res
+}
+
+define <vscale x 8 x i16> @srem_nxv8i16(<vscale x 8 x i16> %x, <vscale x 8 x i16> %y, <vscale x 8 x i1> %m) {
+; CHECK-LABEL: srem_nxv8i16:
+; CHECK: // %bb.0:
+; CHECK-NEXT: mov z2.h, #1 // =0x1
+; CHECK-NEXT: sunpklo z4.s, z0.h
+; CHECK-NEXT: sel z1.h, p0, z1.h, z2.h
+; CHECK-NEXT: sunpkhi z2.s, z0.h
+; CHECK-NEXT: ptrue p0.s
+; CHECK-NEXT: sunpkhi z3.s, z1.h
+; CHECK-NEXT: sdiv z2.s, p0/m, z2.s, z3.s
+; CHECK-NEXT: sunpklo z3.s, z1.h
+; CHECK-NEXT: sdivr z3.s, p0/m, z3.s, z4.s
+; CHECK-NEXT: ptrue p0.h
+; CHECK-NEXT: uzp1 z2.h, z3.h, z2.h
+; CHECK-NEXT: mls z0.h, p0/m, z2.h, z1.h
+; CHECK-NEXT: ret
+ %res = call <vscale x 8 x i16> @llvm.masked.srem(<vscale x 8 x i16> %x, <vscale x 8 x i16> %y, <vscale x 8 x i1> %m)
+ ret <vscale x 8 x i16> %res
+}
+
+define <vscale x 2 x i64> @srem_nxv2i64(<vscale x 2 x i64> %x, <vscale x 2 x i64> %y, <vscale x 2 x i1> %m) {
+; CHECK-LABEL: srem_nxv2i64:
+; CHECK: // %bb.0:
+; CHECK-NEXT: mov z2.d, #1 // =0x1
+; CHECK-NEXT: sel z1.d, p0, z1.d, z2.d
+; CHECK-NEXT: ptrue p0.d
+; CHECK-NEXT: movprfx z2, z0
+; CHECK-NEXT: sdiv z2.d, p0/m, z2.d, z1.d
+; CHECK-NEXT: mls z0.d, p0/m, z2.d, z1.d
+; CHECK-NEXT: ret
+ %res = call <vscale x 2 x i64> @llvm.masked.srem(<vscale x 2 x i64> %x, <vscale x 2 x i64> %y, <vscale x 2 x i1> %m)
+ ret <vscale x 2 x i64> %res
+}
diff --git a/llvm/test/CodeGen/AArch64/masked-udiv-fixed-length.ll b/llvm/test/CodeGen/AArch64/masked-udiv-fixed-length.ll
new file mode 100644
index 0000000000000..950cebfb4b614
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/masked-udiv-fixed-length.ll
@@ -0,0 +1,468 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
+; RUN: llc -mtriple aarch64 < %s | FileCheck %s --check-prefix=NEON
+; RUN: llc -mtriple aarch64 -mattr=+sve < %s | FileCheck %s --check-prefix=SVE
+
+; Legal
+define <4 x i32> @udiv_v4i32(<4 x i32> %x, <4 x i32> %y, <4 x i1> %m) {
+; NEON-LABEL: udiv_v4i32:
+; NEON: // %bb.0:
+; NEON-NEXT: shl v2.4h, v2.4h, #15
+; NEON-NEXT: mov w9, v1.s[1]
+; NEON-NEXT: mov w10, v0.s[1]
+; NEON-NEXT: mov w11, v1.s[2]
+; NEON-NEXT: mov w12, v0.s[2]
+; NEON-NEXT: mov w13, v0.s[3]
+; NEON-NEXT: cmlt v2.4h, v2.4h, #0
+; NEON-NEXT: umov w8, v2.h[1]
+; NEON-NEXT: tst w8, #0xffff
+; NEON-NEXT: csinc w8, w9, wzr, ne
+; NEON-NEXT: umov w9, v2.h[0]
+; NEON-NEXT: udiv w8, w10, w8
+; NEON-NEXT: fmov w10, s1
+; NEON-NEXT: tst w9, #0xffff
+; NEON-NEXT: fmov w9, s0
+; NEON-NEXT: csinc w10, w10, wzr, ne
+; NEON-NEXT: udiv w9, w9, w10
+; NEON-NEXT: umov w10, v2.h[2]
+; NEON-NEXT: tst w10, #0xffff
+; NEON-NEXT: csinc w10, w11, wzr, ne
+; NEON-NEXT: umov w11, v2.h[3]
+; NEON-NEXT: udiv w10, w12, w10
+; NEON-NEXT: mov w12, v1.s[3]
+; NEON-NEXT: fmov s0, w9
+; NEON-NEXT: tst w11, #0xffff
+; NEON-NEXT: mov v0.s[1], w8
+; NEON-NEXT: csinc w9, w12, wzr, ne
+; NEON-NEXT: udiv w8, w13, w9
+; NEON-NEXT: mov v0.s[2], w10
+; NEON-NEXT: mov v0.s[3], w8
+; NEON-NEXT: ret
+;
+; SVE-LABEL: udiv_v4i32:
+; SVE: // %bb.0:
+; SVE-NEXT: shl v2.4h, v2.4h, #15
+; SVE-NEXT: mov w9, v1.s[1]
+; SVE-NEXT: // kill: def $q0 killed $q0 def $z0
+; SVE-NEXT: mov w11, v1.s[2]
+; SVE-NEXT: ptrue p0.s, vl4
+; SVE-NEXT: cmlt v2.4h, v2.4h, #0
+; SVE-NEXT: umov w8, v2.h[1]
+; SVE-NEXT: umov w10, v2.h[0]
+; SVE-NEXT: tst w8, #0xffff
+; SVE-NEXT: csinc w8, w9, wzr, ne
+; SVE-NEXT: fmov w9, s1
+; SVE-NEXT: tst w10, #0xffff
+; SVE-NEXT: umov w10, v2.h[2]
+; SVE-NEXT: csinc w9, w9, wzr, ne
+; SVE-NEXT: fmov s3, w9
+; SVE-NEXT: mov w9, v1.s[3]
+; SVE-NEXT: tst w10, #0xffff
+; SVE-NEXT: csinc w10, w11, wzr, ne
+; SVE-NEXT: mov v3.s[1], w8
+; SVE-NEXT: umov w8, v2.h[3]
+; SVE-NEXT: mov v3.s[2], w10
+; SVE-NEXT: tst w8, #0xffff
+; SVE-NEXT: csinc w8, w9, wzr, ne
+; SVE-NEXT: mov v3.s[3], w8
+; SVE-NEXT: udiv z0.s, p0/m, z0.s, z3.s
+; SVE-NEXT: // kill: def $q0 killed $q0 killed $z0
+; SVE-NEXT: ret
+ %res = call <4 x i32> @llvm.masked.udiv(<4 x i32> %x, <4 x i32> %y, <4 x i1> %m)
+ ret <4 x i32> %res
+}
+
+define <2 x i64> @udiv_v2i64(<2 x i64> %x, <2 x i64> %y, <2 x i1> %m) {
+; NEON-LABEL: udiv_v2i64:
+; NEON: // %bb.0:
+; NEON-NEXT: shl v2.2s, v2.2s, #31
+; NEON-NEXT: mov x9, v1.d[1]
+; NEON-NEXT: cmlt v2.2s, v2.2s, #0
+; NEON-NEXT: mov w8, v2.s[1]
+; NEON-NEXT: fmov w10, s2
+; NEON-NEXT: cmp w8, #0
+; NEON-NEXT: fmov x8, d1
+; NEON-NEXT: csinc x9, x9, xzr, ne
+; NEON-NEXT: cmp w10, #0
+; NEON-NEXT: fmov x10, d0
+; NEON-NEXT: csinc x8, x8, xzr, ne
+; NEON-NEXT: udiv x8, x10, x8
+; NEON-NEXT: mov x10, v0.d[1]
+; NEON-NEXT: udiv x9, x10, x9
+; NEON-NEXT: fmov d0, x8
+; NEON-NEXT: mov v0.d[1], x9
+; NEON-NEXT: ret
+;
+; SVE-LABEL: udiv_v2i64:
+; SVE: // %bb.0:
+; SVE-NEXT: shl v2.2s, v2.2s, #31
+; SVE-NEXT: mov x9, v1.d[1]
+; SVE-NEXT: // kill: def $q0 killed $q0 def $z0
+; SVE-NEXT: fmov x10, d1
+; SVE-NEXT: ptrue p0.d, vl2
+; SVE-NEXT: cmlt v2.2s, v2.2s, #0
+; SVE-NEXT: mov w8, v2.s[1]
+; SVE-NEXT: cmp w8, #0
+; SVE-NEXT: fmov w8, s2
+; SVE-NEXT: csinc x9, x9, xzr, ne
+; SVE-NEXT: cmp w8, #0
+; SVE-NEXT: csinc x8, x10, xzr, ne
+; SVE-NEXT: fmov d1, x8
+; SVE-NEXT: mov v1.d[1], x9
+; SVE-NEXT: udiv z0.d, p0/m, z0.d, z1.d
+; SVE-NEXT: // kill: def $q0 killed $q0 killed $z0
+; SVE-NEXT: ret
+ %res = call <2 x i64> @llvm.masked.udiv(<2 x i64> %x, <2 x i64> %y, <2 x i1> %m)
+ ret <2 x i64> %res
+}
+
+; Splitting
+define <4 x i64> @udiv_v4i64(<4 x i64> %x, <4 x i64> %y, <4 x i1> %m) {
+; NEON-LABEL: udiv_v4i64:
+; NEON: // %bb.0:
+; NEON-NEXT: ushll v4.4s, v4.4h, #0
+; NEON-NEXT: mov x9, v2.d[1]
+; NEON-NEXT: mov x10, v0.d[1]
+; NEON-NEXT: mov x11, v3.d[1]
+; NEON-NEXT: fmov x12, d3
+; NEON-NEXT: shl v5.2s, v4.2s, #31
+; NEON-NEXT: cmlt v5.2s, v5.2s, #0
+; NEON-NEXT: mov w8, v5.s[1]
+; NEON-NEXT: cmp w8, #0
+; NEON-NEXT: csinc x8, x9, xzr, ne
+; NEON-NEXT: fmov w9, s5
+; NEON-NEXT: udiv x8, x10, x8
+; NEON-NEXT: fmov x10, d2
+; NEON-NEXT: cmp w9, #0
+; NEON-NEXT: fmov x9, d0
+; NEON-NEXT: ext v0.16b, v4.16b, v4.16b, #8
+; NEON-NEXT: csinc x10, x10, xzr, ne
+; NEON-NEXT: shl v0.2s, v0.2s, #31
+; NEON-NEXT: cmlt v0.2s, v0.2s, #0
+; NEON-NEXT: udiv x9, x9, x10
+; NEON-NEXT: mov w10, v0.s[1]
+; NEON-NEXT: cmp w10, #0
+; NEON-NEXT: fmov w10, s0
+; NEON-NEXT: csinc x11, x11, xzr, ne
+; NEON-NEXT: cmp w10, #0
+; NEON-NEXT: csinc x10, x12, xzr, ne
+; NEON-NEXT: fmov x12, d1
+; NEON-NEXT: udiv x10, x12, x10
+; NEON-NEXT: mov x12, v1.d[1]
+; NEON-NEXT: fmov d0, x9
+; NEON-NEXT: mov v0.d[1], x8
+; NEON-NEXT: udiv x11, x12, x11
+; NEON-NEXT: fmov d1, x10
+; NEON-NEXT: mov v1.d[1], x11
+; NEON-NEXT: ret
+;
+; SVE-LABEL: udiv_v4i64:
+; SVE: // %bb.0:
+; SVE-NEXT: ushll v4.4s, v4.4h, #0
+; SVE-NEXT: mov x9, v2.d[1]
+; SVE-NEXT: // kill: def $q0 killed $q0 def $z0
+; SVE-NEXT: // kill: def $q1 killed $q1 def $z1
+; SVE-NEXT: ptrue p0.d, vl2
+; SVE-NEXT: shl v5.2s, v4.2s, #31
+; SVE-NEXT: cmlt v5.2s, v5.2s, #0
+; SVE-NEXT: mov w8, v5.s[1]
+; SVE-NEXT: fmov w10, s5
+; SVE-NEXT: cmp w8, #0
+; SVE-NEXT: fmov x8, d2
+; SVE-NEXT: csinc x9, x9, xzr, ne
+; SVE-NEXT: cmp w10, #0
+; SVE-NEXT: csinc x8, x8, xzr, ne
+; SVE-NEXT: fmov d2, x8
+; SVE-NEXT: mov v2.d[1], x9
+; SVE-NEXT: mov x9, v3.d[1]
+; SVE-NEXT: udiv z0.d, p0/m, z0.d, z2.d
+; SVE-NEXT: ext v2.16b, v4.16b, v4.16b, #8
+; SVE-NEXT: // kill: def $q0 killed $q0 killed $z0
+; SVE-NEXT: shl v2.2s, v2.2s, #31
+; SVE-NEXT: cmlt v2.2s, v2.2s, #0
+; SVE-NEXT: mov w8, v2.s[1]
+; SVE-NEXT: fmov w10, s2
+; SVE-NEXT: cmp w8, #0
+; SVE-NEXT: fmov x8, d3
+; SVE-NEXT: csinc x9, x9, xzr, ne
+; SVE-NEXT: cmp w10, #0
+; SVE-NEXT: csinc x8, x8, xzr, ne
+; SVE-NEXT: fmov d2, x8
+; SVE-NEXT: mov v2.d[1], x9
+; SVE-NEXT: udiv z1.d, p0/m, z1.d, z2.d
+; SVE-NEXT: // kill: def $q1 killed $q1 killed $z1
+; SVE-NEXT: ret
+ %res = call <4 x i64> @llvm.masked.udiv(<4 x i64> %x, <4 x i64> %y, <4 x i1> %m)
+ ret <4 x i64> %res
+}
+
+; Widening
+define <2 x i32> @udiv_v2i32(<2 x i32> %x, <2 x i32> %y, <2 x i1> %m) {
+; NEON-LABEL: udiv_v2i32:
+; NEON: // %bb.0:
+; NEON-NEXT: shl v2.2s, v2.2s, #31
+; NEON-NEXT: // kill: def $d0 killed $d0 def $q0
+; NEON-NEXT: fmov w8, s0
+; NEON-NEXT: cmlt v2.2s, v2.2s, #0
+; NEON-NEXT: and v1.8b, v1.8b, v2.8b
+; NEON-NEXT: mvn v2.8b, v2.8b
+; NEON-NEXT: sub v1.2s, v1.2s, v2.2s
+; NEON-NEXT: fmov w9, s1
+; NEON-NEXT: mov w10, v1.s[1]
+; NEON-NEXT: udiv w8, w8, w9
+; NEON-NEXT: mov w9, v0.s[1]
+; NEON-NEXT: udiv w9, w9, w10
+; NEON-NEXT: fmov s0, w8
+; NEON-NEXT: mov v0.s[1], w9
+; NEON-NEXT: // kill: def $d0 killed $d0 killed $q0
+; NEON-NEXT: ret
+;
+; SVE-LABEL: udiv_v2i32:
+; SVE: // %bb.0:
+; SVE-NEXT: shl v2.2s, v2.2s, #31
+; SVE-NEXT: ptrue p0.s, vl2
+; SVE-NEXT: // kill: def $d0 killed $d0 def $z0
+; SVE-NEXT: cmlt v2.2s, v2.2s, #0
+; SVE-NEXT: and v1.8b, v1.8b, v2.8b
+; SVE-NEXT: mvn v2.8b, v2.8b
+; SVE-NEXT: sub v1.2s, v1.2s, v2.2s
+; SVE-NEXT: udiv z0.s, p0/m, z0.s, z1.s
+; SVE-NEXT: // kill: def $d0 killed $d0 killed $z0
+; SVE-NEXT: ret
+ %res = call <2 x i32> @llvm.masked.udiv(<2 x i32> %x, <2 x i32> %y, <2 x i1> %m)
+ ret <2 x i32> %res
+}
+
+; Promotion
+define <4 x i16> @udiv_v4i16(<4 x i16> %x, <4 x i16> %y, <4 x i1> %m) {
+; NEON-LABEL: udiv_v4i16:
+; NEON: // %bb.0:
+; NEON-NEXT: shl v2.4h, v2.4h, #15
+; NEON-NEXT: // kill: def $d0 killed $d0 def $q0
+; NEON-NEXT: umov w8, v0.h[1]
+; NEON-NEXT: cmlt v2.4h, v2.4h, #0
+; NEON-NEXT: and v1.8b, v1.8b, v2.8b
+; NEON-NEXT: mvn v2.8b, v2.8b
+; NEON-NEXT: sub v1.4h, v1.4h, v2.4h
+; NEON-NEXT: umov w9, v1.h[1]
+; NEON-NEXT: umov w10, v1.h[0]
+; NEON-NEXT: umov w11, v1.h[2]
+; NEON-NEXT: umov w12, v1.h[3]
+; NEON-NEXT: udiv w8, w8, w9
+; NEON-NEXT: umov w9, v0.h[0]
+; NEON-NEXT: udiv w9, w9, w10
+; NEON-NEXT: umov w10, v0.h[2]
+; NEON-NEXT: udiv w10, w10, w11
+; NEON-NEXT: umov w11, v0.h[3]
+; NEON-NEXT: fmov s0, w9
+; NEON-NEXT: mov v0.h[1], w8
+; NEON-NEXT: udiv w8, w11, w12
+; NEON-NEXT: mov v0.h[2], w10
+; NEON-NEXT: mov v0.h[3], w8
+; NEON-NEXT: // kill: def $d0 killed $d0 killed $q0
+; NEON-NEXT: ret
+;
+; SVE-LABEL: udiv_v4i16:
+; SVE: // %bb.0:
+; SVE-NEXT: shl v2.4h, v2.4h, #15
+; SVE-NEXT: ushll v0.4s, v0.4h, #0
+; SVE-NEXT: ptrue p0.s, vl4
+; SVE-NEXT: cmlt v2.4h, v2.4h, #0
+; SVE-NEXT: and v1.8b, v1.8b, v2.8b
+; SVE-NEXT: mvn v2.8b, v2.8b
+; SVE-NEXT: sub v1.4h, v1.4h, v2.4h
+; SVE-NEXT: ushll v1.4s, v1.4h, #0
+; SVE-NEXT: udiv z0.s, p0/m, z0.s, z1.s
+; SVE-NEXT: xtn v0.4h, v0.4s
+; SVE-NEXT: ret
+ %res = call <4 x i16> @llvm.masked.udiv(<4 x i16> %x, <4 x i16> %y, <4 x i1> %m)
+ ret <4 x i16> %res
+}
+
+; Scalarization
+define <1 x i64> @udiv_v1i164(<1 x i64> %x, <1 x i64> %y, <1 x i1> %m) {
+; NEON-LABEL: udiv_v1i164:
+; NEON: // %bb.0:
+; NEON-NEXT: // kill: def $d1 killed $d1 def $q1
+; NEON-NEXT: fmov x8, d1
+; NEON-NEXT: // kill: def $d0 killed $d0 def $q0
+; NEON-NEXT: fmov x9, d0
+; NEON-NEXT: tst w0, #0x1
+; NEON-NEXT: csinc x8, x8, xzr, ne
+; NEON-NEXT: udiv x8, x9, x8
+; NEON-NEXT: fmov d0, x8
+; NEON-NEXT: ret
+;
+; SVE-LABEL: udiv_v1i164:
+; SVE: // %bb.0:
+; SVE-NEXT: // kill: def $d1 killed $d1 def $q1
+; SVE-NEXT: fmov x8, d1
+; SVE-NEXT: // kill: def $d0 killed $d0 def $q0
+; SVE-NEXT: fmov x9, d0
+; SVE-NEXT: tst w0, #0x1
+; SVE-NEXT: csinc x8, x8, xzr, ne
+; SVE-NEXT: udiv x8, x9, x8
+; SVE-NEXT: fmov d0, x8
+; SVE-NEXT: ret
+ %res = call <1 x i64> @llvm.masked.udiv(<1 x i64> %x, <1 x i64> %y, <1 x i1> %m)
+ ret <1 x i64> %res
+}
+
+; Expansion
+define <2 x i128> @udiv_v2i128(<2 x i128> %x, <2 x i128> %y, <2 x i1> %m) {
+; NEON-LABEL: udiv_v2i128:
+; NEON: // %bb.0:
+; NEON-NEXT: stp x30, x25, [sp, #-64]! // 16-byte Folded Spill
+; NEON-NEXT: stp x24, x23, [sp, #16] // 16-byte Folded Spill
+; NEON-NEXT: stp x22, x21, [sp, #32] // 16-byte Folded Spill
+; NEON-NEXT: stp x20, x19, [sp, #48] // 16-byte Folded Spill
+; NEON-NEXT: .cfi_def_cfa_offset 64
+; NEON-NEXT: .cfi_offset w19, -8
+; NEON-NEXT: .cfi_offset w20, -16
+; NEON-NEXT: .cfi_offset w21, -24
+; NEON-NEXT: .cfi_offset w22, -32
+; NEON-NEXT: .cfi_offset w23, -40
+; NEON-NEXT: .cfi_offset w24, -48
+; NEON-NEXT: .cfi_offset w25, -56
+; NEON-NEXT: .cfi_offset w30, -64
+; NEON-NEXT: // kill: def $d0 killed $d0 def $q0
+; NEON-NEXT: fmov w8, s0
+; NEON-NEXT: mov x21, x3
+; NEON-NEXT: mov x22, x2
+; NEON-NEXT: mov x19, x7
+; NEON-NEXT: mov x20, x6
+; NEON-NEXT: mov w25, v0.s[1]
+; NEON-NEXT: tst w8, #0x1
+; NEON-NEXT: csel x3, x5, xzr, ne
+; NEON-NEXT: csinc x2, x4, xzr, ne
+; NEON-NEXT: bl __udivti3
+; NEON-NEXT: tst w25, #0x1
+; NEON-NEXT: mov x23, x0
+; NEON-NEXT: mov x24, x1
+; NEON-NEXT: csel x3, x19, xzr, ne
+; NEON-NEXT: csinc x2, x20, xzr, ne
+; NEON-NEXT: mov x0, x22
+; NEON-NEXT: mov x1, x21
+; NEON-NEXT: bl __udivti3
+; NEON-NEXT: mov x2, x0
+; NEON-NEXT: mov x3, x1
+; NEON-NEXT: mov x0, x23
+; NEON-NEXT: mov x1, x24
+; NEON-NEXT: ldp x20, x19, [sp, #48] // 16-byte Folded Reload
+; NEON-NEXT: ldp x22, x21, [sp, #32] // 16-byte Folded Reload
+; NEON-NEXT: ldp x24, x23, [sp, #16] // 16-byte Folded Reload
+; NEON-NEXT: ldp x30, x25, [sp], #64 // 16-byte Folded Reload
+; NEON-NEXT: ret
+;
+; SVE-LABEL: udiv_v2i128:
+; SVE: // %bb.0:
+; SVE-NEXT: stp x30, x25, [sp, #-64]! // 16-byte Folded Spill
+; SVE-NEXT: stp x24, x23, [sp, #16] // 16-byte Folded Spill
+; SVE-NEXT: stp x22, x21, [sp, #32] // 16-byte Folded Spill
+; SVE-NEXT: stp x20, x19, [sp, #48] // 16-byte Folded Spill
+; SVE-NEXT: .cfi_def_cfa_offset 64
+; SVE-NEXT: .cfi_offset w19, -8
+; SVE-NEXT: .cfi_offset w20, -16
+; SVE-NEXT: .cfi_offset w21, -24
+; SVE-NEXT: .cfi_offset w22, -32
+; SVE-NEXT: .cfi_offset w23, -40
+; SVE-NEXT: .cfi_offset w24, -48
+; SVE-NEXT: .cfi_offset w25, -56
+; SVE-NEXT: .cfi_offset w30, -64
+; SVE-NEXT: // kill: def $d0 killed $d0 def $q0
+; SVE-NEXT: fmov w8, s0
+; SVE-NEXT: mov x21, x3
+; SVE-NEXT: mov x22, x2
+; SVE-NEXT: mov x19, x7
+; SVE-NEXT: mov x20, x6
+; SVE-NEXT: mov w25, v0.s[1]
+; SVE-NEXT: tst w8, #0x1
+; SVE-NEXT: csel x3, x5, xzr, ne
+; SVE-NEXT: csinc x2, x4, xzr, ne
+; SVE-NEXT: bl __udivti3
+; SVE-NEXT: tst w25, #0x1
+; SVE-NEXT: mov x23, x0
+; SVE-NEXT: mov x24, x1
+; SVE-NEXT: csel x3, x19, xzr, ne
+; SVE-NEXT: csinc x2, x20, xzr, ne
+; SVE-NEXT: mov x0, x22
+; SVE-NEXT: mov x1, x21
+; SVE-NEXT: bl __udivti3
+; SVE-NEXT: mov x2, x0
+; SVE-NEXT: mov x3, x1
+; SVE-NEXT: mov x0, x23
+; SVE-NEXT: mov x1, x24
+; SVE-NEXT: ldp x20, x19, [sp, #48] // 16-byte Folded Reload
+; SVE-NEXT: ldp x22, x21, [sp, #32] // 16-byte Folded Reload
+; SVE-NEXT: ldp x24, x23, [sp, #16] // 16-byte Folded Reload
+; SVE-NEXT: ldp x30, x25, [sp], #64 // 16-byte Folded Reload
+; SVE-NEXT: ret
+ %res = call <2 x i128> @llvm.masked.udiv(<2 x i128> %x, <2 x i128> %y, <2 x i1> %m)
+ ret <2 x i128> %res
+}
+
+; Promotion and widening
+define <3 x i10> @udiv_v3i10(<3 x i10> %x, <3 x i10> %y, <3 x i1> %m) {
+; NEON-LABEL: udiv_v3i10:
+; NEON: // %bb.0:
+; NEON-NEXT: fmov s0, w6
+; NEON-NEXT: fmov s1, w3
+; NEON-NEXT: ldr w8, [sp]
+; NEON-NEXT: fmov s2, w0
+; NEON-NEXT: mov v0.h[1], w7
+; NEON-NEXT: mov v1.h[1], w4
+; NEON-NEXT: mov v2.h[1], w1
+; NEON-NEXT: mov v0.h[2], w8
+; NEON-NEXT: mov v1.h[2], w5
+; NEON-NEXT: mov v2.h[2], w2
+; NEON-NEXT: shl v0.4h, v0.4h, #15
+; NEON-NEXT: bic v1.4h, #252, lsl #8
+; NEON-NEXT: bic v2.4h, #252, lsl #8
+; NEON-NEXT: cmlt v0.4h, v0.4h, #0
+; NEON-NEXT: umov w9, v2.h[0]
+; NEON-NEXT: and v1.8b, v1.8b, v0.8b
+; NEON-NEXT: mvn v0.8b, v0.8b
+; NEON-NEXT: sub v0.4h, v1.4h, v0.4h
+; NEON-NEXT: umov w8, v0.h[0]
+; NEON-NEXT: and w8, w8, #0x3ff
+; NEON-NEXT: udiv w0, w9, w8
+; NEON-NEXT: umov w8, v0.h[1]
+; NEON-NEXT: umov w9, v2.h[1]
+; NEON-NEXT: and w8, w8, #0x3ff
+; NEON-NEXT: udiv w1, w9, w8
+; NEON-NEXT: umov w8, v0.h[2]
+; NEON-NEXT: umov w9, v2.h[2]
+; NEON-NEXT: and w8, w8, #0x3ff
+; NEON-NEXT: udiv w2, w9, w8
+; NEON-NEXT: ret
+;
+; SVE-LABEL: udiv_v3i10:
+; SVE: // %bb.0:
+; SVE-NEXT: fmov s0, w6
+; SVE-NEXT: fmov s1, w3
+; SVE-NEXT: ldr w8, [sp]
+; SVE-NEXT: fmov s2, w0
+; SVE-NEXT: ptrue p0.s, vl4
+; SVE-NEXT: mov v0.h[1], w7
+; SVE-NEXT: mov v1.h[1], w4
+; SVE-NEXT: mov v2.h[1], w1
+; SVE-NEXT: mov v0.h[2], w8
+; SVE-NEXT: mov v1.h[2], w5
+; SVE-NEXT: mov v2.h[2], w2
+; SVE-NEXT: shl v0.4h, v0.4h, #15
+; SVE-NEXT: bic v1.4h, #252, lsl #8
+; SVE-NEXT: bic v2.4h, #252, lsl #8
+; SVE-NEXT: cmlt v0.4h, v0.4h, #0
+; SVE-NEXT: and v1.8b, v1.8b, v0.8b
+; SVE-NEXT: mvn v0.8b, v0.8b
+; SVE-NEXT: sub v0.4h, v1.4h, v0.4h
+; SVE-NEXT: ushll v1.4s, v2.4h, #0
+; SVE-NEXT: ushll v0.4s, v0.4h, #0
+; SVE-NEXT: udivr z0.s, p0/m, z0.s, z1.s
+; SVE-NEXT: xtn v0.4h, v0.4s
+; SVE-NEXT: umov w0, v0.h[0]
+; SVE-NEXT: umov w1, v0.h[1]
+; SVE-NEXT: umov w2, v0.h[2]
+; SVE-NEXT: ret
+ %res = call <3 x i10> @llvm.masked.udiv(<3 x i10> %x, <3 x i10> %y, <3 x i1> %m)
+ ret <3 x i10> %res
+}
diff --git a/llvm/test/CodeGen/AArch64/masked-udiv-scalable.ll b/llvm/test/CodeGen/AArch64/masked-udiv-scalable.ll
new file mode 100644
index 0000000000000..254320de423f3
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/masked-udiv-scalable.ll
@@ -0,0 +1,74 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
+; RUN: llc -mtriple aarch64 -mattr=+sve < %s | FileCheck %s
+
+define <vscale x 4 x i16> @udiv_nxv4i16(<vscale x 4 x i16> %x, <vscale x 4 x i16> %y, <vscale x 4 x i1> %m) {
+; CHECK-LABEL: udiv_nxv4i16:
+; CHECK: // %bb.0:
+; CHECK-NEXT: and z1.s, z1.s, #0xffff
+; CHECK-NEXT: mov z2.s, #1 // =0x1
+; CHECK-NEXT: and z0.s, z0.s, #0xffff
+; CHECK-NEXT: sel z1.s, p0, z1.s, z2.s
+; CHECK-NEXT: ptrue p0.s
+; CHECK-NEXT: udiv z0.s, p0/m, z0.s, z1.s
+; CHECK-NEXT: ret
+ %res = call <vscale x 4 x i16> @llvm.masked.udiv(<vscale x 4 x i16> %x, <vscale x 4 x i16> %y, <vscale x 4 x i1> %m)
+ ret <vscale x 4 x i16> %res
+}
+
+define <vscale x 4 x i32> @udiv_nxv4i32(<vscale x 4 x i32> %x, <vscale x 4 x i32> %y, <vscale x 4 x i1> %m) {
+; CHECK-LABEL: udiv_nxv4i32:
+; CHECK: // %bb.0:
+; CHECK-NEXT: mov z2.s, #1 // =0x1
+; CHECK-NEXT: sel z1.s, p0, z1.s, z2.s
+; CHECK-NEXT: ptrue p0.s
+; CHECK-NEXT: udiv z0.s, p0/m, z0.s, z1.s
+; CHECK-NEXT: ret
+ %res = call <vscale x 4 x i32> @llvm.masked.udiv(<vscale x 4 x i32> %x, <vscale x 4 x i32> %y, <vscale x 4 x i1> %m)
+ ret <vscale x 4 x i32> %res
+}
+
+define <vscale x 8 x i32> @udiv_nxv8i32(<vscale x 8 x i32> %x, <vscale x 8 x i32> %y, <vscale x 8 x i1> %m) {
+; CHECK-LABEL: udiv_nxv8i32:
+; CHECK: // %bb.0:
+; CHECK-NEXT: mov z4.s, #1 // =0x1
+; CHECK-NEXT: punpklo p1.h, p0.b
+; CHECK-NEXT: punpkhi p0.h, p0.b
+; CHECK-NEXT: sel z2.s, p1, z2.s, z4.s
+; CHECK-NEXT: ptrue p1.s
+; CHECK-NEXT: udiv z0.s, p1/m, z0.s, z2.s
+; CHECK-NEXT: sel z2.s, p0, z3.s, z4.s
+; CHECK-NEXT: udiv z1.s, p1/m, z1.s, z2.s
+; CHECK-NEXT: ret
+ %res = call <vscale x 8 x i32> @llvm.masked.udiv(<vscale x 8 x i32> %x, <vscale x 8 x i32> %y, <vscale x 8 x i1> %m)
+ ret <vscale x 8 x i32> %res
+}
+
+define <vscale x 8 x i16> @udiv_nxv8i16(<vscale x 8 x i16> %x, <vscale x 8 x i16> %y, <vscale x 8 x i1> %m) {
+; CHECK-LABEL: udiv_nxv8i16:
+; CHECK: // %bb.0:
+; CHECK-NEXT: mov z2.h, #1 // =0x1
+; CHECK-NEXT: sel z1.h, p0, z1.h, z2.h
+; CHECK-NEXT: uunpkhi z2.s, z0.h
+; CHECK-NEXT: uunpklo z0.s, z0.h
+; CHECK-NEXT: ptrue p0.s
+; CHECK-NEXT: uunpkhi z3.s, z1.h
+; CHECK-NEXT: uunpklo z1.s, z1.h
+; CHECK-NEXT: udiv z2.s, p0/m, z2.s, z3.s
+; CHECK-NEXT: udiv z0.s, p0/m, z0.s, z1.s
+; CHECK-NEXT: uzp1 z0.h, z0.h, z2.h
+; CHECK-NEXT: ret
+ %res = call <vscale x 8 x i16> @llvm.masked.udiv(<vscale x 8 x i16> %x, <vscale x 8 x i16> %y, <vscale x 8 x i1> %m)
+ ret <vscale x 8 x i16> %res
+}
+
+define <vscale x 2 x i64> @udiv_nxv2i64(<vscale x 2 x i64> %x, <vscale x 2 x i64> %y, <vscale x 2 x i1> %m) {
+; CHECK-LABEL: udiv_nxv2i64:
+; CHECK: // %bb.0:
+; CHECK-NEXT: mov z2.d, #1 // =0x1
+; CHECK-NEXT: sel z1.d, p0, z1.d, z2.d
+; CHECK-NEXT: ptrue p0.d
+; CHECK-NEXT: udiv z0.d, p0/m, z0.d, z1.d
+; CHECK-NEXT: ret
+ %res = call <vscale x 2 x i64> @llvm.masked.udiv(<vscale x 2 x i64> %x, <vscale x 2 x i64> %y, <vscale x 2 x i1> %m)
+ ret <vscale x 2 x i64> %res
+}
diff --git a/llvm/test/CodeGen/AArch64/masked-urem-fixed-length.ll b/llvm/test/CodeGen/AArch64/masked-urem-fixed-length.ll
new file mode 100644
index 0000000000000..07e635da011fc
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/masked-urem-fixed-length.ll
@@ -0,0 +1,501 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
+; RUN: llc -mtriple aarch64 < %s | FileCheck %s --check-prefix=NEON
+; RUN: llc -mtriple aarch64 -mattr=+sve < %s | FileCheck %s --check-prefix=SVE
+
+; Legal
+define <4 x i32> @urem_v4i32(<4 x i32> %x, <4 x i32> %y, <4 x i1> %m) {
+; NEON-LABEL: urem_v4i32:
+; NEON: // %bb.0:
+; NEON-NEXT: shl v2.4h, v2.4h, #15
+; NEON-NEXT: mov w9, v1.s[1]
+; NEON-NEXT: fmov w12, s1
+; NEON-NEXT: mov w10, v0.s[1]
+; NEON-NEXT: mov w15, v1.s[2]
+; NEON-NEXT: mov w16, v0.s[2]
+; NEON-NEXT: mov w18, v1.s[3]
+; NEON-NEXT: mov w0, v0.s[3]
+; NEON-NEXT: cmlt v2.4h, v2.4h, #0
+; NEON-NEXT: umov w8, v2.h[1]
+; NEON-NEXT: umov w11, v2.h[0]
+; NEON-NEXT: umov w14, v2.h[2]
+; NEON-NEXT: umov w17, v2.h[3]
+; NEON-NEXT: tst w8, #0xffff
+; NEON-NEXT: csinc w8, w9, wzr, ne
+; NEON-NEXT: tst w11, #0xffff
+; NEON-NEXT: fmov w11, s0
+; NEON-NEXT: csinc w12, w12, wzr, ne
+; NEON-NEXT: udiv w9, w10, w8
+; NEON-NEXT: tst w14, #0xffff
+; NEON-NEXT: csinc w14, w15, wzr, ne
+; NEON-NEXT: tst w17, #0xffff
+; NEON-NEXT: udiv w13, w11, w12
+; NEON-NEXT: msub w8, w9, w8, w10
+; NEON-NEXT: udiv w15, w16, w14
+; NEON-NEXT: msub w11, w13, w12, w11
+; NEON-NEXT: csinc w12, w18, wzr, ne
+; NEON-NEXT: fmov s0, w11
+; NEON-NEXT: mov v0.s[1], w8
+; NEON-NEXT: udiv w9, w0, w12
+; NEON-NEXT: msub w8, w15, w14, w16
+; NEON-NEXT: mov v0.s[2], w8
+; NEON-NEXT: msub w8, w9, w12, w0
+; NEON-NEXT: mov v0.s[3], w8
+; NEON-NEXT: ret
+;
+; SVE-LABEL: urem_v4i32:
+; SVE: // %bb.0:
+; SVE-NEXT: shl v2.4h, v2.4h, #15
+; SVE-NEXT: mov w9, v1.s[1]
+; SVE-NEXT: // kill: def $q0 killed $q0 def $z0
+; SVE-NEXT: mov w11, v1.s[2]
+; SVE-NEXT: ptrue p0.s, vl4
+; SVE-NEXT: cmlt v2.4h, v2.4h, #0
+; SVE-NEXT: umov w8, v2.h[1]
+; SVE-NEXT: umov w10, v2.h[0]
+; SVE-NEXT: tst w8, #0xffff
+; SVE-NEXT: csinc w8, w9, wzr, ne
+; SVE-NEXT: fmov w9, s1
+; SVE-NEXT: tst w10, #0xffff
+; SVE-NEXT: umov w10, v2.h[2]
+; SVE-NEXT: csinc w9, w9, wzr, ne
+; SVE-NEXT: fmov s3, w9
+; SVE-NEXT: mov w9, v1.s[3]
+; SVE-NEXT: tst w10, #0xffff
+; SVE-NEXT: csinc w10, w11, wzr, ne
+; SVE-NEXT: mov v3.s[1], w8
+; SVE-NEXT: umov w8, v2.h[3]
+; SVE-NEXT: mov v3.s[2], w10
+; SVE-NEXT: tst w8, #0xffff
+; SVE-NEXT: csinc w8, w9, wzr, ne
+; SVE-NEXT: mov v3.s[3], w8
+; SVE-NEXT: movprfx z1, z0
+; SVE-NEXT: udiv z1.s, p0/m, z1.s, z3.s
+; SVE-NEXT: mls v0.4s, v1.4s, v3.4s
+; SVE-NEXT: // kill: def $q0 killed $q0 killed $z0
+; SVE-NEXT: ret
+ %res = call <4 x i32> @llvm.masked.urem(<4 x i32> %x, <4 x i32> %y, <4 x i1> %m)
+ ret <4 x i32> %res
+}
+
+define <2 x i64> @urem_v2i64(<2 x i64> %x, <2 x i64> %y, <2 x i1> %m) {
+; NEON-LABEL: urem_v2i64:
+; NEON: // %bb.0:
+; NEON-NEXT: shl v2.2s, v2.2s, #31
+; NEON-NEXT: mov x9, v1.d[1]
+; NEON-NEXT: mov x12, v0.d[1]
+; NEON-NEXT: cmlt v2.2s, v2.2s, #0
+; NEON-NEXT: mov w8, v2.s[1]
+; NEON-NEXT: fmov w10, s2
+; NEON-NEXT: cmp w8, #0
+; NEON-NEXT: fmov x8, d1
+; NEON-NEXT: csinc x9, x9, xzr, ne
+; NEON-NEXT: cmp w10, #0
+; NEON-NEXT: fmov x10, d0
+; NEON-NEXT: udiv x13, x12, x9
+; NEON-NEXT: csinc x8, x8, xzr, ne
+; NEON-NEXT: udiv x11, x10, x8
+; NEON-NEXT: msub x9, x13, x9, x12
+; NEON-NEXT: msub x8, x11, x8, x10
+; NEON-NEXT: fmov d0, x8
+; NEON-NEXT: mov v0.d[1], x9
+; NEON-NEXT: ret
+;
+; SVE-LABEL: urem_v2i64:
+; SVE: // %bb.0:
+; SVE-NEXT: shl v2.2s, v2.2s, #31
+; SVE-NEXT: mov x9, v1.d[1]
+; SVE-NEXT: // kill: def $q0 killed $q0 def $z0
+; SVE-NEXT: fmov x10, d1
+; SVE-NEXT: ptrue p0.d, vl2
+; SVE-NEXT: cmlt v2.2s, v2.2s, #0
+; SVE-NEXT: mov w8, v2.s[1]
+; SVE-NEXT: cmp w8, #0
+; SVE-NEXT: fmov w8, s2
+; SVE-NEXT: csinc x9, x9, xzr, ne
+; SVE-NEXT: cmp w8, #0
+; SVE-NEXT: csinc x8, x10, xzr, ne
+; SVE-NEXT: fmov d1, x8
+; SVE-NEXT: mov v1.d[1], x9
+; SVE-NEXT: movprfx z2, z0
+; SVE-NEXT: udiv z2.d, p0/m, z2.d, z1.d
+; SVE-NEXT: mls z0.d, p0/m, z2.d, z1.d
+; SVE-NEXT: // kill: def $q0 killed $q0 killed $z0
+; SVE-NEXT: ret
+ %res = call <2 x i64> @llvm.masked.urem(<2 x i64> %x, <2 x i64> %y, <2 x i1> %m)
+ ret <2 x i64> %res
+}
+
+; Splitting
+define <4 x i64> @urem_v4i64(<4 x i64> %x, <4 x i64> %y, <4 x i1> %m) {
+; NEON-LABEL: urem_v4i64:
+; NEON: // %bb.0:
+; NEON-NEXT: ushll v4.4s, v4.4h, #0
+; NEON-NEXT: mov x9, v2.d[1]
+; NEON-NEXT: mov x10, v0.d[1]
+; NEON-NEXT: fmov x12, d2
+; NEON-NEXT: mov x15, v3.d[1]
+; NEON-NEXT: fmov x16, d3
+; NEON-NEXT: mov x18, v1.d[1]
+; NEON-NEXT: shl v5.2s, v4.2s, #31
+; NEON-NEXT: cmlt v5.2s, v5.2s, #0
+; NEON-NEXT: mov w8, v5.s[1]
+; NEON-NEXT: fmov w11, s5
+; NEON-NEXT: cmp w8, #0
+; NEON-NEXT: csinc x8, x9, xzr, ne
+; NEON-NEXT: cmp w11, #0
+; NEON-NEXT: fmov x11, d0
+; NEON-NEXT: ext v0.16b, v4.16b, v4.16b, #8
+; NEON-NEXT: csinc x12, x12, xzr, ne
+; NEON-NEXT: udiv x9, x10, x8
+; NEON-NEXT: shl v0.2s, v0.2s, #31
+; NEON-NEXT: cmlt v0.2s, v0.2s, #0
+; NEON-NEXT: mov w14, v0.s[1]
+; NEON-NEXT: cmp w14, #0
+; NEON-NEXT: fmov w14, s0
+; NEON-NEXT: csinc x15, x15, xzr, ne
+; NEON-NEXT: udiv x13, x11, x12
+; NEON-NEXT: msub x8, x9, x8, x10
+; NEON-NEXT: cmp w14, #0
+; NEON-NEXT: csinc x14, x16, xzr, ne
+; NEON-NEXT: fmov x16, d1
+; NEON-NEXT: udiv x17, x16, x14
+; NEON-NEXT: msub x9, x13, x12, x11
+; NEON-NEXT: fmov d0, x9
+; NEON-NEXT: mov v0.d[1], x8
+; NEON-NEXT: udiv x0, x18, x15
+; NEON-NEXT: msub x10, x17, x14, x16
+; NEON-NEXT: fmov d1, x10
+; NEON-NEXT: msub x11, x0, x15, x18
+; NEON-NEXT: mov v1.d[1], x11
+; NEON-NEXT: ret
+;
+; SVE-LABEL: urem_v4i64:
+; SVE: // %bb.0:
+; SVE-NEXT: ushll v4.4s, v4.4h, #0
+; SVE-NEXT: mov x9, v2.d[1]
+; SVE-NEXT: // kill: def $q1 killed $q1 def $z1
+; SVE-NEXT: // kill: def $q0 killed $q0 def $z0
+; SVE-NEXT: ptrue p0.d, vl2
+; SVE-NEXT: shl v5.2s, v4.2s, #31
+; SVE-NEXT: ext v4.16b, v4.16b, v4.16b, #8
+; SVE-NEXT: cmlt v5.2s, v5.2s, #0
+; SVE-NEXT: shl v4.2s, v4.2s, #31
+; SVE-NEXT: mov w8, v5.s[1]
+; SVE-NEXT: fmov w10, s5
+; SVE-NEXT: cmlt v4.2s, v4.2s, #0
+; SVE-NEXT: cmp w8, #0
+; SVE-NEXT: fmov x8, d2
+; SVE-NEXT: csinc x9, x9, xzr, ne
+; SVE-NEXT: cmp w10, #0
+; SVE-NEXT: fmov w10, s4
+; SVE-NEXT: csinc x8, x8, xzr, ne
+; SVE-NEXT: fmov d2, x8
+; SVE-NEXT: mov w8, v4.s[1]
+; SVE-NEXT: mov v2.d[1], x9
+; SVE-NEXT: mov x9, v3.d[1]
+; SVE-NEXT: cmp w8, #0
+; SVE-NEXT: fmov x8, d3
+; SVE-NEXT: csinc x9, x9, xzr, ne
+; SVE-NEXT: cmp w10, #0
+; SVE-NEXT: movprfx z5, z0
+; SVE-NEXT: udiv z5.d, p0/m, z5.d, z2.d
+; SVE-NEXT: csinc x8, x8, xzr, ne
+; SVE-NEXT: fmov d3, x8
+; SVE-NEXT: mov v3.d[1], x9
+; SVE-NEXT: movprfx z4, z1
+; SVE-NEXT: udiv z4.d, p0/m, z4.d, z3.d
+; SVE-NEXT: mls z0.d, p0/m, z5.d, z2.d
+; SVE-NEXT: // kill: def $q0 killed $q0 killed $z0
+; SVE-NEXT: mls z1.d, p0/m, z4.d, z3.d
+; SVE-NEXT: // kill: def $q1 killed $q1 killed $z1
+; SVE-NEXT: ret
+ %res = call <4 x i64> @llvm.masked.urem(<4 x i64> %x, <4 x i64> %y, <4 x i1> %m)
+ ret <4 x i64> %res
+}
+
+; Widening
+define <2 x i32> @urem_v2i32(<2 x i32> %x, <2 x i32> %y, <2 x i1> %m) {
+; NEON-LABEL: urem_v2i32:
+; NEON: // %bb.0:
+; NEON-NEXT: shl v2.2s, v2.2s, #31
+; NEON-NEXT: // kill: def $d0 killed $d0 def $q0
+; NEON-NEXT: fmov w8, s0
+; NEON-NEXT: mov w11, v0.s[1]
+; NEON-NEXT: cmlt v2.2s, v2.2s, #0
+; NEON-NEXT: and v1.8b, v1.8b, v2.8b
+; NEON-NEXT: mvn v2.8b, v2.8b
+; NEON-NEXT: sub v1.2s, v1.2s, v2.2s
+; NEON-NEXT: fmov w9, s1
+; NEON-NEXT: mov w12, v1.s[1]
+; NEON-NEXT: udiv w10, w8, w9
+; NEON-NEXT: udiv w13, w11, w12
+; NEON-NEXT: msub w8, w10, w9, w8
+; NEON-NEXT: fmov s0, w8
+; NEON-NEXT: msub w9, w13, w12, w11
+; NEON-NEXT: mov v0.s[1], w9
+; NEON-NEXT: // kill: def $d0 killed $d0 killed $q0
+; NEON-NEXT: ret
+;
+; SVE-LABEL: urem_v2i32:
+; SVE: // %bb.0:
+; SVE-NEXT: shl v2.2s, v2.2s, #31
+; SVE-NEXT: ptrue p0.s, vl2
+; SVE-NEXT: // kill: def $d0 killed $d0 def $z0
+; SVE-NEXT: cmlt v2.2s, v2.2s, #0
+; SVE-NEXT: and v1.8b, v1.8b, v2.8b
+; SVE-NEXT: mvn v2.8b, v2.8b
+; SVE-NEXT: sub v1.2s, v1.2s, v2.2s
+; SVE-NEXT: movprfx z2, z0
+; SVE-NEXT: udiv z2.s, p0/m, z2.s, z1.s
+; SVE-NEXT: mls v0.2s, v2.2s, v1.2s
+; SVE-NEXT: // kill: def $d0 killed $d0 killed $z0
+; SVE-NEXT: ret
+ %res = call <2 x i32> @llvm.masked.urem(<2 x i32> %x, <2 x i32> %y, <2 x i1> %m)
+ ret <2 x i32> %res
+}
+
+; Promotion
+define <4 x i16> @urem_v4i16(<4 x i16> %x, <4 x i16> %y, <4 x i1> %m) {
+; NEON-LABEL: urem_v4i16:
+; NEON: // %bb.0:
+; NEON-NEXT: shl v2.4h, v2.4h, #15
+; NEON-NEXT: // kill: def $d0 killed $d0 def $q0
+; NEON-NEXT: umov w11, v0.h[0]
+; NEON-NEXT: umov w8, v0.h[1]
+; NEON-NEXT: umov w14, v0.h[2]
+; NEON-NEXT: umov w17, v0.h[3]
+; NEON-NEXT: cmlt v2.4h, v2.4h, #0
+; NEON-NEXT: and v1.8b, v1.8b, v2.8b
+; NEON-NEXT: mvn v2.8b, v2.8b
+; NEON-NEXT: sub v1.4h, v1.4h, v2.4h
+; NEON-NEXT: umov w12, v1.h[0]
+; NEON-NEXT: umov w9, v1.h[1]
+; NEON-NEXT: umov w15, v1.h[2]
+; NEON-NEXT: umov w18, v1.h[3]
+; NEON-NEXT: udiv w13, w11, w12
+; NEON-NEXT: udiv w10, w8, w9
+; NEON-NEXT: msub w11, w13, w12, w11
+; NEON-NEXT: fmov s0, w11
+; NEON-NEXT: udiv w16, w14, w15
+; NEON-NEXT: msub w8, w10, w9, w8
+; NEON-NEXT: mov v0.h[1], w8
+; NEON-NEXT: udiv w9, w17, w18
+; NEON-NEXT: msub w8, w16, w15, w14
+; NEON-NEXT: mov v0.h[2], w8
+; NEON-NEXT: msub w8, w9, w18, w17
+; NEON-NEXT: mov v0.h[3], w8
+; NEON-NEXT: // kill: def $d0 killed $d0 killed $q0
+; NEON-NEXT: ret
+;
+; SVE-LABEL: urem_v4i16:
+; SVE: // %bb.0:
+; SVE-NEXT: shl v2.4h, v2.4h, #15
+; SVE-NEXT: ptrue p0.s, vl4
+; SVE-NEXT: cmlt v2.4h, v2.4h, #0
+; SVE-NEXT: and v1.8b, v1.8b, v2.8b
+; SVE-NEXT: mvn v2.8b, v2.8b
+; SVE-NEXT: sub v1.4h, v1.4h, v2.4h
+; SVE-NEXT: ushll v2.4s, v0.4h, #0
+; SVE-NEXT: ushll v3.4s, v1.4h, #0
+; SVE-NEXT: udiv z2.s, p0/m, z2.s, z3.s
+; SVE-NEXT: xtn v2.4h, v2.4s
+; SVE-NEXT: mls v0.4h, v2.4h, v1.4h
+; SVE-NEXT: ret
+ %res = call <4 x i16> @llvm.masked.urem(<4 x i16> %x, <4 x i16> %y, <4 x i1> %m)
+ ret <4 x i16> %res
+}
+
+; Scalarization
+define <1 x i64> @urem_v1i164(<1 x i64> %x, <1 x i64> %y, <1 x i1> %m) {
+; NEON-LABEL: urem_v1i164:
+; NEON: // %bb.0:
+; NEON-NEXT: // kill: def $d1 killed $d1 def $q1
+; NEON-NEXT: fmov x8, d1
+; NEON-NEXT: // kill: def $d0 killed $d0 def $q0
+; NEON-NEXT: fmov x9, d0
+; NEON-NEXT: tst w0, #0x1
+; NEON-NEXT: csinc x8, x8, xzr, ne
+; NEON-NEXT: udiv x10, x9, x8
+; NEON-NEXT: msub x8, x10, x8, x9
+; NEON-NEXT: fmov d0, x8
+; NEON-NEXT: ret
+;
+; SVE-LABEL: urem_v1i164:
+; SVE: // %bb.0:
+; SVE-NEXT: // kill: def $d1 killed $d1 def $q1
+; SVE-NEXT: fmov x8, d1
+; SVE-NEXT: // kill: def $d0 killed $d0 def $q0
+; SVE-NEXT: fmov x9, d0
+; SVE-NEXT: tst w0, #0x1
+; SVE-NEXT: csinc x8, x8, xzr, ne
+; SVE-NEXT: udiv x10, x9, x8
+; SVE-NEXT: msub x8, x10, x8, x9
+; SVE-NEXT: fmov d0, x8
+; SVE-NEXT: ret
+ %res = call <1 x i64> @llvm.masked.urem(<1 x i64> %x, <1 x i64> %y, <1 x i1> %m)
+ ret <1 x i64> %res
+}
+
+; Expansion
+define <2 x i128> @urem_v2i128(<2 x i128> %x, <2 x i128> %y, <2 x i1> %m) {
+; NEON-LABEL: urem_v2i128:
+; NEON: // %bb.0:
+; NEON-NEXT: stp x30, x25, [sp, #-64]! // 16-byte Folded Spill
+; NEON-NEXT: stp x24, x23, [sp, #16] // 16-byte Folded Spill
+; NEON-NEXT: stp x22, x21, [sp, #32] // 16-byte Folded Spill
+; NEON-NEXT: stp x20, x19, [sp, #48] // 16-byte Folded Spill
+; NEON-NEXT: .cfi_def_cfa_offset 64
+; NEON-NEXT: .cfi_offset w19, -8
+; NEON-NEXT: .cfi_offset w20, -16
+; NEON-NEXT: .cfi_offset w21, -24
+; NEON-NEXT: .cfi_offset w22, -32
+; NEON-NEXT: .cfi_offset w23, -40
+; NEON-NEXT: .cfi_offset w24, -48
+; NEON-NEXT: .cfi_offset w25, -56
+; NEON-NEXT: .cfi_offset w30, -64
+; NEON-NEXT: // kill: def $d0 killed $d0 def $q0
+; NEON-NEXT: fmov w8, s0
+; NEON-NEXT: mov x21, x3
+; NEON-NEXT: mov x22, x2
+; NEON-NEXT: mov x19, x7
+; NEON-NEXT: mov x20, x6
+; NEON-NEXT: mov w25, v0.s[1]
+; NEON-NEXT: tst w8, #0x1
+; NEON-NEXT: csel x3, x5, xzr, ne
+; NEON-NEXT: csinc x2, x4, xzr, ne
+; NEON-NEXT: bl __umodti3
+; NEON-NEXT: tst w25, #0x1
+; NEON-NEXT: mov x23, x0
+; NEON-NEXT: mov x24, x1
+; NEON-NEXT: csel x3, x19, xzr, ne
+; NEON-NEXT: csinc x2, x20, xzr, ne
+; NEON-NEXT: mov x0, x22
+; NEON-NEXT: mov x1, x21
+; NEON-NEXT: bl __umodti3
+; NEON-NEXT: mov x2, x0
+; NEON-NEXT: mov x3, x1
+; NEON-NEXT: mov x0, x23
+; NEON-NEXT: mov x1, x24
+; NEON-NEXT: ldp x20, x19, [sp, #48] // 16-byte Folded Reload
+; NEON-NEXT: ldp x22, x21, [sp, #32] // 16-byte Folded Reload
+; NEON-NEXT: ldp x24, x23, [sp, #16] // 16-byte Folded Reload
+; NEON-NEXT: ldp x30, x25, [sp], #64 // 16-byte Folded Reload
+; NEON-NEXT: ret
+;
+; SVE-LABEL: urem_v2i128:
+; SVE: // %bb.0:
+; SVE-NEXT: stp x30, x25, [sp, #-64]! // 16-byte Folded Spill
+; SVE-NEXT: stp x24, x23, [sp, #16] // 16-byte Folded Spill
+; SVE-NEXT: stp x22, x21, [sp, #32] // 16-byte Folded Spill
+; SVE-NEXT: stp x20, x19, [sp, #48] // 16-byte Folded Spill
+; SVE-NEXT: .cfi_def_cfa_offset 64
+; SVE-NEXT: .cfi_offset w19, -8
+; SVE-NEXT: .cfi_offset w20, -16
+; SVE-NEXT: .cfi_offset w21, -24
+; SVE-NEXT: .cfi_offset w22, -32
+; SVE-NEXT: .cfi_offset w23, -40
+; SVE-NEXT: .cfi_offset w24, -48
+; SVE-NEXT: .cfi_offset w25, -56
+; SVE-NEXT: .cfi_offset w30, -64
+; SVE-NEXT: // kill: def $d0 killed $d0 def $q0
+; SVE-NEXT: fmov w8, s0
+; SVE-NEXT: mov x21, x3
+; SVE-NEXT: mov x22, x2
+; SVE-NEXT: mov x19, x7
+; SVE-NEXT: mov x20, x6
+; SVE-NEXT: mov w25, v0.s[1]
+; SVE-NEXT: tst w8, #0x1
+; SVE-NEXT: csel x3, x5, xzr, ne
+; SVE-NEXT: csinc x2, x4, xzr, ne
+; SVE-NEXT: bl __umodti3
+; SVE-NEXT: tst w25, #0x1
+; SVE-NEXT: mov x23, x0
+; SVE-NEXT: mov x24, x1
+; SVE-NEXT: csel x3, x19, xzr, ne
+; SVE-NEXT: csinc x2, x20, xzr, ne
+; SVE-NEXT: mov x0, x22
+; SVE-NEXT: mov x1, x21
+; SVE-NEXT: bl __umodti3
+; SVE-NEXT: mov x2, x0
+; SVE-NEXT: mov x3, x1
+; SVE-NEXT: mov x0, x23
+; SVE-NEXT: mov x1, x24
+; SVE-NEXT: ldp x20, x19, [sp, #48] // 16-byte Folded Reload
+; SVE-NEXT: ldp x22, x21, [sp, #32] // 16-byte Folded Reload
+; SVE-NEXT: ldp x24, x23, [sp, #16] // 16-byte Folded Reload
+; SVE-NEXT: ldp x30, x25, [sp], #64 // 16-byte Folded Reload
+; SVE-NEXT: ret
+ %res = call <2 x i128> @llvm.masked.urem(<2 x i128> %x, <2 x i128> %y, <2 x i1> %m)
+ ret <2 x i128> %res
+}
+
+; Promotion and widening
+define <3 x i10> @urem_v3i10(<3 x i10> %x, <3 x i10> %y, <3 x i1> %m) {
+; NEON-LABEL: urem_v3i10:
+; NEON: // %bb.0:
+; NEON-NEXT: fmov s0, w6
+; NEON-NEXT: fmov s1, w3
+; NEON-NEXT: ldr w8, [sp]
+; NEON-NEXT: fmov s2, w0
+; NEON-NEXT: mov v0.h[1], w7
+; NEON-NEXT: mov v1.h[1], w4
+; NEON-NEXT: mov v2.h[1], w1
+; NEON-NEXT: mov v0.h[2], w8
+; NEON-NEXT: mov v1.h[2], w5
+; NEON-NEXT: mov v2.h[2], w2
+; NEON-NEXT: shl v0.4h, v0.4h, #15
+; NEON-NEXT: bic v1.4h, #252, lsl #8
+; NEON-NEXT: bic v2.4h, #252, lsl #8
+; NEON-NEXT: cmlt v0.4h, v0.4h, #0
+; NEON-NEXT: umov w9, v2.h[0]
+; NEON-NEXT: umov w12, v2.h[1]
+; NEON-NEXT: umov w15, v2.h[2]
+; NEON-NEXT: and v1.8b, v1.8b, v0.8b
+; NEON-NEXT: mvn v0.8b, v0.8b
+; NEON-NEXT: sub v0.4h, v1.4h, v0.4h
+; NEON-NEXT: umov w8, v0.h[0]
+; NEON-NEXT: umov w11, v0.h[1]
+; NEON-NEXT: umov w14, v0.h[2]
+; NEON-NEXT: and w8, w8, #0x3ff
+; NEON-NEXT: and w11, w11, #0x3ff
+; NEON-NEXT: and w14, w14, #0x3ff
+; NEON-NEXT: udiv w10, w9, w8
+; NEON-NEXT: udiv w13, w12, w11
+; NEON-NEXT: msub w0, w10, w8, w9
+; NEON-NEXT: udiv w16, w15, w14
+; NEON-NEXT: msub w1, w13, w11, w12
+; NEON-NEXT: msub w2, w16, w14, w15
+; NEON-NEXT: ret
+;
+; SVE-LABEL: urem_v3i10:
+; SVE: // %bb.0:
+; SVE-NEXT: fmov s0, w6
+; SVE-NEXT: fmov s1, w3
+; SVE-NEXT: ldr w8, [sp]
+; SVE-NEXT: fmov s2, w0
+; SVE-NEXT: ptrue p0.s, vl4
+; SVE-NEXT: mov v0.h[1], w7
+; SVE-NEXT: mov v1.h[1], w4
+; SVE-NEXT: mov v2.h[1], w1
+; SVE-NEXT: mov v0.h[2], w8
+; SVE-NEXT: mov v1.h[2], w5
+; SVE-NEXT: mov v2.h[2], w2
+; SVE-NEXT: shl v0.4h, v0.4h, #15
+; SVE-NEXT: bic v1.4h, #252, lsl #8
+; SVE-NEXT: bic v2.4h, #252, lsl #8
+; SVE-NEXT: cmlt v0.4h, v0.4h, #0
+; SVE-NEXT: ushll v3.4s, v2.4h, #0
+; SVE-NEXT: and v1.8b, v1.8b, v0.8b
+; SVE-NEXT: mvn v0.8b, v0.8b
+; SVE-NEXT: sub v0.4h, v1.4h, v0.4h
+; SVE-NEXT: ushll v1.4s, v0.4h, #0
+; SVE-NEXT: udivr z1.s, p0/m, z1.s, z3.s
+; SVE-NEXT: xtn v1.4h, v1.4s
+; SVE-NEXT: mls v2.4h, v1.4h, v0.4h
+; SVE-NEXT: umov w0, v2.h[0]
+; SVE-NEXT: umov w1, v2.h[1]
+; SVE-NEXT: umov w2, v2.h[2]
+; SVE-NEXT: ret
+ %res = call <3 x i10> @llvm.masked.urem(<3 x i10> %x, <3 x i10> %y, <3 x i1> %m)
+ ret <3 x i10> %res
+}
diff --git a/llvm/test/CodeGen/AArch64/masked-urem-scalable.ll b/llvm/test/CodeGen/AArch64/masked-urem-scalable.ll
new file mode 100644
index 0000000000000..af6b372ce7eb7
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/masked-urem-scalable.ll
@@ -0,0 +1,86 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
+; RUN: llc -mtriple aarch64 -mattr=+sve < %s | FileCheck %s
+
+define <vscale x 4 x i16> @urem_nxv4i16(<vscale x 4 x i16> %x, <vscale x 4 x i16> %y, <vscale x 4 x i1> %m) {
+; CHECK-LABEL: urem_nxv4i16:
+; CHECK: // %bb.0:
+; CHECK-NEXT: and z1.s, z1.s, #0xffff
+; CHECK-NEXT: mov z2.s, #1 // =0x1
+; CHECK-NEXT: and z0.s, z0.s, #0xffff
+; CHECK-NEXT: sel z1.s, p0, z1.s, z2.s
+; CHECK-NEXT: ptrue p0.s
+; CHECK-NEXT: movprfx z2, z0
+; CHECK-NEXT: udiv z2.s, p0/m, z2.s, z1.s
+; CHECK-NEXT: mls z0.s, p0/m, z2.s, z1.s
+; CHECK-NEXT: ret
+ %res = call <vscale x 4 x i16> @llvm.masked.urem(<vscale x 4 x i16> %x, <vscale x 4 x i16> %y, <vscale x 4 x i1> %m)
+ ret <vscale x 4 x i16> %res
+}
+
+define <vscale x 4 x i32> @urem_nxv4i32(<vscale x 4 x i32> %x, <vscale x 4 x i32> %y, <vscale x 4 x i1> %m) {
+; CHECK-LABEL: urem_nxv4i32:
+; CHECK: // %bb.0:
+; CHECK-NEXT: mov z2.s, #1 // =0x1
+; CHECK-NEXT: sel z1.s, p0, z1.s, z2.s
+; CHECK-NEXT: ptrue p0.s
+; CHECK-NEXT: movprfx z2, z0
+; CHECK-NEXT: udiv z2.s, p0/m, z2.s, z1.s
+; CHECK-NEXT: mls z0.s, p0/m, z2.s, z1.s
+; CHECK-NEXT: ret
+ %res = call <vscale x 4 x i32> @llvm.masked.urem(<vscale x 4 x i32> %x, <vscale x 4 x i32> %y, <vscale x 4 x i1> %m)
+ ret <vscale x 4 x i32> %res
+}
+
+define <vscale x 8 x i32> @urem_nxv8i32(<vscale x 8 x i32> %x, <vscale x 8 x i32> %y, <vscale x 8 x i1> %m) {
+; CHECK-LABEL: urem_nxv8i32:
+; CHECK: // %bb.0:
+; CHECK-NEXT: mov z4.s, #1 // =0x1
+; CHECK-NEXT: punpklo p1.h, p0.b
+; CHECK-NEXT: punpkhi p0.h, p0.b
+; CHECK-NEXT: sel z2.s, p1, z2.s, z4.s
+; CHECK-NEXT: sel z3.s, p0, z3.s, z4.s
+; CHECK-NEXT: ptrue p1.s
+; CHECK-NEXT: movprfx z5, z0
+; CHECK-NEXT: udiv z5.s, p1/m, z5.s, z2.s
+; CHECK-NEXT: movprfx z4, z1
+; CHECK-NEXT: udiv z4.s, p1/m, z4.s, z3.s
+; CHECK-NEXT: mls z0.s, p1/m, z5.s, z2.s
+; CHECK-NEXT: mls z1.s, p1/m, z4.s, z3.s
+; CHECK-NEXT: ret
+ %res = call <vscale x 8 x i32> @llvm.masked.urem(<vscale x 8 x i32> %x, <vscale x 8 x i32> %y, <vscale x 8 x i1> %m)
+ ret <vscale x 8 x i32> %res
+}
+
+define <vscale x 8 x i16> @urem_nxv8i16(<vscale x 8 x i16> %x, <vscale x 8 x i16> %y, <vscale x 8 x i1> %m) {
+; CHECK-LABEL: urem_nxv8i16:
+; CHECK: // %bb.0:
+; CHECK-NEXT: mov z2.h, #1 // =0x1
+; CHECK-NEXT: uunpklo z4.s, z0.h
+; CHECK-NEXT: sel z1.h, p0, z1.h, z2.h
+; CHECK-NEXT: uunpkhi z2.s, z0.h
+; CHECK-NEXT: ptrue p0.s
+; CHECK-NEXT: uunpkhi z3.s, z1.h
+; CHECK-NEXT: udiv z2.s, p0/m, z2.s, z3.s
+; CHECK-NEXT: uunpklo z3.s, z1.h
+; CHECK-NEXT: udivr z3.s, p0/m, z3.s, z4.s
+; CHECK-NEXT: ptrue p0.h
+; CHECK-NEXT: uzp1 z2.h, z3.h, z2.h
+; CHECK-NEXT: mls z0.h, p0/m, z2.h, z1.h
+; CHECK-NEXT: ret
+ %res = call <vscale x 8 x i16> @llvm.masked.urem(<vscale x 8 x i16> %x, <vscale x 8 x i16> %y, <vscale x 8 x i1> %m)
+ ret <vscale x 8 x i16> %res
+}
+
+define <vscale x 2 x i64> @urem_nxv2i64(<vscale x 2 x i64> %x, <vscale x 2 x i64> %y, <vscale x 2 x i1> %m) {
+; CHECK-LABEL: urem_nxv2i64:
+; CHECK: // %bb.0:
+; CHECK-NEXT: mov z2.d, #1 // =0x1
+; CHECK-NEXT: sel z1.d, p0, z1.d, z2.d
+; CHECK-NEXT: ptrue p0.d
+; CHECK-NEXT: movprfx z2, z0
+; CHECK-NEXT: udiv z2.d, p0/m, z2.d, z1.d
+; CHECK-NEXT: mls z0.d, p0/m, z2.d, z1.d
+; CHECK-NEXT: ret
+ %res = call <vscale x 2 x i64> @llvm.masked.urem(<vscale x 2 x i64> %x, <vscale x 2 x i64> %y, <vscale x 2 x i1> %m)
+ ret <vscale x 2 x i64> %res
+}
diff --git a/llvm/test/CodeGen/PowerPC/masked-sdiv.ll b/llvm/test/CodeGen/PowerPC/masked-sdiv.ll
new file mode 100644
index 0000000000000..0d824bc79fec2
--- /dev/null
+++ b/llvm/test/CodeGen/PowerPC/masked-sdiv.ll
@@ -0,0 +1,399 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
+; RUN: llc -mtriple=powerpc64le < %s | FileCheck %s
+
+; Legal
+define <4 x i32> @sdiv_v4i32(<4 x i32> %x, <4 x i32> %y, <4 x i1> %m) {
+; CHECK-LABEL: sdiv_v4i32:
+; CHECK: # %bb.0:
+; CHECK-NEXT: xxleqv 32, 32, 32
+; CHECK-NEXT: vspltisw 5, 1
+; CHECK-NEXT: xxsldwi 1, 34, 34, 1
+; CHECK-NEXT: vslw 4, 4, 0
+; CHECK-NEXT: xxswapd 4, 34
+; CHECK-NEXT: xxsldwi 6, 34, 34, 3
+; CHECK-NEXT: mffprwz 4, 1
+; CHECK-NEXT: vsraw 4, 4, 0
+; CHECK-NEXT: xxsel 0, 37, 35, 36
+; CHECK-NEXT: xxsldwi 2, 0, 0, 1
+; CHECK-NEXT: xxswapd 3, 0
+; CHECK-NEXT: xxsldwi 5, 0, 0, 3
+; CHECK-NEXT: mffprwz 3, 2
+; CHECK-NEXT: mffprwz 5, 3
+; CHECK-NEXT: divw 3, 4, 3
+; CHECK-NEXT: mffprwz 4, 4
+; CHECK-NEXT: divw 4, 4, 5
+; CHECK-NEXT: mfvsrwz 5, 34
+; CHECK-NEXT: rldimi 4, 3, 32, 0
+; CHECK-NEXT: mffprwz 3, 5
+; CHECK-NEXT: mtfprd 1, 4
+; CHECK-NEXT: mffprwz 4, 6
+; CHECK-NEXT: divw 3, 4, 3
+; CHECK-NEXT: mffprwz 4, 0
+; CHECK-NEXT: divw 4, 5, 4
+; CHECK-NEXT: rldimi 4, 3, 32, 0
+; CHECK-NEXT: mtfprd 0, 4
+; CHECK-NEXT: xxmrghd 34, 0, 1
+; CHECK-NEXT: blr
+ %res = call <4 x i32> @llvm.masked.sdiv(<4 x i32> %x, <4 x i32> %y, <4 x i1> %m)
+ ret <4 x i32> %res
+}
+
+define <2 x i64> @sdiv_v2i64(<2 x i64> %x, <2 x i64> %y, <2 x i1> %m) {
+; CHECK-LABEL: sdiv_v2i64:
+; CHECK: # %bb.0:
+; CHECK-NEXT: xxleqv 32, 32, 32
+; CHECK-NEXT: vspltisw 5, 1
+; CHECK-NEXT: mfvsrd 4, 34
+; CHECK-NEXT: xxswapd 2, 34
+; CHECK-NEXT: vsld 4, 4, 0
+; CHECK-NEXT: vsrad 4, 4, 0
+; CHECK-NEXT: vupklsw 5, 5
+; CHECK-NEXT: xxsel 0, 37, 35, 36
+; CHECK-NEXT: mffprd 3, 0
+; CHECK-NEXT: divd 3, 4, 3
+; CHECK-NEXT: mffprd 4, 2
+; CHECK-NEXT: xxswapd 1, 0
+; CHECK-NEXT: mtfprd 0, 3
+; CHECK-NEXT: mffprd 3, 1
+; CHECK-NEXT: divd 3, 4, 3
+; CHECK-NEXT: mtfprd 1, 3
+; CHECK-NEXT: xxmrghd 34, 0, 1
+; CHECK-NEXT: blr
+ %res = call <2 x i64> @llvm.masked.sdiv(<2 x i64> %x, <2 x i64> %y, <2 x i1> %m)
+ ret <2 x i64> %res
+}
+
+; Splitting
+define <4 x i64> @sdiv_v4i64(<4 x i64> %x, <4 x i64> %y, <4 x i1> %m) {
+; CHECK-LABEL: sdiv_v4i64:
+; CHECK: # %bb.0:
+; CHECK-NEXT: xxmrglw 32, 38, 38
+; CHECK-NEXT: xxleqv 39, 39, 39
+; CHECK-NEXT: xxmrghw 33, 38, 38
+; CHECK-NEXT: mfvsrd 3, 34
+; CHECK-NEXT: vspltisw 6, 1
+; CHECK-NEXT: mfvsrd 4, 35
+; CHECK-NEXT: xxswapd 2, 34
+; CHECK-NEXT: xxswapd 4, 35
+; CHECK-NEXT: vsld 0, 0, 7
+; CHECK-NEXT: mffprd 5, 2
+; CHECK-NEXT: vsrad 0, 0, 7
+; CHECK-NEXT: vupklsw 6, 6
+; CHECK-NEXT: xxsel 0, 38, 36, 32
+; CHECK-NEXT: vsld 4, 1, 7
+; CHECK-NEXT: mffprd 6, 0
+; CHECK-NEXT: vsrad 4, 4, 7
+; CHECK-NEXT: divd 3, 3, 6
+; CHECK-NEXT: xxswapd 3, 0
+; CHECK-NEXT: mtfprd 0, 3
+; CHECK-NEXT: xxsel 1, 38, 37, 36
+; CHECK-NEXT: mffprd 6, 1
+; CHECK-NEXT: divd 4, 4, 6
+; CHECK-NEXT: mffprd 6, 3
+; CHECK-NEXT: divd 5, 5, 6
+; CHECK-NEXT: mtfprd 2, 5
+; CHECK-NEXT: xxswapd 5, 1
+; CHECK-NEXT: mtfprd 1, 4
+; CHECK-NEXT: mffprd 3, 5
+; CHECK-NEXT: mffprd 4, 4
+; CHECK-NEXT: divd 3, 4, 3
+; CHECK-NEXT: xxmrghd 34, 0, 2
+; CHECK-NEXT: mtfprd 0, 3
+; CHECK-NEXT: xxmrghd 35, 1, 0
+; CHECK-NEXT: blr
+ %res = call <4 x i64> @llvm.masked.sdiv(<4 x i64> %x, <4 x i64> %y, <4 x i1> %m)
+ ret <4 x i64> %res
+}
+
+; Widening
+define <2 x i32> @sdiv_v2i32(<2 x i32> %x, <2 x i32> %y, <2 x i1> %m) {
+; CHECK-LABEL: sdiv_v2i32:
+; CHECK: # %bb.0:
+; CHECK-NEXT: addis 3, 2, .LCPI3_0 at toc@ha
+; CHECK-NEXT: xxlxor 32, 32, 32
+; CHECK-NEXT: xxsldwi 1, 34, 34, 1
+; CHECK-NEXT: addi 3, 3, .LCPI3_0 at toc@l
+; CHECK-NEXT: mffprwz 4, 1
+; CHECK-NEXT: xxswapd 4, 34
+; CHECK-NEXT: xxsldwi 6, 34, 34, 3
+; CHECK-NEXT: lxvd2x 0, 0, 3
+; CHECK-NEXT: xxswapd 37, 0
+; CHECK-NEXT: vperm 4, 0, 4, 5
+; CHECK-NEXT: xxleqv 32, 32, 32
+; CHECK-NEXT: vspltisw 5, 1
+; CHECK-NEXT: vslw 4, 4, 0
+; CHECK-NEXT: vsraw 4, 4, 0
+; CHECK-NEXT: xxsel 0, 37, 35, 36
+; CHECK-NEXT: xxsldwi 2, 0, 0, 1
+; CHECK-NEXT: xxswapd 3, 0
+; CHECK-NEXT: xxsldwi 5, 0, 0, 3
+; CHECK-NEXT: mffprwz 3, 2
+; CHECK-NEXT: mffprwz 5, 3
+; CHECK-NEXT: divw 3, 4, 3
+; CHECK-NEXT: mffprwz 4, 4
+; CHECK-NEXT: divw 4, 4, 5
+; CHECK-NEXT: mfvsrwz 5, 34
+; CHECK-NEXT: rldimi 4, 3, 32, 0
+; CHECK-NEXT: mffprwz 3, 5
+; CHECK-NEXT: mtfprd 1, 4
+; CHECK-NEXT: mffprwz 4, 6
+; CHECK-NEXT: divw 3, 4, 3
+; CHECK-NEXT: mffprwz 4, 0
+; CHECK-NEXT: divw 4, 5, 4
+; CHECK-NEXT: rldimi 4, 3, 32, 0
+; CHECK-NEXT: mtfprd 0, 4
+; CHECK-NEXT: xxmrghd 34, 0, 1
+; CHECK-NEXT: blr
+ %res = call <2 x i32> @llvm.masked.sdiv(<2 x i32> %x, <2 x i32> %y, <2 x i1> %m)
+ ret <2 x i32> %res
+}
+
+; Promotion
+define <4 x i16> @sdiv_v4i16(<4 x i16> %x, <4 x i16> %y, <4 x i1> %m) {
+; CHECK-LABEL: sdiv_v4i16:
+; CHECK: # %bb.0:
+; CHECK-NEXT: xxswapd 0, 36
+; CHECK-NEXT: xxsldwi 1, 36, 36, 1
+; CHECK-NEXT: mfvsrwz 3, 36
+; CHECK-NEXT: li 7, 0
+; CHECK-NEXT: xxsldwi 2, 36, 36, 3
+; CHECK-NEXT: mffprwz 4, 0
+; CHECK-NEXT: mffprwz 5, 1
+; CHECK-NEXT: mffprwz 6, 2
+; CHECK-NEXT: mtvsrd 36, 3
+; CHECK-NEXT: mtvsrd 37, 4
+; CHECK-NEXT: mtvsrd 32, 5
+; CHECK-NEXT: mfvsrd 5, 34
+; CHECK-NEXT: rldicl 8, 5, 48, 48
+; CHECK-NEXT: rldicl 9, 5, 32, 48
+; CHECK-NEXT: extsh 8, 8
+; CHECK-NEXT: extsh 9, 9
+; CHECK-NEXT: vmrghh 5, 0, 5
+; CHECK-NEXT: mtvsrd 32, 6
+; CHECK-NEXT: vmrghh 4, 0, 4
+; CHECK-NEXT: mtvsrd 32, 7
+; CHECK-NEXT: clrldi 7, 5, 48
+; CHECK-NEXT: rldicl 5, 5, 16, 48
+; CHECK-NEXT: extsh 7, 7
+; CHECK-NEXT: extsh 5, 5
+; CHECK-NEXT: xxmrglw 1, 36, 37
+; CHECK-NEXT: vspltish 4, 15
+; CHECK-NEXT: vsplth 0, 0, 3
+; CHECK-NEXT: xxspltw 0, 32, 3
+; CHECK-NEXT: vspltish 0, 1
+; CHECK-NEXT: xxmrgld 37, 0, 1
+; CHECK-NEXT: xxswapd 1, 34
+; CHECK-NEXT: vslh 5, 5, 4
+; CHECK-NEXT: mffprd 3, 1
+; CHECK-NEXT: vsrah 4, 5, 4
+; CHECK-NEXT: clrldi 10, 3, 48
+; CHECK-NEXT: rldicl 11, 3, 48, 48
+; CHECK-NEXT: extsh 10, 10
+; CHECK-NEXT: extsh 11, 11
+; CHECK-NEXT: xxsel 0, 32, 35, 36
+; CHECK-NEXT: mffprd 6, 0
+; CHECK-NEXT: clrldi 12, 6, 48
+; CHECK-NEXT: extsh 12, 12
+; CHECK-NEXT: divw 7, 7, 12
+; CHECK-NEXT: rldicl 12, 6, 48, 48
+; CHECK-NEXT: extsh 12, 12
+; CHECK-NEXT: divw 8, 8, 12
+; CHECK-NEXT: xxswapd 2, 0
+; CHECK-NEXT: mffprd 4, 2
+; CHECK-NEXT: rldicl 12, 6, 32, 48
+; CHECK-NEXT: rldicl 6, 6, 16, 48
+; CHECK-NEXT: extsh 6, 6
+; CHECK-NEXT: extsh 12, 12
+; CHECK-NEXT: divw 5, 5, 6
+; CHECK-NEXT: clrldi 6, 4, 48
+; CHECK-NEXT: divw 9, 9, 12
+; CHECK-NEXT: rldicl 12, 3, 32, 48
+; CHECK-NEXT: rldicl 3, 3, 16, 48
+; CHECK-NEXT: extsh 6, 6
+; CHECK-NEXT: extsh 12, 12
+; CHECK-NEXT: extsh 3, 3
+; CHECK-NEXT: divw 6, 10, 6
+; CHECK-NEXT: rldicl 10, 4, 48, 48
+; CHECK-NEXT: extsh 10, 10
+; CHECK-NEXT: mtvsrd 34, 7
+; CHECK-NEXT: divw 10, 11, 10
+; CHECK-NEXT: rldicl 11, 4, 32, 48
+; CHECK-NEXT: rldicl 4, 4, 16, 48
+; CHECK-NEXT: extsh 11, 11
+; CHECK-NEXT: extsh 4, 4
+; CHECK-NEXT: mtvsrd 35, 8
+; CHECK-NEXT: divw 11, 12, 11
+; CHECK-NEXT: divw 3, 3, 4
+; CHECK-NEXT: mtvsrd 36, 9
+; CHECK-NEXT: mtvsrd 37, 5
+; CHECK-NEXT: mtvsrd 32, 6
+; CHECK-NEXT: vmrghh 2, 3, 2
+; CHECK-NEXT: vmrghh 3, 5, 4
+; CHECK-NEXT: mtvsrd 36, 10
+; CHECK-NEXT: mtvsrd 37, 11
+; CHECK-NEXT: xxmrglw 0, 35, 34
+; CHECK-NEXT: vmrghh 4, 4, 0
+; CHECK-NEXT: mtvsrd 32, 3
+; CHECK-NEXT: vmrghh 5, 0, 5
+; CHECK-NEXT: xxmrglw 1, 37, 36
+; CHECK-NEXT: xxmrgld 34, 0, 1
+; CHECK-NEXT: blr
+ %res = call <4 x i16> @llvm.masked.sdiv(<4 x i16> %x, <4 x i16> %y, <4 x i1> %m)
+ ret <4 x i16> %res
+}
+
+; Scalarization
+define <1 x i64> @sdiv_v1i164(<1 x i64> %x, <1 x i64> %y, <1 x i1> %m) {
+; CHECK-LABEL: sdiv_v1i164:
+; CHECK: # %bb.0:
+; CHECK-NEXT: andi. 5, 5, 1
+; CHECK-NEXT: li 5, 1
+; CHECK-NEXT: iselgt 4, 4, 5
+; CHECK-NEXT: divd 3, 3, 4
+; CHECK-NEXT: blr
+ %res = call <1 x i64> @llvm.masked.sdiv(<1 x i64> %x, <1 x i64> %y, <1 x i1> %m)
+ ret <1 x i64> %res
+}
+
+; Expansion
+define <2 x i128> @sdiv_v2i128(<2 x i128> %x, <2 x i128> %y, <2 x i1> %m) {
+; CHECK-LABEL: sdiv_v2i128:
+; CHECK: # %bb.0:
+; CHECK-NEXT: mfocrf 12, 32
+; CHECK-NEXT: stw 12, 8(1)
+; CHECK-NEXT: mflr 0
+; CHECK-NEXT: stdu 1, -128(1)
+; CHECK-NEXT: std 0, 144(1)
+; CHECK-NEXT: .cfi_def_cfa_offset 128
+; CHECK-NEXT: .cfi_offset lr, 16
+; CHECK-NEXT: .cfi_offset r29, -24
+; CHECK-NEXT: .cfi_offset r30, -16
+; CHECK-NEXT: .cfi_offset cr2, 8
+; CHECK-NEXT: .cfi_offset v29, -80
+; CHECK-NEXT: .cfi_offset v30, -64
+; CHECK-NEXT: .cfi_offset v31, -48
+; CHECK-NEXT: li 3, 48
+; CHECK-NEXT: xxswapd 0, 38
+; CHECK-NEXT: xxswapd 1, 37
+; CHECK-NEXT: std 30, 112(1) # 8-byte Folded Spill
+; CHECK-NEXT: li 30, 1
+; CHECK-NEXT: std 29, 104(1) # 8-byte Folded Spill
+; CHECK-NEXT: li 29, 0
+; CHECK-NEXT: mfvsrd 4, 35
+; CHECK-NEXT: stvx 29, 1, 3 # 16-byte Folded Spill
+; CHECK-NEXT: li 3, 64
+; CHECK-NEXT: stvx 30, 1, 3 # 16-byte Folded Spill
+; CHECK-NEXT: li 3, 80
+; CHECK-NEXT: vmr 30, 2
+; CHECK-NEXT: stvx 31, 1, 3 # 16-byte Folded Spill
+; CHECK-NEXT: mffprd 3, 0
+; CHECK-NEXT: vmr 31, 4
+; CHECK-NEXT: andi. 3, 3, 1
+; CHECK-NEXT: mfvsrd 3, 38
+; CHECK-NEXT: crmove 8, 1
+; CHECK-NEXT: andi. 3, 3, 1
+; CHECK-NEXT: mffprd 3, 1
+; CHECK-NEXT: iselgt 5, 3, 30
+; CHECK-NEXT: mfvsrd 3, 37
+; CHECK-NEXT: iselgt 6, 3, 29
+; CHECK-NEXT: xxswapd 0, 35
+; CHECK-NEXT: mffprd 3, 0
+; CHECK-NEXT: bl __divti3
+; CHECK-NEXT: nop
+; CHECK-NEXT: xxswapd 0, 63
+; CHECK-NEXT: mtfprd 1, 3
+; CHECK-NEXT: mtfprd 2, 4
+; CHECK-NEXT: mfvsrd 4, 62
+; CHECK-NEXT: mffprd 3, 0
+; CHECK-NEXT: isel 5, 3, 30, 8
+; CHECK-NEXT: mfvsrd 3, 63
+; CHECK-NEXT: isel 6, 3, 29, 8
+; CHECK-NEXT: xxswapd 0, 62
+; CHECK-NEXT: mffprd 3, 0
+; CHECK-NEXT: xxmrghd 61, 2, 1
+; CHECK-NEXT: bl __divti3
+; CHECK-NEXT: nop
+; CHECK-NEXT: mtfprd 0, 3
+; CHECK-NEXT: li 3, 80
+; CHECK-NEXT: mtfprd 1, 4
+; CHECK-NEXT: ld 30, 112(1) # 8-byte Folded Reload
+; CHECK-NEXT: vmr 3, 29
+; CHECK-NEXT: ld 29, 104(1) # 8-byte Folded Reload
+; CHECK-NEXT: lvx 31, 1, 3 # 16-byte Folded Reload
+; CHECK-NEXT: li 3, 64
+; CHECK-NEXT: lvx 30, 1, 3 # 16-byte Folded Reload
+; CHECK-NEXT: li 3, 48
+; CHECK-NEXT: lvx 29, 1, 3 # 16-byte Folded Reload
+; CHECK-NEXT: xxmrghd 34, 1, 0
+; CHECK-NEXT: addi 1, 1, 128
+; CHECK-NEXT: ld 0, 16(1)
+; CHECK-NEXT: lwz 12, 8(1)
+; CHECK-NEXT: mtlr 0
+; CHECK-NEXT: mtocrf 32, 12
+; CHECK-NEXT: blr
+ %res = call <2 x i128> @llvm.masked.sdiv(<2 x i128> %x, <2 x i128> %y, <2 x i1> %m)
+ ret <2 x i128> %res
+}
+
+; Promotion and widening
+define <3 x i10> @sdiv_v3i10(<3 x i10> %x, <3 x i10> %y, <3 x i1> %m) {
+; CHECK-LABEL: sdiv_v3i10:
+; CHECK: # %bb.0:
+; CHECK-NEXT: mtfprwz 0, 9
+; CHECK-NEXT: mtfprwz 1, 10
+; CHECK-NEXT: addis 9, 2, .LCPI7_0 at toc@ha
+; CHECK-NEXT: xxleqv 38, 38, 38
+; CHECK-NEXT: vspltisw 5, 11
+; CHECK-NEXT: addi 9, 9, .LCPI7_0 at toc@l
+; CHECK-NEXT: vadduwm 5, 5, 5
+; CHECK-NEXT: xxmrghw 35, 1, 0
+; CHECK-NEXT: lxvd2x 0, 0, 9
+; CHECK-NEXT: mtfprwz 1, 7
+; CHECK-NEXT: xxswapd 34, 0
+; CHECK-NEXT: mtfprwz 0, 6
+; CHECK-NEXT: xxmrghw 36, 1, 0
+; CHECK-NEXT: mtfprwz 0, 3
+; CHECK-NEXT: lbz 3, 96(1)
+; CHECK-NEXT: mtfprwz 1, 4
+; CHECK-NEXT: mtvsrwz 33, 3
+; CHECK-NEXT: xxmrghw 32, 1, 0
+; CHECK-NEXT: vperm 3, 1, 3, 2
+; CHECK-NEXT: mtvsrwz 33, 8
+; CHECK-NEXT: vslw 3, 3, 6
+; CHECK-NEXT: vsraw 3, 3, 6
+; CHECK-NEXT: vperm 4, 1, 4, 2
+; CHECK-NEXT: mtvsrwz 33, 5
+; CHECK-NEXT: vslw 4, 4, 5
+; CHECK-NEXT: vsraw 4, 4, 5
+; CHECK-NEXT: vperm 0, 1, 0, 2
+; CHECK-NEXT: vspltisw 1, 1
+; CHECK-NEXT: xxsel 0, 33, 36, 35
+; CHECK-NEXT: vslw 3, 0, 5
+; CHECK-NEXT: vsraw 3, 3, 5
+; CHECK-NEXT: xxswapd 1, 0
+; CHECK-NEXT: xxsldwi 3, 0, 0, 1
+; CHECK-NEXT: mffprwz 3, 1
+; CHECK-NEXT: xxswapd 2, 35
+; CHECK-NEXT: xxsldwi 4, 35, 35, 1
+; CHECK-NEXT: mffprwz 4, 2
+; CHECK-NEXT: divw 3, 4, 3
+; CHECK-NEXT: mffprwz 4, 4
+; CHECK-NEXT: mtfprwz 1, 3
+; CHECK-NEXT: mffprwz 3, 3
+; CHECK-NEXT: divw 3, 4, 3
+; CHECK-NEXT: mfvsrwz 4, 35
+; CHECK-NEXT: mtfprwz 2, 3
+; CHECK-NEXT: mffprwz 3, 0
+; CHECK-NEXT: divw 3, 4, 3
+; CHECK-NEXT: mtvsrwz 35, 3
+; CHECK-NEXT: xxmrghw 36, 2, 1
+; CHECK-NEXT: vperm 2, 3, 4, 2
+; CHECK-NEXT: mfvsrwz 5, 34
+; CHECK-NEXT: xxswapd 0, 34
+; CHECK-NEXT: xxsldwi 1, 34, 34, 1
+; CHECK-NEXT: mffprwz 3, 0
+; CHECK-NEXT: mffprwz 4, 1
+; CHECK-NEXT: blr
+ %res = call <3 x i10> @llvm.masked.sdiv(<3 x i10> %x, <3 x i10> %y, <3 x i1> %m)
+ ret <3 x i10> %res
+}
diff --git a/llvm/test/CodeGen/PowerPC/masked-srem.ll b/llvm/test/CodeGen/PowerPC/masked-srem.ll
new file mode 100644
index 0000000000000..be2e1ed7c12e6
--- /dev/null
+++ b/llvm/test/CodeGen/PowerPC/masked-srem.ll
@@ -0,0 +1,463 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
+; RUN: llc -mtriple=powerpc64le < %s | FileCheck %s
+
+; Legal
+define <4 x i32> @srem_v4i32(<4 x i32> %x, <4 x i32> %y, <4 x i1> %m) {
+; CHECK-LABEL: srem_v4i32:
+; CHECK: # %bb.0:
+; CHECK-NEXT: xxleqv 32, 32, 32
+; CHECK-NEXT: vspltisw 5, 1
+; CHECK-NEXT: xxsldwi 1, 34, 34, 1
+; CHECK-NEXT: vslw 4, 4, 0
+; CHECK-NEXT: xxswapd 4, 34
+; CHECK-NEXT: xxsldwi 6, 34, 34, 3
+; CHECK-NEXT: mffprwz 4, 1
+; CHECK-NEXT: vsraw 4, 4, 0
+; CHECK-NEXT: xxsel 0, 37, 35, 36
+; CHECK-NEXT: xxsldwi 2, 0, 0, 1
+; CHECK-NEXT: xxswapd 3, 0
+; CHECK-NEXT: xxsldwi 5, 0, 0, 3
+; CHECK-NEXT: mffprwz 3, 2
+; CHECK-NEXT: mffprwz 5, 3
+; CHECK-NEXT: divw 6, 4, 3
+; CHECK-NEXT: mullw 3, 6, 3
+; CHECK-NEXT: mffprwz 6, 4
+; CHECK-NEXT: sub 3, 4, 3
+; CHECK-NEXT: divw 4, 6, 5
+; CHECK-NEXT: mullw 4, 4, 5
+; CHECK-NEXT: sub 4, 6, 4
+; CHECK-NEXT: rldimi 4, 3, 32, 0
+; CHECK-NEXT: mffprwz 3, 5
+; CHECK-NEXT: mtfprd 1, 4
+; CHECK-NEXT: mffprwz 4, 6
+; CHECK-NEXT: divw 5, 4, 3
+; CHECK-NEXT: mullw 3, 5, 3
+; CHECK-NEXT: mfvsrwz 5, 34
+; CHECK-NEXT: sub 3, 4, 3
+; CHECK-NEXT: mffprwz 4, 0
+; CHECK-NEXT: divw 6, 5, 4
+; CHECK-NEXT: mullw 4, 6, 4
+; CHECK-NEXT: sub 4, 5, 4
+; CHECK-NEXT: rldimi 4, 3, 32, 0
+; CHECK-NEXT: mtfprd 0, 4
+; CHECK-NEXT: xxmrghd 34, 0, 1
+; CHECK-NEXT: blr
+ %res = call <4 x i32> @llvm.masked.srem(<4 x i32> %x, <4 x i32> %y, <4 x i1> %m)
+ ret <4 x i32> %res
+}
+
+define <2 x i64> @srem_v2i64(<2 x i64> %x, <2 x i64> %y, <2 x i1> %m) {
+; CHECK-LABEL: srem_v2i64:
+; CHECK: # %bb.0:
+; CHECK-NEXT: xxleqv 32, 32, 32
+; CHECK-NEXT: vspltisw 5, 1
+; CHECK-NEXT: mfvsrd 4, 34
+; CHECK-NEXT: xxswapd 2, 34
+; CHECK-NEXT: vsld 4, 4, 0
+; CHECK-NEXT: vsrad 4, 4, 0
+; CHECK-NEXT: vupklsw 5, 5
+; CHECK-NEXT: xxsel 0, 37, 35, 36
+; CHECK-NEXT: mffprd 3, 0
+; CHECK-NEXT: divd 5, 4, 3
+; CHECK-NEXT: mulld 3, 5, 3
+; CHECK-NEXT: sub 3, 4, 3
+; CHECK-NEXT: mffprd 4, 2
+; CHECK-NEXT: xxswapd 1, 0
+; CHECK-NEXT: mtfprd 0, 3
+; CHECK-NEXT: mffprd 3, 1
+; CHECK-NEXT: divd 5, 4, 3
+; CHECK-NEXT: mulld 3, 5, 3
+; CHECK-NEXT: sub 3, 4, 3
+; CHECK-NEXT: mtfprd 1, 3
+; CHECK-NEXT: xxmrghd 34, 0, 1
+; CHECK-NEXT: blr
+ %res = call <2 x i64> @llvm.masked.srem(<2 x i64> %x, <2 x i64> %y, <2 x i1> %m)
+ ret <2 x i64> %res
+}
+
+; Splitting
+define <4 x i64> @srem_v4i64(<4 x i64> %x, <4 x i64> %y, <4 x i1> %m) {
+; CHECK-LABEL: srem_v4i64:
+; CHECK: # %bb.0:
+; CHECK-NEXT: xxmrglw 32, 38, 38
+; CHECK-NEXT: xxleqv 39, 39, 39
+; CHECK-NEXT: xxmrghw 33, 38, 38
+; CHECK-NEXT: mfvsrd 3, 34
+; CHECK-NEXT: vspltisw 6, 1
+; CHECK-NEXT: mfvsrd 4, 35
+; CHECK-NEXT: xxswapd 2, 34
+; CHECK-NEXT: xxswapd 4, 35
+; CHECK-NEXT: vsld 0, 0, 7
+; CHECK-NEXT: mffprd 5, 2
+; CHECK-NEXT: vsrad 0, 0, 7
+; CHECK-NEXT: vupklsw 6, 6
+; CHECK-NEXT: xxsel 0, 38, 36, 32
+; CHECK-NEXT: vsld 4, 1, 7
+; CHECK-NEXT: mffprd 6, 0
+; CHECK-NEXT: vsrad 4, 4, 7
+; CHECK-NEXT: divd 9, 3, 6
+; CHECK-NEXT: mulld 6, 9, 6
+; CHECK-NEXT: sub 3, 3, 6
+; CHECK-NEXT: xxswapd 3, 0
+; CHECK-NEXT: mffprd 8, 3
+; CHECK-NEXT: mtfprd 0, 3
+; CHECK-NEXT: xxsel 1, 38, 37, 36
+; CHECK-NEXT: mffprd 7, 1
+; CHECK-NEXT: divd 9, 4, 7
+; CHECK-NEXT: mulld 7, 9, 7
+; CHECK-NEXT: divd 9, 5, 8
+; CHECK-NEXT: sub 4, 4, 7
+; CHECK-NEXT: mulld 8, 9, 8
+; CHECK-NEXT: sub 3, 5, 8
+; CHECK-NEXT: xxswapd 5, 1
+; CHECK-NEXT: mffprd 5, 4
+; CHECK-NEXT: mtfprd 1, 4
+; CHECK-NEXT: mffprd 4, 5
+; CHECK-NEXT: mtfprd 2, 3
+; CHECK-NEXT: divd 3, 5, 4
+; CHECK-NEXT: mulld 3, 3, 4
+; CHECK-NEXT: sub 3, 5, 3
+; CHECK-NEXT: xxmrghd 34, 0, 2
+; CHECK-NEXT: mtfprd 0, 3
+; CHECK-NEXT: xxmrghd 35, 1, 0
+; CHECK-NEXT: blr
+ %res = call <4 x i64> @llvm.masked.srem(<4 x i64> %x, <4 x i64> %y, <4 x i1> %m)
+ ret <4 x i64> %res
+}
+
+; Widening
+define <2 x i32> @srem_v2i32(<2 x i32> %x, <2 x i32> %y, <2 x i1> %m) {
+; CHECK-LABEL: srem_v2i32:
+; CHECK: # %bb.0:
+; CHECK-NEXT: addis 3, 2, .LCPI3_0 at toc@ha
+; CHECK-NEXT: xxlxor 32, 32, 32
+; CHECK-NEXT: xxsldwi 1, 34, 34, 1
+; CHECK-NEXT: addi 3, 3, .LCPI3_0 at toc@l
+; CHECK-NEXT: mffprwz 4, 1
+; CHECK-NEXT: xxswapd 4, 34
+; CHECK-NEXT: xxsldwi 6, 34, 34, 3
+; CHECK-NEXT: lxvd2x 0, 0, 3
+; CHECK-NEXT: xxswapd 37, 0
+; CHECK-NEXT: vperm 4, 0, 4, 5
+; CHECK-NEXT: xxleqv 32, 32, 32
+; CHECK-NEXT: vspltisw 5, 1
+; CHECK-NEXT: vslw 4, 4, 0
+; CHECK-NEXT: vsraw 4, 4, 0
+; CHECK-NEXT: xxsel 0, 37, 35, 36
+; CHECK-NEXT: xxsldwi 2, 0, 0, 1
+; CHECK-NEXT: xxswapd 3, 0
+; CHECK-NEXT: xxsldwi 5, 0, 0, 3
+; CHECK-NEXT: mffprwz 3, 2
+; CHECK-NEXT: mffprwz 5, 3
+; CHECK-NEXT: divw 6, 4, 3
+; CHECK-NEXT: mullw 3, 6, 3
+; CHECK-NEXT: sub 3, 4, 3
+; CHECK-NEXT: mffprwz 4, 4
+; CHECK-NEXT: divw 6, 4, 5
+; CHECK-NEXT: mullw 5, 6, 5
+; CHECK-NEXT: sub 4, 4, 5
+; CHECK-NEXT: rldimi 4, 3, 32, 0
+; CHECK-NEXT: mffprwz 3, 5
+; CHECK-NEXT: mtfprd 1, 4
+; CHECK-NEXT: mffprwz 4, 6
+; CHECK-NEXT: divw 5, 4, 3
+; CHECK-NEXT: mullw 3, 5, 3
+; CHECK-NEXT: mfvsrwz 5, 34
+; CHECK-NEXT: sub 3, 4, 3
+; CHECK-NEXT: mffprwz 4, 0
+; CHECK-NEXT: divw 6, 5, 4
+; CHECK-NEXT: mullw 4, 6, 4
+; CHECK-NEXT: sub 4, 5, 4
+; CHECK-NEXT: rldimi 4, 3, 32, 0
+; CHECK-NEXT: mtfprd 0, 4
+; CHECK-NEXT: xxmrghd 34, 0, 1
+; CHECK-NEXT: blr
+ %res = call <2 x i32> @llvm.masked.srem(<2 x i32> %x, <2 x i32> %y, <2 x i1> %m)
+ ret <2 x i32> %res
+}
+
+; Promotion
+define <4 x i16> @srem_v4i16(<4 x i16> %x, <4 x i16> %y, <4 x i1> %m) {
+; CHECK-LABEL: srem_v4i16:
+; CHECK: # %bb.0:
+; CHECK-NEXT: xxswapd 0, 36
+; CHECK-NEXT: xxsldwi 1, 36, 36, 1
+; CHECK-NEXT: mfvsrwz 3, 36
+; CHECK-NEXT: li 7, 0
+; CHECK-NEXT: xxsldwi 2, 36, 36, 3
+; CHECK-NEXT: std 25, -56(1) # 8-byte Folded Spill
+; CHECK-NEXT: std 30, -16(1) # 8-byte Folded Spill
+; CHECK-NEXT: std 29, -24(1) # 8-byte Folded Spill
+; CHECK-NEXT: mffprwz 4, 0
+; CHECK-NEXT: mffprwz 5, 1
+; CHECK-NEXT: std 28, -32(1) # 8-byte Folded Spill
+; CHECK-NEXT: std 27, -40(1) # 8-byte Folded Spill
+; CHECK-NEXT: mffprwz 6, 2
+; CHECK-NEXT: std 26, -48(1) # 8-byte Folded Spill
+; CHECK-NEXT: mtvsrd 36, 3
+; CHECK-NEXT: mfvsrd 3, 34
+; CHECK-NEXT: mtvsrd 37, 4
+; CHECK-NEXT: mtvsrd 32, 5
+; CHECK-NEXT: rldicl 8, 3, 48, 48
+; CHECK-NEXT: rldicl 9, 3, 32, 48
+; CHECK-NEXT: extsh 8, 8
+; CHECK-NEXT: extsh 9, 9
+; CHECK-NEXT: vmrghh 5, 0, 5
+; CHECK-NEXT: mtvsrd 32, 6
+; CHECK-NEXT: vmrghh 4, 0, 4
+; CHECK-NEXT: mtvsrd 32, 7
+; CHECK-NEXT: clrldi 7, 3, 48
+; CHECK-NEXT: rldicl 3, 3, 16, 48
+; CHECK-NEXT: extsh 7, 7
+; CHECK-NEXT: extsh 3, 3
+; CHECK-NEXT: xxmrglw 1, 36, 37
+; CHECK-NEXT: vspltish 4, 15
+; CHECK-NEXT: vsplth 0, 0, 3
+; CHECK-NEXT: xxspltw 0, 32, 3
+; CHECK-NEXT: vspltish 0, 1
+; CHECK-NEXT: xxmrgld 37, 0, 1
+; CHECK-NEXT: xxswapd 1, 34
+; CHECK-NEXT: vslh 5, 5, 4
+; CHECK-NEXT: mffprd 4, 1
+; CHECK-NEXT: vsrah 4, 5, 4
+; CHECK-NEXT: clrldi 10, 4, 48
+; CHECK-NEXT: rldicl 11, 4, 48, 48
+; CHECK-NEXT: rldicl 12, 4, 32, 48
+; CHECK-NEXT: rldicl 4, 4, 16, 48
+; CHECK-NEXT: extsh 10, 10
+; CHECK-NEXT: extsh 11, 11
+; CHECK-NEXT: extsh 12, 12
+; CHECK-NEXT: extsh 4, 4
+; CHECK-NEXT: xxsel 0, 32, 35, 36
+; CHECK-NEXT: mffprd 5, 0
+; CHECK-NEXT: clrldi 0, 5, 48
+; CHECK-NEXT: rldicl 30, 5, 48, 48
+; CHECK-NEXT: rldicl 29, 5, 32, 48
+; CHECK-NEXT: rldicl 5, 5, 16, 48
+; CHECK-NEXT: extsh 0, 0
+; CHECK-NEXT: extsh 30, 30
+; CHECK-NEXT: extsh 29, 29
+; CHECK-NEXT: extsh 5, 5
+; CHECK-NEXT: xxswapd 2, 0
+; CHECK-NEXT: mffprd 6, 2
+; CHECK-NEXT: clrldi 28, 6, 48
+; CHECK-NEXT: rldicl 27, 6, 48, 48
+; CHECK-NEXT: rldicl 26, 6, 32, 48
+; CHECK-NEXT: rldicl 6, 6, 16, 48
+; CHECK-NEXT: divw 25, 7, 0
+; CHECK-NEXT: extsh 28, 28
+; CHECK-NEXT: extsh 27, 27
+; CHECK-NEXT: extsh 26, 26
+; CHECK-NEXT: extsh 6, 6
+; CHECK-NEXT: mullw 0, 25, 0
+; CHECK-NEXT: divw 25, 8, 30
+; CHECK-NEXT: sub 7, 7, 0
+; CHECK-NEXT: mtvsrd 34, 7
+; CHECK-NEXT: mullw 30, 25, 30
+; CHECK-NEXT: divw 25, 9, 29
+; CHECK-NEXT: sub 8, 8, 30
+; CHECK-NEXT: ld 30, -16(1) # 8-byte Folded Reload
+; CHECK-NEXT: mtvsrd 35, 8
+; CHECK-NEXT: mullw 29, 25, 29
+; CHECK-NEXT: divw 25, 3, 5
+; CHECK-NEXT: sub 9, 9, 29
+; CHECK-NEXT: ld 29, -24(1) # 8-byte Folded Reload
+; CHECK-NEXT: mtvsrd 36, 9
+; CHECK-NEXT: mullw 5, 25, 5
+; CHECK-NEXT: divw 25, 10, 28
+; CHECK-NEXT: sub 3, 3, 5
+; CHECK-NEXT: mtvsrd 37, 3
+; CHECK-NEXT: mullw 28, 25, 28
+; CHECK-NEXT: divw 25, 11, 27
+; CHECK-NEXT: sub 3, 10, 28
+; CHECK-NEXT: ld 28, -32(1) # 8-byte Folded Reload
+; CHECK-NEXT: mullw 27, 25, 27
+; CHECK-NEXT: divw 25, 12, 26
+; CHECK-NEXT: sub 5, 11, 27
+; CHECK-NEXT: ld 27, -40(1) # 8-byte Folded Reload
+; CHECK-NEXT: mullw 26, 25, 26
+; CHECK-NEXT: divw 25, 4, 6
+; CHECK-NEXT: sub 7, 12, 26
+; CHECK-NEXT: ld 26, -48(1) # 8-byte Folded Reload
+; CHECK-NEXT: mullw 6, 25, 6
+; CHECK-NEXT: ld 25, -56(1) # 8-byte Folded Reload
+; CHECK-NEXT: vmrghh 2, 3, 2
+; CHECK-NEXT: vmrghh 3, 5, 4
+; CHECK-NEXT: mtvsrd 36, 3
+; CHECK-NEXT: mtvsrd 37, 5
+; CHECK-NEXT: sub 3, 4, 6
+; CHECK-NEXT: mtvsrd 32, 3
+; CHECK-NEXT: xxmrglw 0, 35, 34
+; CHECK-NEXT: vmrghh 4, 5, 4
+; CHECK-NEXT: mtvsrd 37, 7
+; CHECK-NEXT: vmrghh 5, 0, 5
+; CHECK-NEXT: xxmrglw 1, 37, 36
+; CHECK-NEXT: xxmrgld 34, 0, 1
+; CHECK-NEXT: blr
+ %res = call <4 x i16> @llvm.masked.srem(<4 x i16> %x, <4 x i16> %y, <4 x i1> %m)
+ ret <4 x i16> %res
+}
+
+; Scalarization
+define <1 x i64> @srem_v1i164(<1 x i64> %x, <1 x i64> %y, <1 x i1> %m) {
+; CHECK-LABEL: srem_v1i164:
+; CHECK: # %bb.0:
+; CHECK-NEXT: andi. 5, 5, 1
+; CHECK-NEXT: li 5, 1
+; CHECK-NEXT: iselgt 4, 4, 5
+; CHECK-NEXT: divd 5, 3, 4
+; CHECK-NEXT: mulld 4, 5, 4
+; CHECK-NEXT: sub 3, 3, 4
+; CHECK-NEXT: blr
+ %res = call <1 x i64> @llvm.masked.srem(<1 x i64> %x, <1 x i64> %y, <1 x i1> %m)
+ ret <1 x i64> %res
+}
+
+; Expansion
+define <2 x i128> @srem_v2i128(<2 x i128> %x, <2 x i128> %y, <2 x i1> %m) {
+; CHECK-LABEL: srem_v2i128:
+; CHECK: # %bb.0:
+; CHECK-NEXT: mfocrf 12, 32
+; CHECK-NEXT: stw 12, 8(1)
+; CHECK-NEXT: mflr 0
+; CHECK-NEXT: stdu 1, -128(1)
+; CHECK-NEXT: std 0, 144(1)
+; CHECK-NEXT: .cfi_def_cfa_offset 128
+; CHECK-NEXT: .cfi_offset lr, 16
+; CHECK-NEXT: .cfi_offset r29, -24
+; CHECK-NEXT: .cfi_offset r30, -16
+; CHECK-NEXT: .cfi_offset cr2, 8
+; CHECK-NEXT: .cfi_offset v29, -80
+; CHECK-NEXT: .cfi_offset v30, -64
+; CHECK-NEXT: .cfi_offset v31, -48
+; CHECK-NEXT: li 3, 48
+; CHECK-NEXT: xxswapd 0, 38
+; CHECK-NEXT: xxswapd 1, 37
+; CHECK-NEXT: std 30, 112(1) # 8-byte Folded Spill
+; CHECK-NEXT: li 30, 1
+; CHECK-NEXT: std 29, 104(1) # 8-byte Folded Spill
+; CHECK-NEXT: li 29, 0
+; CHECK-NEXT: mfvsrd 4, 35
+; CHECK-NEXT: stvx 29, 1, 3 # 16-byte Folded Spill
+; CHECK-NEXT: li 3, 64
+; CHECK-NEXT: stvx 30, 1, 3 # 16-byte Folded Spill
+; CHECK-NEXT: li 3, 80
+; CHECK-NEXT: vmr 30, 2
+; CHECK-NEXT: stvx 31, 1, 3 # 16-byte Folded Spill
+; CHECK-NEXT: mffprd 3, 0
+; CHECK-NEXT: vmr 31, 4
+; CHECK-NEXT: andi. 3, 3, 1
+; CHECK-NEXT: mfvsrd 3, 38
+; CHECK-NEXT: crmove 8, 1
+; CHECK-NEXT: andi. 3, 3, 1
+; CHECK-NEXT: mffprd 3, 1
+; CHECK-NEXT: iselgt 5, 3, 30
+; CHECK-NEXT: mfvsrd 3, 37
+; CHECK-NEXT: iselgt 6, 3, 29
+; CHECK-NEXT: xxswapd 0, 35
+; CHECK-NEXT: mffprd 3, 0
+; CHECK-NEXT: bl __modti3
+; CHECK-NEXT: nop
+; CHECK-NEXT: xxswapd 0, 63
+; CHECK-NEXT: mtfprd 1, 3
+; CHECK-NEXT: mtfprd 2, 4
+; CHECK-NEXT: mfvsrd 4, 62
+; CHECK-NEXT: mffprd 3, 0
+; CHECK-NEXT: isel 5, 3, 30, 8
+; CHECK-NEXT: mfvsrd 3, 63
+; CHECK-NEXT: isel 6, 3, 29, 8
+; CHECK-NEXT: xxswapd 0, 62
+; CHECK-NEXT: mffprd 3, 0
+; CHECK-NEXT: xxmrghd 61, 2, 1
+; CHECK-NEXT: bl __modti3
+; CHECK-NEXT: nop
+; CHECK-NEXT: mtfprd 0, 3
+; CHECK-NEXT: li 3, 80
+; CHECK-NEXT: mtfprd 1, 4
+; CHECK-NEXT: ld 30, 112(1) # 8-byte Folded Reload
+; CHECK-NEXT: vmr 3, 29
+; CHECK-NEXT: ld 29, 104(1) # 8-byte Folded Reload
+; CHECK-NEXT: lvx 31, 1, 3 # 16-byte Folded Reload
+; CHECK-NEXT: li 3, 64
+; CHECK-NEXT: lvx 30, 1, 3 # 16-byte Folded Reload
+; CHECK-NEXT: li 3, 48
+; CHECK-NEXT: lvx 29, 1, 3 # 16-byte Folded Reload
+; CHECK-NEXT: xxmrghd 34, 1, 0
+; CHECK-NEXT: addi 1, 1, 128
+; CHECK-NEXT: ld 0, 16(1)
+; CHECK-NEXT: lwz 12, 8(1)
+; CHECK-NEXT: mtlr 0
+; CHECK-NEXT: mtocrf 32, 12
+; CHECK-NEXT: blr
+ %res = call <2 x i128> @llvm.masked.srem(<2 x i128> %x, <2 x i128> %y, <2 x i1> %m)
+ ret <2 x i128> %res
+}
+
+; Promotion and widening
+define <3 x i10> @srem_v3i10(<3 x i10> %x, <3 x i10> %y, <3 x i1> %m) {
+; CHECK-LABEL: srem_v3i10:
+; CHECK: # %bb.0:
+; CHECK-NEXT: mtfprwz 0, 9
+; CHECK-NEXT: mtfprwz 1, 10
+; CHECK-NEXT: addis 9, 2, .LCPI7_0 at toc@ha
+; CHECK-NEXT: xxleqv 38, 38, 38
+; CHECK-NEXT: vspltisw 5, 11
+; CHECK-NEXT: addi 9, 9, .LCPI7_0 at toc@l
+; CHECK-NEXT: vadduwm 5, 5, 5
+; CHECK-NEXT: xxmrghw 35, 1, 0
+; CHECK-NEXT: lxvd2x 0, 0, 9
+; CHECK-NEXT: mtfprwz 1, 7
+; CHECK-NEXT: xxswapd 34, 0
+; CHECK-NEXT: mtfprwz 0, 6
+; CHECK-NEXT: xxmrghw 36, 1, 0
+; CHECK-NEXT: mtfprwz 0, 3
+; CHECK-NEXT: lbz 3, 96(1)
+; CHECK-NEXT: mtfprwz 1, 4
+; CHECK-NEXT: mtvsrwz 33, 3
+; CHECK-NEXT: xxmrghw 32, 1, 0
+; CHECK-NEXT: vperm 3, 1, 3, 2
+; CHECK-NEXT: mtvsrwz 33, 8
+; CHECK-NEXT: vslw 3, 3, 6
+; CHECK-NEXT: vsraw 3, 3, 6
+; CHECK-NEXT: vperm 4, 1, 4, 2
+; CHECK-NEXT: mtvsrwz 33, 5
+; CHECK-NEXT: vslw 4, 4, 5
+; CHECK-NEXT: vsraw 4, 4, 5
+; CHECK-NEXT: vperm 0, 1, 0, 2
+; CHECK-NEXT: vspltisw 1, 1
+; CHECK-NEXT: xxsel 0, 33, 36, 35
+; CHECK-NEXT: vslw 3, 0, 5
+; CHECK-NEXT: vsraw 3, 3, 5
+; CHECK-NEXT: xxswapd 1, 0
+; CHECK-NEXT: xxsldwi 3, 0, 0, 1
+; CHECK-NEXT: mffprwz 3, 1
+; CHECK-NEXT: xxswapd 2, 35
+; CHECK-NEXT: xxsldwi 4, 35, 35, 1
+; CHECK-NEXT: mffprwz 4, 2
+; CHECK-NEXT: divw 5, 4, 3
+; CHECK-NEXT: mullw 3, 5, 3
+; CHECK-NEXT: sub 3, 4, 3
+; CHECK-NEXT: mffprwz 4, 4
+; CHECK-NEXT: mtfprwz 1, 3
+; CHECK-NEXT: mffprwz 3, 3
+; CHECK-NEXT: divw 5, 4, 3
+; CHECK-NEXT: mullw 3, 5, 3
+; CHECK-NEXT: sub 3, 4, 3
+; CHECK-NEXT: mfvsrwz 4, 35
+; CHECK-NEXT: mtfprwz 2, 3
+; CHECK-NEXT: mffprwz 3, 0
+; CHECK-NEXT: divw 5, 4, 3
+; CHECK-NEXT: mullw 3, 5, 3
+; CHECK-NEXT: sub 3, 4, 3
+; CHECK-NEXT: mtvsrwz 35, 3
+; CHECK-NEXT: xxmrghw 36, 2, 1
+; CHECK-NEXT: vperm 2, 3, 4, 2
+; CHECK-NEXT: mfvsrwz 5, 34
+; CHECK-NEXT: xxswapd 0, 34
+; CHECK-NEXT: xxsldwi 1, 34, 34, 1
+; CHECK-NEXT: mffprwz 3, 0
+; CHECK-NEXT: mffprwz 4, 1
+; CHECK-NEXT: blr
+ %res = call <3 x i10> @llvm.masked.srem(<3 x i10> %x, <3 x i10> %y, <3 x i1> %m)
+ ret <3 x i10> %res
+}
diff --git a/llvm/test/CodeGen/PowerPC/masked-udiv.ll b/llvm/test/CodeGen/PowerPC/masked-udiv.ll
new file mode 100644
index 0000000000000..c0d4fd8f4ddc2
--- /dev/null
+++ b/llvm/test/CodeGen/PowerPC/masked-udiv.ll
@@ -0,0 +1,397 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
+; RUN: llc -mtriple=powerpc64le < %s | FileCheck %s
+
+; Legal
+define <4 x i32> @udiv_v4i32(<4 x i32> %x, <4 x i32> %y, <4 x i1> %m) {
+; CHECK-LABEL: udiv_v4i32:
+; CHECK: # %bb.0:
+; CHECK-NEXT: xxleqv 32, 32, 32
+; CHECK-NEXT: vspltisw 5, 1
+; CHECK-NEXT: xxsldwi 1, 34, 34, 1
+; CHECK-NEXT: vslw 4, 4, 0
+; CHECK-NEXT: xxswapd 4, 34
+; CHECK-NEXT: xxsldwi 6, 34, 34, 3
+; CHECK-NEXT: mffprwz 4, 1
+; CHECK-NEXT: vsraw 4, 4, 0
+; CHECK-NEXT: xxsel 0, 37, 35, 36
+; CHECK-NEXT: xxsldwi 2, 0, 0, 1
+; CHECK-NEXT: xxswapd 3, 0
+; CHECK-NEXT: xxsldwi 5, 0, 0, 3
+; CHECK-NEXT: mffprwz 3, 2
+; CHECK-NEXT: mffprwz 5, 3
+; CHECK-NEXT: divwu 3, 4, 3
+; CHECK-NEXT: mffprwz 4, 4
+; CHECK-NEXT: divwu 4, 4, 5
+; CHECK-NEXT: mfvsrwz 5, 34
+; CHECK-NEXT: rldimi 4, 3, 32, 0
+; CHECK-NEXT: mffprwz 3, 5
+; CHECK-NEXT: mtfprd 1, 4
+; CHECK-NEXT: mffprwz 4, 6
+; CHECK-NEXT: divwu 3, 4, 3
+; CHECK-NEXT: mffprwz 4, 0
+; CHECK-NEXT: divwu 4, 5, 4
+; CHECK-NEXT: rldimi 4, 3, 32, 0
+; CHECK-NEXT: mtfprd 0, 4
+; CHECK-NEXT: xxmrghd 34, 0, 1
+; CHECK-NEXT: blr
+ %res = call <4 x i32> @llvm.masked.udiv(<4 x i32> %x, <4 x i32> %y, <4 x i1> %m)
+ ret <4 x i32> %res
+}
+
+define <2 x i64> @udiv_v2i64(<2 x i64> %x, <2 x i64> %y, <2 x i1> %m) {
+; CHECK-LABEL: udiv_v2i64:
+; CHECK: # %bb.0:
+; CHECK-NEXT: xxleqv 32, 32, 32
+; CHECK-NEXT: vspltisw 5, 1
+; CHECK-NEXT: mfvsrd 4, 34
+; CHECK-NEXT: xxswapd 2, 34
+; CHECK-NEXT: vsld 4, 4, 0
+; CHECK-NEXT: vsrad 4, 4, 0
+; CHECK-NEXT: vupklsw 5, 5
+; CHECK-NEXT: xxsel 0, 37, 35, 36
+; CHECK-NEXT: mffprd 3, 0
+; CHECK-NEXT: divdu 3, 4, 3
+; CHECK-NEXT: mffprd 4, 2
+; CHECK-NEXT: xxswapd 1, 0
+; CHECK-NEXT: mtfprd 0, 3
+; CHECK-NEXT: mffprd 3, 1
+; CHECK-NEXT: divdu 3, 4, 3
+; CHECK-NEXT: mtfprd 1, 3
+; CHECK-NEXT: xxmrghd 34, 0, 1
+; CHECK-NEXT: blr
+ %res = call <2 x i64> @llvm.masked.udiv(<2 x i64> %x, <2 x i64> %y, <2 x i1> %m)
+ ret <2 x i64> %res
+}
+
+; Splitting
+define <4 x i64> @udiv_v4i64(<4 x i64> %x, <4 x i64> %y, <4 x i1> %m) {
+; CHECK-LABEL: udiv_v4i64:
+; CHECK: # %bb.0:
+; CHECK-NEXT: xxmrglw 32, 38, 38
+; CHECK-NEXT: xxleqv 39, 39, 39
+; CHECK-NEXT: xxmrghw 33, 38, 38
+; CHECK-NEXT: mfvsrd 3, 34
+; CHECK-NEXT: vspltisw 6, 1
+; CHECK-NEXT: mfvsrd 4, 35
+; CHECK-NEXT: xxswapd 2, 34
+; CHECK-NEXT: xxswapd 4, 35
+; CHECK-NEXT: vsld 0, 0, 7
+; CHECK-NEXT: mffprd 5, 2
+; CHECK-NEXT: vsrad 0, 0, 7
+; CHECK-NEXT: vupklsw 6, 6
+; CHECK-NEXT: xxsel 0, 38, 36, 32
+; CHECK-NEXT: vsld 4, 1, 7
+; CHECK-NEXT: mffprd 6, 0
+; CHECK-NEXT: vsrad 4, 4, 7
+; CHECK-NEXT: divdu 3, 3, 6
+; CHECK-NEXT: xxswapd 3, 0
+; CHECK-NEXT: mtfprd 0, 3
+; CHECK-NEXT: xxsel 1, 38, 37, 36
+; CHECK-NEXT: mffprd 6, 1
+; CHECK-NEXT: divdu 4, 4, 6
+; CHECK-NEXT: mffprd 6, 3
+; CHECK-NEXT: divdu 5, 5, 6
+; CHECK-NEXT: mtfprd 2, 5
+; CHECK-NEXT: xxswapd 5, 1
+; CHECK-NEXT: mtfprd 1, 4
+; CHECK-NEXT: mffprd 3, 5
+; CHECK-NEXT: mffprd 4, 4
+; CHECK-NEXT: divdu 3, 4, 3
+; CHECK-NEXT: xxmrghd 34, 0, 2
+; CHECK-NEXT: mtfprd 0, 3
+; CHECK-NEXT: xxmrghd 35, 1, 0
+; CHECK-NEXT: blr
+ %res = call <4 x i64> @llvm.masked.udiv(<4 x i64> %x, <4 x i64> %y, <4 x i1> %m)
+ ret <4 x i64> %res
+}
+
+; Widening
+define <2 x i32> @udiv_v2i32(<2 x i32> %x, <2 x i32> %y, <2 x i1> %m) {
+; CHECK-LABEL: udiv_v2i32:
+; CHECK: # %bb.0:
+; CHECK-NEXT: addis 3, 2, .LCPI3_0 at toc@ha
+; CHECK-NEXT: xxlxor 32, 32, 32
+; CHECK-NEXT: xxsldwi 1, 34, 34, 1
+; CHECK-NEXT: addi 3, 3, .LCPI3_0 at toc@l
+; CHECK-NEXT: mffprwz 4, 1
+; CHECK-NEXT: xxswapd 4, 34
+; CHECK-NEXT: xxsldwi 6, 34, 34, 3
+; CHECK-NEXT: lxvd2x 0, 0, 3
+; CHECK-NEXT: xxswapd 37, 0
+; CHECK-NEXT: vperm 4, 0, 4, 5
+; CHECK-NEXT: xxleqv 32, 32, 32
+; CHECK-NEXT: vspltisw 5, 1
+; CHECK-NEXT: vslw 4, 4, 0
+; CHECK-NEXT: vsraw 4, 4, 0
+; CHECK-NEXT: xxsel 0, 37, 35, 36
+; CHECK-NEXT: xxsldwi 2, 0, 0, 1
+; CHECK-NEXT: xxswapd 3, 0
+; CHECK-NEXT: xxsldwi 5, 0, 0, 3
+; CHECK-NEXT: mffprwz 3, 2
+; CHECK-NEXT: mffprwz 5, 3
+; CHECK-NEXT: divwu 3, 4, 3
+; CHECK-NEXT: mffprwz 4, 4
+; CHECK-NEXT: divwu 4, 4, 5
+; CHECK-NEXT: mfvsrwz 5, 34
+; CHECK-NEXT: rldimi 4, 3, 32, 0
+; CHECK-NEXT: mffprwz 3, 5
+; CHECK-NEXT: mtfprd 1, 4
+; CHECK-NEXT: mffprwz 4, 6
+; CHECK-NEXT: divwu 3, 4, 3
+; CHECK-NEXT: mffprwz 4, 0
+; CHECK-NEXT: divwu 4, 5, 4
+; CHECK-NEXT: rldimi 4, 3, 32, 0
+; CHECK-NEXT: mtfprd 0, 4
+; CHECK-NEXT: xxmrghd 34, 0, 1
+; CHECK-NEXT: blr
+ %res = call <2 x i32> @llvm.masked.udiv(<2 x i32> %x, <2 x i32> %y, <2 x i1> %m)
+ ret <2 x i32> %res
+}
+
+; Promotion
+define <4 x i16> @udiv_v4i16(<4 x i16> %x, <4 x i16> %y, <4 x i1> %m) {
+; CHECK-LABEL: udiv_v4i16:
+; CHECK: # %bb.0:
+; CHECK-NEXT: xxswapd 0, 36
+; CHECK-NEXT: xxsldwi 1, 36, 36, 1
+; CHECK-NEXT: mfvsrwz 3, 36
+; CHECK-NEXT: li 7, 0
+; CHECK-NEXT: xxsldwi 2, 36, 36, 3
+; CHECK-NEXT: mffprwz 4, 0
+; CHECK-NEXT: mffprwz 5, 1
+; CHECK-NEXT: mffprwz 6, 2
+; CHECK-NEXT: mtvsrd 36, 3
+; CHECK-NEXT: mtvsrd 37, 4
+; CHECK-NEXT: mtvsrd 32, 5
+; CHECK-NEXT: mfvsrd 5, 34
+; CHECK-NEXT: rldicl 8, 5, 48, 48
+; CHECK-NEXT: rldicl 9, 5, 32, 48
+; CHECK-NEXT: clrlwi 8, 8, 16
+; CHECK-NEXT: clrlwi 9, 9, 16
+; CHECK-NEXT: vmrghh 5, 0, 5
+; CHECK-NEXT: mtvsrd 32, 6
+; CHECK-NEXT: vmrghh 4, 0, 4
+; CHECK-NEXT: mtvsrd 32, 7
+; CHECK-NEXT: clrldi 7, 5, 48
+; CHECK-NEXT: rldicl 5, 5, 16, 48
+; CHECK-NEXT: clrlwi 7, 7, 16
+; CHECK-NEXT: clrlwi 5, 5, 16
+; CHECK-NEXT: xxmrglw 1, 36, 37
+; CHECK-NEXT: vspltish 4, 15
+; CHECK-NEXT: vsplth 0, 0, 3
+; CHECK-NEXT: xxspltw 0, 32, 3
+; CHECK-NEXT: vspltish 0, 1
+; CHECK-NEXT: xxmrgld 37, 0, 1
+; CHECK-NEXT: xxswapd 1, 34
+; CHECK-NEXT: vslh 5, 5, 4
+; CHECK-NEXT: mffprd 3, 1
+; CHECK-NEXT: vsrah 4, 5, 4
+; CHECK-NEXT: clrldi 10, 3, 48
+; CHECK-NEXT: rldicl 11, 3, 48, 48
+; CHECK-NEXT: clrlwi 10, 10, 16
+; CHECK-NEXT: clrlwi 11, 11, 16
+; CHECK-NEXT: xxsel 0, 32, 35, 36
+; CHECK-NEXT: mffprd 6, 0
+; CHECK-NEXT: clrldi 12, 6, 48
+; CHECK-NEXT: clrlwi 12, 12, 16
+; CHECK-NEXT: divwu 7, 7, 12
+; CHECK-NEXT: rldicl 12, 6, 48, 48
+; CHECK-NEXT: clrlwi 12, 12, 16
+; CHECK-NEXT: divwu 8, 8, 12
+; CHECK-NEXT: xxswapd 2, 0
+; CHECK-NEXT: mffprd 4, 2
+; CHECK-NEXT: rldicl 12, 6, 32, 48
+; CHECK-NEXT: rldicl 6, 6, 16, 48
+; CHECK-NEXT: clrlwi 6, 6, 16
+; CHECK-NEXT: clrlwi 12, 12, 16
+; CHECK-NEXT: divwu 5, 5, 6
+; CHECK-NEXT: clrldi 6, 4, 48
+; CHECK-NEXT: divwu 9, 9, 12
+; CHECK-NEXT: rldicl 12, 3, 32, 48
+; CHECK-NEXT: rldicl 3, 3, 16, 48
+; CHECK-NEXT: clrlwi 6, 6, 16
+; CHECK-NEXT: clrlwi 12, 12, 16
+; CHECK-NEXT: clrlwi 3, 3, 16
+; CHECK-NEXT: divwu 6, 10, 6
+; CHECK-NEXT: rldicl 10, 4, 48, 48
+; CHECK-NEXT: clrlwi 10, 10, 16
+; CHECK-NEXT: mtvsrd 34, 7
+; CHECK-NEXT: divwu 10, 11, 10
+; CHECK-NEXT: rldicl 11, 4, 32, 48
+; CHECK-NEXT: rldicl 4, 4, 16, 48
+; CHECK-NEXT: clrlwi 11, 11, 16
+; CHECK-NEXT: clrlwi 4, 4, 16
+; CHECK-NEXT: mtvsrd 35, 8
+; CHECK-NEXT: divwu 11, 12, 11
+; CHECK-NEXT: divwu 3, 3, 4
+; CHECK-NEXT: mtvsrd 36, 9
+; CHECK-NEXT: mtvsrd 37, 5
+; CHECK-NEXT: mtvsrd 32, 6
+; CHECK-NEXT: vmrghh 2, 3, 2
+; CHECK-NEXT: vmrghh 3, 5, 4
+; CHECK-NEXT: mtvsrd 36, 10
+; CHECK-NEXT: mtvsrd 37, 11
+; CHECK-NEXT: xxmrglw 0, 35, 34
+; CHECK-NEXT: vmrghh 4, 4, 0
+; CHECK-NEXT: mtvsrd 32, 3
+; CHECK-NEXT: vmrghh 5, 0, 5
+; CHECK-NEXT: xxmrglw 1, 37, 36
+; CHECK-NEXT: xxmrgld 34, 0, 1
+; CHECK-NEXT: blr
+ %res = call <4 x i16> @llvm.masked.udiv(<4 x i16> %x, <4 x i16> %y, <4 x i1> %m)
+ ret <4 x i16> %res
+}
+
+; Scalarization
+define <1 x i64> @udiv_v1i164(<1 x i64> %x, <1 x i64> %y, <1 x i1> %m) {
+; CHECK-LABEL: udiv_v1i164:
+; CHECK: # %bb.0:
+; CHECK-NEXT: andi. 5, 5, 1
+; CHECK-NEXT: li 5, 1
+; CHECK-NEXT: iselgt 4, 4, 5
+; CHECK-NEXT: divdu 3, 3, 4
+; CHECK-NEXT: blr
+ %res = call <1 x i64> @llvm.masked.udiv(<1 x i64> %x, <1 x i64> %y, <1 x i1> %m)
+ ret <1 x i64> %res
+}
+
+; Expansion
+define <2 x i128> @udiv_v2i128(<2 x i128> %x, <2 x i128> %y, <2 x i1> %m) {
+; CHECK-LABEL: udiv_v2i128:
+; CHECK: # %bb.0:
+; CHECK-NEXT: mfocrf 12, 32
+; CHECK-NEXT: stw 12, 8(1)
+; CHECK-NEXT: mflr 0
+; CHECK-NEXT: stdu 1, -128(1)
+; CHECK-NEXT: std 0, 144(1)
+; CHECK-NEXT: .cfi_def_cfa_offset 128
+; CHECK-NEXT: .cfi_offset lr, 16
+; CHECK-NEXT: .cfi_offset r29, -24
+; CHECK-NEXT: .cfi_offset r30, -16
+; CHECK-NEXT: .cfi_offset cr2, 8
+; CHECK-NEXT: .cfi_offset v29, -80
+; CHECK-NEXT: .cfi_offset v30, -64
+; CHECK-NEXT: .cfi_offset v31, -48
+; CHECK-NEXT: li 3, 48
+; CHECK-NEXT: xxswapd 0, 38
+; CHECK-NEXT: xxswapd 1, 37
+; CHECK-NEXT: std 30, 112(1) # 8-byte Folded Spill
+; CHECK-NEXT: li 30, 1
+; CHECK-NEXT: std 29, 104(1) # 8-byte Folded Spill
+; CHECK-NEXT: li 29, 0
+; CHECK-NEXT: mfvsrd 4, 35
+; CHECK-NEXT: stvx 29, 1, 3 # 16-byte Folded Spill
+; CHECK-NEXT: li 3, 64
+; CHECK-NEXT: stvx 30, 1, 3 # 16-byte Folded Spill
+; CHECK-NEXT: li 3, 80
+; CHECK-NEXT: vmr 30, 2
+; CHECK-NEXT: stvx 31, 1, 3 # 16-byte Folded Spill
+; CHECK-NEXT: mffprd 3, 0
+; CHECK-NEXT: vmr 31, 4
+; CHECK-NEXT: andi. 3, 3, 1
+; CHECK-NEXT: mfvsrd 3, 38
+; CHECK-NEXT: crmove 8, 1
+; CHECK-NEXT: andi. 3, 3, 1
+; CHECK-NEXT: mffprd 3, 1
+; CHECK-NEXT: iselgt 5, 3, 30
+; CHECK-NEXT: mfvsrd 3, 37
+; CHECK-NEXT: iselgt 6, 3, 29
+; CHECK-NEXT: xxswapd 0, 35
+; CHECK-NEXT: mffprd 3, 0
+; CHECK-NEXT: bl __udivti3
+; CHECK-NEXT: nop
+; CHECK-NEXT: xxswapd 0, 63
+; CHECK-NEXT: mtfprd 1, 3
+; CHECK-NEXT: mtfprd 2, 4
+; CHECK-NEXT: mfvsrd 4, 62
+; CHECK-NEXT: mffprd 3, 0
+; CHECK-NEXT: isel 5, 3, 30, 8
+; CHECK-NEXT: mfvsrd 3, 63
+; CHECK-NEXT: isel 6, 3, 29, 8
+; CHECK-NEXT: xxswapd 0, 62
+; CHECK-NEXT: mffprd 3, 0
+; CHECK-NEXT: xxmrghd 61, 2, 1
+; CHECK-NEXT: bl __udivti3
+; CHECK-NEXT: nop
+; CHECK-NEXT: mtfprd 0, 3
+; CHECK-NEXT: li 3, 80
+; CHECK-NEXT: mtfprd 1, 4
+; CHECK-NEXT: ld 30, 112(1) # 8-byte Folded Reload
+; CHECK-NEXT: vmr 3, 29
+; CHECK-NEXT: ld 29, 104(1) # 8-byte Folded Reload
+; CHECK-NEXT: lvx 31, 1, 3 # 16-byte Folded Reload
+; CHECK-NEXT: li 3, 64
+; CHECK-NEXT: lvx 30, 1, 3 # 16-byte Folded Reload
+; CHECK-NEXT: li 3, 48
+; CHECK-NEXT: lvx 29, 1, 3 # 16-byte Folded Reload
+; CHECK-NEXT: xxmrghd 34, 1, 0
+; CHECK-NEXT: addi 1, 1, 128
+; CHECK-NEXT: ld 0, 16(1)
+; CHECK-NEXT: lwz 12, 8(1)
+; CHECK-NEXT: mtlr 0
+; CHECK-NEXT: mtocrf 32, 12
+; CHECK-NEXT: blr
+ %res = call <2 x i128> @llvm.masked.udiv(<2 x i128> %x, <2 x i128> %y, <2 x i1> %m)
+ ret <2 x i128> %res
+}
+
+; Promotion and widening
+define <3 x i10> @udiv_v3i10(<3 x i10> %x, <3 x i10> %y, <3 x i1> %m) {
+; CHECK-LABEL: udiv_v3i10:
+; CHECK: # %bb.0:
+; CHECK-NEXT: mtfprwz 0, 9
+; CHECK-NEXT: mtfprwz 1, 10
+; CHECK-NEXT: addis 9, 2, .LCPI7_0 at toc@ha
+; CHECK-NEXT: addi 9, 9, .LCPI7_0 at toc@l
+; CHECK-NEXT: mtvsrwz 33, 8
+; CHECK-NEXT: xxmrghw 34, 1, 0
+; CHECK-NEXT: lxvd2x 0, 0, 9
+; CHECK-NEXT: mtfprwz 1, 7
+; CHECK-NEXT: xxswapd 35, 0
+; CHECK-NEXT: mtfprwz 0, 6
+; CHECK-NEXT: lbz 6, 96(1)
+; CHECK-NEXT: mtvsrwz 37, 6
+; CHECK-NEXT: xxmrghw 36, 1, 0
+; CHECK-NEXT: mtfprwz 0, 3
+; CHECK-NEXT: mtfprwz 1, 4
+; CHECK-NEXT: vperm 4, 1, 4, 3
+; CHECK-NEXT: mtvsrwz 33, 5
+; CHECK-NEXT: vperm 2, 5, 2, 3
+; CHECK-NEXT: vspltisw 5, -10
+; CHECK-NEXT: vsrw 5, 5, 5
+; CHECK-NEXT: xxmrghw 32, 1, 0
+; CHECK-NEXT: xxland 0, 36, 37
+; CHECK-NEXT: xxleqv 36, 36, 36
+; CHECK-NEXT: vslw 2, 2, 4
+; CHECK-NEXT: vsraw 2, 2, 4
+; CHECK-NEXT: vperm 0, 1, 0, 3
+; CHECK-NEXT: vspltisw 1, 1
+; CHECK-NEXT: xxland 1, 32, 37
+; CHECK-NEXT: xxswapd 3, 1
+; CHECK-NEXT: xxsldwi 5, 1, 1, 1
+; CHECK-NEXT: mffprwz 4, 3
+; CHECK-NEXT: xxsel 0, 33, 0, 34
+; CHECK-NEXT: xxswapd 2, 0
+; CHECK-NEXT: xxsldwi 4, 0, 0, 1
+; CHECK-NEXT: mffprwz 3, 2
+; CHECK-NEXT: divwu 3, 4, 3
+; CHECK-NEXT: mffprwz 4, 4
+; CHECK-NEXT: mtfprwz 2, 3
+; CHECK-NEXT: mffprwz 3, 5
+; CHECK-NEXT: divwu 3, 3, 4
+; CHECK-NEXT: mffprwz 4, 1
+; CHECK-NEXT: mtfprwz 3, 3
+; CHECK-NEXT: mffprwz 3, 0
+; CHECK-NEXT: divwu 3, 4, 3
+; CHECK-NEXT: mtvsrwz 36, 3
+; CHECK-NEXT: xxmrghw 34, 3, 2
+; CHECK-NEXT: vperm 2, 4, 2, 3
+; CHECK-NEXT: mfvsrwz 5, 34
+; CHECK-NEXT: xxswapd 0, 34
+; CHECK-NEXT: xxsldwi 1, 34, 34, 1
+; CHECK-NEXT: mffprwz 3, 0
+; CHECK-NEXT: mffprwz 4, 1
+; CHECK-NEXT: blr
+ %res = call <3 x i10> @llvm.masked.udiv(<3 x i10> %x, <3 x i10> %y, <3 x i1> %m)
+ ret <3 x i10> %res
+}
diff --git a/llvm/test/CodeGen/PowerPC/masked-urem.ll b/llvm/test/CodeGen/PowerPC/masked-urem.ll
new file mode 100644
index 0000000000000..c76d57d572a8c
--- /dev/null
+++ b/llvm/test/CodeGen/PowerPC/masked-urem.ll
@@ -0,0 +1,461 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
+; RUN: llc -mtriple=powerpc64le < %s | FileCheck %s
+
+; Legal
+define <4 x i32> @urem_v4i32(<4 x i32> %x, <4 x i32> %y, <4 x i1> %m) {
+; CHECK-LABEL: urem_v4i32:
+; CHECK: # %bb.0:
+; CHECK-NEXT: xxleqv 32, 32, 32
+; CHECK-NEXT: vspltisw 5, 1
+; CHECK-NEXT: xxsldwi 1, 34, 34, 1
+; CHECK-NEXT: vslw 4, 4, 0
+; CHECK-NEXT: xxswapd 4, 34
+; CHECK-NEXT: xxsldwi 6, 34, 34, 3
+; CHECK-NEXT: mffprwz 4, 1
+; CHECK-NEXT: vsraw 4, 4, 0
+; CHECK-NEXT: xxsel 0, 37, 35, 36
+; CHECK-NEXT: xxsldwi 2, 0, 0, 1
+; CHECK-NEXT: xxswapd 3, 0
+; CHECK-NEXT: xxsldwi 5, 0, 0, 3
+; CHECK-NEXT: mffprwz 3, 2
+; CHECK-NEXT: mffprwz 5, 3
+; CHECK-NEXT: divwu 6, 4, 3
+; CHECK-NEXT: mullw 3, 6, 3
+; CHECK-NEXT: mffprwz 6, 4
+; CHECK-NEXT: sub 3, 4, 3
+; CHECK-NEXT: divwu 4, 6, 5
+; CHECK-NEXT: mullw 4, 4, 5
+; CHECK-NEXT: sub 4, 6, 4
+; CHECK-NEXT: rldimi 4, 3, 32, 0
+; CHECK-NEXT: mffprwz 3, 5
+; CHECK-NEXT: mtfprd 1, 4
+; CHECK-NEXT: mffprwz 4, 6
+; CHECK-NEXT: divwu 5, 4, 3
+; CHECK-NEXT: mullw 3, 5, 3
+; CHECK-NEXT: mfvsrwz 5, 34
+; CHECK-NEXT: sub 3, 4, 3
+; CHECK-NEXT: mffprwz 4, 0
+; CHECK-NEXT: divwu 6, 5, 4
+; CHECK-NEXT: mullw 4, 6, 4
+; CHECK-NEXT: sub 4, 5, 4
+; CHECK-NEXT: rldimi 4, 3, 32, 0
+; CHECK-NEXT: mtfprd 0, 4
+; CHECK-NEXT: xxmrghd 34, 0, 1
+; CHECK-NEXT: blr
+ %res = call <4 x i32> @llvm.masked.urem(<4 x i32> %x, <4 x i32> %y, <4 x i1> %m)
+ ret <4 x i32> %res
+}
+
+define <2 x i64> @urem_v2i64(<2 x i64> %x, <2 x i64> %y, <2 x i1> %m) {
+; CHECK-LABEL: urem_v2i64:
+; CHECK: # %bb.0:
+; CHECK-NEXT: xxleqv 32, 32, 32
+; CHECK-NEXT: vspltisw 5, 1
+; CHECK-NEXT: mfvsrd 4, 34
+; CHECK-NEXT: xxswapd 2, 34
+; CHECK-NEXT: vsld 4, 4, 0
+; CHECK-NEXT: vsrad 4, 4, 0
+; CHECK-NEXT: vupklsw 5, 5
+; CHECK-NEXT: xxsel 0, 37, 35, 36
+; CHECK-NEXT: mffprd 3, 0
+; CHECK-NEXT: divdu 5, 4, 3
+; CHECK-NEXT: mulld 3, 5, 3
+; CHECK-NEXT: sub 3, 4, 3
+; CHECK-NEXT: mffprd 4, 2
+; CHECK-NEXT: xxswapd 1, 0
+; CHECK-NEXT: mtfprd 0, 3
+; CHECK-NEXT: mffprd 3, 1
+; CHECK-NEXT: divdu 5, 4, 3
+; CHECK-NEXT: mulld 3, 5, 3
+; CHECK-NEXT: sub 3, 4, 3
+; CHECK-NEXT: mtfprd 1, 3
+; CHECK-NEXT: xxmrghd 34, 0, 1
+; CHECK-NEXT: blr
+ %res = call <2 x i64> @llvm.masked.urem(<2 x i64> %x, <2 x i64> %y, <2 x i1> %m)
+ ret <2 x i64> %res
+}
+
+; Splitting
+define <4 x i64> @urem_v4i64(<4 x i64> %x, <4 x i64> %y, <4 x i1> %m) {
+; CHECK-LABEL: urem_v4i64:
+; CHECK: # %bb.0:
+; CHECK-NEXT: xxmrglw 32, 38, 38
+; CHECK-NEXT: xxleqv 39, 39, 39
+; CHECK-NEXT: xxmrghw 33, 38, 38
+; CHECK-NEXT: mfvsrd 3, 34
+; CHECK-NEXT: vspltisw 6, 1
+; CHECK-NEXT: mfvsrd 4, 35
+; CHECK-NEXT: xxswapd 2, 34
+; CHECK-NEXT: xxswapd 4, 35
+; CHECK-NEXT: vsld 0, 0, 7
+; CHECK-NEXT: mffprd 5, 2
+; CHECK-NEXT: vsrad 0, 0, 7
+; CHECK-NEXT: vupklsw 6, 6
+; CHECK-NEXT: xxsel 0, 38, 36, 32
+; CHECK-NEXT: vsld 4, 1, 7
+; CHECK-NEXT: mffprd 6, 0
+; CHECK-NEXT: vsrad 4, 4, 7
+; CHECK-NEXT: divdu 9, 3, 6
+; CHECK-NEXT: mulld 6, 9, 6
+; CHECK-NEXT: sub 3, 3, 6
+; CHECK-NEXT: xxswapd 3, 0
+; CHECK-NEXT: mffprd 8, 3
+; CHECK-NEXT: mtfprd 0, 3
+; CHECK-NEXT: xxsel 1, 38, 37, 36
+; CHECK-NEXT: mffprd 7, 1
+; CHECK-NEXT: divdu 9, 4, 7
+; CHECK-NEXT: mulld 7, 9, 7
+; CHECK-NEXT: divdu 9, 5, 8
+; CHECK-NEXT: sub 4, 4, 7
+; CHECK-NEXT: mulld 8, 9, 8
+; CHECK-NEXT: sub 3, 5, 8
+; CHECK-NEXT: xxswapd 5, 1
+; CHECK-NEXT: mffprd 5, 4
+; CHECK-NEXT: mtfprd 1, 4
+; CHECK-NEXT: mffprd 4, 5
+; CHECK-NEXT: mtfprd 2, 3
+; CHECK-NEXT: divdu 3, 5, 4
+; CHECK-NEXT: mulld 3, 3, 4
+; CHECK-NEXT: sub 3, 5, 3
+; CHECK-NEXT: xxmrghd 34, 0, 2
+; CHECK-NEXT: mtfprd 0, 3
+; CHECK-NEXT: xxmrghd 35, 1, 0
+; CHECK-NEXT: blr
+ %res = call <4 x i64> @llvm.masked.urem(<4 x i64> %x, <4 x i64> %y, <4 x i1> %m)
+ ret <4 x i64> %res
+}
+
+; Widening
+define <2 x i32> @urem_v2i32(<2 x i32> %x, <2 x i32> %y, <2 x i1> %m) {
+; CHECK-LABEL: urem_v2i32:
+; CHECK: # %bb.0:
+; CHECK-NEXT: addis 3, 2, .LCPI3_0 at toc@ha
+; CHECK-NEXT: xxlxor 32, 32, 32
+; CHECK-NEXT: xxsldwi 1, 34, 34, 1
+; CHECK-NEXT: addi 3, 3, .LCPI3_0 at toc@l
+; CHECK-NEXT: mffprwz 4, 1
+; CHECK-NEXT: xxswapd 4, 34
+; CHECK-NEXT: xxsldwi 6, 34, 34, 3
+; CHECK-NEXT: lxvd2x 0, 0, 3
+; CHECK-NEXT: xxswapd 37, 0
+; CHECK-NEXT: vperm 4, 0, 4, 5
+; CHECK-NEXT: xxleqv 32, 32, 32
+; CHECK-NEXT: vspltisw 5, 1
+; CHECK-NEXT: vslw 4, 4, 0
+; CHECK-NEXT: vsraw 4, 4, 0
+; CHECK-NEXT: xxsel 0, 37, 35, 36
+; CHECK-NEXT: xxsldwi 2, 0, 0, 1
+; CHECK-NEXT: xxswapd 3, 0
+; CHECK-NEXT: xxsldwi 5, 0, 0, 3
+; CHECK-NEXT: mffprwz 3, 2
+; CHECK-NEXT: mffprwz 5, 3
+; CHECK-NEXT: divwu 6, 4, 3
+; CHECK-NEXT: mullw 3, 6, 3
+; CHECK-NEXT: sub 3, 4, 3
+; CHECK-NEXT: mffprwz 4, 4
+; CHECK-NEXT: divwu 6, 4, 5
+; CHECK-NEXT: mullw 5, 6, 5
+; CHECK-NEXT: sub 4, 4, 5
+; CHECK-NEXT: rldimi 4, 3, 32, 0
+; CHECK-NEXT: mffprwz 3, 5
+; CHECK-NEXT: mtfprd 1, 4
+; CHECK-NEXT: mffprwz 4, 6
+; CHECK-NEXT: divwu 5, 4, 3
+; CHECK-NEXT: mullw 3, 5, 3
+; CHECK-NEXT: mfvsrwz 5, 34
+; CHECK-NEXT: sub 3, 4, 3
+; CHECK-NEXT: mffprwz 4, 0
+; CHECK-NEXT: divwu 6, 5, 4
+; CHECK-NEXT: mullw 4, 6, 4
+; CHECK-NEXT: sub 4, 5, 4
+; CHECK-NEXT: rldimi 4, 3, 32, 0
+; CHECK-NEXT: mtfprd 0, 4
+; CHECK-NEXT: xxmrghd 34, 0, 1
+; CHECK-NEXT: blr
+ %res = call <2 x i32> @llvm.masked.urem(<2 x i32> %x, <2 x i32> %y, <2 x i1> %m)
+ ret <2 x i32> %res
+}
+
+; Promotion
+define <4 x i16> @urem_v4i16(<4 x i16> %x, <4 x i16> %y, <4 x i1> %m) {
+; CHECK-LABEL: urem_v4i16:
+; CHECK: # %bb.0:
+; CHECK-NEXT: xxswapd 0, 36
+; CHECK-NEXT: xxsldwi 1, 36, 36, 1
+; CHECK-NEXT: mfvsrwz 3, 36
+; CHECK-NEXT: li 7, 0
+; CHECK-NEXT: xxsldwi 2, 36, 36, 3
+; CHECK-NEXT: std 25, -56(1) # 8-byte Folded Spill
+; CHECK-NEXT: std 30, -16(1) # 8-byte Folded Spill
+; CHECK-NEXT: std 29, -24(1) # 8-byte Folded Spill
+; CHECK-NEXT: mffprwz 4, 0
+; CHECK-NEXT: mffprwz 5, 1
+; CHECK-NEXT: std 28, -32(1) # 8-byte Folded Spill
+; CHECK-NEXT: std 27, -40(1) # 8-byte Folded Spill
+; CHECK-NEXT: mffprwz 6, 2
+; CHECK-NEXT: std 26, -48(1) # 8-byte Folded Spill
+; CHECK-NEXT: mtvsrd 36, 3
+; CHECK-NEXT: mfvsrd 3, 34
+; CHECK-NEXT: mtvsrd 37, 4
+; CHECK-NEXT: mtvsrd 32, 5
+; CHECK-NEXT: rldicl 8, 3, 48, 48
+; CHECK-NEXT: rldicl 9, 3, 32, 48
+; CHECK-NEXT: clrlwi 8, 8, 16
+; CHECK-NEXT: clrlwi 9, 9, 16
+; CHECK-NEXT: vmrghh 5, 0, 5
+; CHECK-NEXT: mtvsrd 32, 6
+; CHECK-NEXT: vmrghh 4, 0, 4
+; CHECK-NEXT: mtvsrd 32, 7
+; CHECK-NEXT: clrldi 7, 3, 48
+; CHECK-NEXT: rldicl 3, 3, 16, 48
+; CHECK-NEXT: clrlwi 7, 7, 16
+; CHECK-NEXT: clrlwi 3, 3, 16
+; CHECK-NEXT: xxmrglw 1, 36, 37
+; CHECK-NEXT: vspltish 4, 15
+; CHECK-NEXT: vsplth 0, 0, 3
+; CHECK-NEXT: xxspltw 0, 32, 3
+; CHECK-NEXT: vspltish 0, 1
+; CHECK-NEXT: xxmrgld 37, 0, 1
+; CHECK-NEXT: xxswapd 1, 34
+; CHECK-NEXT: vslh 5, 5, 4
+; CHECK-NEXT: mffprd 4, 1
+; CHECK-NEXT: vsrah 4, 5, 4
+; CHECK-NEXT: clrldi 10, 4, 48
+; CHECK-NEXT: rldicl 11, 4, 48, 48
+; CHECK-NEXT: rldicl 12, 4, 32, 48
+; CHECK-NEXT: rldicl 4, 4, 16, 48
+; CHECK-NEXT: clrlwi 10, 10, 16
+; CHECK-NEXT: clrlwi 11, 11, 16
+; CHECK-NEXT: clrlwi 12, 12, 16
+; CHECK-NEXT: clrlwi 4, 4, 16
+; CHECK-NEXT: xxsel 0, 32, 35, 36
+; CHECK-NEXT: mffprd 5, 0
+; CHECK-NEXT: clrldi 0, 5, 48
+; CHECK-NEXT: rldicl 30, 5, 48, 48
+; CHECK-NEXT: rldicl 29, 5, 32, 48
+; CHECK-NEXT: rldicl 5, 5, 16, 48
+; CHECK-NEXT: clrlwi 0, 0, 16
+; CHECK-NEXT: clrlwi 30, 30, 16
+; CHECK-NEXT: clrlwi 29, 29, 16
+; CHECK-NEXT: clrlwi 5, 5, 16
+; CHECK-NEXT: xxswapd 2, 0
+; CHECK-NEXT: mffprd 6, 2
+; CHECK-NEXT: clrldi 28, 6, 48
+; CHECK-NEXT: rldicl 27, 6, 48, 48
+; CHECK-NEXT: rldicl 26, 6, 32, 48
+; CHECK-NEXT: rldicl 6, 6, 16, 48
+; CHECK-NEXT: divwu 25, 7, 0
+; CHECK-NEXT: clrlwi 28, 28, 16
+; CHECK-NEXT: clrlwi 27, 27, 16
+; CHECK-NEXT: clrlwi 26, 26, 16
+; CHECK-NEXT: clrlwi 6, 6, 16
+; CHECK-NEXT: mullw 0, 25, 0
+; CHECK-NEXT: divwu 25, 8, 30
+; CHECK-NEXT: sub 7, 7, 0
+; CHECK-NEXT: mtvsrd 34, 7
+; CHECK-NEXT: mullw 30, 25, 30
+; CHECK-NEXT: divwu 25, 9, 29
+; CHECK-NEXT: sub 8, 8, 30
+; CHECK-NEXT: ld 30, -16(1) # 8-byte Folded Reload
+; CHECK-NEXT: mtvsrd 35, 8
+; CHECK-NEXT: mullw 29, 25, 29
+; CHECK-NEXT: divwu 25, 3, 5
+; CHECK-NEXT: sub 9, 9, 29
+; CHECK-NEXT: ld 29, -24(1) # 8-byte Folded Reload
+; CHECK-NEXT: mtvsrd 36, 9
+; CHECK-NEXT: mullw 5, 25, 5
+; CHECK-NEXT: divwu 25, 10, 28
+; CHECK-NEXT: sub 3, 3, 5
+; CHECK-NEXT: mtvsrd 37, 3
+; CHECK-NEXT: mullw 28, 25, 28
+; CHECK-NEXT: divwu 25, 11, 27
+; CHECK-NEXT: sub 3, 10, 28
+; CHECK-NEXT: ld 28, -32(1) # 8-byte Folded Reload
+; CHECK-NEXT: mullw 27, 25, 27
+; CHECK-NEXT: divwu 25, 12, 26
+; CHECK-NEXT: sub 5, 11, 27
+; CHECK-NEXT: ld 27, -40(1) # 8-byte Folded Reload
+; CHECK-NEXT: mullw 26, 25, 26
+; CHECK-NEXT: divwu 25, 4, 6
+; CHECK-NEXT: sub 7, 12, 26
+; CHECK-NEXT: ld 26, -48(1) # 8-byte Folded Reload
+; CHECK-NEXT: mullw 6, 25, 6
+; CHECK-NEXT: ld 25, -56(1) # 8-byte Folded Reload
+; CHECK-NEXT: vmrghh 2, 3, 2
+; CHECK-NEXT: vmrghh 3, 5, 4
+; CHECK-NEXT: mtvsrd 36, 3
+; CHECK-NEXT: mtvsrd 37, 5
+; CHECK-NEXT: sub 3, 4, 6
+; CHECK-NEXT: mtvsrd 32, 3
+; CHECK-NEXT: xxmrglw 0, 35, 34
+; CHECK-NEXT: vmrghh 4, 5, 4
+; CHECK-NEXT: mtvsrd 37, 7
+; CHECK-NEXT: vmrghh 5, 0, 5
+; CHECK-NEXT: xxmrglw 1, 37, 36
+; CHECK-NEXT: xxmrgld 34, 0, 1
+; CHECK-NEXT: blr
+ %res = call <4 x i16> @llvm.masked.urem(<4 x i16> %x, <4 x i16> %y, <4 x i1> %m)
+ ret <4 x i16> %res
+}
+
+; Scalarization
+define <1 x i64> @urem_v1i164(<1 x i64> %x, <1 x i64> %y, <1 x i1> %m) {
+; CHECK-LABEL: urem_v1i164:
+; CHECK: # %bb.0:
+; CHECK-NEXT: andi. 5, 5, 1
+; CHECK-NEXT: li 5, 1
+; CHECK-NEXT: iselgt 4, 4, 5
+; CHECK-NEXT: divdu 5, 3, 4
+; CHECK-NEXT: mulld 4, 5, 4
+; CHECK-NEXT: sub 3, 3, 4
+; CHECK-NEXT: blr
+ %res = call <1 x i64> @llvm.masked.urem(<1 x i64> %x, <1 x i64> %y, <1 x i1> %m)
+ ret <1 x i64> %res
+}
+
+; Expansion
+define <2 x i128> @urem_v2i128(<2 x i128> %x, <2 x i128> %y, <2 x i1> %m) {
+; CHECK-LABEL: urem_v2i128:
+; CHECK: # %bb.0:
+; CHECK-NEXT: mfocrf 12, 32
+; CHECK-NEXT: stw 12, 8(1)
+; CHECK-NEXT: mflr 0
+; CHECK-NEXT: stdu 1, -128(1)
+; CHECK-NEXT: std 0, 144(1)
+; CHECK-NEXT: .cfi_def_cfa_offset 128
+; CHECK-NEXT: .cfi_offset lr, 16
+; CHECK-NEXT: .cfi_offset r29, -24
+; CHECK-NEXT: .cfi_offset r30, -16
+; CHECK-NEXT: .cfi_offset cr2, 8
+; CHECK-NEXT: .cfi_offset v29, -80
+; CHECK-NEXT: .cfi_offset v30, -64
+; CHECK-NEXT: .cfi_offset v31, -48
+; CHECK-NEXT: li 3, 48
+; CHECK-NEXT: xxswapd 0, 38
+; CHECK-NEXT: xxswapd 1, 37
+; CHECK-NEXT: std 30, 112(1) # 8-byte Folded Spill
+; CHECK-NEXT: li 30, 1
+; CHECK-NEXT: std 29, 104(1) # 8-byte Folded Spill
+; CHECK-NEXT: li 29, 0
+; CHECK-NEXT: mfvsrd 4, 35
+; CHECK-NEXT: stvx 29, 1, 3 # 16-byte Folded Spill
+; CHECK-NEXT: li 3, 64
+; CHECK-NEXT: stvx 30, 1, 3 # 16-byte Folded Spill
+; CHECK-NEXT: li 3, 80
+; CHECK-NEXT: vmr 30, 2
+; CHECK-NEXT: stvx 31, 1, 3 # 16-byte Folded Spill
+; CHECK-NEXT: mffprd 3, 0
+; CHECK-NEXT: vmr 31, 4
+; CHECK-NEXT: andi. 3, 3, 1
+; CHECK-NEXT: mfvsrd 3, 38
+; CHECK-NEXT: crmove 8, 1
+; CHECK-NEXT: andi. 3, 3, 1
+; CHECK-NEXT: mffprd 3, 1
+; CHECK-NEXT: iselgt 5, 3, 30
+; CHECK-NEXT: mfvsrd 3, 37
+; CHECK-NEXT: iselgt 6, 3, 29
+; CHECK-NEXT: xxswapd 0, 35
+; CHECK-NEXT: mffprd 3, 0
+; CHECK-NEXT: bl __umodti3
+; CHECK-NEXT: nop
+; CHECK-NEXT: xxswapd 0, 63
+; CHECK-NEXT: mtfprd 1, 3
+; CHECK-NEXT: mtfprd 2, 4
+; CHECK-NEXT: mfvsrd 4, 62
+; CHECK-NEXT: mffprd 3, 0
+; CHECK-NEXT: isel 5, 3, 30, 8
+; CHECK-NEXT: mfvsrd 3, 63
+; CHECK-NEXT: isel 6, 3, 29, 8
+; CHECK-NEXT: xxswapd 0, 62
+; CHECK-NEXT: mffprd 3, 0
+; CHECK-NEXT: xxmrghd 61, 2, 1
+; CHECK-NEXT: bl __umodti3
+; CHECK-NEXT: nop
+; CHECK-NEXT: mtfprd 0, 3
+; CHECK-NEXT: li 3, 80
+; CHECK-NEXT: mtfprd 1, 4
+; CHECK-NEXT: ld 30, 112(1) # 8-byte Folded Reload
+; CHECK-NEXT: vmr 3, 29
+; CHECK-NEXT: ld 29, 104(1) # 8-byte Folded Reload
+; CHECK-NEXT: lvx 31, 1, 3 # 16-byte Folded Reload
+; CHECK-NEXT: li 3, 64
+; CHECK-NEXT: lvx 30, 1, 3 # 16-byte Folded Reload
+; CHECK-NEXT: li 3, 48
+; CHECK-NEXT: lvx 29, 1, 3 # 16-byte Folded Reload
+; CHECK-NEXT: xxmrghd 34, 1, 0
+; CHECK-NEXT: addi 1, 1, 128
+; CHECK-NEXT: ld 0, 16(1)
+; CHECK-NEXT: lwz 12, 8(1)
+; CHECK-NEXT: mtlr 0
+; CHECK-NEXT: mtocrf 32, 12
+; CHECK-NEXT: blr
+ %res = call <2 x i128> @llvm.masked.urem(<2 x i128> %x, <2 x i128> %y, <2 x i1> %m)
+ ret <2 x i128> %res
+}
+
+; Promotion and widening
+define <3 x i10> @urem_v3i10(<3 x i10> %x, <3 x i10> %y, <3 x i1> %m) {
+; CHECK-LABEL: urem_v3i10:
+; CHECK: # %bb.0:
+; CHECK-NEXT: mtfprwz 0, 9
+; CHECK-NEXT: mtfprwz 1, 10
+; CHECK-NEXT: addis 9, 2, .LCPI7_0 at toc@ha
+; CHECK-NEXT: addi 9, 9, .LCPI7_0 at toc@l
+; CHECK-NEXT: mtvsrwz 33, 8
+; CHECK-NEXT: xxmrghw 35, 1, 0
+; CHECK-NEXT: lxvd2x 0, 0, 9
+; CHECK-NEXT: mtfprwz 1, 7
+; CHECK-NEXT: xxswapd 34, 0
+; CHECK-NEXT: mtfprwz 0, 6
+; CHECK-NEXT: lbz 6, 96(1)
+; CHECK-NEXT: mtvsrwz 37, 6
+; CHECK-NEXT: xxmrghw 36, 1, 0
+; CHECK-NEXT: mtfprwz 0, 3
+; CHECK-NEXT: mtfprwz 1, 4
+; CHECK-NEXT: vperm 4, 1, 4, 2
+; CHECK-NEXT: mtvsrwz 33, 5
+; CHECK-NEXT: vperm 3, 5, 3, 2
+; CHECK-NEXT: vspltisw 5, -10
+; CHECK-NEXT: vsrw 5, 5, 5
+; CHECK-NEXT: xxmrghw 32, 1, 0
+; CHECK-NEXT: xxland 0, 36, 37
+; CHECK-NEXT: xxleqv 36, 36, 36
+; CHECK-NEXT: vslw 3, 3, 4
+; CHECK-NEXT: vsraw 3, 3, 4
+; CHECK-NEXT: vperm 0, 1, 0, 2
+; CHECK-NEXT: vspltisw 1, 1
+; CHECK-NEXT: xxland 1, 32, 37
+; CHECK-NEXT: xxswapd 3, 1
+; CHECK-NEXT: xxsldwi 5, 1, 1, 1
+; CHECK-NEXT: mffprwz 4, 3
+; CHECK-NEXT: xxsel 0, 33, 0, 35
+; CHECK-NEXT: xxswapd 2, 0
+; CHECK-NEXT: xxsldwi 4, 0, 0, 1
+; CHECK-NEXT: mffprwz 3, 2
+; CHECK-NEXT: divwu 5, 4, 3
+; CHECK-NEXT: mullw 3, 5, 3
+; CHECK-NEXT: sub 3, 4, 3
+; CHECK-NEXT: mffprwz 4, 4
+; CHECK-NEXT: mtfprwz 2, 3
+; CHECK-NEXT: mffprwz 3, 5
+; CHECK-NEXT: divwu 5, 3, 4
+; CHECK-NEXT: mullw 4, 5, 4
+; CHECK-NEXT: sub 3, 3, 4
+; CHECK-NEXT: mffprwz 4, 1
+; CHECK-NEXT: mtfprwz 3, 3
+; CHECK-NEXT: mffprwz 3, 0
+; CHECK-NEXT: divwu 5, 4, 3
+; CHECK-NEXT: mullw 3, 5, 3
+; CHECK-NEXT: sub 3, 4, 3
+; CHECK-NEXT: mtvsrwz 36, 3
+; CHECK-NEXT: xxmrghw 35, 3, 2
+; CHECK-NEXT: vperm 2, 4, 3, 2
+; CHECK-NEXT: mfvsrwz 5, 34
+; CHECK-NEXT: xxswapd 0, 34
+; CHECK-NEXT: xxsldwi 1, 34, 34, 1
+; CHECK-NEXT: mffprwz 3, 0
+; CHECK-NEXT: mffprwz 4, 1
+; CHECK-NEXT: blr
+ %res = call <3 x i10> @llvm.masked.urem(<3 x i10> %x, <3 x i10> %y, <3 x i1> %m)
+ ret <3 x i10> %res
+}
diff --git a/llvm/test/CodeGen/RISCV/rvv/masked-sdiv.ll b/llvm/test/CodeGen/RISCV/rvv/masked-sdiv.ll
new file mode 100644
index 0000000000000..fc6e4afdbc31e
--- /dev/null
+++ b/llvm/test/CodeGen/RISCV/rvv/masked-sdiv.ll
@@ -0,0 +1,303 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
+; RUN: llc -mtriple riscv64 -mattr=+v < %s | FileCheck %s
+
+define <vscale x 8 x i8> @sdiv_nxv8i8(<vscale x 8 x i8> %x, <vscale x 8 x i8> %y, <vscale x 8 x i1> %m) {
+; CHECK-LABEL: sdiv_nxv8i8:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetvli a0, zero, e8, m1, ta, ma
+; CHECK-NEXT: vmv.v.i v10, 1
+; CHECK-NEXT: vmerge.vvm v9, v10, v9, v0
+; CHECK-NEXT: vdiv.vv v8, v8, v9
+; CHECK-NEXT: ret
+ %res = call <vscale x 8 x i8> @llvm.masked.sdiv(<vscale x 8 x i8> %x, <vscale x 8 x i8> %y, <vscale x 8 x i1> %m)
+ ret <vscale x 8 x i8> %res
+}
+
+define <vscale x 4 x i16> @sdiv_nxv4i16(<vscale x 4 x i16> %x, <vscale x 4 x i16> %y, <vscale x 4 x i1> %m) {
+; CHECK-LABEL: sdiv_nxv4i16:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetvli a0, zero, e16, m1, ta, ma
+; CHECK-NEXT: vmv.v.i v10, 1
+; CHECK-NEXT: vmerge.vvm v9, v10, v9, v0
+; CHECK-NEXT: vdiv.vv v8, v8, v9
+; CHECK-NEXT: ret
+ %res = call <vscale x 4 x i16> @llvm.masked.sdiv(<vscale x 4 x i16> %x, <vscale x 4 x i16> %y, <vscale x 4 x i1> %m)
+ ret <vscale x 4 x i16> %res
+}
+
+define <vscale x 4 x i32> @sdiv_nxv4i32(<vscale x 4 x i32> %x, <vscale x 4 x i32> %y, <vscale x 4 x i1> %m) {
+; CHECK-LABEL: sdiv_nxv4i32:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetvli a0, zero, e32, m2, ta, ma
+; CHECK-NEXT: vmv.v.i v12, 1
+; CHECK-NEXT: vmerge.vvm v10, v12, v10, v0
+; CHECK-NEXT: vdiv.vv v8, v8, v10
+; CHECK-NEXT: ret
+ %res = call <vscale x 4 x i32> @llvm.masked.sdiv(<vscale x 4 x i32> %x, <vscale x 4 x i32> %y, <vscale x 4 x i1> %m)
+ ret <vscale x 4 x i32> %res
+}
+
+define <vscale x 8 x i32> @sdiv_nxv8i32(<vscale x 8 x i32> %x, <vscale x 8 x i32> %y, <vscale x 8 x i1> %m) {
+; CHECK-LABEL: sdiv_nxv8i32:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetvli a0, zero, e32, m4, ta, ma
+; CHECK-NEXT: vmv.v.i v16, 1
+; CHECK-NEXT: vmerge.vvm v12, v16, v12, v0
+; CHECK-NEXT: vdiv.vv v8, v8, v12
+; CHECK-NEXT: ret
+ %res = call <vscale x 8 x i32> @llvm.masked.sdiv(<vscale x 8 x i32> %x, <vscale x 8 x i32> %y, <vscale x 8 x i1> %m)
+ ret <vscale x 8 x i32> %res
+}
+
+define <vscale x 8 x i32> @sdiv_nxv8i32_splat_rhs(<vscale x 8 x i32> %x, i32 %y, <vscale x 8 x i1> %m) {
+; CHECK-LABEL: sdiv_nxv8i32_splat_rhs:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetvli a1, zero, e32, m4, ta, ma
+; CHECK-NEXT: vmv.v.i v12, 1
+; CHECK-NEXT: vmerge.vxm v12, v12, a0, v0
+; CHECK-NEXT: vdiv.vv v8, v8, v12
+; CHECK-NEXT: ret
+ %head = insertelement <vscale x 8 x i32> poison, i32 %y, i32 0
+ %splat = shufflevector <vscale x 8 x i32> %head, <vscale x 8 x i32> poison, <vscale x 8 x i32> zeroinitializer
+ %res = call <vscale x 8 x i32> @llvm.masked.sdiv(<vscale x 8 x i32> %x, <vscale x 8 x i32> %splat, <vscale x 8 x i1> %m)
+ ret <vscale x 8 x i32> %res
+}
+
+define <vscale x 8 x i16> @sdiv_nxv8i16(<vscale x 8 x i16> %x, <vscale x 8 x i16> %y, <vscale x 8 x i1> %m) {
+; CHECK-LABEL: sdiv_nxv8i16:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetvli a0, zero, e16, m2, ta, ma
+; CHECK-NEXT: vmv.v.i v12, 1
+; CHECK-NEXT: vmerge.vvm v10, v12, v10, v0
+; CHECK-NEXT: vdiv.vv v8, v8, v10
+; CHECK-NEXT: ret
+ %res = call <vscale x 8 x i16> @llvm.masked.sdiv(<vscale x 8 x i16> %x, <vscale x 8 x i16> %y, <vscale x 8 x i1> %m)
+ ret <vscale x 8 x i16> %res
+}
+
+define <vscale x 2 x i64> @sdiv_nxv2i64(<vscale x 2 x i64> %x, <vscale x 2 x i64> %y, <vscale x 2 x i1> %m) {
+; CHECK-LABEL: sdiv_nxv2i64:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetvli a0, zero, e64, m2, ta, ma
+; CHECK-NEXT: vmv.v.i v12, 1
+; CHECK-NEXT: vmerge.vvm v10, v12, v10, v0
+; CHECK-NEXT: vdiv.vv v8, v8, v10
+; CHECK-NEXT: ret
+ %res = call <vscale x 2 x i64> @llvm.masked.sdiv(<vscale x 2 x i64> %x, <vscale x 2 x i64> %y, <vscale x 2 x i1> %m)
+ ret <vscale x 2 x i64> %res
+}
+
+define <4 x i32> @sdiv_v4i32(<4 x i32> %x, <4 x i32> %y, <4 x i1> %m) {
+; CHECK-LABEL: sdiv_v4i32:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetivli zero, 4, e32, m1, ta, ma
+; CHECK-NEXT: vmv.v.i v10, 1
+; CHECK-NEXT: vmerge.vvm v9, v10, v9, v0
+; CHECK-NEXT: vdiv.vv v8, v8, v9
+; CHECK-NEXT: ret
+ %res = call <4 x i32> @llvm.masked.sdiv(<4 x i32> %x, <4 x i32> %y, <4 x i1> %m)
+ ret <4 x i32> %res
+}
+
+define <2 x i64> @sdiv_v2i64(<2 x i64> %x, <2 x i64> %y, <2 x i1> %m) {
+; CHECK-LABEL: sdiv_v2i64:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetivli zero, 2, e64, m1, ta, ma
+; CHECK-NEXT: vmv.v.i v10, 1
+; CHECK-NEXT: vmerge.vvm v9, v10, v9, v0
+; CHECK-NEXT: vdiv.vv v8, v8, v9
+; CHECK-NEXT: ret
+ %res = call <2 x i64> @llvm.masked.sdiv(<2 x i64> %x, <2 x i64> %y, <2 x i1> %m)
+ ret <2 x i64> %res
+}
+
+define <4 x i64> @sdiv_v4i64(<4 x i64> %x, <4 x i64> %y, <4 x i1> %m) {
+; CHECK-LABEL: sdiv_v4i64:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetivli zero, 4, e64, m2, ta, ma
+; CHECK-NEXT: vmv.v.i v12, 1
+; CHECK-NEXT: vmerge.vvm v10, v12, v10, v0
+; CHECK-NEXT: vdiv.vv v8, v8, v10
+; CHECK-NEXT: ret
+ %res = call <4 x i64> @llvm.masked.sdiv(<4 x i64> %x, <4 x i64> %y, <4 x i1> %m)
+ ret <4 x i64> %res
+}
+
+define <2 x i32> @sdiv_v2i32(<2 x i32> %x, <2 x i32> %y, <2 x i1> %m) {
+; CHECK-LABEL: sdiv_v2i32:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetivli zero, 2, e32, mf2, ta, ma
+; CHECK-NEXT: vmv.v.i v10, 1
+; CHECK-NEXT: vmerge.vvm v9, v10, v9, v0
+; CHECK-NEXT: vdiv.vv v8, v8, v9
+; CHECK-NEXT: ret
+ %res = call <2 x i32> @llvm.masked.sdiv(<2 x i32> %x, <2 x i32> %y, <2 x i1> %m)
+ ret <2 x i32> %res
+}
+
+define <4 x i16> @sdiv_v4i16(<4 x i16> %x, <4 x i16> %y, <4 x i1> %m) {
+; CHECK-LABEL: sdiv_v4i16:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetivli zero, 4, e16, mf2, ta, ma
+; CHECK-NEXT: vmv.v.i v10, 1
+; CHECK-NEXT: vmerge.vvm v9, v10, v9, v0
+; CHECK-NEXT: vdiv.vv v8, v8, v9
+; CHECK-NEXT: ret
+ %res = call <4 x i16> @llvm.masked.sdiv(<4 x i16> %x, <4 x i16> %y, <4 x i1> %m)
+ ret <4 x i16> %res
+}
+
+define <1 x i64> @sdiv_v1i64(<1 x i64> %x, <1 x i64> %y, <1 x i1> %m) {
+; CHECK-LABEL: sdiv_v1i64:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetivli zero, 1, e64, m1, ta, ma
+; CHECK-NEXT: vmv.v.i v10, 1
+; CHECK-NEXT: vmerge.vvm v9, v10, v9, v0
+; CHECK-NEXT: vdiv.vv v8, v8, v9
+; CHECK-NEXT: ret
+ %res = call <1 x i64> @llvm.masked.sdiv(<1 x i64> %x, <1 x i64> %y, <1 x i1> %m)
+ ret <1 x i64> %res
+}
+
+define <2 x i128> @sdiv_v2i128(<2 x i128> %x, <2 x i128> %y, <2 x i1> %m) {
+; CHECK-LABEL: sdiv_v2i128:
+; CHECK: # %bb.0:
+; CHECK-NEXT: addi sp, sp, -80
+; CHECK-NEXT: .cfi_def_cfa_offset 80
+; CHECK-NEXT: sd ra, 72(sp) # 8-byte Folded Spill
+; CHECK-NEXT: sd s0, 64(sp) # 8-byte Folded Spill
+; CHECK-NEXT: sd s1, 56(sp) # 8-byte Folded Spill
+; CHECK-NEXT: sd s2, 48(sp) # 8-byte Folded Spill
+; CHECK-NEXT: sd s3, 40(sp) # 8-byte Folded Spill
+; CHECK-NEXT: sd s4, 32(sp) # 8-byte Folded Spill
+; CHECK-NEXT: sd s5, 24(sp) # 8-byte Folded Spill
+; CHECK-NEXT: sd s6, 16(sp) # 8-byte Folded Spill
+; CHECK-NEXT: .cfi_offset ra, -8
+; CHECK-NEXT: .cfi_offset s0, -16
+; CHECK-NEXT: .cfi_offset s1, -24
+; CHECK-NEXT: .cfi_offset s2, -32
+; CHECK-NEXT: .cfi_offset s3, -40
+; CHECK-NEXT: .cfi_offset s4, -48
+; CHECK-NEXT: .cfi_offset s5, -56
+; CHECK-NEXT: .cfi_offset s6, -64
+; CHECK-NEXT: csrr a3, vlenb
+; CHECK-NEXT: sub sp, sp, a3
+; CHECK-NEXT: .cfi_escape 0x0f, 0x0e, 0x72, 0x00, 0x11, 0xd0, 0x00, 0x22, 0x11, 0x01, 0x92, 0xa2, 0x38, 0x00, 0x1e, 0x22 # sp + 80 + 1 * vlenb
+; CHECK-NEXT: mv a4, a1
+; CHECK-NEXT: addi a3, sp, 16
+; CHECK-NEXT: vs1r.v v0, (a3) # vscale x 8-byte Folded Spill
+; CHECK-NEXT: ld a6, 16(a1)
+; CHECK-NEXT: ld a1, 24(a1)
+; CHECK-NEXT: ld a3, 24(a2)
+; CHECK-NEXT: vsetivli zero, 2, e8, mf8, ta, ma
+; CHECK-NEXT: vmv.v.i v8, 0
+; CHECK-NEXT: vmerge.vim v8, v8, 1, v0
+; CHECK-NEXT: vslidedown.vi v8, v8, 1
+; CHECK-NEXT: vmv.x.s a5, v8
+; CHECK-NEXT: andi a7, a5, 1
+; CHECK-NEXT: mv s0, a0
+; CHECK-NEXT: bnez a7, .LBB13_2
+; CHECK-NEXT: # %bb.1:
+; CHECK-NEXT: li a5, 1
+; CHECK-NEXT: j .LBB13_3
+; CHECK-NEXT: .LBB13_2:
+; CHECK-NEXT: ld a5, 16(a2)
+; CHECK-NEXT: .LBB13_3:
+; CHECK-NEXT: ld s1, 0(a4)
+; CHECK-NEXT: ld s2, 8(a4)
+; CHECK-NEXT: ld s3, 0(a2)
+; CHECK-NEXT: ld s6, 8(a2)
+; CHECK-NEXT: neg a0, a7
+; CHECK-NEXT: and a3, a0, a3
+; CHECK-NEXT: mv a0, a6
+; CHECK-NEXT: mv a2, a5
+; CHECK-NEXT: call __divti3
+; CHECK-NEXT: mv s4, a0
+; CHECK-NEXT: addi a0, sp, 16
+; CHECK-NEXT: vl1r.v v8, (a0) # vscale x 8-byte Folded Reload
+; CHECK-NEXT: vsetivli zero, 2, e8, mf8, ta, ma
+; CHECK-NEXT: vfirst.m a0, v8
+; CHECK-NEXT: mv s5, a1
+; CHECK-NEXT: beqz a0, .LBB13_5
+; CHECK-NEXT: # %bb.4:
+; CHECK-NEXT: li s3, 1
+; CHECK-NEXT: .LBB13_5:
+; CHECK-NEXT: snez a0, a0
+; CHECK-NEXT: addi a0, a0, -1
+; CHECK-NEXT: and a3, a0, s6
+; CHECK-NEXT: mv a0, s1
+; CHECK-NEXT: mv a1, s2
+; CHECK-NEXT: mv a2, s3
+; CHECK-NEXT: call __divti3
+; CHECK-NEXT: sd a0, 0(s0)
+; CHECK-NEXT: sd a1, 8(s0)
+; CHECK-NEXT: sd s4, 16(s0)
+; CHECK-NEXT: sd s5, 24(s0)
+; CHECK-NEXT: csrr a0, vlenb
+; CHECK-NEXT: add sp, sp, a0
+; CHECK-NEXT: .cfi_def_cfa sp, 80
+; CHECK-NEXT: ld ra, 72(sp) # 8-byte Folded Reload
+; CHECK-NEXT: ld s0, 64(sp) # 8-byte Folded Reload
+; CHECK-NEXT: ld s1, 56(sp) # 8-byte Folded Reload
+; CHECK-NEXT: ld s2, 48(sp) # 8-byte Folded Reload
+; CHECK-NEXT: ld s3, 40(sp) # 8-byte Folded Reload
+; CHECK-NEXT: ld s4, 32(sp) # 8-byte Folded Reload
+; CHECK-NEXT: ld s5, 24(sp) # 8-byte Folded Reload
+; CHECK-NEXT: ld s6, 16(sp) # 8-byte Folded Reload
+; CHECK-NEXT: .cfi_restore ra
+; CHECK-NEXT: .cfi_restore s0
+; CHECK-NEXT: .cfi_restore s1
+; CHECK-NEXT: .cfi_restore s2
+; CHECK-NEXT: .cfi_restore s3
+; CHECK-NEXT: .cfi_restore s4
+; CHECK-NEXT: .cfi_restore s5
+; CHECK-NEXT: .cfi_restore s6
+; CHECK-NEXT: addi sp, sp, 80
+; CHECK-NEXT: .cfi_def_cfa_offset 0
+; CHECK-NEXT: ret
+ %res = call <2 x i128> @llvm.masked.sdiv(<2 x i128> %x, <2 x i128> %y, <2 x i1> %m)
+ ret <2 x i128> %res
+}
+
+define <3 x i10> @sdiv_v3i10(<3 x i10> %x, <3 x i10> %y, <3 x i1> %m) {
+; CHECK-LABEL: sdiv_v3i10:
+; CHECK: # %bb.0:
+; CHECK-NEXT: ld a3, 0(a1)
+; CHECK-NEXT: ld a4, 8(a1)
+; CHECK-NEXT: ld a1, 16(a1)
+; CHECK-NEXT: ld a5, 0(a2)
+; CHECK-NEXT: vsetivli zero, 4, e16, mf2, ta, mu
+; CHECK-NEXT: vmv.v.x v8, a3
+; CHECK-NEXT: ld a3, 8(a2)
+; CHECK-NEXT: ld a2, 16(a2)
+; CHECK-NEXT: vmv.v.x v9, a5
+; CHECK-NEXT: vmv.v.i v10, 1
+; CHECK-NEXT: vslide1down.vx v8, v8, a4
+; CHECK-NEXT: vslide1down.vx v9, v9, a3
+; CHECK-NEXT: vslide1down.vx v8, v8, a1
+; CHECK-NEXT: vslide1down.vx v9, v9, a2
+; CHECK-NEXT: vslidedown.vi v8, v8, 1
+; CHECK-NEXT: vslidedown.vi v9, v9, 1
+; CHECK-NEXT: vsll.vi v8, v8, 6
+; CHECK-NEXT: vsll.vi v9, v9, 6
+; CHECK-NEXT: vsra.vi v8, v8, 6
+; CHECK-NEXT: vsra.vi v10, v9, 6, v0.t
+; CHECK-NEXT: vdiv.vv v8, v8, v10
+; CHECK-NEXT: vmv.x.s a1, v8
+; CHECK-NEXT: vslidedown.vi v9, v8, 1
+; CHECK-NEXT: vslidedown.vi v8, v8, 2
+; CHECK-NEXT: andi a1, a1, 1023
+; CHECK-NEXT: vmv.x.s a2, v9
+; CHECK-NEXT: vmv.x.s a3, v8
+; CHECK-NEXT: andi a2, a2, 1023
+; CHECK-NEXT: slli a3, a3, 20
+; CHECK-NEXT: slli a2, a2, 10
+; CHECK-NEXT: or a1, a1, a3
+; CHECK-NEXT: or a1, a1, a2
+; CHECK-NEXT: slli a1, a1, 34
+; CHECK-NEXT: srli a1, a1, 34
+; CHECK-NEXT: sw a1, 0(a0)
+; CHECK-NEXT: ret
+ %res = call <3 x i10> @llvm.masked.sdiv(<3 x i10> %x, <3 x i10> %y, <3 x i1> %m)
+ ret <3 x i10> %res
+}
diff --git a/llvm/test/CodeGen/RISCV/rvv/masked-srem.ll b/llvm/test/CodeGen/RISCV/rvv/masked-srem.ll
new file mode 100644
index 0000000000000..eb0c9e97b023a
--- /dev/null
+++ b/llvm/test/CodeGen/RISCV/rvv/masked-srem.ll
@@ -0,0 +1,303 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
+; RUN: llc -mtriple riscv64 -mattr=+v < %s | FileCheck %s
+
+define <vscale x 8 x i8> @srem_nxv8i8(<vscale x 8 x i8> %x, <vscale x 8 x i8> %y, <vscale x 8 x i1> %m) {
+; CHECK-LABEL: srem_nxv8i8:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetvli a0, zero, e8, m1, ta, ma
+; CHECK-NEXT: vmv.v.i v10, 1
+; CHECK-NEXT: vmerge.vvm v9, v10, v9, v0
+; CHECK-NEXT: vrem.vv v8, v8, v9
+; CHECK-NEXT: ret
+ %res = call <vscale x 8 x i8> @llvm.masked.srem(<vscale x 8 x i8> %x, <vscale x 8 x i8> %y, <vscale x 8 x i1> %m)
+ ret <vscale x 8 x i8> %res
+}
+
+define <vscale x 4 x i16> @srem_nxv4i16(<vscale x 4 x i16> %x, <vscale x 4 x i16> %y, <vscale x 4 x i1> %m) {
+; CHECK-LABEL: srem_nxv4i16:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetvli a0, zero, e16, m1, ta, ma
+; CHECK-NEXT: vmv.v.i v10, 1
+; CHECK-NEXT: vmerge.vvm v9, v10, v9, v0
+; CHECK-NEXT: vrem.vv v8, v8, v9
+; CHECK-NEXT: ret
+ %res = call <vscale x 4 x i16> @llvm.masked.srem(<vscale x 4 x i16> %x, <vscale x 4 x i16> %y, <vscale x 4 x i1> %m)
+ ret <vscale x 4 x i16> %res
+}
+
+define <vscale x 4 x i32> @srem_nxv4i32(<vscale x 4 x i32> %x, <vscale x 4 x i32> %y, <vscale x 4 x i1> %m) {
+; CHECK-LABEL: srem_nxv4i32:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetvli a0, zero, e32, m2, ta, ma
+; CHECK-NEXT: vmv.v.i v12, 1
+; CHECK-NEXT: vmerge.vvm v10, v12, v10, v0
+; CHECK-NEXT: vrem.vv v8, v8, v10
+; CHECK-NEXT: ret
+ %res = call <vscale x 4 x i32> @llvm.masked.srem(<vscale x 4 x i32> %x, <vscale x 4 x i32> %y, <vscale x 4 x i1> %m)
+ ret <vscale x 4 x i32> %res
+}
+
+define <vscale x 8 x i32> @srem_nxv8i32(<vscale x 8 x i32> %x, <vscale x 8 x i32> %y, <vscale x 8 x i1> %m) {
+; CHECK-LABEL: srem_nxv8i32:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetvli a0, zero, e32, m4, ta, ma
+; CHECK-NEXT: vmv.v.i v16, 1
+; CHECK-NEXT: vmerge.vvm v12, v16, v12, v0
+; CHECK-NEXT: vrem.vv v8, v8, v12
+; CHECK-NEXT: ret
+ %res = call <vscale x 8 x i32> @llvm.masked.srem(<vscale x 8 x i32> %x, <vscale x 8 x i32> %y, <vscale x 8 x i1> %m)
+ ret <vscale x 8 x i32> %res
+}
+
+define <vscale x 8 x i32> @srem_nxv8i32_splat_rhs(<vscale x 8 x i32> %x, i32 %y, <vscale x 8 x i1> %m) {
+; CHECK-LABEL: srem_nxv8i32_splat_rhs:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetvli a1, zero, e32, m4, ta, ma
+; CHECK-NEXT: vmv.v.i v12, 1
+; CHECK-NEXT: vmerge.vxm v12, v12, a0, v0
+; CHECK-NEXT: vrem.vv v8, v8, v12
+; CHECK-NEXT: ret
+ %head = insertelement <vscale x 8 x i32> poison, i32 %y, i32 0
+ %splat = shufflevector <vscale x 8 x i32> %head, <vscale x 8 x i32> poison, <vscale x 8 x i32> zeroinitializer
+ %res = call <vscale x 8 x i32> @llvm.masked.srem(<vscale x 8 x i32> %x, <vscale x 8 x i32> %splat, <vscale x 8 x i1> %m)
+ ret <vscale x 8 x i32> %res
+}
+
+define <vscale x 8 x i16> @srem_nxv8i16(<vscale x 8 x i16> %x, <vscale x 8 x i16> %y, <vscale x 8 x i1> %m) {
+; CHECK-LABEL: srem_nxv8i16:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetvli a0, zero, e16, m2, ta, ma
+; CHECK-NEXT: vmv.v.i v12, 1
+; CHECK-NEXT: vmerge.vvm v10, v12, v10, v0
+; CHECK-NEXT: vrem.vv v8, v8, v10
+; CHECK-NEXT: ret
+ %res = call <vscale x 8 x i16> @llvm.masked.srem(<vscale x 8 x i16> %x, <vscale x 8 x i16> %y, <vscale x 8 x i1> %m)
+ ret <vscale x 8 x i16> %res
+}
+
+define <vscale x 2 x i64> @srem_nxv2i64(<vscale x 2 x i64> %x, <vscale x 2 x i64> %y, <vscale x 2 x i1> %m) {
+; CHECK-LABEL: srem_nxv2i64:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetvli a0, zero, e64, m2, ta, ma
+; CHECK-NEXT: vmv.v.i v12, 1
+; CHECK-NEXT: vmerge.vvm v10, v12, v10, v0
+; CHECK-NEXT: vrem.vv v8, v8, v10
+; CHECK-NEXT: ret
+ %res = call <vscale x 2 x i64> @llvm.masked.srem(<vscale x 2 x i64> %x, <vscale x 2 x i64> %y, <vscale x 2 x i1> %m)
+ ret <vscale x 2 x i64> %res
+}
+
+define <4 x i32> @srem_v4i32(<4 x i32> %x, <4 x i32> %y, <4 x i1> %m) {
+; CHECK-LABEL: srem_v4i32:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetivli zero, 4, e32, m1, ta, ma
+; CHECK-NEXT: vmv.v.i v10, 1
+; CHECK-NEXT: vmerge.vvm v9, v10, v9, v0
+; CHECK-NEXT: vrem.vv v8, v8, v9
+; CHECK-NEXT: ret
+ %res = call <4 x i32> @llvm.masked.srem(<4 x i32> %x, <4 x i32> %y, <4 x i1> %m)
+ ret <4 x i32> %res
+}
+
+define <2 x i64> @srem_v2i64(<2 x i64> %x, <2 x i64> %y, <2 x i1> %m) {
+; CHECK-LABEL: srem_v2i64:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetivli zero, 2, e64, m1, ta, ma
+; CHECK-NEXT: vmv.v.i v10, 1
+; CHECK-NEXT: vmerge.vvm v9, v10, v9, v0
+; CHECK-NEXT: vrem.vv v8, v8, v9
+; CHECK-NEXT: ret
+ %res = call <2 x i64> @llvm.masked.srem(<2 x i64> %x, <2 x i64> %y, <2 x i1> %m)
+ ret <2 x i64> %res
+}
+
+define <4 x i64> @srem_v4i64(<4 x i64> %x, <4 x i64> %y, <4 x i1> %m) {
+; CHECK-LABEL: srem_v4i64:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetivli zero, 4, e64, m2, ta, ma
+; CHECK-NEXT: vmv.v.i v12, 1
+; CHECK-NEXT: vmerge.vvm v10, v12, v10, v0
+; CHECK-NEXT: vrem.vv v8, v8, v10
+; CHECK-NEXT: ret
+ %res = call <4 x i64> @llvm.masked.srem(<4 x i64> %x, <4 x i64> %y, <4 x i1> %m)
+ ret <4 x i64> %res
+}
+
+define <2 x i32> @srem_v2i32(<2 x i32> %x, <2 x i32> %y, <2 x i1> %m) {
+; CHECK-LABEL: srem_v2i32:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetivli zero, 2, e32, mf2, ta, ma
+; CHECK-NEXT: vmv.v.i v10, 1
+; CHECK-NEXT: vmerge.vvm v9, v10, v9, v0
+; CHECK-NEXT: vrem.vv v8, v8, v9
+; CHECK-NEXT: ret
+ %res = call <2 x i32> @llvm.masked.srem(<2 x i32> %x, <2 x i32> %y, <2 x i1> %m)
+ ret <2 x i32> %res
+}
+
+define <4 x i16> @srem_v4i16(<4 x i16> %x, <4 x i16> %y, <4 x i1> %m) {
+; CHECK-LABEL: srem_v4i16:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetivli zero, 4, e16, mf2, ta, ma
+; CHECK-NEXT: vmv.v.i v10, 1
+; CHECK-NEXT: vmerge.vvm v9, v10, v9, v0
+; CHECK-NEXT: vrem.vv v8, v8, v9
+; CHECK-NEXT: ret
+ %res = call <4 x i16> @llvm.masked.srem(<4 x i16> %x, <4 x i16> %y, <4 x i1> %m)
+ ret <4 x i16> %res
+}
+
+define <1 x i64> @srem_v1i64(<1 x i64> %x, <1 x i64> %y, <1 x i1> %m) {
+; CHECK-LABEL: srem_v1i64:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetivli zero, 1, e64, m1, ta, ma
+; CHECK-NEXT: vmv.v.i v10, 1
+; CHECK-NEXT: vmerge.vvm v9, v10, v9, v0
+; CHECK-NEXT: vrem.vv v8, v8, v9
+; CHECK-NEXT: ret
+ %res = call <1 x i64> @llvm.masked.srem(<1 x i64> %x, <1 x i64> %y, <1 x i1> %m)
+ ret <1 x i64> %res
+}
+
+define <2 x i128> @srem_v2i128(<2 x i128> %x, <2 x i128> %y, <2 x i1> %m) {
+; CHECK-LABEL: srem_v2i128:
+; CHECK: # %bb.0:
+; CHECK-NEXT: addi sp, sp, -80
+; CHECK-NEXT: .cfi_def_cfa_offset 80
+; CHECK-NEXT: sd ra, 72(sp) # 8-byte Folded Spill
+; CHECK-NEXT: sd s0, 64(sp) # 8-byte Folded Spill
+; CHECK-NEXT: sd s1, 56(sp) # 8-byte Folded Spill
+; CHECK-NEXT: sd s2, 48(sp) # 8-byte Folded Spill
+; CHECK-NEXT: sd s3, 40(sp) # 8-byte Folded Spill
+; CHECK-NEXT: sd s4, 32(sp) # 8-byte Folded Spill
+; CHECK-NEXT: sd s5, 24(sp) # 8-byte Folded Spill
+; CHECK-NEXT: sd s6, 16(sp) # 8-byte Folded Spill
+; CHECK-NEXT: .cfi_offset ra, -8
+; CHECK-NEXT: .cfi_offset s0, -16
+; CHECK-NEXT: .cfi_offset s1, -24
+; CHECK-NEXT: .cfi_offset s2, -32
+; CHECK-NEXT: .cfi_offset s3, -40
+; CHECK-NEXT: .cfi_offset s4, -48
+; CHECK-NEXT: .cfi_offset s5, -56
+; CHECK-NEXT: .cfi_offset s6, -64
+; CHECK-NEXT: csrr a3, vlenb
+; CHECK-NEXT: sub sp, sp, a3
+; CHECK-NEXT: .cfi_escape 0x0f, 0x0e, 0x72, 0x00, 0x11, 0xd0, 0x00, 0x22, 0x11, 0x01, 0x92, 0xa2, 0x38, 0x00, 0x1e, 0x22 # sp + 80 + 1 * vlenb
+; CHECK-NEXT: mv a4, a1
+; CHECK-NEXT: addi a3, sp, 16
+; CHECK-NEXT: vs1r.v v0, (a3) # vscale x 8-byte Folded Spill
+; CHECK-NEXT: ld a6, 16(a1)
+; CHECK-NEXT: ld a1, 24(a1)
+; CHECK-NEXT: ld a3, 24(a2)
+; CHECK-NEXT: vsetivli zero, 2, e8, mf8, ta, ma
+; CHECK-NEXT: vmv.v.i v8, 0
+; CHECK-NEXT: vmerge.vim v8, v8, 1, v0
+; CHECK-NEXT: vslidedown.vi v8, v8, 1
+; CHECK-NEXT: vmv.x.s a5, v8
+; CHECK-NEXT: andi a7, a5, 1
+; CHECK-NEXT: mv s0, a0
+; CHECK-NEXT: bnez a7, .LBB13_2
+; CHECK-NEXT: # %bb.1:
+; CHECK-NEXT: li a5, 1
+; CHECK-NEXT: j .LBB13_3
+; CHECK-NEXT: .LBB13_2:
+; CHECK-NEXT: ld a5, 16(a2)
+; CHECK-NEXT: .LBB13_3:
+; CHECK-NEXT: ld s1, 0(a4)
+; CHECK-NEXT: ld s2, 8(a4)
+; CHECK-NEXT: ld s3, 0(a2)
+; CHECK-NEXT: ld s6, 8(a2)
+; CHECK-NEXT: neg a0, a7
+; CHECK-NEXT: and a3, a0, a3
+; CHECK-NEXT: mv a0, a6
+; CHECK-NEXT: mv a2, a5
+; CHECK-NEXT: call __modti3
+; CHECK-NEXT: mv s4, a0
+; CHECK-NEXT: addi a0, sp, 16
+; CHECK-NEXT: vl1r.v v8, (a0) # vscale x 8-byte Folded Reload
+; CHECK-NEXT: vsetivli zero, 2, e8, mf8, ta, ma
+; CHECK-NEXT: vfirst.m a0, v8
+; CHECK-NEXT: mv s5, a1
+; CHECK-NEXT: beqz a0, .LBB13_5
+; CHECK-NEXT: # %bb.4:
+; CHECK-NEXT: li s3, 1
+; CHECK-NEXT: .LBB13_5:
+; CHECK-NEXT: snez a0, a0
+; CHECK-NEXT: addi a0, a0, -1
+; CHECK-NEXT: and a3, a0, s6
+; CHECK-NEXT: mv a0, s1
+; CHECK-NEXT: mv a1, s2
+; CHECK-NEXT: mv a2, s3
+; CHECK-NEXT: call __modti3
+; CHECK-NEXT: sd a0, 0(s0)
+; CHECK-NEXT: sd a1, 8(s0)
+; CHECK-NEXT: sd s4, 16(s0)
+; CHECK-NEXT: sd s5, 24(s0)
+; CHECK-NEXT: csrr a0, vlenb
+; CHECK-NEXT: add sp, sp, a0
+; CHECK-NEXT: .cfi_def_cfa sp, 80
+; CHECK-NEXT: ld ra, 72(sp) # 8-byte Folded Reload
+; CHECK-NEXT: ld s0, 64(sp) # 8-byte Folded Reload
+; CHECK-NEXT: ld s1, 56(sp) # 8-byte Folded Reload
+; CHECK-NEXT: ld s2, 48(sp) # 8-byte Folded Reload
+; CHECK-NEXT: ld s3, 40(sp) # 8-byte Folded Reload
+; CHECK-NEXT: ld s4, 32(sp) # 8-byte Folded Reload
+; CHECK-NEXT: ld s5, 24(sp) # 8-byte Folded Reload
+; CHECK-NEXT: ld s6, 16(sp) # 8-byte Folded Reload
+; CHECK-NEXT: .cfi_restore ra
+; CHECK-NEXT: .cfi_restore s0
+; CHECK-NEXT: .cfi_restore s1
+; CHECK-NEXT: .cfi_restore s2
+; CHECK-NEXT: .cfi_restore s3
+; CHECK-NEXT: .cfi_restore s4
+; CHECK-NEXT: .cfi_restore s5
+; CHECK-NEXT: .cfi_restore s6
+; CHECK-NEXT: addi sp, sp, 80
+; CHECK-NEXT: .cfi_def_cfa_offset 0
+; CHECK-NEXT: ret
+ %res = call <2 x i128> @llvm.masked.srem(<2 x i128> %x, <2 x i128> %y, <2 x i1> %m)
+ ret <2 x i128> %res
+}
+
+define <3 x i10> @srem_v3i10(<3 x i10> %x, <3 x i10> %y, <3 x i1> %m) {
+; CHECK-LABEL: srem_v3i10:
+; CHECK: # %bb.0:
+; CHECK-NEXT: ld a3, 0(a1)
+; CHECK-NEXT: ld a4, 8(a1)
+; CHECK-NEXT: ld a1, 16(a1)
+; CHECK-NEXT: ld a5, 0(a2)
+; CHECK-NEXT: vsetivli zero, 4, e16, mf2, ta, mu
+; CHECK-NEXT: vmv.v.x v8, a3
+; CHECK-NEXT: ld a3, 8(a2)
+; CHECK-NEXT: ld a2, 16(a2)
+; CHECK-NEXT: vmv.v.x v9, a5
+; CHECK-NEXT: vmv.v.i v10, 1
+; CHECK-NEXT: vslide1down.vx v8, v8, a4
+; CHECK-NEXT: vslide1down.vx v9, v9, a3
+; CHECK-NEXT: vslide1down.vx v8, v8, a1
+; CHECK-NEXT: vslide1down.vx v9, v9, a2
+; CHECK-NEXT: vslidedown.vi v8, v8, 1
+; CHECK-NEXT: vslidedown.vi v9, v9, 1
+; CHECK-NEXT: vsll.vi v8, v8, 6
+; CHECK-NEXT: vsll.vi v9, v9, 6
+; CHECK-NEXT: vsra.vi v8, v8, 6
+; CHECK-NEXT: vsra.vi v10, v9, 6, v0.t
+; CHECK-NEXT: vrem.vv v8, v8, v10
+; CHECK-NEXT: vmv.x.s a1, v8
+; CHECK-NEXT: vslidedown.vi v9, v8, 1
+; CHECK-NEXT: vslidedown.vi v8, v8, 2
+; CHECK-NEXT: andi a1, a1, 1023
+; CHECK-NEXT: vmv.x.s a2, v9
+; CHECK-NEXT: vmv.x.s a3, v8
+; CHECK-NEXT: andi a2, a2, 1023
+; CHECK-NEXT: slli a3, a3, 20
+; CHECK-NEXT: slli a2, a2, 10
+; CHECK-NEXT: or a1, a1, a3
+; CHECK-NEXT: or a1, a1, a2
+; CHECK-NEXT: slli a1, a1, 34
+; CHECK-NEXT: srli a1, a1, 34
+; CHECK-NEXT: sw a1, 0(a0)
+; CHECK-NEXT: ret
+ %res = call <3 x i10> @llvm.masked.srem(<3 x i10> %x, <3 x i10> %y, <3 x i1> %m)
+ ret <3 x i10> %res
+}
diff --git a/llvm/test/CodeGen/RISCV/rvv/masked-udiv.ll b/llvm/test/CodeGen/RISCV/rvv/masked-udiv.ll
new file mode 100644
index 0000000000000..c2f151d8fd47e
--- /dev/null
+++ b/llvm/test/CodeGen/RISCV/rvv/masked-udiv.ll
@@ -0,0 +1,302 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
+; RUN: llc -mtriple riscv64 -mattr=+v < %s | FileCheck %s
+
+define <vscale x 8 x i8> @udiv_nxv8i8(<vscale x 8 x i8> %x, <vscale x 8 x i8> %y, <vscale x 8 x i1> %m) {
+; CHECK-LABEL: udiv_nxv8i8:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetvli a0, zero, e8, m1, ta, ma
+; CHECK-NEXT: vmv.v.i v10, 1
+; CHECK-NEXT: vmerge.vvm v9, v10, v9, v0
+; CHECK-NEXT: vdivu.vv v8, v8, v9
+; CHECK-NEXT: ret
+ %res = call <vscale x 8 x i8> @llvm.masked.udiv(<vscale x 8 x i8> %x, <vscale x 8 x i8> %y, <vscale x 8 x i1> %m)
+ ret <vscale x 8 x i8> %res
+}
+
+define <vscale x 4 x i16> @udiv_nxv4i16(<vscale x 4 x i16> %x, <vscale x 4 x i16> %y, <vscale x 4 x i1> %m) {
+; CHECK-LABEL: udiv_nxv4i16:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetvli a0, zero, e16, m1, ta, ma
+; CHECK-NEXT: vmv.v.i v10, 1
+; CHECK-NEXT: vmerge.vvm v9, v10, v9, v0
+; CHECK-NEXT: vdivu.vv v8, v8, v9
+; CHECK-NEXT: ret
+ %res = call <vscale x 4 x i16> @llvm.masked.udiv(<vscale x 4 x i16> %x, <vscale x 4 x i16> %y, <vscale x 4 x i1> %m)
+ ret <vscale x 4 x i16> %res
+}
+
+define <vscale x 4 x i32> @udiv_nxv4i32(<vscale x 4 x i32> %x, <vscale x 4 x i32> %y, <vscale x 4 x i1> %m) {
+; CHECK-LABEL: udiv_nxv4i32:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetvli a0, zero, e32, m2, ta, ma
+; CHECK-NEXT: vmv.v.i v12, 1
+; CHECK-NEXT: vmerge.vvm v10, v12, v10, v0
+; CHECK-NEXT: vdivu.vv v8, v8, v10
+; CHECK-NEXT: ret
+ %res = call <vscale x 4 x i32> @llvm.masked.udiv(<vscale x 4 x i32> %x, <vscale x 4 x i32> %y, <vscale x 4 x i1> %m)
+ ret <vscale x 4 x i32> %res
+}
+
+define <vscale x 8 x i32> @udiv_nxv8i32(<vscale x 8 x i32> %x, <vscale x 8 x i32> %y, <vscale x 8 x i1> %m) {
+; CHECK-LABEL: udiv_nxv8i32:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetvli a0, zero, e32, m4, ta, ma
+; CHECK-NEXT: vmv.v.i v16, 1
+; CHECK-NEXT: vmerge.vvm v12, v16, v12, v0
+; CHECK-NEXT: vdivu.vv v8, v8, v12
+; CHECK-NEXT: ret
+ %res = call <vscale x 8 x i32> @llvm.masked.udiv(<vscale x 8 x i32> %x, <vscale x 8 x i32> %y, <vscale x 8 x i1> %m)
+ ret <vscale x 8 x i32> %res
+}
+
+define <vscale x 8 x i32> @udiv_nxv8i32_splat_rhs(<vscale x 8 x i32> %x, i32 %y, <vscale x 8 x i1> %m) {
+; CHECK-LABEL: udiv_nxv8i32_splat_rhs:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetvli a1, zero, e32, m4, ta, ma
+; CHECK-NEXT: vmv.v.i v12, 1
+; CHECK-NEXT: vmerge.vxm v12, v12, a0, v0
+; CHECK-NEXT: vdivu.vv v8, v8, v12
+; CHECK-NEXT: ret
+ %head = insertelement <vscale x 8 x i32> poison, i32 %y, i32 0
+ %splat = shufflevector <vscale x 8 x i32> %head, <vscale x 8 x i32> poison, <vscale x 8 x i32> zeroinitializer
+ %res = call <vscale x 8 x i32> @llvm.masked.udiv(<vscale x 8 x i32> %x, <vscale x 8 x i32> %splat, <vscale x 8 x i1> %m)
+ ret <vscale x 8 x i32> %res
+}
+
+define <vscale x 8 x i16> @udiv_nxv8i16(<vscale x 8 x i16> %x, <vscale x 8 x i16> %y, <vscale x 8 x i1> %m) {
+; CHECK-LABEL: udiv_nxv8i16:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetvli a0, zero, e16, m2, ta, ma
+; CHECK-NEXT: vmv.v.i v12, 1
+; CHECK-NEXT: vmerge.vvm v10, v12, v10, v0
+; CHECK-NEXT: vdivu.vv v8, v8, v10
+; CHECK-NEXT: ret
+ %res = call <vscale x 8 x i16> @llvm.masked.udiv(<vscale x 8 x i16> %x, <vscale x 8 x i16> %y, <vscale x 8 x i1> %m)
+ ret <vscale x 8 x i16> %res
+}
+
+define <vscale x 2 x i64> @udiv_nxv2i64(<vscale x 2 x i64> %x, <vscale x 2 x i64> %y, <vscale x 2 x i1> %m) {
+; CHECK-LABEL: udiv_nxv2i64:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetvli a0, zero, e64, m2, ta, ma
+; CHECK-NEXT: vmv.v.i v12, 1
+; CHECK-NEXT: vmerge.vvm v10, v12, v10, v0
+; CHECK-NEXT: vdivu.vv v8, v8, v10
+; CHECK-NEXT: ret
+ %res = call <vscale x 2 x i64> @llvm.masked.udiv(<vscale x 2 x i64> %x, <vscale x 2 x i64> %y, <vscale x 2 x i1> %m)
+ ret <vscale x 2 x i64> %res
+}
+
+define <4 x i32> @udiv_v4i32(<4 x i32> %x, <4 x i32> %y, <4 x i1> %m) {
+; CHECK-LABEL: udiv_v4i32:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetivli zero, 4, e32, m1, ta, ma
+; CHECK-NEXT: vmv.v.i v10, 1
+; CHECK-NEXT: vmerge.vvm v9, v10, v9, v0
+; CHECK-NEXT: vdivu.vv v8, v8, v9
+; CHECK-NEXT: ret
+ %res = call <4 x i32> @llvm.masked.udiv(<4 x i32> %x, <4 x i32> %y, <4 x i1> %m)
+ ret <4 x i32> %res
+}
+
+define <2 x i64> @udiv_v2i64(<2 x i64> %x, <2 x i64> %y, <2 x i1> %m) {
+; CHECK-LABEL: udiv_v2i64:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetivli zero, 2, e64, m1, ta, ma
+; CHECK-NEXT: vmv.v.i v10, 1
+; CHECK-NEXT: vmerge.vvm v9, v10, v9, v0
+; CHECK-NEXT: vdivu.vv v8, v8, v9
+; CHECK-NEXT: ret
+ %res = call <2 x i64> @llvm.masked.udiv(<2 x i64> %x, <2 x i64> %y, <2 x i1> %m)
+ ret <2 x i64> %res
+}
+
+define <4 x i64> @udiv_v4i64(<4 x i64> %x, <4 x i64> %y, <4 x i1> %m) {
+; CHECK-LABEL: udiv_v4i64:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetivli zero, 4, e64, m2, ta, ma
+; CHECK-NEXT: vmv.v.i v12, 1
+; CHECK-NEXT: vmerge.vvm v10, v12, v10, v0
+; CHECK-NEXT: vdivu.vv v8, v8, v10
+; CHECK-NEXT: ret
+ %res = call <4 x i64> @llvm.masked.udiv(<4 x i64> %x, <4 x i64> %y, <4 x i1> %m)
+ ret <4 x i64> %res
+}
+
+define <2 x i32> @udiv_v2i32(<2 x i32> %x, <2 x i32> %y, <2 x i1> %m) {
+; CHECK-LABEL: udiv_v2i32:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetivli zero, 2, e32, mf2, ta, ma
+; CHECK-NEXT: vmv.v.i v10, 1
+; CHECK-NEXT: vmerge.vvm v9, v10, v9, v0
+; CHECK-NEXT: vdivu.vv v8, v8, v9
+; CHECK-NEXT: ret
+ %res = call <2 x i32> @llvm.masked.udiv(<2 x i32> %x, <2 x i32> %y, <2 x i1> %m)
+ ret <2 x i32> %res
+}
+
+define <4 x i16> @udiv_v4i16(<4 x i16> %x, <4 x i16> %y, <4 x i1> %m) {
+; CHECK-LABEL: udiv_v4i16:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetivli zero, 4, e16, mf2, ta, ma
+; CHECK-NEXT: vmv.v.i v10, 1
+; CHECK-NEXT: vmerge.vvm v9, v10, v9, v0
+; CHECK-NEXT: vdivu.vv v8, v8, v9
+; CHECK-NEXT: ret
+ %res = call <4 x i16> @llvm.masked.udiv(<4 x i16> %x, <4 x i16> %y, <4 x i1> %m)
+ ret <4 x i16> %res
+}
+
+define <1 x i64> @udiv_v1i64(<1 x i64> %x, <1 x i64> %y, <1 x i1> %m) {
+; CHECK-LABEL: udiv_v1i64:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetivli zero, 1, e64, m1, ta, ma
+; CHECK-NEXT: vmv.v.i v10, 1
+; CHECK-NEXT: vmerge.vvm v9, v10, v9, v0
+; CHECK-NEXT: vdivu.vv v8, v8, v9
+; CHECK-NEXT: ret
+ %res = call <1 x i64> @llvm.masked.udiv(<1 x i64> %x, <1 x i64> %y, <1 x i1> %m)
+ ret <1 x i64> %res
+}
+
+define <2 x i128> @udiv_v2i128(<2 x i128> %x, <2 x i128> %y, <2 x i1> %m) {
+; CHECK-LABEL: udiv_v2i128:
+; CHECK: # %bb.0:
+; CHECK-NEXT: addi sp, sp, -80
+; CHECK-NEXT: .cfi_def_cfa_offset 80
+; CHECK-NEXT: sd ra, 72(sp) # 8-byte Folded Spill
+; CHECK-NEXT: sd s0, 64(sp) # 8-byte Folded Spill
+; CHECK-NEXT: sd s1, 56(sp) # 8-byte Folded Spill
+; CHECK-NEXT: sd s2, 48(sp) # 8-byte Folded Spill
+; CHECK-NEXT: sd s3, 40(sp) # 8-byte Folded Spill
+; CHECK-NEXT: sd s4, 32(sp) # 8-byte Folded Spill
+; CHECK-NEXT: sd s5, 24(sp) # 8-byte Folded Spill
+; CHECK-NEXT: sd s6, 16(sp) # 8-byte Folded Spill
+; CHECK-NEXT: .cfi_offset ra, -8
+; CHECK-NEXT: .cfi_offset s0, -16
+; CHECK-NEXT: .cfi_offset s1, -24
+; CHECK-NEXT: .cfi_offset s2, -32
+; CHECK-NEXT: .cfi_offset s3, -40
+; CHECK-NEXT: .cfi_offset s4, -48
+; CHECK-NEXT: .cfi_offset s5, -56
+; CHECK-NEXT: .cfi_offset s6, -64
+; CHECK-NEXT: csrr a3, vlenb
+; CHECK-NEXT: sub sp, sp, a3
+; CHECK-NEXT: .cfi_escape 0x0f, 0x0e, 0x72, 0x00, 0x11, 0xd0, 0x00, 0x22, 0x11, 0x01, 0x92, 0xa2, 0x38, 0x00, 0x1e, 0x22 # sp + 80 + 1 * vlenb
+; CHECK-NEXT: mv a4, a1
+; CHECK-NEXT: addi a3, sp, 16
+; CHECK-NEXT: vs1r.v v0, (a3) # vscale x 8-byte Folded Spill
+; CHECK-NEXT: ld a6, 16(a1)
+; CHECK-NEXT: ld a1, 24(a1)
+; CHECK-NEXT: ld a3, 24(a2)
+; CHECK-NEXT: vsetivli zero, 2, e8, mf8, ta, ma
+; CHECK-NEXT: vmv.v.i v8, 0
+; CHECK-NEXT: vmerge.vim v8, v8, 1, v0
+; CHECK-NEXT: vslidedown.vi v8, v8, 1
+; CHECK-NEXT: vmv.x.s a5, v8
+; CHECK-NEXT: andi a7, a5, 1
+; CHECK-NEXT: mv s0, a0
+; CHECK-NEXT: bnez a7, .LBB13_2
+; CHECK-NEXT: # %bb.1:
+; CHECK-NEXT: li a5, 1
+; CHECK-NEXT: j .LBB13_3
+; CHECK-NEXT: .LBB13_2:
+; CHECK-NEXT: ld a5, 16(a2)
+; CHECK-NEXT: .LBB13_3:
+; CHECK-NEXT: ld s1, 0(a4)
+; CHECK-NEXT: ld s2, 8(a4)
+; CHECK-NEXT: ld s3, 0(a2)
+; CHECK-NEXT: ld s6, 8(a2)
+; CHECK-NEXT: neg a0, a7
+; CHECK-NEXT: and a3, a0, a3
+; CHECK-NEXT: mv a0, a6
+; CHECK-NEXT: mv a2, a5
+; CHECK-NEXT: call __udivti3
+; CHECK-NEXT: mv s4, a0
+; CHECK-NEXT: addi a0, sp, 16
+; CHECK-NEXT: vl1r.v v8, (a0) # vscale x 8-byte Folded Reload
+; CHECK-NEXT: vsetivli zero, 2, e8, mf8, ta, ma
+; CHECK-NEXT: vfirst.m a0, v8
+; CHECK-NEXT: mv s5, a1
+; CHECK-NEXT: beqz a0, .LBB13_5
+; CHECK-NEXT: # %bb.4:
+; CHECK-NEXT: li s3, 1
+; CHECK-NEXT: .LBB13_5:
+; CHECK-NEXT: snez a0, a0
+; CHECK-NEXT: addi a0, a0, -1
+; CHECK-NEXT: and a3, a0, s6
+; CHECK-NEXT: mv a0, s1
+; CHECK-NEXT: mv a1, s2
+; CHECK-NEXT: mv a2, s3
+; CHECK-NEXT: call __udivti3
+; CHECK-NEXT: sd a0, 0(s0)
+; CHECK-NEXT: sd a1, 8(s0)
+; CHECK-NEXT: sd s4, 16(s0)
+; CHECK-NEXT: sd s5, 24(s0)
+; CHECK-NEXT: csrr a0, vlenb
+; CHECK-NEXT: add sp, sp, a0
+; CHECK-NEXT: .cfi_def_cfa sp, 80
+; CHECK-NEXT: ld ra, 72(sp) # 8-byte Folded Reload
+; CHECK-NEXT: ld s0, 64(sp) # 8-byte Folded Reload
+; CHECK-NEXT: ld s1, 56(sp) # 8-byte Folded Reload
+; CHECK-NEXT: ld s2, 48(sp) # 8-byte Folded Reload
+; CHECK-NEXT: ld s3, 40(sp) # 8-byte Folded Reload
+; CHECK-NEXT: ld s4, 32(sp) # 8-byte Folded Reload
+; CHECK-NEXT: ld s5, 24(sp) # 8-byte Folded Reload
+; CHECK-NEXT: ld s6, 16(sp) # 8-byte Folded Reload
+; CHECK-NEXT: .cfi_restore ra
+; CHECK-NEXT: .cfi_restore s0
+; CHECK-NEXT: .cfi_restore s1
+; CHECK-NEXT: .cfi_restore s2
+; CHECK-NEXT: .cfi_restore s3
+; CHECK-NEXT: .cfi_restore s4
+; CHECK-NEXT: .cfi_restore s5
+; CHECK-NEXT: .cfi_restore s6
+; CHECK-NEXT: addi sp, sp, 80
+; CHECK-NEXT: .cfi_def_cfa_offset 0
+; CHECK-NEXT: ret
+ %res = call <2 x i128> @llvm.masked.udiv(<2 x i128> %x, <2 x i128> %y, <2 x i1> %m)
+ ret <2 x i128> %res
+}
+
+define <3 x i10> @udiv_v3i10(<3 x i10> %x, <3 x i10> %y, <3 x i1> %m) {
+; CHECK-LABEL: udiv_v3i10:
+; CHECK: # %bb.0:
+; CHECK-NEXT: ld a3, 0(a1)
+; CHECK-NEXT: ld a4, 8(a1)
+; CHECK-NEXT: ld a1, 16(a1)
+; CHECK-NEXT: ld a5, 0(a2)
+; CHECK-NEXT: vsetivli zero, 4, e16, mf2, ta, mu
+; CHECK-NEXT: vmv.v.x v8, a3
+; CHECK-NEXT: ld a3, 8(a2)
+; CHECK-NEXT: ld a2, 16(a2)
+; CHECK-NEXT: vmv.v.x v9, a5
+; CHECK-NEXT: vslide1down.vx v8, v8, a4
+; CHECK-NEXT: li a4, 1023
+; CHECK-NEXT: vmv.v.i v10, 1
+; CHECK-NEXT: vslide1down.vx v9, v9, a3
+; CHECK-NEXT: vslide1down.vx v8, v8, a1
+; CHECK-NEXT: vslide1down.vx v9, v9, a2
+; CHECK-NEXT: vslidedown.vi v8, v8, 1
+; CHECK-NEXT: vslidedown.vi v9, v9, 1
+; CHECK-NEXT: vand.vx v8, v8, a4
+; CHECK-NEXT: vand.vx v10, v9, a4, v0.t
+; CHECK-NEXT: vdivu.vv v8, v8, v10
+; CHECK-NEXT: vmv.x.s a1, v8
+; CHECK-NEXT: vslidedown.vi v9, v8, 1
+; CHECK-NEXT: vslidedown.vi v8, v8, 2
+; CHECK-NEXT: andi a1, a1, 1023
+; CHECK-NEXT: vmv.x.s a2, v9
+; CHECK-NEXT: vmv.x.s a3, v8
+; CHECK-NEXT: andi a2, a2, 1023
+; CHECK-NEXT: slli a3, a3, 20
+; CHECK-NEXT: slli a2, a2, 10
+; CHECK-NEXT: or a1, a1, a3
+; CHECK-NEXT: or a1, a1, a2
+; CHECK-NEXT: slli a1, a1, 34
+; CHECK-NEXT: srli a1, a1, 34
+; CHECK-NEXT: sw a1, 0(a0)
+; CHECK-NEXT: ret
+ %res = call <3 x i10> @llvm.masked.udiv(<3 x i10> %x, <3 x i10> %y, <3 x i1> %m)
+ ret <3 x i10> %res
+}
diff --git a/llvm/test/CodeGen/RISCV/rvv/masked-urem.ll b/llvm/test/CodeGen/RISCV/rvv/masked-urem.ll
new file mode 100644
index 0000000000000..b0d2bdae583b0
--- /dev/null
+++ b/llvm/test/CodeGen/RISCV/rvv/masked-urem.ll
@@ -0,0 +1,302 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
+; RUN: llc -mtriple riscv64 -mattr=+v < %s | FileCheck %s
+
+define <vscale x 8 x i8> @urem_nxv8i8(<vscale x 8 x i8> %x, <vscale x 8 x i8> %y, <vscale x 8 x i1> %m) {
+; CHECK-LABEL: urem_nxv8i8:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetvli a0, zero, e8, m1, ta, ma
+; CHECK-NEXT: vmv.v.i v10, 1
+; CHECK-NEXT: vmerge.vvm v9, v10, v9, v0
+; CHECK-NEXT: vremu.vv v8, v8, v9
+; CHECK-NEXT: ret
+ %res = call <vscale x 8 x i8> @llvm.masked.urem(<vscale x 8 x i8> %x, <vscale x 8 x i8> %y, <vscale x 8 x i1> %m)
+ ret <vscale x 8 x i8> %res
+}
+
+define <vscale x 4 x i16> @urem_nxv4i16(<vscale x 4 x i16> %x, <vscale x 4 x i16> %y, <vscale x 4 x i1> %m) {
+; CHECK-LABEL: urem_nxv4i16:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetvli a0, zero, e16, m1, ta, ma
+; CHECK-NEXT: vmv.v.i v10, 1
+; CHECK-NEXT: vmerge.vvm v9, v10, v9, v0
+; CHECK-NEXT: vremu.vv v8, v8, v9
+; CHECK-NEXT: ret
+ %res = call <vscale x 4 x i16> @llvm.masked.urem(<vscale x 4 x i16> %x, <vscale x 4 x i16> %y, <vscale x 4 x i1> %m)
+ ret <vscale x 4 x i16> %res
+}
+
+define <vscale x 4 x i32> @urem_nxv4i32(<vscale x 4 x i32> %x, <vscale x 4 x i32> %y, <vscale x 4 x i1> %m) {
+; CHECK-LABEL: urem_nxv4i32:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetvli a0, zero, e32, m2, ta, ma
+; CHECK-NEXT: vmv.v.i v12, 1
+; CHECK-NEXT: vmerge.vvm v10, v12, v10, v0
+; CHECK-NEXT: vremu.vv v8, v8, v10
+; CHECK-NEXT: ret
+ %res = call <vscale x 4 x i32> @llvm.masked.urem(<vscale x 4 x i32> %x, <vscale x 4 x i32> %y, <vscale x 4 x i1> %m)
+ ret <vscale x 4 x i32> %res
+}
+
+define <vscale x 8 x i32> @urem_nxv8i32(<vscale x 8 x i32> %x, <vscale x 8 x i32> %y, <vscale x 8 x i1> %m) {
+; CHECK-LABEL: urem_nxv8i32:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetvli a0, zero, e32, m4, ta, ma
+; CHECK-NEXT: vmv.v.i v16, 1
+; CHECK-NEXT: vmerge.vvm v12, v16, v12, v0
+; CHECK-NEXT: vremu.vv v8, v8, v12
+; CHECK-NEXT: ret
+ %res = call <vscale x 8 x i32> @llvm.masked.urem(<vscale x 8 x i32> %x, <vscale x 8 x i32> %y, <vscale x 8 x i1> %m)
+ ret <vscale x 8 x i32> %res
+}
+
+define <vscale x 8 x i32> @urem_nxv8i32_splat_rhs(<vscale x 8 x i32> %x, i32 %y, <vscale x 8 x i1> %m) {
+; CHECK-LABEL: urem_nxv8i32_splat_rhs:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetvli a1, zero, e32, m4, ta, ma
+; CHECK-NEXT: vmv.v.i v12, 1
+; CHECK-NEXT: vmerge.vxm v12, v12, a0, v0
+; CHECK-NEXT: vremu.vv v8, v8, v12
+; CHECK-NEXT: ret
+ %head = insertelement <vscale x 8 x i32> poison, i32 %y, i32 0
+ %splat = shufflevector <vscale x 8 x i32> %head, <vscale x 8 x i32> poison, <vscale x 8 x i32> zeroinitializer
+ %res = call <vscale x 8 x i32> @llvm.masked.urem(<vscale x 8 x i32> %x, <vscale x 8 x i32> %splat, <vscale x 8 x i1> %m)
+ ret <vscale x 8 x i32> %res
+}
+
+define <vscale x 8 x i16> @urem_nxv8i16(<vscale x 8 x i16> %x, <vscale x 8 x i16> %y, <vscale x 8 x i1> %m) {
+; CHECK-LABEL: urem_nxv8i16:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetvli a0, zero, e16, m2, ta, ma
+; CHECK-NEXT: vmv.v.i v12, 1
+; CHECK-NEXT: vmerge.vvm v10, v12, v10, v0
+; CHECK-NEXT: vremu.vv v8, v8, v10
+; CHECK-NEXT: ret
+ %res = call <vscale x 8 x i16> @llvm.masked.urem(<vscale x 8 x i16> %x, <vscale x 8 x i16> %y, <vscale x 8 x i1> %m)
+ ret <vscale x 8 x i16> %res
+}
+
+define <vscale x 2 x i64> @urem_nxv2i64(<vscale x 2 x i64> %x, <vscale x 2 x i64> %y, <vscale x 2 x i1> %m) {
+; CHECK-LABEL: urem_nxv2i64:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetvli a0, zero, e64, m2, ta, ma
+; CHECK-NEXT: vmv.v.i v12, 1
+; CHECK-NEXT: vmerge.vvm v10, v12, v10, v0
+; CHECK-NEXT: vremu.vv v8, v8, v10
+; CHECK-NEXT: ret
+ %res = call <vscale x 2 x i64> @llvm.masked.urem(<vscale x 2 x i64> %x, <vscale x 2 x i64> %y, <vscale x 2 x i1> %m)
+ ret <vscale x 2 x i64> %res
+}
+
+define <4 x i32> @urem_v4i32(<4 x i32> %x, <4 x i32> %y, <4 x i1> %m) {
+; CHECK-LABEL: urem_v4i32:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetivli zero, 4, e32, m1, ta, ma
+; CHECK-NEXT: vmv.v.i v10, 1
+; CHECK-NEXT: vmerge.vvm v9, v10, v9, v0
+; CHECK-NEXT: vremu.vv v8, v8, v9
+; CHECK-NEXT: ret
+ %res = call <4 x i32> @llvm.masked.urem(<4 x i32> %x, <4 x i32> %y, <4 x i1> %m)
+ ret <4 x i32> %res
+}
+
+define <2 x i64> @urem_v2i64(<2 x i64> %x, <2 x i64> %y, <2 x i1> %m) {
+; CHECK-LABEL: urem_v2i64:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetivli zero, 2, e64, m1, ta, ma
+; CHECK-NEXT: vmv.v.i v10, 1
+; CHECK-NEXT: vmerge.vvm v9, v10, v9, v0
+; CHECK-NEXT: vremu.vv v8, v8, v9
+; CHECK-NEXT: ret
+ %res = call <2 x i64> @llvm.masked.urem(<2 x i64> %x, <2 x i64> %y, <2 x i1> %m)
+ ret <2 x i64> %res
+}
+
+define <4 x i64> @urem_v4i64(<4 x i64> %x, <4 x i64> %y, <4 x i1> %m) {
+; CHECK-LABEL: urem_v4i64:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetivli zero, 4, e64, m2, ta, ma
+; CHECK-NEXT: vmv.v.i v12, 1
+; CHECK-NEXT: vmerge.vvm v10, v12, v10, v0
+; CHECK-NEXT: vremu.vv v8, v8, v10
+; CHECK-NEXT: ret
+ %res = call <4 x i64> @llvm.masked.urem(<4 x i64> %x, <4 x i64> %y, <4 x i1> %m)
+ ret <4 x i64> %res
+}
+
+define <2 x i32> @urem_v2i32(<2 x i32> %x, <2 x i32> %y, <2 x i1> %m) {
+; CHECK-LABEL: urem_v2i32:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetivli zero, 2, e32, mf2, ta, ma
+; CHECK-NEXT: vmv.v.i v10, 1
+; CHECK-NEXT: vmerge.vvm v9, v10, v9, v0
+; CHECK-NEXT: vremu.vv v8, v8, v9
+; CHECK-NEXT: ret
+ %res = call <2 x i32> @llvm.masked.urem(<2 x i32> %x, <2 x i32> %y, <2 x i1> %m)
+ ret <2 x i32> %res
+}
+
+define <4 x i16> @urem_v4i16(<4 x i16> %x, <4 x i16> %y, <4 x i1> %m) {
+; CHECK-LABEL: urem_v4i16:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetivli zero, 4, e16, mf2, ta, ma
+; CHECK-NEXT: vmv.v.i v10, 1
+; CHECK-NEXT: vmerge.vvm v9, v10, v9, v0
+; CHECK-NEXT: vremu.vv v8, v8, v9
+; CHECK-NEXT: ret
+ %res = call <4 x i16> @llvm.masked.urem(<4 x i16> %x, <4 x i16> %y, <4 x i1> %m)
+ ret <4 x i16> %res
+}
+
+define <1 x i64> @urem_v1i64(<1 x i64> %x, <1 x i64> %y, <1 x i1> %m) {
+; CHECK-LABEL: urem_v1i64:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetivli zero, 1, e64, m1, ta, ma
+; CHECK-NEXT: vmv.v.i v10, 1
+; CHECK-NEXT: vmerge.vvm v9, v10, v9, v0
+; CHECK-NEXT: vremu.vv v8, v8, v9
+; CHECK-NEXT: ret
+ %res = call <1 x i64> @llvm.masked.urem(<1 x i64> %x, <1 x i64> %y, <1 x i1> %m)
+ ret <1 x i64> %res
+}
+
+define <2 x i128> @urem_v2i128(<2 x i128> %x, <2 x i128> %y, <2 x i1> %m) {
+; CHECK-LABEL: urem_v2i128:
+; CHECK: # %bb.0:
+; CHECK-NEXT: addi sp, sp, -80
+; CHECK-NEXT: .cfi_def_cfa_offset 80
+; CHECK-NEXT: sd ra, 72(sp) # 8-byte Folded Spill
+; CHECK-NEXT: sd s0, 64(sp) # 8-byte Folded Spill
+; CHECK-NEXT: sd s1, 56(sp) # 8-byte Folded Spill
+; CHECK-NEXT: sd s2, 48(sp) # 8-byte Folded Spill
+; CHECK-NEXT: sd s3, 40(sp) # 8-byte Folded Spill
+; CHECK-NEXT: sd s4, 32(sp) # 8-byte Folded Spill
+; CHECK-NEXT: sd s5, 24(sp) # 8-byte Folded Spill
+; CHECK-NEXT: sd s6, 16(sp) # 8-byte Folded Spill
+; CHECK-NEXT: .cfi_offset ra, -8
+; CHECK-NEXT: .cfi_offset s0, -16
+; CHECK-NEXT: .cfi_offset s1, -24
+; CHECK-NEXT: .cfi_offset s2, -32
+; CHECK-NEXT: .cfi_offset s3, -40
+; CHECK-NEXT: .cfi_offset s4, -48
+; CHECK-NEXT: .cfi_offset s5, -56
+; CHECK-NEXT: .cfi_offset s6, -64
+; CHECK-NEXT: csrr a3, vlenb
+; CHECK-NEXT: sub sp, sp, a3
+; CHECK-NEXT: .cfi_escape 0x0f, 0x0e, 0x72, 0x00, 0x11, 0xd0, 0x00, 0x22, 0x11, 0x01, 0x92, 0xa2, 0x38, 0x00, 0x1e, 0x22 # sp + 80 + 1 * vlenb
+; CHECK-NEXT: mv a4, a1
+; CHECK-NEXT: addi a3, sp, 16
+; CHECK-NEXT: vs1r.v v0, (a3) # vscale x 8-byte Folded Spill
+; CHECK-NEXT: ld a6, 16(a1)
+; CHECK-NEXT: ld a1, 24(a1)
+; CHECK-NEXT: ld a3, 24(a2)
+; CHECK-NEXT: vsetivli zero, 2, e8, mf8, ta, ma
+; CHECK-NEXT: vmv.v.i v8, 0
+; CHECK-NEXT: vmerge.vim v8, v8, 1, v0
+; CHECK-NEXT: vslidedown.vi v8, v8, 1
+; CHECK-NEXT: vmv.x.s a5, v8
+; CHECK-NEXT: andi a7, a5, 1
+; CHECK-NEXT: mv s0, a0
+; CHECK-NEXT: bnez a7, .LBB13_2
+; CHECK-NEXT: # %bb.1:
+; CHECK-NEXT: li a5, 1
+; CHECK-NEXT: j .LBB13_3
+; CHECK-NEXT: .LBB13_2:
+; CHECK-NEXT: ld a5, 16(a2)
+; CHECK-NEXT: .LBB13_3:
+; CHECK-NEXT: ld s1, 0(a4)
+; CHECK-NEXT: ld s2, 8(a4)
+; CHECK-NEXT: ld s3, 0(a2)
+; CHECK-NEXT: ld s6, 8(a2)
+; CHECK-NEXT: neg a0, a7
+; CHECK-NEXT: and a3, a0, a3
+; CHECK-NEXT: mv a0, a6
+; CHECK-NEXT: mv a2, a5
+; CHECK-NEXT: call __umodti3
+; CHECK-NEXT: mv s4, a0
+; CHECK-NEXT: addi a0, sp, 16
+; CHECK-NEXT: vl1r.v v8, (a0) # vscale x 8-byte Folded Reload
+; CHECK-NEXT: vsetivli zero, 2, e8, mf8, ta, ma
+; CHECK-NEXT: vfirst.m a0, v8
+; CHECK-NEXT: mv s5, a1
+; CHECK-NEXT: beqz a0, .LBB13_5
+; CHECK-NEXT: # %bb.4:
+; CHECK-NEXT: li s3, 1
+; CHECK-NEXT: .LBB13_5:
+; CHECK-NEXT: snez a0, a0
+; CHECK-NEXT: addi a0, a0, -1
+; CHECK-NEXT: and a3, a0, s6
+; CHECK-NEXT: mv a0, s1
+; CHECK-NEXT: mv a1, s2
+; CHECK-NEXT: mv a2, s3
+; CHECK-NEXT: call __umodti3
+; CHECK-NEXT: sd a0, 0(s0)
+; CHECK-NEXT: sd a1, 8(s0)
+; CHECK-NEXT: sd s4, 16(s0)
+; CHECK-NEXT: sd s5, 24(s0)
+; CHECK-NEXT: csrr a0, vlenb
+; CHECK-NEXT: add sp, sp, a0
+; CHECK-NEXT: .cfi_def_cfa sp, 80
+; CHECK-NEXT: ld ra, 72(sp) # 8-byte Folded Reload
+; CHECK-NEXT: ld s0, 64(sp) # 8-byte Folded Reload
+; CHECK-NEXT: ld s1, 56(sp) # 8-byte Folded Reload
+; CHECK-NEXT: ld s2, 48(sp) # 8-byte Folded Reload
+; CHECK-NEXT: ld s3, 40(sp) # 8-byte Folded Reload
+; CHECK-NEXT: ld s4, 32(sp) # 8-byte Folded Reload
+; CHECK-NEXT: ld s5, 24(sp) # 8-byte Folded Reload
+; CHECK-NEXT: ld s6, 16(sp) # 8-byte Folded Reload
+; CHECK-NEXT: .cfi_restore ra
+; CHECK-NEXT: .cfi_restore s0
+; CHECK-NEXT: .cfi_restore s1
+; CHECK-NEXT: .cfi_restore s2
+; CHECK-NEXT: .cfi_restore s3
+; CHECK-NEXT: .cfi_restore s4
+; CHECK-NEXT: .cfi_restore s5
+; CHECK-NEXT: .cfi_restore s6
+; CHECK-NEXT: addi sp, sp, 80
+; CHECK-NEXT: .cfi_def_cfa_offset 0
+; CHECK-NEXT: ret
+ %res = call <2 x i128> @llvm.masked.urem(<2 x i128> %x, <2 x i128> %y, <2 x i1> %m)
+ ret <2 x i128> %res
+}
+
+define <3 x i10> @urem_v3i10(<3 x i10> %x, <3 x i10> %y, <3 x i1> %m) {
+; CHECK-LABEL: urem_v3i10:
+; CHECK: # %bb.0:
+; CHECK-NEXT: ld a3, 0(a1)
+; CHECK-NEXT: ld a4, 8(a1)
+; CHECK-NEXT: ld a1, 16(a1)
+; CHECK-NEXT: ld a5, 0(a2)
+; CHECK-NEXT: vsetivli zero, 4, e16, mf2, ta, mu
+; CHECK-NEXT: vmv.v.x v8, a3
+; CHECK-NEXT: ld a3, 8(a2)
+; CHECK-NEXT: ld a2, 16(a2)
+; CHECK-NEXT: vmv.v.x v9, a5
+; CHECK-NEXT: vslide1down.vx v8, v8, a4
+; CHECK-NEXT: li a4, 1023
+; CHECK-NEXT: vmv.v.i v10, 1
+; CHECK-NEXT: vslide1down.vx v9, v9, a3
+; CHECK-NEXT: vslide1down.vx v8, v8, a1
+; CHECK-NEXT: vslide1down.vx v9, v9, a2
+; CHECK-NEXT: vslidedown.vi v8, v8, 1
+; CHECK-NEXT: vslidedown.vi v9, v9, 1
+; CHECK-NEXT: vand.vx v8, v8, a4
+; CHECK-NEXT: vand.vx v10, v9, a4, v0.t
+; CHECK-NEXT: vremu.vv v8, v8, v10
+; CHECK-NEXT: vmv.x.s a1, v8
+; CHECK-NEXT: vslidedown.vi v9, v8, 1
+; CHECK-NEXT: vslidedown.vi v8, v8, 2
+; CHECK-NEXT: andi a1, a1, 1023
+; CHECK-NEXT: vmv.x.s a2, v9
+; CHECK-NEXT: vmv.x.s a3, v8
+; CHECK-NEXT: andi a2, a2, 1023
+; CHECK-NEXT: slli a3, a3, 20
+; CHECK-NEXT: slli a2, a2, 10
+; CHECK-NEXT: or a1, a1, a3
+; CHECK-NEXT: or a1, a1, a2
+; CHECK-NEXT: slli a1, a1, 34
+; CHECK-NEXT: srli a1, a1, 34
+; CHECK-NEXT: sw a1, 0(a0)
+; CHECK-NEXT: ret
+ %res = call <3 x i10> @llvm.masked.urem(<3 x i10> %x, <3 x i10> %y, <3 x i1> %m)
+ ret <3 x i10> %res
+}
diff --git a/llvm/test/CodeGen/X86/masked-sdiv.ll b/llvm/test/CodeGen/X86/masked-sdiv.ll
new file mode 100644
index 0000000000000..189cfcab4bd20
--- /dev/null
+++ b/llvm/test/CodeGen/X86/masked-sdiv.ll
@@ -0,0 +1,758 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
+; RUN: llc -mtriple x86_64 -mattr=+sse2 < %s | FileCheck %s --check-prefix=SSE2
+; RUN: llc -mtriple x86_64 -mattr=+avx512 < %s | FileCheck %s --check-prefix=AVX512
+
+; Legal
+define <4 x i32> @sdiv_v4i32(<4 x i32> %x, <4 x i32> %y, <4 x i1> %m) {
+; SSE2-LABEL: sdiv_v4i32:
+; SSE2: # %bb.0:
+; SSE2-NEXT: pslld $31, %xmm2
+; SSE2-NEXT: psrad $31, %xmm2
+; SSE2-NEXT: pand %xmm2, %xmm1
+; SSE2-NEXT: paddd %xmm2, %xmm1
+; SSE2-NEXT: pcmpeqd %xmm2, %xmm2
+; SSE2-NEXT: psubd %xmm2, %xmm1
+; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm1[3,3,3,3]
+; SSE2-NEXT: movd %xmm2, %ecx
+; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm0[3,3,3,3]
+; SSE2-NEXT: movd %xmm2, %eax
+; SSE2-NEXT: cltd
+; SSE2-NEXT: idivl %ecx
+; SSE2-NEXT: movd %eax, %xmm2
+; SSE2-NEXT: pshufd {{.*#+}} xmm3 = xmm1[2,3,2,3]
+; SSE2-NEXT: movd %xmm3, %ecx
+; SSE2-NEXT: pshufd {{.*#+}} xmm3 = xmm0[2,3,2,3]
+; SSE2-NEXT: movd %xmm3, %eax
+; SSE2-NEXT: cltd
+; SSE2-NEXT: idivl %ecx
+; SSE2-NEXT: movd %eax, %xmm3
+; SSE2-NEXT: punpckldq {{.*#+}} xmm3 = xmm3[0],xmm2[0],xmm3[1],xmm2[1]
+; SSE2-NEXT: movd %xmm1, %ecx
+; SSE2-NEXT: movd %xmm0, %eax
+; SSE2-NEXT: cltd
+; SSE2-NEXT: idivl %ecx
+; SSE2-NEXT: movd %eax, %xmm2
+; SSE2-NEXT: pshufd {{.*#+}} xmm1 = xmm1[1,1,1,1]
+; SSE2-NEXT: movd %xmm1, %ecx
+; SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm0[1,1,1,1]
+; SSE2-NEXT: movd %xmm0, %eax
+; SSE2-NEXT: cltd
+; SSE2-NEXT: idivl %ecx
+; SSE2-NEXT: movd %eax, %xmm0
+; SSE2-NEXT: punpckldq {{.*#+}} xmm2 = xmm2[0],xmm0[0],xmm2[1],xmm0[1]
+; SSE2-NEXT: punpcklqdq {{.*#+}} xmm2 = xmm2[0],xmm3[0]
+; SSE2-NEXT: movdqa %xmm2, %xmm0
+; SSE2-NEXT: retq
+;
+; AVX512-LABEL: sdiv_v4i32:
+; AVX512: # %bb.0:
+; AVX512-NEXT: pslld $31, %xmm2
+; AVX512-NEXT: psrad $31, %xmm2
+; AVX512-NEXT: pand %xmm2, %xmm1
+; AVX512-NEXT: paddd %xmm2, %xmm1
+; AVX512-NEXT: pcmpeqd %xmm2, %xmm2
+; AVX512-NEXT: psubd %xmm2, %xmm1
+; AVX512-NEXT: pshufd {{.*#+}} xmm2 = xmm1[3,3,3,3]
+; AVX512-NEXT: movd %xmm2, %ecx
+; AVX512-NEXT: pshufd {{.*#+}} xmm2 = xmm0[3,3,3,3]
+; AVX512-NEXT: movd %xmm2, %eax
+; AVX512-NEXT: cltd
+; AVX512-NEXT: idivl %ecx
+; AVX512-NEXT: movd %eax, %xmm2
+; AVX512-NEXT: pshufd {{.*#+}} xmm3 = xmm1[2,3,2,3]
+; AVX512-NEXT: movd %xmm3, %ecx
+; AVX512-NEXT: pshufd {{.*#+}} xmm3 = xmm0[2,3,2,3]
+; AVX512-NEXT: movd %xmm3, %eax
+; AVX512-NEXT: cltd
+; AVX512-NEXT: idivl %ecx
+; AVX512-NEXT: movd %eax, %xmm3
+; AVX512-NEXT: punpckldq {{.*#+}} xmm3 = xmm3[0],xmm2[0],xmm3[1],xmm2[1]
+; AVX512-NEXT: movd %xmm1, %ecx
+; AVX512-NEXT: movd %xmm0, %eax
+; AVX512-NEXT: cltd
+; AVX512-NEXT: idivl %ecx
+; AVX512-NEXT: movd %eax, %xmm2
+; AVX512-NEXT: pshufd {{.*#+}} xmm1 = xmm1[1,1,1,1]
+; AVX512-NEXT: movd %xmm1, %ecx
+; AVX512-NEXT: pshufd {{.*#+}} xmm0 = xmm0[1,1,1,1]
+; AVX512-NEXT: movd %xmm0, %eax
+; AVX512-NEXT: cltd
+; AVX512-NEXT: idivl %ecx
+; AVX512-NEXT: movd %eax, %xmm0
+; AVX512-NEXT: punpckldq {{.*#+}} xmm2 = xmm2[0],xmm0[0],xmm2[1],xmm0[1]
+; AVX512-NEXT: punpcklqdq {{.*#+}} xmm2 = xmm2[0],xmm3[0]
+; AVX512-NEXT: movdqa %xmm2, %xmm0
+; AVX512-NEXT: retq
+ %res = call <4 x i32> @llvm.masked.sdiv(<4 x i32> %x, <4 x i32> %y, <4 x i1> %m)
+ ret <4 x i32> %res
+}
+
+define <2 x i64> @sdiv_v2i64(<2 x i64> %x, <2 x i64> %y, <2 x i1> %m) {
+; SSE2-LABEL: sdiv_v2i64:
+; SSE2: # %bb.0:
+; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm2[0,0,2,2]
+; SSE2-NEXT: pslld $31, %xmm2
+; SSE2-NEXT: psrad $31, %xmm2
+; SSE2-NEXT: pand %xmm2, %xmm1
+; SSE2-NEXT: pandn {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm2
+; SSE2-NEXT: por %xmm1, %xmm2
+; SSE2-NEXT: movq %xmm2, %rcx
+; SSE2-NEXT: movq %xmm0, %rax
+; SSE2-NEXT: cqto
+; SSE2-NEXT: idivq %rcx
+; SSE2-NEXT: movq %rax, %xmm1
+; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm2[2,3,2,3]
+; SSE2-NEXT: movq %xmm2, %rcx
+; SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm0[2,3,2,3]
+; SSE2-NEXT: movq %xmm0, %rax
+; SSE2-NEXT: cqto
+; SSE2-NEXT: idivq %rcx
+; SSE2-NEXT: movq %rax, %xmm0
+; SSE2-NEXT: punpcklqdq {{.*#+}} xmm1 = xmm1[0],xmm0[0]
+; SSE2-NEXT: movdqa %xmm1, %xmm0
+; SSE2-NEXT: retq
+;
+; AVX512-LABEL: sdiv_v2i64:
+; AVX512: # %bb.0:
+; AVX512-NEXT: pshufd {{.*#+}} xmm2 = xmm2[0,0,2,2]
+; AVX512-NEXT: pslld $31, %xmm2
+; AVX512-NEXT: psrad $31, %xmm2
+; AVX512-NEXT: pand %xmm2, %xmm1
+; AVX512-NEXT: pandn {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm2
+; AVX512-NEXT: por %xmm1, %xmm2
+; AVX512-NEXT: movq %xmm2, %rcx
+; AVX512-NEXT: movq %xmm0, %rax
+; AVX512-NEXT: cqto
+; AVX512-NEXT: idivq %rcx
+; AVX512-NEXT: movq %rax, %xmm1
+; AVX512-NEXT: pshufd {{.*#+}} xmm2 = xmm2[2,3,2,3]
+; AVX512-NEXT: movq %xmm2, %rcx
+; AVX512-NEXT: pshufd {{.*#+}} xmm0 = xmm0[2,3,2,3]
+; AVX512-NEXT: movq %xmm0, %rax
+; AVX512-NEXT: cqto
+; AVX512-NEXT: idivq %rcx
+; AVX512-NEXT: movq %rax, %xmm0
+; AVX512-NEXT: punpcklqdq {{.*#+}} xmm1 = xmm1[0],xmm0[0]
+; AVX512-NEXT: movdqa %xmm1, %xmm0
+; AVX512-NEXT: retq
+ %res = call <2 x i64> @llvm.masked.sdiv(<2 x i64> %x, <2 x i64> %y, <2 x i1> %m)
+ ret <2 x i64> %res
+}
+
+; Splitting
+define <4 x i64> @sdiv_v4i64(<4 x i64> %x, <4 x i64> %y, <4 x i1> %m) {
+; SSE2-LABEL: sdiv_v4i64:
+; SSE2: # %bb.0:
+; SSE2-NEXT: movdqa %xmm0, %xmm5
+; SSE2-NEXT: pshufd {{.*#+}} xmm6 = xmm4[2,2,3,3]
+; SSE2-NEXT: pslld $31, %xmm6
+; SSE2-NEXT: psrad $31, %xmm6
+; SSE2-NEXT: pshufd {{.*#+}} xmm4 = xmm4[0,0,1,1]
+; SSE2-NEXT: pslld $31, %xmm4
+; SSE2-NEXT: psrad $31, %xmm4
+; SSE2-NEXT: movdqa {{.*#+}} xmm7 = [1,1]
+; SSE2-NEXT: pand %xmm4, %xmm2
+; SSE2-NEXT: pandn %xmm7, %xmm4
+; SSE2-NEXT: por %xmm2, %xmm4
+; SSE2-NEXT: movq %xmm4, %rcx
+; SSE2-NEXT: movq %xmm0, %rax
+; SSE2-NEXT: cqto
+; SSE2-NEXT: idivq %rcx
+; SSE2-NEXT: movq %rax, %xmm0
+; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm4[2,3,2,3]
+; SSE2-NEXT: movq %xmm2, %rcx
+; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm5[2,3,2,3]
+; SSE2-NEXT: movq %xmm2, %rax
+; SSE2-NEXT: cqto
+; SSE2-NEXT: idivq %rcx
+; SSE2-NEXT: movq %rax, %xmm2
+; SSE2-NEXT: punpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm2[0]
+; SSE2-NEXT: pand %xmm6, %xmm3
+; SSE2-NEXT: pandn %xmm7, %xmm6
+; SSE2-NEXT: por %xmm3, %xmm6
+; SSE2-NEXT: movq %xmm6, %rcx
+; SSE2-NEXT: movq %xmm1, %rax
+; SSE2-NEXT: cqto
+; SSE2-NEXT: idivq %rcx
+; SSE2-NEXT: movq %rax, %xmm2
+; SSE2-NEXT: pshufd {{.*#+}} xmm3 = xmm6[2,3,2,3]
+; SSE2-NEXT: movq %xmm3, %rcx
+; SSE2-NEXT: pshufd {{.*#+}} xmm1 = xmm1[2,3,2,3]
+; SSE2-NEXT: movq %xmm1, %rax
+; SSE2-NEXT: cqto
+; SSE2-NEXT: idivq %rcx
+; SSE2-NEXT: movq %rax, %xmm1
+; SSE2-NEXT: punpcklqdq {{.*#+}} xmm2 = xmm2[0],xmm1[0]
+; SSE2-NEXT: movdqa %xmm2, %xmm1
+; SSE2-NEXT: retq
+;
+; AVX512-LABEL: sdiv_v4i64:
+; AVX512: # %bb.0:
+; AVX512-NEXT: movdqa %xmm0, %xmm5
+; AVX512-NEXT: pshufd {{.*#+}} xmm6 = xmm4[2,2,3,3]
+; AVX512-NEXT: pslld $31, %xmm6
+; AVX512-NEXT: psrad $31, %xmm6
+; AVX512-NEXT: pshufd {{.*#+}} xmm4 = xmm4[0,0,1,1]
+; AVX512-NEXT: pslld $31, %xmm4
+; AVX512-NEXT: psrad $31, %xmm4
+; AVX512-NEXT: movdqa {{.*#+}} xmm7 = [1,1]
+; AVX512-NEXT: pand %xmm4, %xmm2
+; AVX512-NEXT: pandn %xmm7, %xmm4
+; AVX512-NEXT: por %xmm2, %xmm4
+; AVX512-NEXT: movq %xmm4, %rcx
+; AVX512-NEXT: movq %xmm0, %rax
+; AVX512-NEXT: cqto
+; AVX512-NEXT: idivq %rcx
+; AVX512-NEXT: movq %rax, %xmm0
+; AVX512-NEXT: pshufd {{.*#+}} xmm2 = xmm4[2,3,2,3]
+; AVX512-NEXT: movq %xmm2, %rcx
+; AVX512-NEXT: pshufd {{.*#+}} xmm2 = xmm5[2,3,2,3]
+; AVX512-NEXT: movq %xmm2, %rax
+; AVX512-NEXT: cqto
+; AVX512-NEXT: idivq %rcx
+; AVX512-NEXT: movq %rax, %xmm2
+; AVX512-NEXT: punpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm2[0]
+; AVX512-NEXT: pand %xmm6, %xmm3
+; AVX512-NEXT: pandn %xmm7, %xmm6
+; AVX512-NEXT: por %xmm3, %xmm6
+; AVX512-NEXT: movq %xmm6, %rcx
+; AVX512-NEXT: movq %xmm1, %rax
+; AVX512-NEXT: cqto
+; AVX512-NEXT: idivq %rcx
+; AVX512-NEXT: movq %rax, %xmm2
+; AVX512-NEXT: pshufd {{.*#+}} xmm3 = xmm6[2,3,2,3]
+; AVX512-NEXT: movq %xmm3, %rcx
+; AVX512-NEXT: pshufd {{.*#+}} xmm1 = xmm1[2,3,2,3]
+; AVX512-NEXT: movq %xmm1, %rax
+; AVX512-NEXT: cqto
+; AVX512-NEXT: idivq %rcx
+; AVX512-NEXT: movq %rax, %xmm1
+; AVX512-NEXT: punpcklqdq {{.*#+}} xmm2 = xmm2[0],xmm1[0]
+; AVX512-NEXT: movdqa %xmm2, %xmm1
+; AVX512-NEXT: retq
+ %res = call <4 x i64> @llvm.masked.sdiv(<4 x i64> %x, <4 x i64> %y, <4 x i1> %m)
+ ret <4 x i64> %res
+}
+
+; Widening
+define <2 x i32> @sdiv_v2i32(<2 x i32> %x, <2 x i32> %y, <2 x i1> %m) {
+; SSE2-LABEL: sdiv_v2i32:
+; SSE2: # %bb.0:
+; SSE2-NEXT: xorps %xmm3, %xmm3
+; SSE2-NEXT: shufps {{.*#+}} xmm2 = xmm2[0,2],xmm3[2,3]
+; SSE2-NEXT: pslld $31, %xmm2
+; SSE2-NEXT: psrad $31, %xmm2
+; SSE2-NEXT: pand %xmm2, %xmm1
+; SSE2-NEXT: paddd %xmm2, %xmm1
+; SSE2-NEXT: pcmpeqd %xmm2, %xmm2
+; SSE2-NEXT: psubd %xmm2, %xmm1
+; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm1[3,3,3,3]
+; SSE2-NEXT: movd %xmm2, %ecx
+; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm0[3,3,3,3]
+; SSE2-NEXT: movd %xmm2, %eax
+; SSE2-NEXT: cltd
+; SSE2-NEXT: idivl %ecx
+; SSE2-NEXT: movd %eax, %xmm2
+; SSE2-NEXT: pshufd {{.*#+}} xmm3 = xmm1[2,3,2,3]
+; SSE2-NEXT: movd %xmm3, %ecx
+; SSE2-NEXT: pshufd {{.*#+}} xmm3 = xmm0[2,3,2,3]
+; SSE2-NEXT: movd %xmm3, %eax
+; SSE2-NEXT: cltd
+; SSE2-NEXT: idivl %ecx
+; SSE2-NEXT: movd %eax, %xmm3
+; SSE2-NEXT: punpckldq {{.*#+}} xmm3 = xmm3[0],xmm2[0],xmm3[1],xmm2[1]
+; SSE2-NEXT: movd %xmm1, %ecx
+; SSE2-NEXT: movd %xmm0, %eax
+; SSE2-NEXT: cltd
+; SSE2-NEXT: idivl %ecx
+; SSE2-NEXT: movd %eax, %xmm2
+; SSE2-NEXT: pshufd {{.*#+}} xmm1 = xmm1[1,1,1,1]
+; SSE2-NEXT: movd %xmm1, %ecx
+; SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm0[1,1,1,1]
+; SSE2-NEXT: movd %xmm0, %eax
+; SSE2-NEXT: cltd
+; SSE2-NEXT: idivl %ecx
+; SSE2-NEXT: movd %eax, %xmm0
+; SSE2-NEXT: punpckldq {{.*#+}} xmm2 = xmm2[0],xmm0[0],xmm2[1],xmm0[1]
+; SSE2-NEXT: punpcklqdq {{.*#+}} xmm2 = xmm2[0],xmm3[0]
+; SSE2-NEXT: movdqa %xmm2, %xmm0
+; SSE2-NEXT: retq
+;
+; AVX512-LABEL: sdiv_v2i32:
+; AVX512: # %bb.0:
+; AVX512-NEXT: xorps %xmm3, %xmm3
+; AVX512-NEXT: shufps {{.*#+}} xmm2 = xmm2[0,2],xmm3[2,3]
+; AVX512-NEXT: pslld $31, %xmm2
+; AVX512-NEXT: psrad $31, %xmm2
+; AVX512-NEXT: pand %xmm2, %xmm1
+; AVX512-NEXT: paddd %xmm2, %xmm1
+; AVX512-NEXT: pcmpeqd %xmm2, %xmm2
+; AVX512-NEXT: psubd %xmm2, %xmm1
+; AVX512-NEXT: pshufd {{.*#+}} xmm2 = xmm1[3,3,3,3]
+; AVX512-NEXT: movd %xmm2, %ecx
+; AVX512-NEXT: pshufd {{.*#+}} xmm2 = xmm0[3,3,3,3]
+; AVX512-NEXT: movd %xmm2, %eax
+; AVX512-NEXT: cltd
+; AVX512-NEXT: idivl %ecx
+; AVX512-NEXT: movd %eax, %xmm2
+; AVX512-NEXT: pshufd {{.*#+}} xmm3 = xmm1[2,3,2,3]
+; AVX512-NEXT: movd %xmm3, %ecx
+; AVX512-NEXT: pshufd {{.*#+}} xmm3 = xmm0[2,3,2,3]
+; AVX512-NEXT: movd %xmm3, %eax
+; AVX512-NEXT: cltd
+; AVX512-NEXT: idivl %ecx
+; AVX512-NEXT: movd %eax, %xmm3
+; AVX512-NEXT: punpckldq {{.*#+}} xmm3 = xmm3[0],xmm2[0],xmm3[1],xmm2[1]
+; AVX512-NEXT: movd %xmm1, %ecx
+; AVX512-NEXT: movd %xmm0, %eax
+; AVX512-NEXT: cltd
+; AVX512-NEXT: idivl %ecx
+; AVX512-NEXT: movd %eax, %xmm2
+; AVX512-NEXT: pshufd {{.*#+}} xmm1 = xmm1[1,1,1,1]
+; AVX512-NEXT: movd %xmm1, %ecx
+; AVX512-NEXT: pshufd {{.*#+}} xmm0 = xmm0[1,1,1,1]
+; AVX512-NEXT: movd %xmm0, %eax
+; AVX512-NEXT: cltd
+; AVX512-NEXT: idivl %ecx
+; AVX512-NEXT: movd %eax, %xmm0
+; AVX512-NEXT: punpckldq {{.*#+}} xmm2 = xmm2[0],xmm0[0],xmm2[1],xmm0[1]
+; AVX512-NEXT: punpcklqdq {{.*#+}} xmm2 = xmm2[0],xmm3[0]
+; AVX512-NEXT: movdqa %xmm2, %xmm0
+; AVX512-NEXT: retq
+ %res = call <2 x i32> @llvm.masked.sdiv(<2 x i32> %x, <2 x i32> %y, <2 x i1> %m)
+ ret <2 x i32> %res
+}
+
+; Promotion
+define <4 x i16> @sdiv_v4i16(<4 x i16> %x, <4 x i16> %y, <4 x i1> %m) {
+; SSE2-LABEL: sdiv_v4i16:
+; SSE2: # %bb.0:
+; SSE2-NEXT: pshuflw {{.*#+}} xmm2 = xmm2[0,2,2,3,4,5,6,7]
+; SSE2-NEXT: pshufhw {{.*#+}} xmm2 = xmm2[0,1,2,3,4,6,6,7]
+; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm2[0,1,0,2]
+; SSE2-NEXT: psrldq {{.*#+}} xmm2 = xmm2[8,9,10,11,12,13,14,15],zero,zero,zero,zero,zero,zero,zero,zero
+; SSE2-NEXT: psllw $15, %xmm2
+; SSE2-NEXT: psraw $15, %xmm2
+; SSE2-NEXT: pand %xmm2, %xmm1
+; SSE2-NEXT: paddw %xmm2, %xmm1
+; SSE2-NEXT: pcmpeqd %xmm2, %xmm2
+; SSE2-NEXT: psubw %xmm2, %xmm1
+; SSE2-NEXT: pextrw $7, %xmm1, %ecx
+; SSE2-NEXT: pextrw $7, %xmm0, %eax
+; SSE2-NEXT: # kill: def $ax killed $ax killed $eax
+; SSE2-NEXT: cwtd
+; SSE2-NEXT: idivw %cx
+; SSE2-NEXT: # kill: def $ax killed $ax def $eax
+; SSE2-NEXT: movd %eax, %xmm2
+; SSE2-NEXT: pextrw $6, %xmm1, %ecx
+; SSE2-NEXT: pextrw $6, %xmm0, %eax
+; SSE2-NEXT: # kill: def $ax killed $ax killed $eax
+; SSE2-NEXT: cwtd
+; SSE2-NEXT: idivw %cx
+; SSE2-NEXT: # kill: def $ax killed $ax def $eax
+; SSE2-NEXT: movd %eax, %xmm3
+; SSE2-NEXT: punpcklwd {{.*#+}} xmm3 = xmm3[0],xmm2[0],xmm3[1],xmm2[1],xmm3[2],xmm2[2],xmm3[3],xmm2[3]
+; SSE2-NEXT: pextrw $5, %xmm1, %ecx
+; SSE2-NEXT: pextrw $5, %xmm0, %eax
+; SSE2-NEXT: # kill: def $ax killed $ax killed $eax
+; SSE2-NEXT: cwtd
+; SSE2-NEXT: idivw %cx
+; SSE2-NEXT: # kill: def $ax killed $ax def $eax
+; SSE2-NEXT: movd %eax, %xmm4
+; SSE2-NEXT: pextrw $4, %xmm1, %ecx
+; SSE2-NEXT: pextrw $4, %xmm0, %eax
+; SSE2-NEXT: # kill: def $ax killed $ax killed $eax
+; SSE2-NEXT: cwtd
+; SSE2-NEXT: idivw %cx
+; SSE2-NEXT: # kill: def $ax killed $ax def $eax
+; SSE2-NEXT: movd %eax, %xmm2
+; SSE2-NEXT: punpcklwd {{.*#+}} xmm2 = xmm2[0],xmm4[0],xmm2[1],xmm4[1],xmm2[2],xmm4[2],xmm2[3],xmm4[3]
+; SSE2-NEXT: punpckldq {{.*#+}} xmm2 = xmm2[0],xmm3[0],xmm2[1],xmm3[1]
+; SSE2-NEXT: pextrw $3, %xmm1, %ecx
+; SSE2-NEXT: pextrw $3, %xmm0, %eax
+; SSE2-NEXT: # kill: def $ax killed $ax killed $eax
+; SSE2-NEXT: cwtd
+; SSE2-NEXT: idivw %cx
+; SSE2-NEXT: # kill: def $ax killed $ax def $eax
+; SSE2-NEXT: movd %eax, %xmm3
+; SSE2-NEXT: pextrw $2, %xmm1, %ecx
+; SSE2-NEXT: pextrw $2, %xmm0, %eax
+; SSE2-NEXT: # kill: def $ax killed $ax killed $eax
+; SSE2-NEXT: cwtd
+; SSE2-NEXT: idivw %cx
+; SSE2-NEXT: # kill: def $ax killed $ax def $eax
+; SSE2-NEXT: movd %eax, %xmm4
+; SSE2-NEXT: punpcklwd {{.*#+}} xmm4 = xmm4[0],xmm3[0],xmm4[1],xmm3[1],xmm4[2],xmm3[2],xmm4[3],xmm3[3]
+; SSE2-NEXT: pextrw $1, %xmm1, %ecx
+; SSE2-NEXT: pextrw $1, %xmm0, %eax
+; SSE2-NEXT: # kill: def $ax killed $ax killed $eax
+; SSE2-NEXT: cwtd
+; SSE2-NEXT: idivw %cx
+; SSE2-NEXT: # kill: def $ax killed $ax def $eax
+; SSE2-NEXT: movd %eax, %xmm3
+; SSE2-NEXT: movd %xmm1, %ecx
+; SSE2-NEXT: movd %xmm0, %eax
+; SSE2-NEXT: # kill: def $ax killed $ax killed $eax
+; SSE2-NEXT: cwtd
+; SSE2-NEXT: idivw %cx
+; SSE2-NEXT: # kill: def $ax killed $ax def $eax
+; SSE2-NEXT: movd %eax, %xmm0
+; SSE2-NEXT: punpcklwd {{.*#+}} xmm0 = xmm0[0],xmm3[0],xmm0[1],xmm3[1],xmm0[2],xmm3[2],xmm0[3],xmm3[3]
+; SSE2-NEXT: punpckldq {{.*#+}} xmm0 = xmm0[0],xmm4[0],xmm0[1],xmm4[1]
+; SSE2-NEXT: punpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm2[0]
+; SSE2-NEXT: retq
+;
+; AVX512-LABEL: sdiv_v4i16:
+; AVX512: # %bb.0:
+; AVX512-NEXT: pshuflw {{.*#+}} xmm2 = xmm2[0,2,2,3,4,5,6,7]
+; AVX512-NEXT: pshufhw {{.*#+}} xmm2 = xmm2[0,1,2,3,4,6,6,7]
+; AVX512-NEXT: pshufd {{.*#+}} xmm2 = xmm2[0,1,0,2]
+; AVX512-NEXT: psrldq {{.*#+}} xmm2 = xmm2[8,9,10,11,12,13,14,15],zero,zero,zero,zero,zero,zero,zero,zero
+; AVX512-NEXT: psllw $15, %xmm2
+; AVX512-NEXT: psraw $15, %xmm2
+; AVX512-NEXT: pand %xmm2, %xmm1
+; AVX512-NEXT: paddw %xmm2, %xmm1
+; AVX512-NEXT: pcmpeqd %xmm2, %xmm2
+; AVX512-NEXT: psubw %xmm2, %xmm1
+; AVX512-NEXT: pextrw $7, %xmm1, %ecx
+; AVX512-NEXT: pextrw $7, %xmm0, %eax
+; AVX512-NEXT: # kill: def $ax killed $ax killed $eax
+; AVX512-NEXT: cwtd
+; AVX512-NEXT: idivw %cx
+; AVX512-NEXT: # kill: def $ax killed $ax def $eax
+; AVX512-NEXT: movd %eax, %xmm2
+; AVX512-NEXT: pextrw $6, %xmm1, %ecx
+; AVX512-NEXT: pextrw $6, %xmm0, %eax
+; AVX512-NEXT: # kill: def $ax killed $ax killed $eax
+; AVX512-NEXT: cwtd
+; AVX512-NEXT: idivw %cx
+; AVX512-NEXT: # kill: def $ax killed $ax def $eax
+; AVX512-NEXT: movd %eax, %xmm3
+; AVX512-NEXT: punpcklwd {{.*#+}} xmm3 = xmm3[0],xmm2[0],xmm3[1],xmm2[1],xmm3[2],xmm2[2],xmm3[3],xmm2[3]
+; AVX512-NEXT: pextrw $5, %xmm1, %ecx
+; AVX512-NEXT: pextrw $5, %xmm0, %eax
+; AVX512-NEXT: # kill: def $ax killed $ax killed $eax
+; AVX512-NEXT: cwtd
+; AVX512-NEXT: idivw %cx
+; AVX512-NEXT: # kill: def $ax killed $ax def $eax
+; AVX512-NEXT: movd %eax, %xmm4
+; AVX512-NEXT: pextrw $4, %xmm1, %ecx
+; AVX512-NEXT: pextrw $4, %xmm0, %eax
+; AVX512-NEXT: # kill: def $ax killed $ax killed $eax
+; AVX512-NEXT: cwtd
+; AVX512-NEXT: idivw %cx
+; AVX512-NEXT: # kill: def $ax killed $ax def $eax
+; AVX512-NEXT: movd %eax, %xmm2
+; AVX512-NEXT: punpcklwd {{.*#+}} xmm2 = xmm2[0],xmm4[0],xmm2[1],xmm4[1],xmm2[2],xmm4[2],xmm2[3],xmm4[3]
+; AVX512-NEXT: punpckldq {{.*#+}} xmm2 = xmm2[0],xmm3[0],xmm2[1],xmm3[1]
+; AVX512-NEXT: pextrw $3, %xmm1, %ecx
+; AVX512-NEXT: pextrw $3, %xmm0, %eax
+; AVX512-NEXT: # kill: def $ax killed $ax killed $eax
+; AVX512-NEXT: cwtd
+; AVX512-NEXT: idivw %cx
+; AVX512-NEXT: # kill: def $ax killed $ax def $eax
+; AVX512-NEXT: movd %eax, %xmm3
+; AVX512-NEXT: pextrw $2, %xmm1, %ecx
+; AVX512-NEXT: pextrw $2, %xmm0, %eax
+; AVX512-NEXT: # kill: def $ax killed $ax killed $eax
+; AVX512-NEXT: cwtd
+; AVX512-NEXT: idivw %cx
+; AVX512-NEXT: # kill: def $ax killed $ax def $eax
+; AVX512-NEXT: movd %eax, %xmm4
+; AVX512-NEXT: punpcklwd {{.*#+}} xmm4 = xmm4[0],xmm3[0],xmm4[1],xmm3[1],xmm4[2],xmm3[2],xmm4[3],xmm3[3]
+; AVX512-NEXT: pextrw $1, %xmm1, %ecx
+; AVX512-NEXT: pextrw $1, %xmm0, %eax
+; AVX512-NEXT: # kill: def $ax killed $ax killed $eax
+; AVX512-NEXT: cwtd
+; AVX512-NEXT: idivw %cx
+; AVX512-NEXT: # kill: def $ax killed $ax def $eax
+; AVX512-NEXT: movd %eax, %xmm3
+; AVX512-NEXT: movd %xmm1, %ecx
+; AVX512-NEXT: movd %xmm0, %eax
+; AVX512-NEXT: # kill: def $ax killed $ax killed $eax
+; AVX512-NEXT: cwtd
+; AVX512-NEXT: idivw %cx
+; AVX512-NEXT: # kill: def $ax killed $ax def $eax
+; AVX512-NEXT: movd %eax, %xmm0
+; AVX512-NEXT: punpcklwd {{.*#+}} xmm0 = xmm0[0],xmm3[0],xmm0[1],xmm3[1],xmm0[2],xmm3[2],xmm0[3],xmm3[3]
+; AVX512-NEXT: punpckldq {{.*#+}} xmm0 = xmm0[0],xmm4[0],xmm0[1],xmm4[1]
+; AVX512-NEXT: punpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm2[0]
+; AVX512-NEXT: retq
+ %res = call <4 x i16> @llvm.masked.sdiv(<4 x i16> %x, <4 x i16> %y, <4 x i1> %m)
+ ret <4 x i16> %res
+}
+
+; Scalarization
+define <1 x i64> @sdiv_v1i164(<1 x i64> %x, <1 x i64> %y, <1 x i1> %m) {
+; SSE2-LABEL: sdiv_v1i164:
+; SSE2: # %bb.0:
+; SSE2-NEXT: movq %rdi, %rax
+; SSE2-NEXT: testb $1, %dl
+; SSE2-NEXT: movl $1, %ecx
+; SSE2-NEXT: cmovneq %rsi, %rcx
+; SSE2-NEXT: cqto
+; SSE2-NEXT: idivq %rcx
+; SSE2-NEXT: retq
+;
+; AVX512-LABEL: sdiv_v1i164:
+; AVX512: # %bb.0:
+; AVX512-NEXT: movq %rdi, %rax
+; AVX512-NEXT: testb $1, %dl
+; AVX512-NEXT: movl $1, %ecx
+; AVX512-NEXT: cmovneq %rsi, %rcx
+; AVX512-NEXT: cqto
+; AVX512-NEXT: idivq %rcx
+; AVX512-NEXT: retq
+ %res = call <1 x i64> @llvm.masked.sdiv(<1 x i64> %x, <1 x i64> %y, <1 x i1> %m)
+ ret <1 x i64> %res
+}
+
+; Expansion
+define <2 x i128> @sdiv_v2i128(<2 x i128> %x, <2 x i128> %y, <2 x i1> %m) {
+; SSE2-LABEL: sdiv_v2i128:
+; SSE2: # %bb.0:
+; SSE2-NEXT: pushq %rbp
+; SSE2-NEXT: .cfi_def_cfa_offset 16
+; SSE2-NEXT: pushq %r15
+; SSE2-NEXT: .cfi_def_cfa_offset 24
+; SSE2-NEXT: pushq %r14
+; SSE2-NEXT: .cfi_def_cfa_offset 32
+; SSE2-NEXT: pushq %r13
+; SSE2-NEXT: .cfi_def_cfa_offset 40
+; SSE2-NEXT: pushq %r12
+; SSE2-NEXT: .cfi_def_cfa_offset 48
+; SSE2-NEXT: pushq %rbx
+; SSE2-NEXT: .cfi_def_cfa_offset 56
+; SSE2-NEXT: subq $40, %rsp
+; SSE2-NEXT: .cfi_def_cfa_offset 96
+; SSE2-NEXT: .cfi_offset %rbx, -56
+; SSE2-NEXT: .cfi_offset %r12, -48
+; SSE2-NEXT: .cfi_offset %r13, -40
+; SSE2-NEXT: .cfi_offset %r14, -32
+; SSE2-NEXT: .cfi_offset %r15, -24
+; SSE2-NEXT: .cfi_offset %rbp, -16
+; SSE2-NEXT: movq %r8, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
+; SSE2-NEXT: movq %rcx, %r15
+; SSE2-NEXT: movq %rdi, %rbx
+; SSE2-NEXT: movaps %xmm0, {{[0-9]+}}(%rsp)
+; SSE2-NEXT: xorl %r12d, %r12d
+; SSE2-NEXT: testb $1, {{[0-9]+}}(%rsp)
+; SSE2-NEXT: movl $1, %r13d
+; SSE2-NEXT: cmoveq %r13, %r9
+; SSE2-NEXT: movq {{[0-9]+}}(%rsp), %rcx
+; SSE2-NEXT: cmoveq %r12, %rcx
+; SSE2-NEXT: movq %rsi, %rdi
+; SSE2-NEXT: movq %rdx, %rsi
+; SSE2-NEXT: movq %r9, %rdx
+; SSE2-NEXT: callq __divti3 at PLT
+; SSE2-NEXT: movq %rax, %rbp
+; SSE2-NEXT: movq %rdx, %r14
+; SSE2-NEXT: testb $1, {{[0-9]+}}(%rsp)
+; SSE2-NEXT: je .LBB6_2
+; SSE2-NEXT: # %bb.1:
+; SSE2-NEXT: movq {{[0-9]+}}(%rsp), %r13
+; SSE2-NEXT: movq {{[0-9]+}}(%rsp), %r12
+; SSE2-NEXT: .LBB6_2:
+; SSE2-NEXT: movq %r15, %rdi
+; SSE2-NEXT: movq {{[-0-9]+}}(%r{{[sb]}}p), %rsi # 8-byte Reload
+; SSE2-NEXT: movq %r13, %rdx
+; SSE2-NEXT: movq %r12, %rcx
+; SSE2-NEXT: callq __divti3 at PLT
+; SSE2-NEXT: movq %rdx, 24(%rbx)
+; SSE2-NEXT: movq %rax, 16(%rbx)
+; SSE2-NEXT: movq %r14, 8(%rbx)
+; SSE2-NEXT: movq %rbp, (%rbx)
+; SSE2-NEXT: movq %rbx, %rax
+; SSE2-NEXT: addq $40, %rsp
+; SSE2-NEXT: .cfi_def_cfa_offset 56
+; SSE2-NEXT: popq %rbx
+; SSE2-NEXT: .cfi_def_cfa_offset 48
+; SSE2-NEXT: popq %r12
+; SSE2-NEXT: .cfi_def_cfa_offset 40
+; SSE2-NEXT: popq %r13
+; SSE2-NEXT: .cfi_def_cfa_offset 32
+; SSE2-NEXT: popq %r14
+; SSE2-NEXT: .cfi_def_cfa_offset 24
+; SSE2-NEXT: popq %r15
+; SSE2-NEXT: .cfi_def_cfa_offset 16
+; SSE2-NEXT: popq %rbp
+; SSE2-NEXT: .cfi_def_cfa_offset 8
+; SSE2-NEXT: retq
+;
+; AVX512-LABEL: sdiv_v2i128:
+; AVX512: # %bb.0:
+; AVX512-NEXT: pushq %rbp
+; AVX512-NEXT: .cfi_def_cfa_offset 16
+; AVX512-NEXT: pushq %r15
+; AVX512-NEXT: .cfi_def_cfa_offset 24
+; AVX512-NEXT: pushq %r14
+; AVX512-NEXT: .cfi_def_cfa_offset 32
+; AVX512-NEXT: pushq %r13
+; AVX512-NEXT: .cfi_def_cfa_offset 40
+; AVX512-NEXT: pushq %r12
+; AVX512-NEXT: .cfi_def_cfa_offset 48
+; AVX512-NEXT: pushq %rbx
+; AVX512-NEXT: .cfi_def_cfa_offset 56
+; AVX512-NEXT: subq $40, %rsp
+; AVX512-NEXT: .cfi_def_cfa_offset 96
+; AVX512-NEXT: .cfi_offset %rbx, -56
+; AVX512-NEXT: .cfi_offset %r12, -48
+; AVX512-NEXT: .cfi_offset %r13, -40
+; AVX512-NEXT: .cfi_offset %r14, -32
+; AVX512-NEXT: .cfi_offset %r15, -24
+; AVX512-NEXT: .cfi_offset %rbp, -16
+; AVX512-NEXT: movq %r8, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
+; AVX512-NEXT: movq %rcx, %r15
+; AVX512-NEXT: movq %rdi, %rbx
+; AVX512-NEXT: movaps %xmm0, {{[0-9]+}}(%rsp)
+; AVX512-NEXT: xorl %r12d, %r12d
+; AVX512-NEXT: testb $1, {{[0-9]+}}(%rsp)
+; AVX512-NEXT: movl $1, %r13d
+; AVX512-NEXT: cmoveq %r13, %r9
+; AVX512-NEXT: movq {{[0-9]+}}(%rsp), %rcx
+; AVX512-NEXT: cmoveq %r12, %rcx
+; AVX512-NEXT: movq %rsi, %rdi
+; AVX512-NEXT: movq %rdx, %rsi
+; AVX512-NEXT: movq %r9, %rdx
+; AVX512-NEXT: callq __divti3 at PLT
+; AVX512-NEXT: movq %rax, %rbp
+; AVX512-NEXT: movq %rdx, %r14
+; AVX512-NEXT: testb $1, {{[0-9]+}}(%rsp)
+; AVX512-NEXT: je .LBB6_2
+; AVX512-NEXT: # %bb.1:
+; AVX512-NEXT: movq {{[0-9]+}}(%rsp), %r13
+; AVX512-NEXT: movq {{[0-9]+}}(%rsp), %r12
+; AVX512-NEXT: .LBB6_2:
+; AVX512-NEXT: movq %r15, %rdi
+; AVX512-NEXT: movq {{[-0-9]+}}(%r{{[sb]}}p), %rsi # 8-byte Reload
+; AVX512-NEXT: movq %r13, %rdx
+; AVX512-NEXT: movq %r12, %rcx
+; AVX512-NEXT: callq __divti3 at PLT
+; AVX512-NEXT: movq %rdx, 24(%rbx)
+; AVX512-NEXT: movq %rax, 16(%rbx)
+; AVX512-NEXT: movq %r14, 8(%rbx)
+; AVX512-NEXT: movq %rbp, (%rbx)
+; AVX512-NEXT: movq %rbx, %rax
+; AVX512-NEXT: addq $40, %rsp
+; AVX512-NEXT: .cfi_def_cfa_offset 56
+; AVX512-NEXT: popq %rbx
+; AVX512-NEXT: .cfi_def_cfa_offset 48
+; AVX512-NEXT: popq %r12
+; AVX512-NEXT: .cfi_def_cfa_offset 40
+; AVX512-NEXT: popq %r13
+; AVX512-NEXT: .cfi_def_cfa_offset 32
+; AVX512-NEXT: popq %r14
+; AVX512-NEXT: .cfi_def_cfa_offset 24
+; AVX512-NEXT: popq %r15
+; AVX512-NEXT: .cfi_def_cfa_offset 16
+; AVX512-NEXT: popq %rbp
+; AVX512-NEXT: .cfi_def_cfa_offset 8
+; AVX512-NEXT: retq
+ %res = call <2 x i128> @llvm.masked.sdiv(<2 x i128> %x, <2 x i128> %y, <2 x i1> %m)
+ ret <2 x i128> %res
+}
+
+; Promotion and widening
+define <3 x i10> @sdiv_v3i10(<3 x i10> %x, <3 x i10> %y, <3 x i1> %m) {
+; SSE2-LABEL: sdiv_v3i10:
+; SSE2: # %bb.0:
+; SSE2-NEXT: movd %esi, %xmm1
+; SSE2-NEXT: movd %edi, %xmm0
+; SSE2-NEXT: punpckldq {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1]
+; SSE2-NEXT: movd %edx, %xmm1
+; SSE2-NEXT: punpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm1[0]
+; SSE2-NEXT: pslld $22, %xmm0
+; SSE2-NEXT: psrad $22, %xmm0
+; SSE2-NEXT: movd {{.*#+}} xmm1 = mem[0],zero,zero,zero
+; SSE2-NEXT: movd {{.*#+}} xmm2 = mem[0],zero,zero,zero
+; SSE2-NEXT: punpckldq {{.*#+}} xmm2 = xmm2[0],xmm1[0],xmm2[1],xmm1[1]
+; SSE2-NEXT: movd {{.*#+}} xmm1 = mem[0],zero,zero,zero
+; SSE2-NEXT: punpcklqdq {{.*#+}} xmm2 = xmm2[0],xmm1[0]
+; SSE2-NEXT: pslld $31, %xmm2
+; SSE2-NEXT: psrad $31, %xmm2
+; SSE2-NEXT: movd %r8d, %xmm3
+; SSE2-NEXT: movd %ecx, %xmm1
+; SSE2-NEXT: punpckldq {{.*#+}} xmm1 = xmm1[0],xmm3[0],xmm1[1],xmm3[1]
+; SSE2-NEXT: movd %r9d, %xmm3
+; SSE2-NEXT: punpcklqdq {{.*#+}} xmm1 = xmm1[0],xmm3[0]
+; SSE2-NEXT: pslld $22, %xmm1
+; SSE2-NEXT: psrad $22, %xmm1
+; SSE2-NEXT: pand %xmm2, %xmm1
+; SSE2-NEXT: paddd %xmm2, %xmm1
+; SSE2-NEXT: pcmpeqd %xmm2, %xmm2
+; SSE2-NEXT: psubd %xmm2, %xmm1
+; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm1[2,3,2,3]
+; SSE2-NEXT: movd %xmm2, %ecx
+; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm0[2,3,2,3]
+; SSE2-NEXT: movd %xmm2, %eax
+; SSE2-NEXT: cltd
+; SSE2-NEXT: idivl %ecx
+; SSE2-NEXT: movl %eax, %ecx
+; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm1[1,1,1,1]
+; SSE2-NEXT: movd %xmm2, %esi
+; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm0[1,1,1,1]
+; SSE2-NEXT: movd %xmm2, %eax
+; SSE2-NEXT: cltd
+; SSE2-NEXT: idivl %esi
+; SSE2-NEXT: movl %eax, %esi
+; SSE2-NEXT: movd %xmm1, %edi
+; SSE2-NEXT: movd %xmm0, %eax
+; SSE2-NEXT: cltd
+; SSE2-NEXT: idivl %edi
+; SSE2-NEXT: # kill: def $ax killed $ax killed $eax
+; SSE2-NEXT: movl %esi, %edx
+; SSE2-NEXT: # kill: def $cx killed $cx killed $ecx
+; SSE2-NEXT: retq
+;
+; AVX512-LABEL: sdiv_v3i10:
+; AVX512: # %bb.0:
+; AVX512-NEXT: movd %esi, %xmm1
+; AVX512-NEXT: movd %edi, %xmm0
+; AVX512-NEXT: punpckldq {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1]
+; AVX512-NEXT: movd %edx, %xmm1
+; AVX512-NEXT: punpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm1[0]
+; AVX512-NEXT: pslld $22, %xmm0
+; AVX512-NEXT: psrad $22, %xmm0
+; AVX512-NEXT: movd {{.*#+}} xmm1 = mem[0],zero,zero,zero
+; AVX512-NEXT: movd {{.*#+}} xmm2 = mem[0],zero,zero,zero
+; AVX512-NEXT: punpckldq {{.*#+}} xmm2 = xmm2[0],xmm1[0],xmm2[1],xmm1[1]
+; AVX512-NEXT: movd {{.*#+}} xmm1 = mem[0],zero,zero,zero
+; AVX512-NEXT: punpcklqdq {{.*#+}} xmm2 = xmm2[0],xmm1[0]
+; AVX512-NEXT: pslld $31, %xmm2
+; AVX512-NEXT: psrad $31, %xmm2
+; AVX512-NEXT: movd %r8d, %xmm3
+; AVX512-NEXT: movd %ecx, %xmm1
+; AVX512-NEXT: punpckldq {{.*#+}} xmm1 = xmm1[0],xmm3[0],xmm1[1],xmm3[1]
+; AVX512-NEXT: movd %r9d, %xmm3
+; AVX512-NEXT: punpcklqdq {{.*#+}} xmm1 = xmm1[0],xmm3[0]
+; AVX512-NEXT: pslld $22, %xmm1
+; AVX512-NEXT: psrad $22, %xmm1
+; AVX512-NEXT: pand %xmm2, %xmm1
+; AVX512-NEXT: paddd %xmm2, %xmm1
+; AVX512-NEXT: pcmpeqd %xmm2, %xmm2
+; AVX512-NEXT: psubd %xmm2, %xmm1
+; AVX512-NEXT: pshufd {{.*#+}} xmm2 = xmm1[2,3,2,3]
+; AVX512-NEXT: movd %xmm2, %ecx
+; AVX512-NEXT: pshufd {{.*#+}} xmm2 = xmm0[2,3,2,3]
+; AVX512-NEXT: movd %xmm2, %eax
+; AVX512-NEXT: cltd
+; AVX512-NEXT: idivl %ecx
+; AVX512-NEXT: movl %eax, %ecx
+; AVX512-NEXT: pshufd {{.*#+}} xmm2 = xmm1[1,1,1,1]
+; AVX512-NEXT: movd %xmm2, %esi
+; AVX512-NEXT: pshufd {{.*#+}} xmm2 = xmm0[1,1,1,1]
+; AVX512-NEXT: movd %xmm2, %eax
+; AVX512-NEXT: cltd
+; AVX512-NEXT: idivl %esi
+; AVX512-NEXT: movl %eax, %esi
+; AVX512-NEXT: movd %xmm1, %edi
+; AVX512-NEXT: movd %xmm0, %eax
+; AVX512-NEXT: cltd
+; AVX512-NEXT: idivl %edi
+; AVX512-NEXT: # kill: def $ax killed $ax killed $eax
+; AVX512-NEXT: movl %esi, %edx
+; AVX512-NEXT: # kill: def $cx killed $cx killed $ecx
+; AVX512-NEXT: retq
+ %res = call <3 x i10> @llvm.masked.sdiv(<3 x i10> %x, <3 x i10> %y, <3 x i1> %m)
+ ret <3 x i10> %res
+}
diff --git a/llvm/test/CodeGen/X86/masked-srem.ll b/llvm/test/CodeGen/X86/masked-srem.ll
new file mode 100644
index 0000000000000..af9fee38e46ff
--- /dev/null
+++ b/llvm/test/CodeGen/X86/masked-srem.ll
@@ -0,0 +1,762 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
+; RUN: llc -mtriple x86_64 -mattr=+sse2 < %s | FileCheck %s --check-prefix=SSE2
+; RUN: llc -mtriple x86_64 -mattr=+avx512 < %s | FileCheck %s --check-prefix=AVX512
+
+; Legal
+define <4 x i32> @srem_v4i32(<4 x i32> %x, <4 x i32> %y, <4 x i1> %m) {
+; SSE2-LABEL: srem_v4i32:
+; SSE2: # %bb.0:
+; SSE2-NEXT: pslld $31, %xmm2
+; SSE2-NEXT: psrad $31, %xmm2
+; SSE2-NEXT: pand %xmm2, %xmm1
+; SSE2-NEXT: paddd %xmm2, %xmm1
+; SSE2-NEXT: pcmpeqd %xmm2, %xmm2
+; SSE2-NEXT: psubd %xmm2, %xmm1
+; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm1[3,3,3,3]
+; SSE2-NEXT: movd %xmm2, %ecx
+; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm0[3,3,3,3]
+; SSE2-NEXT: movd %xmm2, %eax
+; SSE2-NEXT: cltd
+; SSE2-NEXT: idivl %ecx
+; SSE2-NEXT: movd %edx, %xmm2
+; SSE2-NEXT: pshufd {{.*#+}} xmm3 = xmm1[2,3,2,3]
+; SSE2-NEXT: movd %xmm3, %ecx
+; SSE2-NEXT: pshufd {{.*#+}} xmm3 = xmm0[2,3,2,3]
+; SSE2-NEXT: movd %xmm3, %eax
+; SSE2-NEXT: cltd
+; SSE2-NEXT: idivl %ecx
+; SSE2-NEXT: movd %edx, %xmm3
+; SSE2-NEXT: punpckldq {{.*#+}} xmm3 = xmm3[0],xmm2[0],xmm3[1],xmm2[1]
+; SSE2-NEXT: movd %xmm1, %ecx
+; SSE2-NEXT: movd %xmm0, %eax
+; SSE2-NEXT: cltd
+; SSE2-NEXT: idivl %ecx
+; SSE2-NEXT: movd %edx, %xmm2
+; SSE2-NEXT: pshufd {{.*#+}} xmm1 = xmm1[1,1,1,1]
+; SSE2-NEXT: movd %xmm1, %ecx
+; SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm0[1,1,1,1]
+; SSE2-NEXT: movd %xmm0, %eax
+; SSE2-NEXT: cltd
+; SSE2-NEXT: idivl %ecx
+; SSE2-NEXT: movd %edx, %xmm0
+; SSE2-NEXT: punpckldq {{.*#+}} xmm2 = xmm2[0],xmm0[0],xmm2[1],xmm0[1]
+; SSE2-NEXT: punpcklqdq {{.*#+}} xmm2 = xmm2[0],xmm3[0]
+; SSE2-NEXT: movdqa %xmm2, %xmm0
+; SSE2-NEXT: retq
+;
+; AVX512-LABEL: srem_v4i32:
+; AVX512: # %bb.0:
+; AVX512-NEXT: pslld $31, %xmm2
+; AVX512-NEXT: psrad $31, %xmm2
+; AVX512-NEXT: pand %xmm2, %xmm1
+; AVX512-NEXT: paddd %xmm2, %xmm1
+; AVX512-NEXT: pcmpeqd %xmm2, %xmm2
+; AVX512-NEXT: psubd %xmm2, %xmm1
+; AVX512-NEXT: pshufd {{.*#+}} xmm2 = xmm1[3,3,3,3]
+; AVX512-NEXT: movd %xmm2, %ecx
+; AVX512-NEXT: pshufd {{.*#+}} xmm2 = xmm0[3,3,3,3]
+; AVX512-NEXT: movd %xmm2, %eax
+; AVX512-NEXT: cltd
+; AVX512-NEXT: idivl %ecx
+; AVX512-NEXT: movd %edx, %xmm2
+; AVX512-NEXT: pshufd {{.*#+}} xmm3 = xmm1[2,3,2,3]
+; AVX512-NEXT: movd %xmm3, %ecx
+; AVX512-NEXT: pshufd {{.*#+}} xmm3 = xmm0[2,3,2,3]
+; AVX512-NEXT: movd %xmm3, %eax
+; AVX512-NEXT: cltd
+; AVX512-NEXT: idivl %ecx
+; AVX512-NEXT: movd %edx, %xmm3
+; AVX512-NEXT: punpckldq {{.*#+}} xmm3 = xmm3[0],xmm2[0],xmm3[1],xmm2[1]
+; AVX512-NEXT: movd %xmm1, %ecx
+; AVX512-NEXT: movd %xmm0, %eax
+; AVX512-NEXT: cltd
+; AVX512-NEXT: idivl %ecx
+; AVX512-NEXT: movd %edx, %xmm2
+; AVX512-NEXT: pshufd {{.*#+}} xmm1 = xmm1[1,1,1,1]
+; AVX512-NEXT: movd %xmm1, %ecx
+; AVX512-NEXT: pshufd {{.*#+}} xmm0 = xmm0[1,1,1,1]
+; AVX512-NEXT: movd %xmm0, %eax
+; AVX512-NEXT: cltd
+; AVX512-NEXT: idivl %ecx
+; AVX512-NEXT: movd %edx, %xmm0
+; AVX512-NEXT: punpckldq {{.*#+}} xmm2 = xmm2[0],xmm0[0],xmm2[1],xmm0[1]
+; AVX512-NEXT: punpcklqdq {{.*#+}} xmm2 = xmm2[0],xmm3[0]
+; AVX512-NEXT: movdqa %xmm2, %xmm0
+; AVX512-NEXT: retq
+ %res = call <4 x i32> @llvm.masked.srem(<4 x i32> %x, <4 x i32> %y, <4 x i1> %m)
+ ret <4 x i32> %res
+}
+
+define <2 x i64> @srem_v2i64(<2 x i64> %x, <2 x i64> %y, <2 x i1> %m) {
+; SSE2-LABEL: srem_v2i64:
+; SSE2: # %bb.0:
+; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm2[0,0,2,2]
+; SSE2-NEXT: pslld $31, %xmm2
+; SSE2-NEXT: psrad $31, %xmm2
+; SSE2-NEXT: pand %xmm2, %xmm1
+; SSE2-NEXT: pandn {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm2
+; SSE2-NEXT: por %xmm1, %xmm2
+; SSE2-NEXT: movq %xmm2, %rcx
+; SSE2-NEXT: movq %xmm0, %rax
+; SSE2-NEXT: cqto
+; SSE2-NEXT: idivq %rcx
+; SSE2-NEXT: movq %rdx, %xmm1
+; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm2[2,3,2,3]
+; SSE2-NEXT: movq %xmm2, %rcx
+; SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm0[2,3,2,3]
+; SSE2-NEXT: movq %xmm0, %rax
+; SSE2-NEXT: cqto
+; SSE2-NEXT: idivq %rcx
+; SSE2-NEXT: movq %rdx, %xmm0
+; SSE2-NEXT: punpcklqdq {{.*#+}} xmm1 = xmm1[0],xmm0[0]
+; SSE2-NEXT: movdqa %xmm1, %xmm0
+; SSE2-NEXT: retq
+;
+; AVX512-LABEL: srem_v2i64:
+; AVX512: # %bb.0:
+; AVX512-NEXT: pshufd {{.*#+}} xmm2 = xmm2[0,0,2,2]
+; AVX512-NEXT: pslld $31, %xmm2
+; AVX512-NEXT: psrad $31, %xmm2
+; AVX512-NEXT: pand %xmm2, %xmm1
+; AVX512-NEXT: pandn {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm2
+; AVX512-NEXT: por %xmm1, %xmm2
+; AVX512-NEXT: movq %xmm2, %rcx
+; AVX512-NEXT: movq %xmm0, %rax
+; AVX512-NEXT: cqto
+; AVX512-NEXT: idivq %rcx
+; AVX512-NEXT: movq %rdx, %xmm1
+; AVX512-NEXT: pshufd {{.*#+}} xmm2 = xmm2[2,3,2,3]
+; AVX512-NEXT: movq %xmm2, %rcx
+; AVX512-NEXT: pshufd {{.*#+}} xmm0 = xmm0[2,3,2,3]
+; AVX512-NEXT: movq %xmm0, %rax
+; AVX512-NEXT: cqto
+; AVX512-NEXT: idivq %rcx
+; AVX512-NEXT: movq %rdx, %xmm0
+; AVX512-NEXT: punpcklqdq {{.*#+}} xmm1 = xmm1[0],xmm0[0]
+; AVX512-NEXT: movdqa %xmm1, %xmm0
+; AVX512-NEXT: retq
+ %res = call <2 x i64> @llvm.masked.srem(<2 x i64> %x, <2 x i64> %y, <2 x i1> %m)
+ ret <2 x i64> %res
+}
+
+; Splitting
+define <4 x i64> @srem_v4i64(<4 x i64> %x, <4 x i64> %y, <4 x i1> %m) {
+; SSE2-LABEL: srem_v4i64:
+; SSE2: # %bb.0:
+; SSE2-NEXT: movdqa %xmm0, %xmm5
+; SSE2-NEXT: pshufd {{.*#+}} xmm6 = xmm4[2,2,3,3]
+; SSE2-NEXT: pslld $31, %xmm6
+; SSE2-NEXT: psrad $31, %xmm6
+; SSE2-NEXT: pshufd {{.*#+}} xmm4 = xmm4[0,0,1,1]
+; SSE2-NEXT: pslld $31, %xmm4
+; SSE2-NEXT: psrad $31, %xmm4
+; SSE2-NEXT: movdqa {{.*#+}} xmm7 = [1,1]
+; SSE2-NEXT: pand %xmm4, %xmm2
+; SSE2-NEXT: pandn %xmm7, %xmm4
+; SSE2-NEXT: por %xmm2, %xmm4
+; SSE2-NEXT: movq %xmm4, %rcx
+; SSE2-NEXT: movq %xmm0, %rax
+; SSE2-NEXT: cqto
+; SSE2-NEXT: idivq %rcx
+; SSE2-NEXT: movq %rdx, %xmm0
+; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm4[2,3,2,3]
+; SSE2-NEXT: movq %xmm2, %rcx
+; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm5[2,3,2,3]
+; SSE2-NEXT: movq %xmm2, %rax
+; SSE2-NEXT: cqto
+; SSE2-NEXT: idivq %rcx
+; SSE2-NEXT: movq %rdx, %xmm2
+; SSE2-NEXT: punpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm2[0]
+; SSE2-NEXT: pand %xmm6, %xmm3
+; SSE2-NEXT: pandn %xmm7, %xmm6
+; SSE2-NEXT: por %xmm3, %xmm6
+; SSE2-NEXT: movq %xmm6, %rcx
+; SSE2-NEXT: movq %xmm1, %rax
+; SSE2-NEXT: cqto
+; SSE2-NEXT: idivq %rcx
+; SSE2-NEXT: movq %rdx, %xmm2
+; SSE2-NEXT: pshufd {{.*#+}} xmm3 = xmm6[2,3,2,3]
+; SSE2-NEXT: movq %xmm3, %rcx
+; SSE2-NEXT: pshufd {{.*#+}} xmm1 = xmm1[2,3,2,3]
+; SSE2-NEXT: movq %xmm1, %rax
+; SSE2-NEXT: cqto
+; SSE2-NEXT: idivq %rcx
+; SSE2-NEXT: movq %rdx, %xmm1
+; SSE2-NEXT: punpcklqdq {{.*#+}} xmm2 = xmm2[0],xmm1[0]
+; SSE2-NEXT: movdqa %xmm2, %xmm1
+; SSE2-NEXT: retq
+;
+; AVX512-LABEL: srem_v4i64:
+; AVX512: # %bb.0:
+; AVX512-NEXT: movdqa %xmm0, %xmm5
+; AVX512-NEXT: pshufd {{.*#+}} xmm6 = xmm4[2,2,3,3]
+; AVX512-NEXT: pslld $31, %xmm6
+; AVX512-NEXT: psrad $31, %xmm6
+; AVX512-NEXT: pshufd {{.*#+}} xmm4 = xmm4[0,0,1,1]
+; AVX512-NEXT: pslld $31, %xmm4
+; AVX512-NEXT: psrad $31, %xmm4
+; AVX512-NEXT: movdqa {{.*#+}} xmm7 = [1,1]
+; AVX512-NEXT: pand %xmm4, %xmm2
+; AVX512-NEXT: pandn %xmm7, %xmm4
+; AVX512-NEXT: por %xmm2, %xmm4
+; AVX512-NEXT: movq %xmm4, %rcx
+; AVX512-NEXT: movq %xmm0, %rax
+; AVX512-NEXT: cqto
+; AVX512-NEXT: idivq %rcx
+; AVX512-NEXT: movq %rdx, %xmm0
+; AVX512-NEXT: pshufd {{.*#+}} xmm2 = xmm4[2,3,2,3]
+; AVX512-NEXT: movq %xmm2, %rcx
+; AVX512-NEXT: pshufd {{.*#+}} xmm2 = xmm5[2,3,2,3]
+; AVX512-NEXT: movq %xmm2, %rax
+; AVX512-NEXT: cqto
+; AVX512-NEXT: idivq %rcx
+; AVX512-NEXT: movq %rdx, %xmm2
+; AVX512-NEXT: punpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm2[0]
+; AVX512-NEXT: pand %xmm6, %xmm3
+; AVX512-NEXT: pandn %xmm7, %xmm6
+; AVX512-NEXT: por %xmm3, %xmm6
+; AVX512-NEXT: movq %xmm6, %rcx
+; AVX512-NEXT: movq %xmm1, %rax
+; AVX512-NEXT: cqto
+; AVX512-NEXT: idivq %rcx
+; AVX512-NEXT: movq %rdx, %xmm2
+; AVX512-NEXT: pshufd {{.*#+}} xmm3 = xmm6[2,3,2,3]
+; AVX512-NEXT: movq %xmm3, %rcx
+; AVX512-NEXT: pshufd {{.*#+}} xmm1 = xmm1[2,3,2,3]
+; AVX512-NEXT: movq %xmm1, %rax
+; AVX512-NEXT: cqto
+; AVX512-NEXT: idivq %rcx
+; AVX512-NEXT: movq %rdx, %xmm1
+; AVX512-NEXT: punpcklqdq {{.*#+}} xmm2 = xmm2[0],xmm1[0]
+; AVX512-NEXT: movdqa %xmm2, %xmm1
+; AVX512-NEXT: retq
+ %res = call <4 x i64> @llvm.masked.srem(<4 x i64> %x, <4 x i64> %y, <4 x i1> %m)
+ ret <4 x i64> %res
+}
+
+; Widening
+define <2 x i32> @srem_v2i32(<2 x i32> %x, <2 x i32> %y, <2 x i1> %m) {
+; SSE2-LABEL: srem_v2i32:
+; SSE2: # %bb.0:
+; SSE2-NEXT: xorps %xmm3, %xmm3
+; SSE2-NEXT: shufps {{.*#+}} xmm2 = xmm2[0,2],xmm3[2,3]
+; SSE2-NEXT: pslld $31, %xmm2
+; SSE2-NEXT: psrad $31, %xmm2
+; SSE2-NEXT: pand %xmm2, %xmm1
+; SSE2-NEXT: paddd %xmm2, %xmm1
+; SSE2-NEXT: pcmpeqd %xmm2, %xmm2
+; SSE2-NEXT: psubd %xmm2, %xmm1
+; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm1[3,3,3,3]
+; SSE2-NEXT: movd %xmm2, %ecx
+; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm0[3,3,3,3]
+; SSE2-NEXT: movd %xmm2, %eax
+; SSE2-NEXT: cltd
+; SSE2-NEXT: idivl %ecx
+; SSE2-NEXT: movd %edx, %xmm2
+; SSE2-NEXT: pshufd {{.*#+}} xmm3 = xmm1[2,3,2,3]
+; SSE2-NEXT: movd %xmm3, %ecx
+; SSE2-NEXT: pshufd {{.*#+}} xmm3 = xmm0[2,3,2,3]
+; SSE2-NEXT: movd %xmm3, %eax
+; SSE2-NEXT: cltd
+; SSE2-NEXT: idivl %ecx
+; SSE2-NEXT: movd %edx, %xmm3
+; SSE2-NEXT: punpckldq {{.*#+}} xmm3 = xmm3[0],xmm2[0],xmm3[1],xmm2[1]
+; SSE2-NEXT: movd %xmm1, %ecx
+; SSE2-NEXT: movd %xmm0, %eax
+; SSE2-NEXT: cltd
+; SSE2-NEXT: idivl %ecx
+; SSE2-NEXT: movd %edx, %xmm2
+; SSE2-NEXT: pshufd {{.*#+}} xmm1 = xmm1[1,1,1,1]
+; SSE2-NEXT: movd %xmm1, %ecx
+; SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm0[1,1,1,1]
+; SSE2-NEXT: movd %xmm0, %eax
+; SSE2-NEXT: cltd
+; SSE2-NEXT: idivl %ecx
+; SSE2-NEXT: movd %edx, %xmm0
+; SSE2-NEXT: punpckldq {{.*#+}} xmm2 = xmm2[0],xmm0[0],xmm2[1],xmm0[1]
+; SSE2-NEXT: punpcklqdq {{.*#+}} xmm2 = xmm2[0],xmm3[0]
+; SSE2-NEXT: movdqa %xmm2, %xmm0
+; SSE2-NEXT: retq
+;
+; AVX512-LABEL: srem_v2i32:
+; AVX512: # %bb.0:
+; AVX512-NEXT: xorps %xmm3, %xmm3
+; AVX512-NEXT: shufps {{.*#+}} xmm2 = xmm2[0,2],xmm3[2,3]
+; AVX512-NEXT: pslld $31, %xmm2
+; AVX512-NEXT: psrad $31, %xmm2
+; AVX512-NEXT: pand %xmm2, %xmm1
+; AVX512-NEXT: paddd %xmm2, %xmm1
+; AVX512-NEXT: pcmpeqd %xmm2, %xmm2
+; AVX512-NEXT: psubd %xmm2, %xmm1
+; AVX512-NEXT: pshufd {{.*#+}} xmm2 = xmm1[3,3,3,3]
+; AVX512-NEXT: movd %xmm2, %ecx
+; AVX512-NEXT: pshufd {{.*#+}} xmm2 = xmm0[3,3,3,3]
+; AVX512-NEXT: movd %xmm2, %eax
+; AVX512-NEXT: cltd
+; AVX512-NEXT: idivl %ecx
+; AVX512-NEXT: movd %edx, %xmm2
+; AVX512-NEXT: pshufd {{.*#+}} xmm3 = xmm1[2,3,2,3]
+; AVX512-NEXT: movd %xmm3, %ecx
+; AVX512-NEXT: pshufd {{.*#+}} xmm3 = xmm0[2,3,2,3]
+; AVX512-NEXT: movd %xmm3, %eax
+; AVX512-NEXT: cltd
+; AVX512-NEXT: idivl %ecx
+; AVX512-NEXT: movd %edx, %xmm3
+; AVX512-NEXT: punpckldq {{.*#+}} xmm3 = xmm3[0],xmm2[0],xmm3[1],xmm2[1]
+; AVX512-NEXT: movd %xmm1, %ecx
+; AVX512-NEXT: movd %xmm0, %eax
+; AVX512-NEXT: cltd
+; AVX512-NEXT: idivl %ecx
+; AVX512-NEXT: movd %edx, %xmm2
+; AVX512-NEXT: pshufd {{.*#+}} xmm1 = xmm1[1,1,1,1]
+; AVX512-NEXT: movd %xmm1, %ecx
+; AVX512-NEXT: pshufd {{.*#+}} xmm0 = xmm0[1,1,1,1]
+; AVX512-NEXT: movd %xmm0, %eax
+; AVX512-NEXT: cltd
+; AVX512-NEXT: idivl %ecx
+; AVX512-NEXT: movd %edx, %xmm0
+; AVX512-NEXT: punpckldq {{.*#+}} xmm2 = xmm2[0],xmm0[0],xmm2[1],xmm0[1]
+; AVX512-NEXT: punpcklqdq {{.*#+}} xmm2 = xmm2[0],xmm3[0]
+; AVX512-NEXT: movdqa %xmm2, %xmm0
+; AVX512-NEXT: retq
+ %res = call <2 x i32> @llvm.masked.srem(<2 x i32> %x, <2 x i32> %y, <2 x i1> %m)
+ ret <2 x i32> %res
+}
+
+; Promotion
+define <4 x i16> @srem_v4i16(<4 x i16> %x, <4 x i16> %y, <4 x i1> %m) {
+; SSE2-LABEL: srem_v4i16:
+; SSE2: # %bb.0:
+; SSE2-NEXT: pshuflw {{.*#+}} xmm2 = xmm2[0,2,2,3,4,5,6,7]
+; SSE2-NEXT: pshufhw {{.*#+}} xmm2 = xmm2[0,1,2,3,4,6,6,7]
+; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm2[0,1,0,2]
+; SSE2-NEXT: psrldq {{.*#+}} xmm2 = xmm2[8,9,10,11,12,13,14,15],zero,zero,zero,zero,zero,zero,zero,zero
+; SSE2-NEXT: psllw $15, %xmm2
+; SSE2-NEXT: psraw $15, %xmm2
+; SSE2-NEXT: pand %xmm2, %xmm1
+; SSE2-NEXT: paddw %xmm2, %xmm1
+; SSE2-NEXT: pcmpeqd %xmm2, %xmm2
+; SSE2-NEXT: psubw %xmm2, %xmm1
+; SSE2-NEXT: pextrw $7, %xmm1, %ecx
+; SSE2-NEXT: pextrw $7, %xmm0, %eax
+; SSE2-NEXT: # kill: def $ax killed $ax killed $eax
+; SSE2-NEXT: cwtd
+; SSE2-NEXT: idivw %cx
+; SSE2-NEXT: # kill: def $dx killed $dx def $edx
+; SSE2-NEXT: movd %edx, %xmm2
+; SSE2-NEXT: pextrw $6, %xmm1, %ecx
+; SSE2-NEXT: pextrw $6, %xmm0, %eax
+; SSE2-NEXT: # kill: def $ax killed $ax killed $eax
+; SSE2-NEXT: cwtd
+; SSE2-NEXT: idivw %cx
+; SSE2-NEXT: # kill: def $dx killed $dx def $edx
+; SSE2-NEXT: movd %edx, %xmm3
+; SSE2-NEXT: punpcklwd {{.*#+}} xmm3 = xmm3[0],xmm2[0],xmm3[1],xmm2[1],xmm3[2],xmm2[2],xmm3[3],xmm2[3]
+; SSE2-NEXT: pextrw $5, %xmm1, %ecx
+; SSE2-NEXT: pextrw $5, %xmm0, %eax
+; SSE2-NEXT: # kill: def $ax killed $ax killed $eax
+; SSE2-NEXT: cwtd
+; SSE2-NEXT: idivw %cx
+; SSE2-NEXT: # kill: def $dx killed $dx def $edx
+; SSE2-NEXT: movd %edx, %xmm4
+; SSE2-NEXT: pextrw $4, %xmm1, %ecx
+; SSE2-NEXT: pextrw $4, %xmm0, %eax
+; SSE2-NEXT: # kill: def $ax killed $ax killed $eax
+; SSE2-NEXT: cwtd
+; SSE2-NEXT: idivw %cx
+; SSE2-NEXT: # kill: def $dx killed $dx def $edx
+; SSE2-NEXT: movd %edx, %xmm2
+; SSE2-NEXT: punpcklwd {{.*#+}} xmm2 = xmm2[0],xmm4[0],xmm2[1],xmm4[1],xmm2[2],xmm4[2],xmm2[3],xmm4[3]
+; SSE2-NEXT: punpckldq {{.*#+}} xmm2 = xmm2[0],xmm3[0],xmm2[1],xmm3[1]
+; SSE2-NEXT: pextrw $3, %xmm1, %ecx
+; SSE2-NEXT: pextrw $3, %xmm0, %eax
+; SSE2-NEXT: # kill: def $ax killed $ax killed $eax
+; SSE2-NEXT: cwtd
+; SSE2-NEXT: idivw %cx
+; SSE2-NEXT: # kill: def $dx killed $dx def $edx
+; SSE2-NEXT: movd %edx, %xmm3
+; SSE2-NEXT: pextrw $2, %xmm1, %ecx
+; SSE2-NEXT: pextrw $2, %xmm0, %eax
+; SSE2-NEXT: # kill: def $ax killed $ax killed $eax
+; SSE2-NEXT: cwtd
+; SSE2-NEXT: idivw %cx
+; SSE2-NEXT: # kill: def $dx killed $dx def $edx
+; SSE2-NEXT: movd %edx, %xmm4
+; SSE2-NEXT: punpcklwd {{.*#+}} xmm4 = xmm4[0],xmm3[0],xmm4[1],xmm3[1],xmm4[2],xmm3[2],xmm4[3],xmm3[3]
+; SSE2-NEXT: pextrw $1, %xmm1, %ecx
+; SSE2-NEXT: pextrw $1, %xmm0, %eax
+; SSE2-NEXT: # kill: def $ax killed $ax killed $eax
+; SSE2-NEXT: cwtd
+; SSE2-NEXT: idivw %cx
+; SSE2-NEXT: # kill: def $dx killed $dx def $edx
+; SSE2-NEXT: movd %edx, %xmm3
+; SSE2-NEXT: movd %xmm1, %ecx
+; SSE2-NEXT: movd %xmm0, %eax
+; SSE2-NEXT: # kill: def $ax killed $ax killed $eax
+; SSE2-NEXT: cwtd
+; SSE2-NEXT: idivw %cx
+; SSE2-NEXT: # kill: def $dx killed $dx def $edx
+; SSE2-NEXT: movd %edx, %xmm0
+; SSE2-NEXT: punpcklwd {{.*#+}} xmm0 = xmm0[0],xmm3[0],xmm0[1],xmm3[1],xmm0[2],xmm3[2],xmm0[3],xmm3[3]
+; SSE2-NEXT: punpckldq {{.*#+}} xmm0 = xmm0[0],xmm4[0],xmm0[1],xmm4[1]
+; SSE2-NEXT: punpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm2[0]
+; SSE2-NEXT: retq
+;
+; AVX512-LABEL: srem_v4i16:
+; AVX512: # %bb.0:
+; AVX512-NEXT: pshuflw {{.*#+}} xmm2 = xmm2[0,2,2,3,4,5,6,7]
+; AVX512-NEXT: pshufhw {{.*#+}} xmm2 = xmm2[0,1,2,3,4,6,6,7]
+; AVX512-NEXT: pshufd {{.*#+}} xmm2 = xmm2[0,1,0,2]
+; AVX512-NEXT: psrldq {{.*#+}} xmm2 = xmm2[8,9,10,11,12,13,14,15],zero,zero,zero,zero,zero,zero,zero,zero
+; AVX512-NEXT: psllw $15, %xmm2
+; AVX512-NEXT: psraw $15, %xmm2
+; AVX512-NEXT: pand %xmm2, %xmm1
+; AVX512-NEXT: paddw %xmm2, %xmm1
+; AVX512-NEXT: pcmpeqd %xmm2, %xmm2
+; AVX512-NEXT: psubw %xmm2, %xmm1
+; AVX512-NEXT: pextrw $7, %xmm1, %ecx
+; AVX512-NEXT: pextrw $7, %xmm0, %eax
+; AVX512-NEXT: # kill: def $ax killed $ax killed $eax
+; AVX512-NEXT: cwtd
+; AVX512-NEXT: idivw %cx
+; AVX512-NEXT: # kill: def $dx killed $dx def $edx
+; AVX512-NEXT: movd %edx, %xmm2
+; AVX512-NEXT: pextrw $6, %xmm1, %ecx
+; AVX512-NEXT: pextrw $6, %xmm0, %eax
+; AVX512-NEXT: # kill: def $ax killed $ax killed $eax
+; AVX512-NEXT: cwtd
+; AVX512-NEXT: idivw %cx
+; AVX512-NEXT: # kill: def $dx killed $dx def $edx
+; AVX512-NEXT: movd %edx, %xmm3
+; AVX512-NEXT: punpcklwd {{.*#+}} xmm3 = xmm3[0],xmm2[0],xmm3[1],xmm2[1],xmm3[2],xmm2[2],xmm3[3],xmm2[3]
+; AVX512-NEXT: pextrw $5, %xmm1, %ecx
+; AVX512-NEXT: pextrw $5, %xmm0, %eax
+; AVX512-NEXT: # kill: def $ax killed $ax killed $eax
+; AVX512-NEXT: cwtd
+; AVX512-NEXT: idivw %cx
+; AVX512-NEXT: # kill: def $dx killed $dx def $edx
+; AVX512-NEXT: movd %edx, %xmm4
+; AVX512-NEXT: pextrw $4, %xmm1, %ecx
+; AVX512-NEXT: pextrw $4, %xmm0, %eax
+; AVX512-NEXT: # kill: def $ax killed $ax killed $eax
+; AVX512-NEXT: cwtd
+; AVX512-NEXT: idivw %cx
+; AVX512-NEXT: # kill: def $dx killed $dx def $edx
+; AVX512-NEXT: movd %edx, %xmm2
+; AVX512-NEXT: punpcklwd {{.*#+}} xmm2 = xmm2[0],xmm4[0],xmm2[1],xmm4[1],xmm2[2],xmm4[2],xmm2[3],xmm4[3]
+; AVX512-NEXT: punpckldq {{.*#+}} xmm2 = xmm2[0],xmm3[0],xmm2[1],xmm3[1]
+; AVX512-NEXT: pextrw $3, %xmm1, %ecx
+; AVX512-NEXT: pextrw $3, %xmm0, %eax
+; AVX512-NEXT: # kill: def $ax killed $ax killed $eax
+; AVX512-NEXT: cwtd
+; AVX512-NEXT: idivw %cx
+; AVX512-NEXT: # kill: def $dx killed $dx def $edx
+; AVX512-NEXT: movd %edx, %xmm3
+; AVX512-NEXT: pextrw $2, %xmm1, %ecx
+; AVX512-NEXT: pextrw $2, %xmm0, %eax
+; AVX512-NEXT: # kill: def $ax killed $ax killed $eax
+; AVX512-NEXT: cwtd
+; AVX512-NEXT: idivw %cx
+; AVX512-NEXT: # kill: def $dx killed $dx def $edx
+; AVX512-NEXT: movd %edx, %xmm4
+; AVX512-NEXT: punpcklwd {{.*#+}} xmm4 = xmm4[0],xmm3[0],xmm4[1],xmm3[1],xmm4[2],xmm3[2],xmm4[3],xmm3[3]
+; AVX512-NEXT: pextrw $1, %xmm1, %ecx
+; AVX512-NEXT: pextrw $1, %xmm0, %eax
+; AVX512-NEXT: # kill: def $ax killed $ax killed $eax
+; AVX512-NEXT: cwtd
+; AVX512-NEXT: idivw %cx
+; AVX512-NEXT: # kill: def $dx killed $dx def $edx
+; AVX512-NEXT: movd %edx, %xmm3
+; AVX512-NEXT: movd %xmm1, %ecx
+; AVX512-NEXT: movd %xmm0, %eax
+; AVX512-NEXT: # kill: def $ax killed $ax killed $eax
+; AVX512-NEXT: cwtd
+; AVX512-NEXT: idivw %cx
+; AVX512-NEXT: # kill: def $dx killed $dx def $edx
+; AVX512-NEXT: movd %edx, %xmm0
+; AVX512-NEXT: punpcklwd {{.*#+}} xmm0 = xmm0[0],xmm3[0],xmm0[1],xmm3[1],xmm0[2],xmm3[2],xmm0[3],xmm3[3]
+; AVX512-NEXT: punpckldq {{.*#+}} xmm0 = xmm0[0],xmm4[0],xmm0[1],xmm4[1]
+; AVX512-NEXT: punpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm2[0]
+; AVX512-NEXT: retq
+ %res = call <4 x i16> @llvm.masked.srem(<4 x i16> %x, <4 x i16> %y, <4 x i1> %m)
+ ret <4 x i16> %res
+}
+
+; Scalarization
+define <1 x i64> @srem_v1i164(<1 x i64> %x, <1 x i64> %y, <1 x i1> %m) {
+; SSE2-LABEL: srem_v1i164:
+; SSE2: # %bb.0:
+; SSE2-NEXT: movq %rdi, %rax
+; SSE2-NEXT: testb $1, %dl
+; SSE2-NEXT: movl $1, %ecx
+; SSE2-NEXT: cmovneq %rsi, %rcx
+; SSE2-NEXT: cqto
+; SSE2-NEXT: idivq %rcx
+; SSE2-NEXT: movq %rdx, %rax
+; SSE2-NEXT: retq
+;
+; AVX512-LABEL: srem_v1i164:
+; AVX512: # %bb.0:
+; AVX512-NEXT: movq %rdi, %rax
+; AVX512-NEXT: testb $1, %dl
+; AVX512-NEXT: movl $1, %ecx
+; AVX512-NEXT: cmovneq %rsi, %rcx
+; AVX512-NEXT: cqto
+; AVX512-NEXT: idivq %rcx
+; AVX512-NEXT: movq %rdx, %rax
+; AVX512-NEXT: retq
+ %res = call <1 x i64> @llvm.masked.srem(<1 x i64> %x, <1 x i64> %y, <1 x i1> %m)
+ ret <1 x i64> %res
+}
+
+; Expansion
+define <2 x i128> @srem_v2i128(<2 x i128> %x, <2 x i128> %y, <2 x i1> %m) {
+; SSE2-LABEL: srem_v2i128:
+; SSE2: # %bb.0:
+; SSE2-NEXT: pushq %rbp
+; SSE2-NEXT: .cfi_def_cfa_offset 16
+; SSE2-NEXT: pushq %r15
+; SSE2-NEXT: .cfi_def_cfa_offset 24
+; SSE2-NEXT: pushq %r14
+; SSE2-NEXT: .cfi_def_cfa_offset 32
+; SSE2-NEXT: pushq %r13
+; SSE2-NEXT: .cfi_def_cfa_offset 40
+; SSE2-NEXT: pushq %r12
+; SSE2-NEXT: .cfi_def_cfa_offset 48
+; SSE2-NEXT: pushq %rbx
+; SSE2-NEXT: .cfi_def_cfa_offset 56
+; SSE2-NEXT: subq $40, %rsp
+; SSE2-NEXT: .cfi_def_cfa_offset 96
+; SSE2-NEXT: .cfi_offset %rbx, -56
+; SSE2-NEXT: .cfi_offset %r12, -48
+; SSE2-NEXT: .cfi_offset %r13, -40
+; SSE2-NEXT: .cfi_offset %r14, -32
+; SSE2-NEXT: .cfi_offset %r15, -24
+; SSE2-NEXT: .cfi_offset %rbp, -16
+; SSE2-NEXT: movq %r8, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
+; SSE2-NEXT: movq %rcx, %r15
+; SSE2-NEXT: movq %rdi, %rbx
+; SSE2-NEXT: movaps %xmm0, {{[0-9]+}}(%rsp)
+; SSE2-NEXT: xorl %r12d, %r12d
+; SSE2-NEXT: testb $1, {{[0-9]+}}(%rsp)
+; SSE2-NEXT: movl $1, %r13d
+; SSE2-NEXT: cmoveq %r13, %r9
+; SSE2-NEXT: movq {{[0-9]+}}(%rsp), %rcx
+; SSE2-NEXT: cmoveq %r12, %rcx
+; SSE2-NEXT: movq %rsi, %rdi
+; SSE2-NEXT: movq %rdx, %rsi
+; SSE2-NEXT: movq %r9, %rdx
+; SSE2-NEXT: callq __modti3 at PLT
+; SSE2-NEXT: movq %rax, %rbp
+; SSE2-NEXT: movq %rdx, %r14
+; SSE2-NEXT: testb $1, {{[0-9]+}}(%rsp)
+; SSE2-NEXT: je .LBB6_2
+; SSE2-NEXT: # %bb.1:
+; SSE2-NEXT: movq {{[0-9]+}}(%rsp), %r13
+; SSE2-NEXT: movq {{[0-9]+}}(%rsp), %r12
+; SSE2-NEXT: .LBB6_2:
+; SSE2-NEXT: movq %r15, %rdi
+; SSE2-NEXT: movq {{[-0-9]+}}(%r{{[sb]}}p), %rsi # 8-byte Reload
+; SSE2-NEXT: movq %r13, %rdx
+; SSE2-NEXT: movq %r12, %rcx
+; SSE2-NEXT: callq __modti3 at PLT
+; SSE2-NEXT: movq %rdx, 24(%rbx)
+; SSE2-NEXT: movq %rax, 16(%rbx)
+; SSE2-NEXT: movq %r14, 8(%rbx)
+; SSE2-NEXT: movq %rbp, (%rbx)
+; SSE2-NEXT: movq %rbx, %rax
+; SSE2-NEXT: addq $40, %rsp
+; SSE2-NEXT: .cfi_def_cfa_offset 56
+; SSE2-NEXT: popq %rbx
+; SSE2-NEXT: .cfi_def_cfa_offset 48
+; SSE2-NEXT: popq %r12
+; SSE2-NEXT: .cfi_def_cfa_offset 40
+; SSE2-NEXT: popq %r13
+; SSE2-NEXT: .cfi_def_cfa_offset 32
+; SSE2-NEXT: popq %r14
+; SSE2-NEXT: .cfi_def_cfa_offset 24
+; SSE2-NEXT: popq %r15
+; SSE2-NEXT: .cfi_def_cfa_offset 16
+; SSE2-NEXT: popq %rbp
+; SSE2-NEXT: .cfi_def_cfa_offset 8
+; SSE2-NEXT: retq
+;
+; AVX512-LABEL: srem_v2i128:
+; AVX512: # %bb.0:
+; AVX512-NEXT: pushq %rbp
+; AVX512-NEXT: .cfi_def_cfa_offset 16
+; AVX512-NEXT: pushq %r15
+; AVX512-NEXT: .cfi_def_cfa_offset 24
+; AVX512-NEXT: pushq %r14
+; AVX512-NEXT: .cfi_def_cfa_offset 32
+; AVX512-NEXT: pushq %r13
+; AVX512-NEXT: .cfi_def_cfa_offset 40
+; AVX512-NEXT: pushq %r12
+; AVX512-NEXT: .cfi_def_cfa_offset 48
+; AVX512-NEXT: pushq %rbx
+; AVX512-NEXT: .cfi_def_cfa_offset 56
+; AVX512-NEXT: subq $40, %rsp
+; AVX512-NEXT: .cfi_def_cfa_offset 96
+; AVX512-NEXT: .cfi_offset %rbx, -56
+; AVX512-NEXT: .cfi_offset %r12, -48
+; AVX512-NEXT: .cfi_offset %r13, -40
+; AVX512-NEXT: .cfi_offset %r14, -32
+; AVX512-NEXT: .cfi_offset %r15, -24
+; AVX512-NEXT: .cfi_offset %rbp, -16
+; AVX512-NEXT: movq %r8, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
+; AVX512-NEXT: movq %rcx, %r15
+; AVX512-NEXT: movq %rdi, %rbx
+; AVX512-NEXT: movaps %xmm0, {{[0-9]+}}(%rsp)
+; AVX512-NEXT: xorl %r12d, %r12d
+; AVX512-NEXT: testb $1, {{[0-9]+}}(%rsp)
+; AVX512-NEXT: movl $1, %r13d
+; AVX512-NEXT: cmoveq %r13, %r9
+; AVX512-NEXT: movq {{[0-9]+}}(%rsp), %rcx
+; AVX512-NEXT: cmoveq %r12, %rcx
+; AVX512-NEXT: movq %rsi, %rdi
+; AVX512-NEXT: movq %rdx, %rsi
+; AVX512-NEXT: movq %r9, %rdx
+; AVX512-NEXT: callq __modti3 at PLT
+; AVX512-NEXT: movq %rax, %rbp
+; AVX512-NEXT: movq %rdx, %r14
+; AVX512-NEXT: testb $1, {{[0-9]+}}(%rsp)
+; AVX512-NEXT: je .LBB6_2
+; AVX512-NEXT: # %bb.1:
+; AVX512-NEXT: movq {{[0-9]+}}(%rsp), %r13
+; AVX512-NEXT: movq {{[0-9]+}}(%rsp), %r12
+; AVX512-NEXT: .LBB6_2:
+; AVX512-NEXT: movq %r15, %rdi
+; AVX512-NEXT: movq {{[-0-9]+}}(%r{{[sb]}}p), %rsi # 8-byte Reload
+; AVX512-NEXT: movq %r13, %rdx
+; AVX512-NEXT: movq %r12, %rcx
+; AVX512-NEXT: callq __modti3 at PLT
+; AVX512-NEXT: movq %rdx, 24(%rbx)
+; AVX512-NEXT: movq %rax, 16(%rbx)
+; AVX512-NEXT: movq %r14, 8(%rbx)
+; AVX512-NEXT: movq %rbp, (%rbx)
+; AVX512-NEXT: movq %rbx, %rax
+; AVX512-NEXT: addq $40, %rsp
+; AVX512-NEXT: .cfi_def_cfa_offset 56
+; AVX512-NEXT: popq %rbx
+; AVX512-NEXT: .cfi_def_cfa_offset 48
+; AVX512-NEXT: popq %r12
+; AVX512-NEXT: .cfi_def_cfa_offset 40
+; AVX512-NEXT: popq %r13
+; AVX512-NEXT: .cfi_def_cfa_offset 32
+; AVX512-NEXT: popq %r14
+; AVX512-NEXT: .cfi_def_cfa_offset 24
+; AVX512-NEXT: popq %r15
+; AVX512-NEXT: .cfi_def_cfa_offset 16
+; AVX512-NEXT: popq %rbp
+; AVX512-NEXT: .cfi_def_cfa_offset 8
+; AVX512-NEXT: retq
+ %res = call <2 x i128> @llvm.masked.srem(<2 x i128> %x, <2 x i128> %y, <2 x i1> %m)
+ ret <2 x i128> %res
+}
+
+; Promotion and widening
+define <3 x i10> @srem_v3i10(<3 x i10> %x, <3 x i10> %y, <3 x i1> %m) {
+; SSE2-LABEL: srem_v3i10:
+; SSE2: # %bb.0:
+; SSE2-NEXT: movd %esi, %xmm1
+; SSE2-NEXT: movd %edi, %xmm0
+; SSE2-NEXT: punpckldq {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1]
+; SSE2-NEXT: movd %edx, %xmm1
+; SSE2-NEXT: punpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm1[0]
+; SSE2-NEXT: pslld $22, %xmm0
+; SSE2-NEXT: psrad $22, %xmm0
+; SSE2-NEXT: movd {{.*#+}} xmm1 = mem[0],zero,zero,zero
+; SSE2-NEXT: movd {{.*#+}} xmm2 = mem[0],zero,zero,zero
+; SSE2-NEXT: punpckldq {{.*#+}} xmm2 = xmm2[0],xmm1[0],xmm2[1],xmm1[1]
+; SSE2-NEXT: movd {{.*#+}} xmm1 = mem[0],zero,zero,zero
+; SSE2-NEXT: punpcklqdq {{.*#+}} xmm2 = xmm2[0],xmm1[0]
+; SSE2-NEXT: pslld $31, %xmm2
+; SSE2-NEXT: psrad $31, %xmm2
+; SSE2-NEXT: movd %r8d, %xmm3
+; SSE2-NEXT: movd %ecx, %xmm1
+; SSE2-NEXT: punpckldq {{.*#+}} xmm1 = xmm1[0],xmm3[0],xmm1[1],xmm3[1]
+; SSE2-NEXT: movd %r9d, %xmm3
+; SSE2-NEXT: punpcklqdq {{.*#+}} xmm1 = xmm1[0],xmm3[0]
+; SSE2-NEXT: pslld $22, %xmm1
+; SSE2-NEXT: psrad $22, %xmm1
+; SSE2-NEXT: pand %xmm2, %xmm1
+; SSE2-NEXT: paddd %xmm2, %xmm1
+; SSE2-NEXT: pcmpeqd %xmm2, %xmm2
+; SSE2-NEXT: psubd %xmm2, %xmm1
+; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm1[2,3,2,3]
+; SSE2-NEXT: movd %xmm2, %ecx
+; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm0[2,3,2,3]
+; SSE2-NEXT: movd %xmm2, %eax
+; SSE2-NEXT: cltd
+; SSE2-NEXT: idivl %ecx
+; SSE2-NEXT: movl %edx, %ecx
+; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm1[1,1,1,1]
+; SSE2-NEXT: movd %xmm2, %esi
+; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm0[1,1,1,1]
+; SSE2-NEXT: movd %xmm2, %eax
+; SSE2-NEXT: cltd
+; SSE2-NEXT: idivl %esi
+; SSE2-NEXT: movl %edx, %esi
+; SSE2-NEXT: movd %xmm1, %edi
+; SSE2-NEXT: movd %xmm0, %eax
+; SSE2-NEXT: cltd
+; SSE2-NEXT: idivl %edi
+; SSE2-NEXT: movl %edx, %eax
+; SSE2-NEXT: # kill: def $ax killed $ax killed $eax
+; SSE2-NEXT: movl %esi, %edx
+; SSE2-NEXT: # kill: def $cx killed $cx killed $ecx
+; SSE2-NEXT: retq
+;
+; AVX512-LABEL: srem_v3i10:
+; AVX512: # %bb.0:
+; AVX512-NEXT: movd %esi, %xmm1
+; AVX512-NEXT: movd %edi, %xmm0
+; AVX512-NEXT: punpckldq {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1]
+; AVX512-NEXT: movd %edx, %xmm1
+; AVX512-NEXT: punpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm1[0]
+; AVX512-NEXT: pslld $22, %xmm0
+; AVX512-NEXT: psrad $22, %xmm0
+; AVX512-NEXT: movd {{.*#+}} xmm1 = mem[0],zero,zero,zero
+; AVX512-NEXT: movd {{.*#+}} xmm2 = mem[0],zero,zero,zero
+; AVX512-NEXT: punpckldq {{.*#+}} xmm2 = xmm2[0],xmm1[0],xmm2[1],xmm1[1]
+; AVX512-NEXT: movd {{.*#+}} xmm1 = mem[0],zero,zero,zero
+; AVX512-NEXT: punpcklqdq {{.*#+}} xmm2 = xmm2[0],xmm1[0]
+; AVX512-NEXT: pslld $31, %xmm2
+; AVX512-NEXT: psrad $31, %xmm2
+; AVX512-NEXT: movd %r8d, %xmm3
+; AVX512-NEXT: movd %ecx, %xmm1
+; AVX512-NEXT: punpckldq {{.*#+}} xmm1 = xmm1[0],xmm3[0],xmm1[1],xmm3[1]
+; AVX512-NEXT: movd %r9d, %xmm3
+; AVX512-NEXT: punpcklqdq {{.*#+}} xmm1 = xmm1[0],xmm3[0]
+; AVX512-NEXT: pslld $22, %xmm1
+; AVX512-NEXT: psrad $22, %xmm1
+; AVX512-NEXT: pand %xmm2, %xmm1
+; AVX512-NEXT: paddd %xmm2, %xmm1
+; AVX512-NEXT: pcmpeqd %xmm2, %xmm2
+; AVX512-NEXT: psubd %xmm2, %xmm1
+; AVX512-NEXT: pshufd {{.*#+}} xmm2 = xmm1[2,3,2,3]
+; AVX512-NEXT: movd %xmm2, %ecx
+; AVX512-NEXT: pshufd {{.*#+}} xmm2 = xmm0[2,3,2,3]
+; AVX512-NEXT: movd %xmm2, %eax
+; AVX512-NEXT: cltd
+; AVX512-NEXT: idivl %ecx
+; AVX512-NEXT: movl %edx, %ecx
+; AVX512-NEXT: pshufd {{.*#+}} xmm2 = xmm1[1,1,1,1]
+; AVX512-NEXT: movd %xmm2, %esi
+; AVX512-NEXT: pshufd {{.*#+}} xmm2 = xmm0[1,1,1,1]
+; AVX512-NEXT: movd %xmm2, %eax
+; AVX512-NEXT: cltd
+; AVX512-NEXT: idivl %esi
+; AVX512-NEXT: movl %edx, %esi
+; AVX512-NEXT: movd %xmm1, %edi
+; AVX512-NEXT: movd %xmm0, %eax
+; AVX512-NEXT: cltd
+; AVX512-NEXT: idivl %edi
+; AVX512-NEXT: movl %edx, %eax
+; AVX512-NEXT: # kill: def $ax killed $ax killed $eax
+; AVX512-NEXT: movl %esi, %edx
+; AVX512-NEXT: # kill: def $cx killed $cx killed $ecx
+; AVX512-NEXT: retq
+ %res = call <3 x i10> @llvm.masked.srem(<3 x i10> %x, <3 x i10> %y, <3 x i1> %m)
+ ret <3 x i10> %res
+}
diff --git a/llvm/test/CodeGen/X86/masked-udiv.ll b/llvm/test/CodeGen/X86/masked-udiv.ll
new file mode 100644
index 0000000000000..2f7c2a4316329
--- /dev/null
+++ b/llvm/test/CodeGen/X86/masked-udiv.ll
@@ -0,0 +1,756 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
+; RUN: llc -mtriple x86_64 -mattr=+sse2 < %s | FileCheck %s --check-prefix=SSE2
+; RUN: llc -mtriple x86_64 -mattr=+avx512 < %s | FileCheck %s --check-prefix=AVX512
+
+; Legal
+define <4 x i32> @udiv_v4i32(<4 x i32> %x, <4 x i32> %y, <4 x i1> %m) {
+; SSE2-LABEL: udiv_v4i32:
+; SSE2: # %bb.0:
+; SSE2-NEXT: pslld $31, %xmm2
+; SSE2-NEXT: psrad $31, %xmm2
+; SSE2-NEXT: pand %xmm2, %xmm1
+; SSE2-NEXT: paddd %xmm2, %xmm1
+; SSE2-NEXT: pcmpeqd %xmm2, %xmm2
+; SSE2-NEXT: psubd %xmm2, %xmm1
+; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm1[3,3,3,3]
+; SSE2-NEXT: movd %xmm2, %ecx
+; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm0[3,3,3,3]
+; SSE2-NEXT: movd %xmm2, %eax
+; SSE2-NEXT: xorl %edx, %edx
+; SSE2-NEXT: divl %ecx
+; SSE2-NEXT: movd %eax, %xmm2
+; SSE2-NEXT: pshufd {{.*#+}} xmm3 = xmm1[2,3,2,3]
+; SSE2-NEXT: movd %xmm3, %ecx
+; SSE2-NEXT: pshufd {{.*#+}} xmm3 = xmm0[2,3,2,3]
+; SSE2-NEXT: movd %xmm3, %eax
+; SSE2-NEXT: xorl %edx, %edx
+; SSE2-NEXT: divl %ecx
+; SSE2-NEXT: movd %eax, %xmm3
+; SSE2-NEXT: punpckldq {{.*#+}} xmm3 = xmm3[0],xmm2[0],xmm3[1],xmm2[1]
+; SSE2-NEXT: movd %xmm1, %ecx
+; SSE2-NEXT: movd %xmm0, %eax
+; SSE2-NEXT: xorl %edx, %edx
+; SSE2-NEXT: divl %ecx
+; SSE2-NEXT: movd %eax, %xmm2
+; SSE2-NEXT: pshufd {{.*#+}} xmm1 = xmm1[1,1,1,1]
+; SSE2-NEXT: movd %xmm1, %ecx
+; SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm0[1,1,1,1]
+; SSE2-NEXT: movd %xmm0, %eax
+; SSE2-NEXT: xorl %edx, %edx
+; SSE2-NEXT: divl %ecx
+; SSE2-NEXT: movd %eax, %xmm0
+; SSE2-NEXT: punpckldq {{.*#+}} xmm2 = xmm2[0],xmm0[0],xmm2[1],xmm0[1]
+; SSE2-NEXT: punpcklqdq {{.*#+}} xmm2 = xmm2[0],xmm3[0]
+; SSE2-NEXT: movdqa %xmm2, %xmm0
+; SSE2-NEXT: retq
+;
+; AVX512-LABEL: udiv_v4i32:
+; AVX512: # %bb.0:
+; AVX512-NEXT: pslld $31, %xmm2
+; AVX512-NEXT: psrad $31, %xmm2
+; AVX512-NEXT: pand %xmm2, %xmm1
+; AVX512-NEXT: paddd %xmm2, %xmm1
+; AVX512-NEXT: pcmpeqd %xmm2, %xmm2
+; AVX512-NEXT: psubd %xmm2, %xmm1
+; AVX512-NEXT: pshufd {{.*#+}} xmm2 = xmm1[3,3,3,3]
+; AVX512-NEXT: movd %xmm2, %ecx
+; AVX512-NEXT: pshufd {{.*#+}} xmm2 = xmm0[3,3,3,3]
+; AVX512-NEXT: movd %xmm2, %eax
+; AVX512-NEXT: xorl %edx, %edx
+; AVX512-NEXT: divl %ecx
+; AVX512-NEXT: movd %eax, %xmm2
+; AVX512-NEXT: pshufd {{.*#+}} xmm3 = xmm1[2,3,2,3]
+; AVX512-NEXT: movd %xmm3, %ecx
+; AVX512-NEXT: pshufd {{.*#+}} xmm3 = xmm0[2,3,2,3]
+; AVX512-NEXT: movd %xmm3, %eax
+; AVX512-NEXT: xorl %edx, %edx
+; AVX512-NEXT: divl %ecx
+; AVX512-NEXT: movd %eax, %xmm3
+; AVX512-NEXT: punpckldq {{.*#+}} xmm3 = xmm3[0],xmm2[0],xmm3[1],xmm2[1]
+; AVX512-NEXT: movd %xmm1, %ecx
+; AVX512-NEXT: movd %xmm0, %eax
+; AVX512-NEXT: xorl %edx, %edx
+; AVX512-NEXT: divl %ecx
+; AVX512-NEXT: movd %eax, %xmm2
+; AVX512-NEXT: pshufd {{.*#+}} xmm1 = xmm1[1,1,1,1]
+; AVX512-NEXT: movd %xmm1, %ecx
+; AVX512-NEXT: pshufd {{.*#+}} xmm0 = xmm0[1,1,1,1]
+; AVX512-NEXT: movd %xmm0, %eax
+; AVX512-NEXT: xorl %edx, %edx
+; AVX512-NEXT: divl %ecx
+; AVX512-NEXT: movd %eax, %xmm0
+; AVX512-NEXT: punpckldq {{.*#+}} xmm2 = xmm2[0],xmm0[0],xmm2[1],xmm0[1]
+; AVX512-NEXT: punpcklqdq {{.*#+}} xmm2 = xmm2[0],xmm3[0]
+; AVX512-NEXT: movdqa %xmm2, %xmm0
+; AVX512-NEXT: retq
+ %res = call <4 x i32> @llvm.masked.udiv(<4 x i32> %x, <4 x i32> %y, <4 x i1> %m)
+ ret <4 x i32> %res
+}
+
+define <2 x i64> @udiv_v2i64(<2 x i64> %x, <2 x i64> %y, <2 x i1> %m) {
+; SSE2-LABEL: udiv_v2i64:
+; SSE2: # %bb.0:
+; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm2[0,0,2,2]
+; SSE2-NEXT: pslld $31, %xmm2
+; SSE2-NEXT: psrad $31, %xmm2
+; SSE2-NEXT: pand %xmm2, %xmm1
+; SSE2-NEXT: pandn {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm2
+; SSE2-NEXT: por %xmm1, %xmm2
+; SSE2-NEXT: movq %xmm2, %rcx
+; SSE2-NEXT: movq %xmm0, %rax
+; SSE2-NEXT: xorl %edx, %edx
+; SSE2-NEXT: divq %rcx
+; SSE2-NEXT: movq %rax, %xmm1
+; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm2[2,3,2,3]
+; SSE2-NEXT: movq %xmm2, %rcx
+; SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm0[2,3,2,3]
+; SSE2-NEXT: movq %xmm0, %rax
+; SSE2-NEXT: xorl %edx, %edx
+; SSE2-NEXT: divq %rcx
+; SSE2-NEXT: movq %rax, %xmm0
+; SSE2-NEXT: punpcklqdq {{.*#+}} xmm1 = xmm1[0],xmm0[0]
+; SSE2-NEXT: movdqa %xmm1, %xmm0
+; SSE2-NEXT: retq
+;
+; AVX512-LABEL: udiv_v2i64:
+; AVX512: # %bb.0:
+; AVX512-NEXT: pshufd {{.*#+}} xmm2 = xmm2[0,0,2,2]
+; AVX512-NEXT: pslld $31, %xmm2
+; AVX512-NEXT: psrad $31, %xmm2
+; AVX512-NEXT: pand %xmm2, %xmm1
+; AVX512-NEXT: pandn {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm2
+; AVX512-NEXT: por %xmm1, %xmm2
+; AVX512-NEXT: movq %xmm2, %rcx
+; AVX512-NEXT: movq %xmm0, %rax
+; AVX512-NEXT: xorl %edx, %edx
+; AVX512-NEXT: divq %rcx
+; AVX512-NEXT: movq %rax, %xmm1
+; AVX512-NEXT: pshufd {{.*#+}} xmm2 = xmm2[2,3,2,3]
+; AVX512-NEXT: movq %xmm2, %rcx
+; AVX512-NEXT: pshufd {{.*#+}} xmm0 = xmm0[2,3,2,3]
+; AVX512-NEXT: movq %xmm0, %rax
+; AVX512-NEXT: xorl %edx, %edx
+; AVX512-NEXT: divq %rcx
+; AVX512-NEXT: movq %rax, %xmm0
+; AVX512-NEXT: punpcklqdq {{.*#+}} xmm1 = xmm1[0],xmm0[0]
+; AVX512-NEXT: movdqa %xmm1, %xmm0
+; AVX512-NEXT: retq
+ %res = call <2 x i64> @llvm.masked.udiv(<2 x i64> %x, <2 x i64> %y, <2 x i1> %m)
+ ret <2 x i64> %res
+}
+
+; Splitting
+define <4 x i64> @udiv_v4i64(<4 x i64> %x, <4 x i64> %y, <4 x i1> %m) {
+; SSE2-LABEL: udiv_v4i64:
+; SSE2: # %bb.0:
+; SSE2-NEXT: movdqa %xmm0, %xmm5
+; SSE2-NEXT: pshufd {{.*#+}} xmm6 = xmm4[2,2,3,3]
+; SSE2-NEXT: pslld $31, %xmm6
+; SSE2-NEXT: psrad $31, %xmm6
+; SSE2-NEXT: pshufd {{.*#+}} xmm4 = xmm4[0,0,1,1]
+; SSE2-NEXT: pslld $31, %xmm4
+; SSE2-NEXT: psrad $31, %xmm4
+; SSE2-NEXT: movdqa {{.*#+}} xmm7 = [1,1]
+; SSE2-NEXT: pand %xmm4, %xmm2
+; SSE2-NEXT: pandn %xmm7, %xmm4
+; SSE2-NEXT: por %xmm2, %xmm4
+; SSE2-NEXT: movq %xmm4, %rcx
+; SSE2-NEXT: movq %xmm0, %rax
+; SSE2-NEXT: xorl %edx, %edx
+; SSE2-NEXT: divq %rcx
+; SSE2-NEXT: movq %rax, %xmm0
+; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm4[2,3,2,3]
+; SSE2-NEXT: movq %xmm2, %rcx
+; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm5[2,3,2,3]
+; SSE2-NEXT: movq %xmm2, %rax
+; SSE2-NEXT: xorl %edx, %edx
+; SSE2-NEXT: divq %rcx
+; SSE2-NEXT: movq %rax, %xmm2
+; SSE2-NEXT: punpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm2[0]
+; SSE2-NEXT: pand %xmm6, %xmm3
+; SSE2-NEXT: pandn %xmm7, %xmm6
+; SSE2-NEXT: por %xmm3, %xmm6
+; SSE2-NEXT: movq %xmm6, %rcx
+; SSE2-NEXT: movq %xmm1, %rax
+; SSE2-NEXT: xorl %edx, %edx
+; SSE2-NEXT: divq %rcx
+; SSE2-NEXT: movq %rax, %xmm2
+; SSE2-NEXT: pshufd {{.*#+}} xmm3 = xmm6[2,3,2,3]
+; SSE2-NEXT: movq %xmm3, %rcx
+; SSE2-NEXT: pshufd {{.*#+}} xmm1 = xmm1[2,3,2,3]
+; SSE2-NEXT: movq %xmm1, %rax
+; SSE2-NEXT: xorl %edx, %edx
+; SSE2-NEXT: divq %rcx
+; SSE2-NEXT: movq %rax, %xmm1
+; SSE2-NEXT: punpcklqdq {{.*#+}} xmm2 = xmm2[0],xmm1[0]
+; SSE2-NEXT: movdqa %xmm2, %xmm1
+; SSE2-NEXT: retq
+;
+; AVX512-LABEL: udiv_v4i64:
+; AVX512: # %bb.0:
+; AVX512-NEXT: movdqa %xmm0, %xmm5
+; AVX512-NEXT: pshufd {{.*#+}} xmm6 = xmm4[2,2,3,3]
+; AVX512-NEXT: pslld $31, %xmm6
+; AVX512-NEXT: psrad $31, %xmm6
+; AVX512-NEXT: pshufd {{.*#+}} xmm4 = xmm4[0,0,1,1]
+; AVX512-NEXT: pslld $31, %xmm4
+; AVX512-NEXT: psrad $31, %xmm4
+; AVX512-NEXT: movdqa {{.*#+}} xmm7 = [1,1]
+; AVX512-NEXT: pand %xmm4, %xmm2
+; AVX512-NEXT: pandn %xmm7, %xmm4
+; AVX512-NEXT: por %xmm2, %xmm4
+; AVX512-NEXT: movq %xmm4, %rcx
+; AVX512-NEXT: movq %xmm0, %rax
+; AVX512-NEXT: xorl %edx, %edx
+; AVX512-NEXT: divq %rcx
+; AVX512-NEXT: movq %rax, %xmm0
+; AVX512-NEXT: pshufd {{.*#+}} xmm2 = xmm4[2,3,2,3]
+; AVX512-NEXT: movq %xmm2, %rcx
+; AVX512-NEXT: pshufd {{.*#+}} xmm2 = xmm5[2,3,2,3]
+; AVX512-NEXT: movq %xmm2, %rax
+; AVX512-NEXT: xorl %edx, %edx
+; AVX512-NEXT: divq %rcx
+; AVX512-NEXT: movq %rax, %xmm2
+; AVX512-NEXT: punpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm2[0]
+; AVX512-NEXT: pand %xmm6, %xmm3
+; AVX512-NEXT: pandn %xmm7, %xmm6
+; AVX512-NEXT: por %xmm3, %xmm6
+; AVX512-NEXT: movq %xmm6, %rcx
+; AVX512-NEXT: movq %xmm1, %rax
+; AVX512-NEXT: xorl %edx, %edx
+; AVX512-NEXT: divq %rcx
+; AVX512-NEXT: movq %rax, %xmm2
+; AVX512-NEXT: pshufd {{.*#+}} xmm3 = xmm6[2,3,2,3]
+; AVX512-NEXT: movq %xmm3, %rcx
+; AVX512-NEXT: pshufd {{.*#+}} xmm1 = xmm1[2,3,2,3]
+; AVX512-NEXT: movq %xmm1, %rax
+; AVX512-NEXT: xorl %edx, %edx
+; AVX512-NEXT: divq %rcx
+; AVX512-NEXT: movq %rax, %xmm1
+; AVX512-NEXT: punpcklqdq {{.*#+}} xmm2 = xmm2[0],xmm1[0]
+; AVX512-NEXT: movdqa %xmm2, %xmm1
+; AVX512-NEXT: retq
+ %res = call <4 x i64> @llvm.masked.udiv(<4 x i64> %x, <4 x i64> %y, <4 x i1> %m)
+ ret <4 x i64> %res
+}
+
+; Widening
+define <2 x i32> @udiv_v2i32(<2 x i32> %x, <2 x i32> %y, <2 x i1> %m) {
+; SSE2-LABEL: udiv_v2i32:
+; SSE2: # %bb.0:
+; SSE2-NEXT: xorps %xmm3, %xmm3
+; SSE2-NEXT: shufps {{.*#+}} xmm2 = xmm2[0,2],xmm3[2,3]
+; SSE2-NEXT: pslld $31, %xmm2
+; SSE2-NEXT: psrad $31, %xmm2
+; SSE2-NEXT: pand %xmm2, %xmm1
+; SSE2-NEXT: paddd %xmm2, %xmm1
+; SSE2-NEXT: pcmpeqd %xmm2, %xmm2
+; SSE2-NEXT: psubd %xmm2, %xmm1
+; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm1[3,3,3,3]
+; SSE2-NEXT: movd %xmm2, %ecx
+; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm0[3,3,3,3]
+; SSE2-NEXT: movd %xmm2, %eax
+; SSE2-NEXT: xorl %edx, %edx
+; SSE2-NEXT: divl %ecx
+; SSE2-NEXT: movd %eax, %xmm2
+; SSE2-NEXT: pshufd {{.*#+}} xmm3 = xmm1[2,3,2,3]
+; SSE2-NEXT: movd %xmm3, %ecx
+; SSE2-NEXT: pshufd {{.*#+}} xmm3 = xmm0[2,3,2,3]
+; SSE2-NEXT: movd %xmm3, %eax
+; SSE2-NEXT: xorl %edx, %edx
+; SSE2-NEXT: divl %ecx
+; SSE2-NEXT: movd %eax, %xmm3
+; SSE2-NEXT: punpckldq {{.*#+}} xmm3 = xmm3[0],xmm2[0],xmm3[1],xmm2[1]
+; SSE2-NEXT: movd %xmm1, %ecx
+; SSE2-NEXT: movd %xmm0, %eax
+; SSE2-NEXT: xorl %edx, %edx
+; SSE2-NEXT: divl %ecx
+; SSE2-NEXT: movd %eax, %xmm2
+; SSE2-NEXT: pshufd {{.*#+}} xmm1 = xmm1[1,1,1,1]
+; SSE2-NEXT: movd %xmm1, %ecx
+; SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm0[1,1,1,1]
+; SSE2-NEXT: movd %xmm0, %eax
+; SSE2-NEXT: xorl %edx, %edx
+; SSE2-NEXT: divl %ecx
+; SSE2-NEXT: movd %eax, %xmm0
+; SSE2-NEXT: punpckldq {{.*#+}} xmm2 = xmm2[0],xmm0[0],xmm2[1],xmm0[1]
+; SSE2-NEXT: punpcklqdq {{.*#+}} xmm2 = xmm2[0],xmm3[0]
+; SSE2-NEXT: movdqa %xmm2, %xmm0
+; SSE2-NEXT: retq
+;
+; AVX512-LABEL: udiv_v2i32:
+; AVX512: # %bb.0:
+; AVX512-NEXT: xorps %xmm3, %xmm3
+; AVX512-NEXT: shufps {{.*#+}} xmm2 = xmm2[0,2],xmm3[2,3]
+; AVX512-NEXT: pslld $31, %xmm2
+; AVX512-NEXT: psrad $31, %xmm2
+; AVX512-NEXT: pand %xmm2, %xmm1
+; AVX512-NEXT: paddd %xmm2, %xmm1
+; AVX512-NEXT: pcmpeqd %xmm2, %xmm2
+; AVX512-NEXT: psubd %xmm2, %xmm1
+; AVX512-NEXT: pshufd {{.*#+}} xmm2 = xmm1[3,3,3,3]
+; AVX512-NEXT: movd %xmm2, %ecx
+; AVX512-NEXT: pshufd {{.*#+}} xmm2 = xmm0[3,3,3,3]
+; AVX512-NEXT: movd %xmm2, %eax
+; AVX512-NEXT: xorl %edx, %edx
+; AVX512-NEXT: divl %ecx
+; AVX512-NEXT: movd %eax, %xmm2
+; AVX512-NEXT: pshufd {{.*#+}} xmm3 = xmm1[2,3,2,3]
+; AVX512-NEXT: movd %xmm3, %ecx
+; AVX512-NEXT: pshufd {{.*#+}} xmm3 = xmm0[2,3,2,3]
+; AVX512-NEXT: movd %xmm3, %eax
+; AVX512-NEXT: xorl %edx, %edx
+; AVX512-NEXT: divl %ecx
+; AVX512-NEXT: movd %eax, %xmm3
+; AVX512-NEXT: punpckldq {{.*#+}} xmm3 = xmm3[0],xmm2[0],xmm3[1],xmm2[1]
+; AVX512-NEXT: movd %xmm1, %ecx
+; AVX512-NEXT: movd %xmm0, %eax
+; AVX512-NEXT: xorl %edx, %edx
+; AVX512-NEXT: divl %ecx
+; AVX512-NEXT: movd %eax, %xmm2
+; AVX512-NEXT: pshufd {{.*#+}} xmm1 = xmm1[1,1,1,1]
+; AVX512-NEXT: movd %xmm1, %ecx
+; AVX512-NEXT: pshufd {{.*#+}} xmm0 = xmm0[1,1,1,1]
+; AVX512-NEXT: movd %xmm0, %eax
+; AVX512-NEXT: xorl %edx, %edx
+; AVX512-NEXT: divl %ecx
+; AVX512-NEXT: movd %eax, %xmm0
+; AVX512-NEXT: punpckldq {{.*#+}} xmm2 = xmm2[0],xmm0[0],xmm2[1],xmm0[1]
+; AVX512-NEXT: punpcklqdq {{.*#+}} xmm2 = xmm2[0],xmm3[0]
+; AVX512-NEXT: movdqa %xmm2, %xmm0
+; AVX512-NEXT: retq
+ %res = call <2 x i32> @llvm.masked.udiv(<2 x i32> %x, <2 x i32> %y, <2 x i1> %m)
+ ret <2 x i32> %res
+}
+
+; Promotion
+define <4 x i16> @udiv_v4i16(<4 x i16> %x, <4 x i16> %y, <4 x i1> %m) {
+; SSE2-LABEL: udiv_v4i16:
+; SSE2: # %bb.0:
+; SSE2-NEXT: pshuflw {{.*#+}} xmm2 = xmm2[0,2,2,3,4,5,6,7]
+; SSE2-NEXT: pshufhw {{.*#+}} xmm2 = xmm2[0,1,2,3,4,6,6,7]
+; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm2[0,1,0,2]
+; SSE2-NEXT: psrldq {{.*#+}} xmm2 = xmm2[8,9,10,11,12,13,14,15],zero,zero,zero,zero,zero,zero,zero,zero
+; SSE2-NEXT: psllw $15, %xmm2
+; SSE2-NEXT: psraw $15, %xmm2
+; SSE2-NEXT: pand %xmm2, %xmm1
+; SSE2-NEXT: paddw %xmm2, %xmm1
+; SSE2-NEXT: pcmpeqd %xmm2, %xmm2
+; SSE2-NEXT: psubw %xmm2, %xmm1
+; SSE2-NEXT: pextrw $7, %xmm1, %ecx
+; SSE2-NEXT: pextrw $7, %xmm0, %eax
+; SSE2-NEXT: # kill: def $ax killed $ax killed $eax
+; SSE2-NEXT: xorl %edx, %edx
+; SSE2-NEXT: divw %cx
+; SSE2-NEXT: # kill: def $ax killed $ax def $eax
+; SSE2-NEXT: movd %eax, %xmm2
+; SSE2-NEXT: pextrw $6, %xmm1, %ecx
+; SSE2-NEXT: pextrw $6, %xmm0, %eax
+; SSE2-NEXT: # kill: def $ax killed $ax killed $eax
+; SSE2-NEXT: xorl %edx, %edx
+; SSE2-NEXT: divw %cx
+; SSE2-NEXT: # kill: def $ax killed $ax def $eax
+; SSE2-NEXT: movd %eax, %xmm3
+; SSE2-NEXT: punpcklwd {{.*#+}} xmm3 = xmm3[0],xmm2[0],xmm3[1],xmm2[1],xmm3[2],xmm2[2],xmm3[3],xmm2[3]
+; SSE2-NEXT: pextrw $5, %xmm1, %ecx
+; SSE2-NEXT: pextrw $5, %xmm0, %eax
+; SSE2-NEXT: # kill: def $ax killed $ax killed $eax
+; SSE2-NEXT: xorl %edx, %edx
+; SSE2-NEXT: divw %cx
+; SSE2-NEXT: # kill: def $ax killed $ax def $eax
+; SSE2-NEXT: movd %eax, %xmm4
+; SSE2-NEXT: pextrw $4, %xmm1, %ecx
+; SSE2-NEXT: pextrw $4, %xmm0, %eax
+; SSE2-NEXT: # kill: def $ax killed $ax killed $eax
+; SSE2-NEXT: xorl %edx, %edx
+; SSE2-NEXT: divw %cx
+; SSE2-NEXT: # kill: def $ax killed $ax def $eax
+; SSE2-NEXT: movd %eax, %xmm2
+; SSE2-NEXT: punpcklwd {{.*#+}} xmm2 = xmm2[0],xmm4[0],xmm2[1],xmm4[1],xmm2[2],xmm4[2],xmm2[3],xmm4[3]
+; SSE2-NEXT: punpckldq {{.*#+}} xmm2 = xmm2[0],xmm3[0],xmm2[1],xmm3[1]
+; SSE2-NEXT: pextrw $3, %xmm1, %ecx
+; SSE2-NEXT: pextrw $3, %xmm0, %eax
+; SSE2-NEXT: # kill: def $ax killed $ax killed $eax
+; SSE2-NEXT: xorl %edx, %edx
+; SSE2-NEXT: divw %cx
+; SSE2-NEXT: # kill: def $ax killed $ax def $eax
+; SSE2-NEXT: movd %eax, %xmm3
+; SSE2-NEXT: pextrw $2, %xmm1, %ecx
+; SSE2-NEXT: pextrw $2, %xmm0, %eax
+; SSE2-NEXT: # kill: def $ax killed $ax killed $eax
+; SSE2-NEXT: xorl %edx, %edx
+; SSE2-NEXT: divw %cx
+; SSE2-NEXT: # kill: def $ax killed $ax def $eax
+; SSE2-NEXT: movd %eax, %xmm4
+; SSE2-NEXT: punpcklwd {{.*#+}} xmm4 = xmm4[0],xmm3[0],xmm4[1],xmm3[1],xmm4[2],xmm3[2],xmm4[3],xmm3[3]
+; SSE2-NEXT: pextrw $1, %xmm1, %ecx
+; SSE2-NEXT: pextrw $1, %xmm0, %eax
+; SSE2-NEXT: # kill: def $ax killed $ax killed $eax
+; SSE2-NEXT: xorl %edx, %edx
+; SSE2-NEXT: divw %cx
+; SSE2-NEXT: # kill: def $ax killed $ax def $eax
+; SSE2-NEXT: movd %eax, %xmm3
+; SSE2-NEXT: movd %xmm1, %ecx
+; SSE2-NEXT: movd %xmm0, %eax
+; SSE2-NEXT: # kill: def $ax killed $ax killed $eax
+; SSE2-NEXT: xorl %edx, %edx
+; SSE2-NEXT: divw %cx
+; SSE2-NEXT: # kill: def $ax killed $ax def $eax
+; SSE2-NEXT: movd %eax, %xmm0
+; SSE2-NEXT: punpcklwd {{.*#+}} xmm0 = xmm0[0],xmm3[0],xmm0[1],xmm3[1],xmm0[2],xmm3[2],xmm0[3],xmm3[3]
+; SSE2-NEXT: punpckldq {{.*#+}} xmm0 = xmm0[0],xmm4[0],xmm0[1],xmm4[1]
+; SSE2-NEXT: punpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm2[0]
+; SSE2-NEXT: retq
+;
+; AVX512-LABEL: udiv_v4i16:
+; AVX512: # %bb.0:
+; AVX512-NEXT: pshuflw {{.*#+}} xmm2 = xmm2[0,2,2,3,4,5,6,7]
+; AVX512-NEXT: pshufhw {{.*#+}} xmm2 = xmm2[0,1,2,3,4,6,6,7]
+; AVX512-NEXT: pshufd {{.*#+}} xmm2 = xmm2[0,1,0,2]
+; AVX512-NEXT: psrldq {{.*#+}} xmm2 = xmm2[8,9,10,11,12,13,14,15],zero,zero,zero,zero,zero,zero,zero,zero
+; AVX512-NEXT: psllw $15, %xmm2
+; AVX512-NEXT: psraw $15, %xmm2
+; AVX512-NEXT: pand %xmm2, %xmm1
+; AVX512-NEXT: paddw %xmm2, %xmm1
+; AVX512-NEXT: pcmpeqd %xmm2, %xmm2
+; AVX512-NEXT: psubw %xmm2, %xmm1
+; AVX512-NEXT: pextrw $7, %xmm1, %ecx
+; AVX512-NEXT: pextrw $7, %xmm0, %eax
+; AVX512-NEXT: # kill: def $ax killed $ax killed $eax
+; AVX512-NEXT: xorl %edx, %edx
+; AVX512-NEXT: divw %cx
+; AVX512-NEXT: # kill: def $ax killed $ax def $eax
+; AVX512-NEXT: movd %eax, %xmm2
+; AVX512-NEXT: pextrw $6, %xmm1, %ecx
+; AVX512-NEXT: pextrw $6, %xmm0, %eax
+; AVX512-NEXT: # kill: def $ax killed $ax killed $eax
+; AVX512-NEXT: xorl %edx, %edx
+; AVX512-NEXT: divw %cx
+; AVX512-NEXT: # kill: def $ax killed $ax def $eax
+; AVX512-NEXT: movd %eax, %xmm3
+; AVX512-NEXT: punpcklwd {{.*#+}} xmm3 = xmm3[0],xmm2[0],xmm3[1],xmm2[1],xmm3[2],xmm2[2],xmm3[3],xmm2[3]
+; AVX512-NEXT: pextrw $5, %xmm1, %ecx
+; AVX512-NEXT: pextrw $5, %xmm0, %eax
+; AVX512-NEXT: # kill: def $ax killed $ax killed $eax
+; AVX512-NEXT: xorl %edx, %edx
+; AVX512-NEXT: divw %cx
+; AVX512-NEXT: # kill: def $ax killed $ax def $eax
+; AVX512-NEXT: movd %eax, %xmm4
+; AVX512-NEXT: pextrw $4, %xmm1, %ecx
+; AVX512-NEXT: pextrw $4, %xmm0, %eax
+; AVX512-NEXT: # kill: def $ax killed $ax killed $eax
+; AVX512-NEXT: xorl %edx, %edx
+; AVX512-NEXT: divw %cx
+; AVX512-NEXT: # kill: def $ax killed $ax def $eax
+; AVX512-NEXT: movd %eax, %xmm2
+; AVX512-NEXT: punpcklwd {{.*#+}} xmm2 = xmm2[0],xmm4[0],xmm2[1],xmm4[1],xmm2[2],xmm4[2],xmm2[3],xmm4[3]
+; AVX512-NEXT: punpckldq {{.*#+}} xmm2 = xmm2[0],xmm3[0],xmm2[1],xmm3[1]
+; AVX512-NEXT: pextrw $3, %xmm1, %ecx
+; AVX512-NEXT: pextrw $3, %xmm0, %eax
+; AVX512-NEXT: # kill: def $ax killed $ax killed $eax
+; AVX512-NEXT: xorl %edx, %edx
+; AVX512-NEXT: divw %cx
+; AVX512-NEXT: # kill: def $ax killed $ax def $eax
+; AVX512-NEXT: movd %eax, %xmm3
+; AVX512-NEXT: pextrw $2, %xmm1, %ecx
+; AVX512-NEXT: pextrw $2, %xmm0, %eax
+; AVX512-NEXT: # kill: def $ax killed $ax killed $eax
+; AVX512-NEXT: xorl %edx, %edx
+; AVX512-NEXT: divw %cx
+; AVX512-NEXT: # kill: def $ax killed $ax def $eax
+; AVX512-NEXT: movd %eax, %xmm4
+; AVX512-NEXT: punpcklwd {{.*#+}} xmm4 = xmm4[0],xmm3[0],xmm4[1],xmm3[1],xmm4[2],xmm3[2],xmm4[3],xmm3[3]
+; AVX512-NEXT: pextrw $1, %xmm1, %ecx
+; AVX512-NEXT: pextrw $1, %xmm0, %eax
+; AVX512-NEXT: # kill: def $ax killed $ax killed $eax
+; AVX512-NEXT: xorl %edx, %edx
+; AVX512-NEXT: divw %cx
+; AVX512-NEXT: # kill: def $ax killed $ax def $eax
+; AVX512-NEXT: movd %eax, %xmm3
+; AVX512-NEXT: movd %xmm1, %ecx
+; AVX512-NEXT: movd %xmm0, %eax
+; AVX512-NEXT: # kill: def $ax killed $ax killed $eax
+; AVX512-NEXT: xorl %edx, %edx
+; AVX512-NEXT: divw %cx
+; AVX512-NEXT: # kill: def $ax killed $ax def $eax
+; AVX512-NEXT: movd %eax, %xmm0
+; AVX512-NEXT: punpcklwd {{.*#+}} xmm0 = xmm0[0],xmm3[0],xmm0[1],xmm3[1],xmm0[2],xmm3[2],xmm0[3],xmm3[3]
+; AVX512-NEXT: punpckldq {{.*#+}} xmm0 = xmm0[0],xmm4[0],xmm0[1],xmm4[1]
+; AVX512-NEXT: punpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm2[0]
+; AVX512-NEXT: retq
+ %res = call <4 x i16> @llvm.masked.udiv(<4 x i16> %x, <4 x i16> %y, <4 x i1> %m)
+ ret <4 x i16> %res
+}
+
+; Scalarization
+define <1 x i64> @udiv_v1i164(<1 x i64> %x, <1 x i64> %y, <1 x i1> %m) {
+; SSE2-LABEL: udiv_v1i164:
+; SSE2: # %bb.0:
+; SSE2-NEXT: movq %rdi, %rax
+; SSE2-NEXT: testb $1, %dl
+; SSE2-NEXT: movl $1, %ecx
+; SSE2-NEXT: cmovneq %rsi, %rcx
+; SSE2-NEXT: xorl %edx, %edx
+; SSE2-NEXT: divq %rcx
+; SSE2-NEXT: retq
+;
+; AVX512-LABEL: udiv_v1i164:
+; AVX512: # %bb.0:
+; AVX512-NEXT: movq %rdi, %rax
+; AVX512-NEXT: testb $1, %dl
+; AVX512-NEXT: movl $1, %ecx
+; AVX512-NEXT: cmovneq %rsi, %rcx
+; AVX512-NEXT: xorl %edx, %edx
+; AVX512-NEXT: divq %rcx
+; AVX512-NEXT: retq
+ %res = call <1 x i64> @llvm.masked.udiv(<1 x i64> %x, <1 x i64> %y, <1 x i1> %m)
+ ret <1 x i64> %res
+}
+
+; Expansion
+define <2 x i128> @udiv_v2i128(<2 x i128> %x, <2 x i128> %y, <2 x i1> %m) {
+; SSE2-LABEL: udiv_v2i128:
+; SSE2: # %bb.0:
+; SSE2-NEXT: pushq %rbp
+; SSE2-NEXT: .cfi_def_cfa_offset 16
+; SSE2-NEXT: pushq %r15
+; SSE2-NEXT: .cfi_def_cfa_offset 24
+; SSE2-NEXT: pushq %r14
+; SSE2-NEXT: .cfi_def_cfa_offset 32
+; SSE2-NEXT: pushq %r13
+; SSE2-NEXT: .cfi_def_cfa_offset 40
+; SSE2-NEXT: pushq %r12
+; SSE2-NEXT: .cfi_def_cfa_offset 48
+; SSE2-NEXT: pushq %rbx
+; SSE2-NEXT: .cfi_def_cfa_offset 56
+; SSE2-NEXT: subq $40, %rsp
+; SSE2-NEXT: .cfi_def_cfa_offset 96
+; SSE2-NEXT: .cfi_offset %rbx, -56
+; SSE2-NEXT: .cfi_offset %r12, -48
+; SSE2-NEXT: .cfi_offset %r13, -40
+; SSE2-NEXT: .cfi_offset %r14, -32
+; SSE2-NEXT: .cfi_offset %r15, -24
+; SSE2-NEXT: .cfi_offset %rbp, -16
+; SSE2-NEXT: movq %r8, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
+; SSE2-NEXT: movq %rcx, %r15
+; SSE2-NEXT: movq %rdi, %rbx
+; SSE2-NEXT: movaps %xmm0, {{[0-9]+}}(%rsp)
+; SSE2-NEXT: xorl %r12d, %r12d
+; SSE2-NEXT: testb $1, {{[0-9]+}}(%rsp)
+; SSE2-NEXT: movl $1, %r13d
+; SSE2-NEXT: cmoveq %r13, %r9
+; SSE2-NEXT: movq {{[0-9]+}}(%rsp), %rcx
+; SSE2-NEXT: cmoveq %r12, %rcx
+; SSE2-NEXT: movq %rsi, %rdi
+; SSE2-NEXT: movq %rdx, %rsi
+; SSE2-NEXT: movq %r9, %rdx
+; SSE2-NEXT: callq __udivti3 at PLT
+; SSE2-NEXT: movq %rax, %rbp
+; SSE2-NEXT: movq %rdx, %r14
+; SSE2-NEXT: testb $1, {{[0-9]+}}(%rsp)
+; SSE2-NEXT: je .LBB6_2
+; SSE2-NEXT: # %bb.1:
+; SSE2-NEXT: movq {{[0-9]+}}(%rsp), %r13
+; SSE2-NEXT: movq {{[0-9]+}}(%rsp), %r12
+; SSE2-NEXT: .LBB6_2:
+; SSE2-NEXT: movq %r15, %rdi
+; SSE2-NEXT: movq {{[-0-9]+}}(%r{{[sb]}}p), %rsi # 8-byte Reload
+; SSE2-NEXT: movq %r13, %rdx
+; SSE2-NEXT: movq %r12, %rcx
+; SSE2-NEXT: callq __udivti3 at PLT
+; SSE2-NEXT: movq %rdx, 24(%rbx)
+; SSE2-NEXT: movq %rax, 16(%rbx)
+; SSE2-NEXT: movq %r14, 8(%rbx)
+; SSE2-NEXT: movq %rbp, (%rbx)
+; SSE2-NEXT: movq %rbx, %rax
+; SSE2-NEXT: addq $40, %rsp
+; SSE2-NEXT: .cfi_def_cfa_offset 56
+; SSE2-NEXT: popq %rbx
+; SSE2-NEXT: .cfi_def_cfa_offset 48
+; SSE2-NEXT: popq %r12
+; SSE2-NEXT: .cfi_def_cfa_offset 40
+; SSE2-NEXT: popq %r13
+; SSE2-NEXT: .cfi_def_cfa_offset 32
+; SSE2-NEXT: popq %r14
+; SSE2-NEXT: .cfi_def_cfa_offset 24
+; SSE2-NEXT: popq %r15
+; SSE2-NEXT: .cfi_def_cfa_offset 16
+; SSE2-NEXT: popq %rbp
+; SSE2-NEXT: .cfi_def_cfa_offset 8
+; SSE2-NEXT: retq
+;
+; AVX512-LABEL: udiv_v2i128:
+; AVX512: # %bb.0:
+; AVX512-NEXT: pushq %rbp
+; AVX512-NEXT: .cfi_def_cfa_offset 16
+; AVX512-NEXT: pushq %r15
+; AVX512-NEXT: .cfi_def_cfa_offset 24
+; AVX512-NEXT: pushq %r14
+; AVX512-NEXT: .cfi_def_cfa_offset 32
+; AVX512-NEXT: pushq %r13
+; AVX512-NEXT: .cfi_def_cfa_offset 40
+; AVX512-NEXT: pushq %r12
+; AVX512-NEXT: .cfi_def_cfa_offset 48
+; AVX512-NEXT: pushq %rbx
+; AVX512-NEXT: .cfi_def_cfa_offset 56
+; AVX512-NEXT: subq $40, %rsp
+; AVX512-NEXT: .cfi_def_cfa_offset 96
+; AVX512-NEXT: .cfi_offset %rbx, -56
+; AVX512-NEXT: .cfi_offset %r12, -48
+; AVX512-NEXT: .cfi_offset %r13, -40
+; AVX512-NEXT: .cfi_offset %r14, -32
+; AVX512-NEXT: .cfi_offset %r15, -24
+; AVX512-NEXT: .cfi_offset %rbp, -16
+; AVX512-NEXT: movq %r8, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
+; AVX512-NEXT: movq %rcx, %r15
+; AVX512-NEXT: movq %rdi, %rbx
+; AVX512-NEXT: movaps %xmm0, {{[0-9]+}}(%rsp)
+; AVX512-NEXT: xorl %r12d, %r12d
+; AVX512-NEXT: testb $1, {{[0-9]+}}(%rsp)
+; AVX512-NEXT: movl $1, %r13d
+; AVX512-NEXT: cmoveq %r13, %r9
+; AVX512-NEXT: movq {{[0-9]+}}(%rsp), %rcx
+; AVX512-NEXT: cmoveq %r12, %rcx
+; AVX512-NEXT: movq %rsi, %rdi
+; AVX512-NEXT: movq %rdx, %rsi
+; AVX512-NEXT: movq %r9, %rdx
+; AVX512-NEXT: callq __udivti3 at PLT
+; AVX512-NEXT: movq %rax, %rbp
+; AVX512-NEXT: movq %rdx, %r14
+; AVX512-NEXT: testb $1, {{[0-9]+}}(%rsp)
+; AVX512-NEXT: je .LBB6_2
+; AVX512-NEXT: # %bb.1:
+; AVX512-NEXT: movq {{[0-9]+}}(%rsp), %r13
+; AVX512-NEXT: movq {{[0-9]+}}(%rsp), %r12
+; AVX512-NEXT: .LBB6_2:
+; AVX512-NEXT: movq %r15, %rdi
+; AVX512-NEXT: movq {{[-0-9]+}}(%r{{[sb]}}p), %rsi # 8-byte Reload
+; AVX512-NEXT: movq %r13, %rdx
+; AVX512-NEXT: movq %r12, %rcx
+; AVX512-NEXT: callq __udivti3 at PLT
+; AVX512-NEXT: movq %rdx, 24(%rbx)
+; AVX512-NEXT: movq %rax, 16(%rbx)
+; AVX512-NEXT: movq %r14, 8(%rbx)
+; AVX512-NEXT: movq %rbp, (%rbx)
+; AVX512-NEXT: movq %rbx, %rax
+; AVX512-NEXT: addq $40, %rsp
+; AVX512-NEXT: .cfi_def_cfa_offset 56
+; AVX512-NEXT: popq %rbx
+; AVX512-NEXT: .cfi_def_cfa_offset 48
+; AVX512-NEXT: popq %r12
+; AVX512-NEXT: .cfi_def_cfa_offset 40
+; AVX512-NEXT: popq %r13
+; AVX512-NEXT: .cfi_def_cfa_offset 32
+; AVX512-NEXT: popq %r14
+; AVX512-NEXT: .cfi_def_cfa_offset 24
+; AVX512-NEXT: popq %r15
+; AVX512-NEXT: .cfi_def_cfa_offset 16
+; AVX512-NEXT: popq %rbp
+; AVX512-NEXT: .cfi_def_cfa_offset 8
+; AVX512-NEXT: retq
+ %res = call <2 x i128> @llvm.masked.udiv(<2 x i128> %x, <2 x i128> %y, <2 x i1> %m)
+ ret <2 x i128> %res
+}
+
+; Promotion and widening
+define <3 x i10> @udiv_v3i10(<3 x i10> %x, <3 x i10> %y, <3 x i1> %m) {
+; SSE2-LABEL: udiv_v3i10:
+; SSE2: # %bb.0:
+; SSE2-NEXT: movd %r8d, %xmm1
+; SSE2-NEXT: movd %ecx, %xmm0
+; SSE2-NEXT: punpckldq {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1]
+; SSE2-NEXT: movd %r9d, %xmm1
+; SSE2-NEXT: punpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm1[0]
+; SSE2-NEXT: movd {{.*#+}} xmm1 = mem[0],zero,zero,zero
+; SSE2-NEXT: movd {{.*#+}} xmm2 = mem[0],zero,zero,zero
+; SSE2-NEXT: punpckldq {{.*#+}} xmm2 = xmm2[0],xmm1[0],xmm2[1],xmm1[1]
+; SSE2-NEXT: movd {{.*#+}} xmm1 = mem[0],zero,zero,zero
+; SSE2-NEXT: punpcklqdq {{.*#+}} xmm2 = xmm2[0],xmm1[0]
+; SSE2-NEXT: pslld $31, %xmm2
+; SSE2-NEXT: psrad $31, %xmm2
+; SSE2-NEXT: movd %esi, %xmm3
+; SSE2-NEXT: movd %edi, %xmm1
+; SSE2-NEXT: punpckldq {{.*#+}} xmm1 = xmm1[0],xmm3[0],xmm1[1],xmm3[1]
+; SSE2-NEXT: movd %edx, %xmm3
+; SSE2-NEXT: punpcklqdq {{.*#+}} xmm1 = xmm1[0],xmm3[0]
+; SSE2-NEXT: movdqa {{.*#+}} xmm3 = [1023,1023,1023,1023]
+; SSE2-NEXT: pand %xmm3, %xmm1
+; SSE2-NEXT: pand %xmm3, %xmm0
+; SSE2-NEXT: pand %xmm2, %xmm0
+; SSE2-NEXT: paddd %xmm2, %xmm0
+; SSE2-NEXT: pcmpeqd %xmm2, %xmm2
+; SSE2-NEXT: psubd %xmm2, %xmm0
+; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm0[2,3,2,3]
+; SSE2-NEXT: movd %xmm2, %ecx
+; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm1[2,3,2,3]
+; SSE2-NEXT: movd %xmm2, %eax
+; SSE2-NEXT: xorl %edx, %edx
+; SSE2-NEXT: divl %ecx
+; SSE2-NEXT: movl %eax, %ecx
+; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm0[1,1,1,1]
+; SSE2-NEXT: movd %xmm2, %esi
+; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm1[1,1,1,1]
+; SSE2-NEXT: movd %xmm2, %eax
+; SSE2-NEXT: xorl %edx, %edx
+; SSE2-NEXT: divl %esi
+; SSE2-NEXT: movl %eax, %esi
+; SSE2-NEXT: movd %xmm0, %edi
+; SSE2-NEXT: movd %xmm1, %eax
+; SSE2-NEXT: xorl %edx, %edx
+; SSE2-NEXT: divl %edi
+; SSE2-NEXT: # kill: def $ax killed $ax killed $eax
+; SSE2-NEXT: movl %esi, %edx
+; SSE2-NEXT: # kill: def $cx killed $cx killed $ecx
+; SSE2-NEXT: retq
+;
+; AVX512-LABEL: udiv_v3i10:
+; AVX512: # %bb.0:
+; AVX512-NEXT: movd %r8d, %xmm1
+; AVX512-NEXT: movd %ecx, %xmm0
+; AVX512-NEXT: punpckldq {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1]
+; AVX512-NEXT: movd %r9d, %xmm1
+; AVX512-NEXT: punpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm1[0]
+; AVX512-NEXT: movd {{.*#+}} xmm1 = mem[0],zero,zero,zero
+; AVX512-NEXT: movd {{.*#+}} xmm2 = mem[0],zero,zero,zero
+; AVX512-NEXT: punpckldq {{.*#+}} xmm2 = xmm2[0],xmm1[0],xmm2[1],xmm1[1]
+; AVX512-NEXT: movd {{.*#+}} xmm1 = mem[0],zero,zero,zero
+; AVX512-NEXT: punpcklqdq {{.*#+}} xmm2 = xmm2[0],xmm1[0]
+; AVX512-NEXT: pslld $31, %xmm2
+; AVX512-NEXT: psrad $31, %xmm2
+; AVX512-NEXT: movd %esi, %xmm3
+; AVX512-NEXT: movd %edi, %xmm1
+; AVX512-NEXT: punpckldq {{.*#+}} xmm1 = xmm1[0],xmm3[0],xmm1[1],xmm3[1]
+; AVX512-NEXT: movd %edx, %xmm3
+; AVX512-NEXT: punpcklqdq {{.*#+}} xmm1 = xmm1[0],xmm3[0]
+; AVX512-NEXT: movdqa {{.*#+}} xmm3 = [1023,1023,1023,1023]
+; AVX512-NEXT: pand %xmm3, %xmm1
+; AVX512-NEXT: pand %xmm3, %xmm0
+; AVX512-NEXT: pand %xmm2, %xmm0
+; AVX512-NEXT: paddd %xmm2, %xmm0
+; AVX512-NEXT: pcmpeqd %xmm2, %xmm2
+; AVX512-NEXT: psubd %xmm2, %xmm0
+; AVX512-NEXT: pshufd {{.*#+}} xmm2 = xmm0[2,3,2,3]
+; AVX512-NEXT: movd %xmm2, %ecx
+; AVX512-NEXT: pshufd {{.*#+}} xmm2 = xmm1[2,3,2,3]
+; AVX512-NEXT: movd %xmm2, %eax
+; AVX512-NEXT: xorl %edx, %edx
+; AVX512-NEXT: divl %ecx
+; AVX512-NEXT: movl %eax, %ecx
+; AVX512-NEXT: pshufd {{.*#+}} xmm2 = xmm0[1,1,1,1]
+; AVX512-NEXT: movd %xmm2, %esi
+; AVX512-NEXT: pshufd {{.*#+}} xmm2 = xmm1[1,1,1,1]
+; AVX512-NEXT: movd %xmm2, %eax
+; AVX512-NEXT: xorl %edx, %edx
+; AVX512-NEXT: divl %esi
+; AVX512-NEXT: movl %eax, %esi
+; AVX512-NEXT: movd %xmm0, %edi
+; AVX512-NEXT: movd %xmm1, %eax
+; AVX512-NEXT: xorl %edx, %edx
+; AVX512-NEXT: divl %edi
+; AVX512-NEXT: # kill: def $ax killed $ax killed $eax
+; AVX512-NEXT: movl %esi, %edx
+; AVX512-NEXT: # kill: def $cx killed $cx killed $ecx
+; AVX512-NEXT: retq
+ %res = call <3 x i10> @llvm.masked.udiv(<3 x i10> %x, <3 x i10> %y, <3 x i1> %m)
+ ret <3 x i10> %res
+}
diff --git a/llvm/test/CodeGen/X86/masked-urem.ll b/llvm/test/CodeGen/X86/masked-urem.ll
new file mode 100644
index 0000000000000..e5451e835efd4
--- /dev/null
+++ b/llvm/test/CodeGen/X86/masked-urem.ll
@@ -0,0 +1,760 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
+; RUN: llc -mtriple x86_64 -mattr=+sse2 < %s | FileCheck %s --check-prefix=SSE2
+; RUN: llc -mtriple x86_64 -mattr=+avx512 < %s | FileCheck %s --check-prefix=AVX512
+
+; Legal
+define <4 x i32> @urem_v4i32(<4 x i32> %x, <4 x i32> %y, <4 x i1> %m) {
+; SSE2-LABEL: urem_v4i32:
+; SSE2: # %bb.0:
+; SSE2-NEXT: pslld $31, %xmm2
+; SSE2-NEXT: psrad $31, %xmm2
+; SSE2-NEXT: pand %xmm2, %xmm1
+; SSE2-NEXT: paddd %xmm2, %xmm1
+; SSE2-NEXT: pcmpeqd %xmm2, %xmm2
+; SSE2-NEXT: psubd %xmm2, %xmm1
+; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm1[3,3,3,3]
+; SSE2-NEXT: movd %xmm2, %ecx
+; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm0[3,3,3,3]
+; SSE2-NEXT: movd %xmm2, %eax
+; SSE2-NEXT: xorl %edx, %edx
+; SSE2-NEXT: divl %ecx
+; SSE2-NEXT: movd %edx, %xmm2
+; SSE2-NEXT: pshufd {{.*#+}} xmm3 = xmm1[2,3,2,3]
+; SSE2-NEXT: movd %xmm3, %ecx
+; SSE2-NEXT: pshufd {{.*#+}} xmm3 = xmm0[2,3,2,3]
+; SSE2-NEXT: movd %xmm3, %eax
+; SSE2-NEXT: xorl %edx, %edx
+; SSE2-NEXT: divl %ecx
+; SSE2-NEXT: movd %edx, %xmm3
+; SSE2-NEXT: punpckldq {{.*#+}} xmm3 = xmm3[0],xmm2[0],xmm3[1],xmm2[1]
+; SSE2-NEXT: movd %xmm1, %ecx
+; SSE2-NEXT: movd %xmm0, %eax
+; SSE2-NEXT: xorl %edx, %edx
+; SSE2-NEXT: divl %ecx
+; SSE2-NEXT: movd %edx, %xmm2
+; SSE2-NEXT: pshufd {{.*#+}} xmm1 = xmm1[1,1,1,1]
+; SSE2-NEXT: movd %xmm1, %ecx
+; SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm0[1,1,1,1]
+; SSE2-NEXT: movd %xmm0, %eax
+; SSE2-NEXT: xorl %edx, %edx
+; SSE2-NEXT: divl %ecx
+; SSE2-NEXT: movd %edx, %xmm0
+; SSE2-NEXT: punpckldq {{.*#+}} xmm2 = xmm2[0],xmm0[0],xmm2[1],xmm0[1]
+; SSE2-NEXT: punpcklqdq {{.*#+}} xmm2 = xmm2[0],xmm3[0]
+; SSE2-NEXT: movdqa %xmm2, %xmm0
+; SSE2-NEXT: retq
+;
+; AVX512-LABEL: urem_v4i32:
+; AVX512: # %bb.0:
+; AVX512-NEXT: pslld $31, %xmm2
+; AVX512-NEXT: psrad $31, %xmm2
+; AVX512-NEXT: pand %xmm2, %xmm1
+; AVX512-NEXT: paddd %xmm2, %xmm1
+; AVX512-NEXT: pcmpeqd %xmm2, %xmm2
+; AVX512-NEXT: psubd %xmm2, %xmm1
+; AVX512-NEXT: pshufd {{.*#+}} xmm2 = xmm1[3,3,3,3]
+; AVX512-NEXT: movd %xmm2, %ecx
+; AVX512-NEXT: pshufd {{.*#+}} xmm2 = xmm0[3,3,3,3]
+; AVX512-NEXT: movd %xmm2, %eax
+; AVX512-NEXT: xorl %edx, %edx
+; AVX512-NEXT: divl %ecx
+; AVX512-NEXT: movd %edx, %xmm2
+; AVX512-NEXT: pshufd {{.*#+}} xmm3 = xmm1[2,3,2,3]
+; AVX512-NEXT: movd %xmm3, %ecx
+; AVX512-NEXT: pshufd {{.*#+}} xmm3 = xmm0[2,3,2,3]
+; AVX512-NEXT: movd %xmm3, %eax
+; AVX512-NEXT: xorl %edx, %edx
+; AVX512-NEXT: divl %ecx
+; AVX512-NEXT: movd %edx, %xmm3
+; AVX512-NEXT: punpckldq {{.*#+}} xmm3 = xmm3[0],xmm2[0],xmm3[1],xmm2[1]
+; AVX512-NEXT: movd %xmm1, %ecx
+; AVX512-NEXT: movd %xmm0, %eax
+; AVX512-NEXT: xorl %edx, %edx
+; AVX512-NEXT: divl %ecx
+; AVX512-NEXT: movd %edx, %xmm2
+; AVX512-NEXT: pshufd {{.*#+}} xmm1 = xmm1[1,1,1,1]
+; AVX512-NEXT: movd %xmm1, %ecx
+; AVX512-NEXT: pshufd {{.*#+}} xmm0 = xmm0[1,1,1,1]
+; AVX512-NEXT: movd %xmm0, %eax
+; AVX512-NEXT: xorl %edx, %edx
+; AVX512-NEXT: divl %ecx
+; AVX512-NEXT: movd %edx, %xmm0
+; AVX512-NEXT: punpckldq {{.*#+}} xmm2 = xmm2[0],xmm0[0],xmm2[1],xmm0[1]
+; AVX512-NEXT: punpcklqdq {{.*#+}} xmm2 = xmm2[0],xmm3[0]
+; AVX512-NEXT: movdqa %xmm2, %xmm0
+; AVX512-NEXT: retq
+ %res = call <4 x i32> @llvm.masked.urem(<4 x i32> %x, <4 x i32> %y, <4 x i1> %m)
+ ret <4 x i32> %res
+}
+
+define <2 x i64> @urem_v2i64(<2 x i64> %x, <2 x i64> %y, <2 x i1> %m) {
+; SSE2-LABEL: urem_v2i64:
+; SSE2: # %bb.0:
+; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm2[0,0,2,2]
+; SSE2-NEXT: pslld $31, %xmm2
+; SSE2-NEXT: psrad $31, %xmm2
+; SSE2-NEXT: pand %xmm2, %xmm1
+; SSE2-NEXT: pandn {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm2
+; SSE2-NEXT: por %xmm1, %xmm2
+; SSE2-NEXT: movq %xmm2, %rcx
+; SSE2-NEXT: movq %xmm0, %rax
+; SSE2-NEXT: xorl %edx, %edx
+; SSE2-NEXT: divq %rcx
+; SSE2-NEXT: movq %rdx, %xmm1
+; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm2[2,3,2,3]
+; SSE2-NEXT: movq %xmm2, %rcx
+; SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm0[2,3,2,3]
+; SSE2-NEXT: movq %xmm0, %rax
+; SSE2-NEXT: xorl %edx, %edx
+; SSE2-NEXT: divq %rcx
+; SSE2-NEXT: movq %rdx, %xmm0
+; SSE2-NEXT: punpcklqdq {{.*#+}} xmm1 = xmm1[0],xmm0[0]
+; SSE2-NEXT: movdqa %xmm1, %xmm0
+; SSE2-NEXT: retq
+;
+; AVX512-LABEL: urem_v2i64:
+; AVX512: # %bb.0:
+; AVX512-NEXT: pshufd {{.*#+}} xmm2 = xmm2[0,0,2,2]
+; AVX512-NEXT: pslld $31, %xmm2
+; AVX512-NEXT: psrad $31, %xmm2
+; AVX512-NEXT: pand %xmm2, %xmm1
+; AVX512-NEXT: pandn {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm2
+; AVX512-NEXT: por %xmm1, %xmm2
+; AVX512-NEXT: movq %xmm2, %rcx
+; AVX512-NEXT: movq %xmm0, %rax
+; AVX512-NEXT: xorl %edx, %edx
+; AVX512-NEXT: divq %rcx
+; AVX512-NEXT: movq %rdx, %xmm1
+; AVX512-NEXT: pshufd {{.*#+}} xmm2 = xmm2[2,3,2,3]
+; AVX512-NEXT: movq %xmm2, %rcx
+; AVX512-NEXT: pshufd {{.*#+}} xmm0 = xmm0[2,3,2,3]
+; AVX512-NEXT: movq %xmm0, %rax
+; AVX512-NEXT: xorl %edx, %edx
+; AVX512-NEXT: divq %rcx
+; AVX512-NEXT: movq %rdx, %xmm0
+; AVX512-NEXT: punpcklqdq {{.*#+}} xmm1 = xmm1[0],xmm0[0]
+; AVX512-NEXT: movdqa %xmm1, %xmm0
+; AVX512-NEXT: retq
+ %res = call <2 x i64> @llvm.masked.urem(<2 x i64> %x, <2 x i64> %y, <2 x i1> %m)
+ ret <2 x i64> %res
+}
+
+; Splitting
+define <4 x i64> @urem_v4i64(<4 x i64> %x, <4 x i64> %y, <4 x i1> %m) {
+; SSE2-LABEL: urem_v4i64:
+; SSE2: # %bb.0:
+; SSE2-NEXT: movdqa %xmm0, %xmm5
+; SSE2-NEXT: pshufd {{.*#+}} xmm6 = xmm4[2,2,3,3]
+; SSE2-NEXT: pslld $31, %xmm6
+; SSE2-NEXT: psrad $31, %xmm6
+; SSE2-NEXT: pshufd {{.*#+}} xmm4 = xmm4[0,0,1,1]
+; SSE2-NEXT: pslld $31, %xmm4
+; SSE2-NEXT: psrad $31, %xmm4
+; SSE2-NEXT: movdqa {{.*#+}} xmm7 = [1,1]
+; SSE2-NEXT: pand %xmm4, %xmm2
+; SSE2-NEXT: pandn %xmm7, %xmm4
+; SSE2-NEXT: por %xmm2, %xmm4
+; SSE2-NEXT: movq %xmm4, %rcx
+; SSE2-NEXT: movq %xmm0, %rax
+; SSE2-NEXT: xorl %edx, %edx
+; SSE2-NEXT: divq %rcx
+; SSE2-NEXT: movq %rdx, %xmm0
+; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm4[2,3,2,3]
+; SSE2-NEXT: movq %xmm2, %rcx
+; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm5[2,3,2,3]
+; SSE2-NEXT: movq %xmm2, %rax
+; SSE2-NEXT: xorl %edx, %edx
+; SSE2-NEXT: divq %rcx
+; SSE2-NEXT: movq %rdx, %xmm2
+; SSE2-NEXT: punpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm2[0]
+; SSE2-NEXT: pand %xmm6, %xmm3
+; SSE2-NEXT: pandn %xmm7, %xmm6
+; SSE2-NEXT: por %xmm3, %xmm6
+; SSE2-NEXT: movq %xmm6, %rcx
+; SSE2-NEXT: movq %xmm1, %rax
+; SSE2-NEXT: xorl %edx, %edx
+; SSE2-NEXT: divq %rcx
+; SSE2-NEXT: movq %rdx, %xmm2
+; SSE2-NEXT: pshufd {{.*#+}} xmm3 = xmm6[2,3,2,3]
+; SSE2-NEXT: movq %xmm3, %rcx
+; SSE2-NEXT: pshufd {{.*#+}} xmm1 = xmm1[2,3,2,3]
+; SSE2-NEXT: movq %xmm1, %rax
+; SSE2-NEXT: xorl %edx, %edx
+; SSE2-NEXT: divq %rcx
+; SSE2-NEXT: movq %rdx, %xmm1
+; SSE2-NEXT: punpcklqdq {{.*#+}} xmm2 = xmm2[0],xmm1[0]
+; SSE2-NEXT: movdqa %xmm2, %xmm1
+; SSE2-NEXT: retq
+;
+; AVX512-LABEL: urem_v4i64:
+; AVX512: # %bb.0:
+; AVX512-NEXT: movdqa %xmm0, %xmm5
+; AVX512-NEXT: pshufd {{.*#+}} xmm6 = xmm4[2,2,3,3]
+; AVX512-NEXT: pslld $31, %xmm6
+; AVX512-NEXT: psrad $31, %xmm6
+; AVX512-NEXT: pshufd {{.*#+}} xmm4 = xmm4[0,0,1,1]
+; AVX512-NEXT: pslld $31, %xmm4
+; AVX512-NEXT: psrad $31, %xmm4
+; AVX512-NEXT: movdqa {{.*#+}} xmm7 = [1,1]
+; AVX512-NEXT: pand %xmm4, %xmm2
+; AVX512-NEXT: pandn %xmm7, %xmm4
+; AVX512-NEXT: por %xmm2, %xmm4
+; AVX512-NEXT: movq %xmm4, %rcx
+; AVX512-NEXT: movq %xmm0, %rax
+; AVX512-NEXT: xorl %edx, %edx
+; AVX512-NEXT: divq %rcx
+; AVX512-NEXT: movq %rdx, %xmm0
+; AVX512-NEXT: pshufd {{.*#+}} xmm2 = xmm4[2,3,2,3]
+; AVX512-NEXT: movq %xmm2, %rcx
+; AVX512-NEXT: pshufd {{.*#+}} xmm2 = xmm5[2,3,2,3]
+; AVX512-NEXT: movq %xmm2, %rax
+; AVX512-NEXT: xorl %edx, %edx
+; AVX512-NEXT: divq %rcx
+; AVX512-NEXT: movq %rdx, %xmm2
+; AVX512-NEXT: punpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm2[0]
+; AVX512-NEXT: pand %xmm6, %xmm3
+; AVX512-NEXT: pandn %xmm7, %xmm6
+; AVX512-NEXT: por %xmm3, %xmm6
+; AVX512-NEXT: movq %xmm6, %rcx
+; AVX512-NEXT: movq %xmm1, %rax
+; AVX512-NEXT: xorl %edx, %edx
+; AVX512-NEXT: divq %rcx
+; AVX512-NEXT: movq %rdx, %xmm2
+; AVX512-NEXT: pshufd {{.*#+}} xmm3 = xmm6[2,3,2,3]
+; AVX512-NEXT: movq %xmm3, %rcx
+; AVX512-NEXT: pshufd {{.*#+}} xmm1 = xmm1[2,3,2,3]
+; AVX512-NEXT: movq %xmm1, %rax
+; AVX512-NEXT: xorl %edx, %edx
+; AVX512-NEXT: divq %rcx
+; AVX512-NEXT: movq %rdx, %xmm1
+; AVX512-NEXT: punpcklqdq {{.*#+}} xmm2 = xmm2[0],xmm1[0]
+; AVX512-NEXT: movdqa %xmm2, %xmm1
+; AVX512-NEXT: retq
+ %res = call <4 x i64> @llvm.masked.urem(<4 x i64> %x, <4 x i64> %y, <4 x i1> %m)
+ ret <4 x i64> %res
+}
+
+; Widening
+define <2 x i32> @urem_v2i32(<2 x i32> %x, <2 x i32> %y, <2 x i1> %m) {
+; SSE2-LABEL: urem_v2i32:
+; SSE2: # %bb.0:
+; SSE2-NEXT: xorps %xmm3, %xmm3
+; SSE2-NEXT: shufps {{.*#+}} xmm2 = xmm2[0,2],xmm3[2,3]
+; SSE2-NEXT: pslld $31, %xmm2
+; SSE2-NEXT: psrad $31, %xmm2
+; SSE2-NEXT: pand %xmm2, %xmm1
+; SSE2-NEXT: paddd %xmm2, %xmm1
+; SSE2-NEXT: pcmpeqd %xmm2, %xmm2
+; SSE2-NEXT: psubd %xmm2, %xmm1
+; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm1[3,3,3,3]
+; SSE2-NEXT: movd %xmm2, %ecx
+; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm0[3,3,3,3]
+; SSE2-NEXT: movd %xmm2, %eax
+; SSE2-NEXT: xorl %edx, %edx
+; SSE2-NEXT: divl %ecx
+; SSE2-NEXT: movd %edx, %xmm2
+; SSE2-NEXT: pshufd {{.*#+}} xmm3 = xmm1[2,3,2,3]
+; SSE2-NEXT: movd %xmm3, %ecx
+; SSE2-NEXT: pshufd {{.*#+}} xmm3 = xmm0[2,3,2,3]
+; SSE2-NEXT: movd %xmm3, %eax
+; SSE2-NEXT: xorl %edx, %edx
+; SSE2-NEXT: divl %ecx
+; SSE2-NEXT: movd %edx, %xmm3
+; SSE2-NEXT: punpckldq {{.*#+}} xmm3 = xmm3[0],xmm2[0],xmm3[1],xmm2[1]
+; SSE2-NEXT: movd %xmm1, %ecx
+; SSE2-NEXT: movd %xmm0, %eax
+; SSE2-NEXT: xorl %edx, %edx
+; SSE2-NEXT: divl %ecx
+; SSE2-NEXT: movd %edx, %xmm2
+; SSE2-NEXT: pshufd {{.*#+}} xmm1 = xmm1[1,1,1,1]
+; SSE2-NEXT: movd %xmm1, %ecx
+; SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm0[1,1,1,1]
+; SSE2-NEXT: movd %xmm0, %eax
+; SSE2-NEXT: xorl %edx, %edx
+; SSE2-NEXT: divl %ecx
+; SSE2-NEXT: movd %edx, %xmm0
+; SSE2-NEXT: punpckldq {{.*#+}} xmm2 = xmm2[0],xmm0[0],xmm2[1],xmm0[1]
+; SSE2-NEXT: punpcklqdq {{.*#+}} xmm2 = xmm2[0],xmm3[0]
+; SSE2-NEXT: movdqa %xmm2, %xmm0
+; SSE2-NEXT: retq
+;
+; AVX512-LABEL: urem_v2i32:
+; AVX512: # %bb.0:
+; AVX512-NEXT: xorps %xmm3, %xmm3
+; AVX512-NEXT: shufps {{.*#+}} xmm2 = xmm2[0,2],xmm3[2,3]
+; AVX512-NEXT: pslld $31, %xmm2
+; AVX512-NEXT: psrad $31, %xmm2
+; AVX512-NEXT: pand %xmm2, %xmm1
+; AVX512-NEXT: paddd %xmm2, %xmm1
+; AVX512-NEXT: pcmpeqd %xmm2, %xmm2
+; AVX512-NEXT: psubd %xmm2, %xmm1
+; AVX512-NEXT: pshufd {{.*#+}} xmm2 = xmm1[3,3,3,3]
+; AVX512-NEXT: movd %xmm2, %ecx
+; AVX512-NEXT: pshufd {{.*#+}} xmm2 = xmm0[3,3,3,3]
+; AVX512-NEXT: movd %xmm2, %eax
+; AVX512-NEXT: xorl %edx, %edx
+; AVX512-NEXT: divl %ecx
+; AVX512-NEXT: movd %edx, %xmm2
+; AVX512-NEXT: pshufd {{.*#+}} xmm3 = xmm1[2,3,2,3]
+; AVX512-NEXT: movd %xmm3, %ecx
+; AVX512-NEXT: pshufd {{.*#+}} xmm3 = xmm0[2,3,2,3]
+; AVX512-NEXT: movd %xmm3, %eax
+; AVX512-NEXT: xorl %edx, %edx
+; AVX512-NEXT: divl %ecx
+; AVX512-NEXT: movd %edx, %xmm3
+; AVX512-NEXT: punpckldq {{.*#+}} xmm3 = xmm3[0],xmm2[0],xmm3[1],xmm2[1]
+; AVX512-NEXT: movd %xmm1, %ecx
+; AVX512-NEXT: movd %xmm0, %eax
+; AVX512-NEXT: xorl %edx, %edx
+; AVX512-NEXT: divl %ecx
+; AVX512-NEXT: movd %edx, %xmm2
+; AVX512-NEXT: pshufd {{.*#+}} xmm1 = xmm1[1,1,1,1]
+; AVX512-NEXT: movd %xmm1, %ecx
+; AVX512-NEXT: pshufd {{.*#+}} xmm0 = xmm0[1,1,1,1]
+; AVX512-NEXT: movd %xmm0, %eax
+; AVX512-NEXT: xorl %edx, %edx
+; AVX512-NEXT: divl %ecx
+; AVX512-NEXT: movd %edx, %xmm0
+; AVX512-NEXT: punpckldq {{.*#+}} xmm2 = xmm2[0],xmm0[0],xmm2[1],xmm0[1]
+; AVX512-NEXT: punpcklqdq {{.*#+}} xmm2 = xmm2[0],xmm3[0]
+; AVX512-NEXT: movdqa %xmm2, %xmm0
+; AVX512-NEXT: retq
+ %res = call <2 x i32> @llvm.masked.urem(<2 x i32> %x, <2 x i32> %y, <2 x i1> %m)
+ ret <2 x i32> %res
+}
+
+; Promotion
+define <4 x i16> @urem_v4i16(<4 x i16> %x, <4 x i16> %y, <4 x i1> %m) {
+; SSE2-LABEL: urem_v4i16:
+; SSE2: # %bb.0:
+; SSE2-NEXT: pshuflw {{.*#+}} xmm2 = xmm2[0,2,2,3,4,5,6,7]
+; SSE2-NEXT: pshufhw {{.*#+}} xmm2 = xmm2[0,1,2,3,4,6,6,7]
+; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm2[0,1,0,2]
+; SSE2-NEXT: psrldq {{.*#+}} xmm2 = xmm2[8,9,10,11,12,13,14,15],zero,zero,zero,zero,zero,zero,zero,zero
+; SSE2-NEXT: psllw $15, %xmm2
+; SSE2-NEXT: psraw $15, %xmm2
+; SSE2-NEXT: pand %xmm2, %xmm1
+; SSE2-NEXT: paddw %xmm2, %xmm1
+; SSE2-NEXT: pcmpeqd %xmm2, %xmm2
+; SSE2-NEXT: psubw %xmm2, %xmm1
+; SSE2-NEXT: pextrw $7, %xmm1, %ecx
+; SSE2-NEXT: pextrw $7, %xmm0, %eax
+; SSE2-NEXT: # kill: def $ax killed $ax killed $eax
+; SSE2-NEXT: xorl %edx, %edx
+; SSE2-NEXT: divw %cx
+; SSE2-NEXT: # kill: def $dx killed $dx def $edx
+; SSE2-NEXT: movd %edx, %xmm2
+; SSE2-NEXT: pextrw $6, %xmm1, %ecx
+; SSE2-NEXT: pextrw $6, %xmm0, %eax
+; SSE2-NEXT: # kill: def $ax killed $ax killed $eax
+; SSE2-NEXT: xorl %edx, %edx
+; SSE2-NEXT: divw %cx
+; SSE2-NEXT: # kill: def $dx killed $dx def $edx
+; SSE2-NEXT: movd %edx, %xmm3
+; SSE2-NEXT: punpcklwd {{.*#+}} xmm3 = xmm3[0],xmm2[0],xmm3[1],xmm2[1],xmm3[2],xmm2[2],xmm3[3],xmm2[3]
+; SSE2-NEXT: pextrw $5, %xmm1, %ecx
+; SSE2-NEXT: pextrw $5, %xmm0, %eax
+; SSE2-NEXT: # kill: def $ax killed $ax killed $eax
+; SSE2-NEXT: xorl %edx, %edx
+; SSE2-NEXT: divw %cx
+; SSE2-NEXT: # kill: def $dx killed $dx def $edx
+; SSE2-NEXT: movd %edx, %xmm4
+; SSE2-NEXT: pextrw $4, %xmm1, %ecx
+; SSE2-NEXT: pextrw $4, %xmm0, %eax
+; SSE2-NEXT: # kill: def $ax killed $ax killed $eax
+; SSE2-NEXT: xorl %edx, %edx
+; SSE2-NEXT: divw %cx
+; SSE2-NEXT: # kill: def $dx killed $dx def $edx
+; SSE2-NEXT: movd %edx, %xmm2
+; SSE2-NEXT: punpcklwd {{.*#+}} xmm2 = xmm2[0],xmm4[0],xmm2[1],xmm4[1],xmm2[2],xmm4[2],xmm2[3],xmm4[3]
+; SSE2-NEXT: punpckldq {{.*#+}} xmm2 = xmm2[0],xmm3[0],xmm2[1],xmm3[1]
+; SSE2-NEXT: pextrw $3, %xmm1, %ecx
+; SSE2-NEXT: pextrw $3, %xmm0, %eax
+; SSE2-NEXT: # kill: def $ax killed $ax killed $eax
+; SSE2-NEXT: xorl %edx, %edx
+; SSE2-NEXT: divw %cx
+; SSE2-NEXT: # kill: def $dx killed $dx def $edx
+; SSE2-NEXT: movd %edx, %xmm3
+; SSE2-NEXT: pextrw $2, %xmm1, %ecx
+; SSE2-NEXT: pextrw $2, %xmm0, %eax
+; SSE2-NEXT: # kill: def $ax killed $ax killed $eax
+; SSE2-NEXT: xorl %edx, %edx
+; SSE2-NEXT: divw %cx
+; SSE2-NEXT: # kill: def $dx killed $dx def $edx
+; SSE2-NEXT: movd %edx, %xmm4
+; SSE2-NEXT: punpcklwd {{.*#+}} xmm4 = xmm4[0],xmm3[0],xmm4[1],xmm3[1],xmm4[2],xmm3[2],xmm4[3],xmm3[3]
+; SSE2-NEXT: pextrw $1, %xmm1, %ecx
+; SSE2-NEXT: pextrw $1, %xmm0, %eax
+; SSE2-NEXT: # kill: def $ax killed $ax killed $eax
+; SSE2-NEXT: xorl %edx, %edx
+; SSE2-NEXT: divw %cx
+; SSE2-NEXT: # kill: def $dx killed $dx def $edx
+; SSE2-NEXT: movd %edx, %xmm3
+; SSE2-NEXT: movd %xmm1, %ecx
+; SSE2-NEXT: movd %xmm0, %eax
+; SSE2-NEXT: # kill: def $ax killed $ax killed $eax
+; SSE2-NEXT: xorl %edx, %edx
+; SSE2-NEXT: divw %cx
+; SSE2-NEXT: # kill: def $dx killed $dx def $edx
+; SSE2-NEXT: movd %edx, %xmm0
+; SSE2-NEXT: punpcklwd {{.*#+}} xmm0 = xmm0[0],xmm3[0],xmm0[1],xmm3[1],xmm0[2],xmm3[2],xmm0[3],xmm3[3]
+; SSE2-NEXT: punpckldq {{.*#+}} xmm0 = xmm0[0],xmm4[0],xmm0[1],xmm4[1]
+; SSE2-NEXT: punpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm2[0]
+; SSE2-NEXT: retq
+;
+; AVX512-LABEL: urem_v4i16:
+; AVX512: # %bb.0:
+; AVX512-NEXT: pshuflw {{.*#+}} xmm2 = xmm2[0,2,2,3,4,5,6,7]
+; AVX512-NEXT: pshufhw {{.*#+}} xmm2 = xmm2[0,1,2,3,4,6,6,7]
+; AVX512-NEXT: pshufd {{.*#+}} xmm2 = xmm2[0,1,0,2]
+; AVX512-NEXT: psrldq {{.*#+}} xmm2 = xmm2[8,9,10,11,12,13,14,15],zero,zero,zero,zero,zero,zero,zero,zero
+; AVX512-NEXT: psllw $15, %xmm2
+; AVX512-NEXT: psraw $15, %xmm2
+; AVX512-NEXT: pand %xmm2, %xmm1
+; AVX512-NEXT: paddw %xmm2, %xmm1
+; AVX512-NEXT: pcmpeqd %xmm2, %xmm2
+; AVX512-NEXT: psubw %xmm2, %xmm1
+; AVX512-NEXT: pextrw $7, %xmm1, %ecx
+; AVX512-NEXT: pextrw $7, %xmm0, %eax
+; AVX512-NEXT: # kill: def $ax killed $ax killed $eax
+; AVX512-NEXT: xorl %edx, %edx
+; AVX512-NEXT: divw %cx
+; AVX512-NEXT: # kill: def $dx killed $dx def $edx
+; AVX512-NEXT: movd %edx, %xmm2
+; AVX512-NEXT: pextrw $6, %xmm1, %ecx
+; AVX512-NEXT: pextrw $6, %xmm0, %eax
+; AVX512-NEXT: # kill: def $ax killed $ax killed $eax
+; AVX512-NEXT: xorl %edx, %edx
+; AVX512-NEXT: divw %cx
+; AVX512-NEXT: # kill: def $dx killed $dx def $edx
+; AVX512-NEXT: movd %edx, %xmm3
+; AVX512-NEXT: punpcklwd {{.*#+}} xmm3 = xmm3[0],xmm2[0],xmm3[1],xmm2[1],xmm3[2],xmm2[2],xmm3[3],xmm2[3]
+; AVX512-NEXT: pextrw $5, %xmm1, %ecx
+; AVX512-NEXT: pextrw $5, %xmm0, %eax
+; AVX512-NEXT: # kill: def $ax killed $ax killed $eax
+; AVX512-NEXT: xorl %edx, %edx
+; AVX512-NEXT: divw %cx
+; AVX512-NEXT: # kill: def $dx killed $dx def $edx
+; AVX512-NEXT: movd %edx, %xmm4
+; AVX512-NEXT: pextrw $4, %xmm1, %ecx
+; AVX512-NEXT: pextrw $4, %xmm0, %eax
+; AVX512-NEXT: # kill: def $ax killed $ax killed $eax
+; AVX512-NEXT: xorl %edx, %edx
+; AVX512-NEXT: divw %cx
+; AVX512-NEXT: # kill: def $dx killed $dx def $edx
+; AVX512-NEXT: movd %edx, %xmm2
+; AVX512-NEXT: punpcklwd {{.*#+}} xmm2 = xmm2[0],xmm4[0],xmm2[1],xmm4[1],xmm2[2],xmm4[2],xmm2[3],xmm4[3]
+; AVX512-NEXT: punpckldq {{.*#+}} xmm2 = xmm2[0],xmm3[0],xmm2[1],xmm3[1]
+; AVX512-NEXT: pextrw $3, %xmm1, %ecx
+; AVX512-NEXT: pextrw $3, %xmm0, %eax
+; AVX512-NEXT: # kill: def $ax killed $ax killed $eax
+; AVX512-NEXT: xorl %edx, %edx
+; AVX512-NEXT: divw %cx
+; AVX512-NEXT: # kill: def $dx killed $dx def $edx
+; AVX512-NEXT: movd %edx, %xmm3
+; AVX512-NEXT: pextrw $2, %xmm1, %ecx
+; AVX512-NEXT: pextrw $2, %xmm0, %eax
+; AVX512-NEXT: # kill: def $ax killed $ax killed $eax
+; AVX512-NEXT: xorl %edx, %edx
+; AVX512-NEXT: divw %cx
+; AVX512-NEXT: # kill: def $dx killed $dx def $edx
+; AVX512-NEXT: movd %edx, %xmm4
+; AVX512-NEXT: punpcklwd {{.*#+}} xmm4 = xmm4[0],xmm3[0],xmm4[1],xmm3[1],xmm4[2],xmm3[2],xmm4[3],xmm3[3]
+; AVX512-NEXT: pextrw $1, %xmm1, %ecx
+; AVX512-NEXT: pextrw $1, %xmm0, %eax
+; AVX512-NEXT: # kill: def $ax killed $ax killed $eax
+; AVX512-NEXT: xorl %edx, %edx
+; AVX512-NEXT: divw %cx
+; AVX512-NEXT: # kill: def $dx killed $dx def $edx
+; AVX512-NEXT: movd %edx, %xmm3
+; AVX512-NEXT: movd %xmm1, %ecx
+; AVX512-NEXT: movd %xmm0, %eax
+; AVX512-NEXT: # kill: def $ax killed $ax killed $eax
+; AVX512-NEXT: xorl %edx, %edx
+; AVX512-NEXT: divw %cx
+; AVX512-NEXT: # kill: def $dx killed $dx def $edx
+; AVX512-NEXT: movd %edx, %xmm0
+; AVX512-NEXT: punpcklwd {{.*#+}} xmm0 = xmm0[0],xmm3[0],xmm0[1],xmm3[1],xmm0[2],xmm3[2],xmm0[3],xmm3[3]
+; AVX512-NEXT: punpckldq {{.*#+}} xmm0 = xmm0[0],xmm4[0],xmm0[1],xmm4[1]
+; AVX512-NEXT: punpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm2[0]
+; AVX512-NEXT: retq
+ %res = call <4 x i16> @llvm.masked.urem(<4 x i16> %x, <4 x i16> %y, <4 x i1> %m)
+ ret <4 x i16> %res
+}
+
+; Scalarization
+define <1 x i64> @urem_v1i164(<1 x i64> %x, <1 x i64> %y, <1 x i1> %m) {
+; SSE2-LABEL: urem_v1i164:
+; SSE2: # %bb.0:
+; SSE2-NEXT: movq %rdi, %rax
+; SSE2-NEXT: testb $1, %dl
+; SSE2-NEXT: movl $1, %ecx
+; SSE2-NEXT: cmovneq %rsi, %rcx
+; SSE2-NEXT: xorl %edx, %edx
+; SSE2-NEXT: divq %rcx
+; SSE2-NEXT: movq %rdx, %rax
+; SSE2-NEXT: retq
+;
+; AVX512-LABEL: urem_v1i164:
+; AVX512: # %bb.0:
+; AVX512-NEXT: movq %rdi, %rax
+; AVX512-NEXT: testb $1, %dl
+; AVX512-NEXT: movl $1, %ecx
+; AVX512-NEXT: cmovneq %rsi, %rcx
+; AVX512-NEXT: xorl %edx, %edx
+; AVX512-NEXT: divq %rcx
+; AVX512-NEXT: movq %rdx, %rax
+; AVX512-NEXT: retq
+ %res = call <1 x i64> @llvm.masked.urem(<1 x i64> %x, <1 x i64> %y, <1 x i1> %m)
+ ret <1 x i64> %res
+}
+
+; Expansion
+define <2 x i128> @urem_v2i128(<2 x i128> %x, <2 x i128> %y, <2 x i1> %m) {
+; SSE2-LABEL: urem_v2i128:
+; SSE2: # %bb.0:
+; SSE2-NEXT: pushq %rbp
+; SSE2-NEXT: .cfi_def_cfa_offset 16
+; SSE2-NEXT: pushq %r15
+; SSE2-NEXT: .cfi_def_cfa_offset 24
+; SSE2-NEXT: pushq %r14
+; SSE2-NEXT: .cfi_def_cfa_offset 32
+; SSE2-NEXT: pushq %r13
+; SSE2-NEXT: .cfi_def_cfa_offset 40
+; SSE2-NEXT: pushq %r12
+; SSE2-NEXT: .cfi_def_cfa_offset 48
+; SSE2-NEXT: pushq %rbx
+; SSE2-NEXT: .cfi_def_cfa_offset 56
+; SSE2-NEXT: subq $40, %rsp
+; SSE2-NEXT: .cfi_def_cfa_offset 96
+; SSE2-NEXT: .cfi_offset %rbx, -56
+; SSE2-NEXT: .cfi_offset %r12, -48
+; SSE2-NEXT: .cfi_offset %r13, -40
+; SSE2-NEXT: .cfi_offset %r14, -32
+; SSE2-NEXT: .cfi_offset %r15, -24
+; SSE2-NEXT: .cfi_offset %rbp, -16
+; SSE2-NEXT: movq %r8, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
+; SSE2-NEXT: movq %rcx, %r15
+; SSE2-NEXT: movq %rdi, %rbx
+; SSE2-NEXT: movaps %xmm0, {{[0-9]+}}(%rsp)
+; SSE2-NEXT: xorl %r12d, %r12d
+; SSE2-NEXT: testb $1, {{[0-9]+}}(%rsp)
+; SSE2-NEXT: movl $1, %r13d
+; SSE2-NEXT: cmoveq %r13, %r9
+; SSE2-NEXT: movq {{[0-9]+}}(%rsp), %rcx
+; SSE2-NEXT: cmoveq %r12, %rcx
+; SSE2-NEXT: movq %rsi, %rdi
+; SSE2-NEXT: movq %rdx, %rsi
+; SSE2-NEXT: movq %r9, %rdx
+; SSE2-NEXT: callq __umodti3 at PLT
+; SSE2-NEXT: movq %rax, %rbp
+; SSE2-NEXT: movq %rdx, %r14
+; SSE2-NEXT: testb $1, {{[0-9]+}}(%rsp)
+; SSE2-NEXT: je .LBB6_2
+; SSE2-NEXT: # %bb.1:
+; SSE2-NEXT: movq {{[0-9]+}}(%rsp), %r13
+; SSE2-NEXT: movq {{[0-9]+}}(%rsp), %r12
+; SSE2-NEXT: .LBB6_2:
+; SSE2-NEXT: movq %r15, %rdi
+; SSE2-NEXT: movq {{[-0-9]+}}(%r{{[sb]}}p), %rsi # 8-byte Reload
+; SSE2-NEXT: movq %r13, %rdx
+; SSE2-NEXT: movq %r12, %rcx
+; SSE2-NEXT: callq __umodti3 at PLT
+; SSE2-NEXT: movq %rdx, 24(%rbx)
+; SSE2-NEXT: movq %rax, 16(%rbx)
+; SSE2-NEXT: movq %r14, 8(%rbx)
+; SSE2-NEXT: movq %rbp, (%rbx)
+; SSE2-NEXT: movq %rbx, %rax
+; SSE2-NEXT: addq $40, %rsp
+; SSE2-NEXT: .cfi_def_cfa_offset 56
+; SSE2-NEXT: popq %rbx
+; SSE2-NEXT: .cfi_def_cfa_offset 48
+; SSE2-NEXT: popq %r12
+; SSE2-NEXT: .cfi_def_cfa_offset 40
+; SSE2-NEXT: popq %r13
+; SSE2-NEXT: .cfi_def_cfa_offset 32
+; SSE2-NEXT: popq %r14
+; SSE2-NEXT: .cfi_def_cfa_offset 24
+; SSE2-NEXT: popq %r15
+; SSE2-NEXT: .cfi_def_cfa_offset 16
+; SSE2-NEXT: popq %rbp
+; SSE2-NEXT: .cfi_def_cfa_offset 8
+; SSE2-NEXT: retq
+;
+; AVX512-LABEL: urem_v2i128:
+; AVX512: # %bb.0:
+; AVX512-NEXT: pushq %rbp
+; AVX512-NEXT: .cfi_def_cfa_offset 16
+; AVX512-NEXT: pushq %r15
+; AVX512-NEXT: .cfi_def_cfa_offset 24
+; AVX512-NEXT: pushq %r14
+; AVX512-NEXT: .cfi_def_cfa_offset 32
+; AVX512-NEXT: pushq %r13
+; AVX512-NEXT: .cfi_def_cfa_offset 40
+; AVX512-NEXT: pushq %r12
+; AVX512-NEXT: .cfi_def_cfa_offset 48
+; AVX512-NEXT: pushq %rbx
+; AVX512-NEXT: .cfi_def_cfa_offset 56
+; AVX512-NEXT: subq $40, %rsp
+; AVX512-NEXT: .cfi_def_cfa_offset 96
+; AVX512-NEXT: .cfi_offset %rbx, -56
+; AVX512-NEXT: .cfi_offset %r12, -48
+; AVX512-NEXT: .cfi_offset %r13, -40
+; AVX512-NEXT: .cfi_offset %r14, -32
+; AVX512-NEXT: .cfi_offset %r15, -24
+; AVX512-NEXT: .cfi_offset %rbp, -16
+; AVX512-NEXT: movq %r8, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
+; AVX512-NEXT: movq %rcx, %r15
+; AVX512-NEXT: movq %rdi, %rbx
+; AVX512-NEXT: movaps %xmm0, {{[0-9]+}}(%rsp)
+; AVX512-NEXT: xorl %r12d, %r12d
+; AVX512-NEXT: testb $1, {{[0-9]+}}(%rsp)
+; AVX512-NEXT: movl $1, %r13d
+; AVX512-NEXT: cmoveq %r13, %r9
+; AVX512-NEXT: movq {{[0-9]+}}(%rsp), %rcx
+; AVX512-NEXT: cmoveq %r12, %rcx
+; AVX512-NEXT: movq %rsi, %rdi
+; AVX512-NEXT: movq %rdx, %rsi
+; AVX512-NEXT: movq %r9, %rdx
+; AVX512-NEXT: callq __umodti3 at PLT
+; AVX512-NEXT: movq %rax, %rbp
+; AVX512-NEXT: movq %rdx, %r14
+; AVX512-NEXT: testb $1, {{[0-9]+}}(%rsp)
+; AVX512-NEXT: je .LBB6_2
+; AVX512-NEXT: # %bb.1:
+; AVX512-NEXT: movq {{[0-9]+}}(%rsp), %r13
+; AVX512-NEXT: movq {{[0-9]+}}(%rsp), %r12
+; AVX512-NEXT: .LBB6_2:
+; AVX512-NEXT: movq %r15, %rdi
+; AVX512-NEXT: movq {{[-0-9]+}}(%r{{[sb]}}p), %rsi # 8-byte Reload
+; AVX512-NEXT: movq %r13, %rdx
+; AVX512-NEXT: movq %r12, %rcx
+; AVX512-NEXT: callq __umodti3 at PLT
+; AVX512-NEXT: movq %rdx, 24(%rbx)
+; AVX512-NEXT: movq %rax, 16(%rbx)
+; AVX512-NEXT: movq %r14, 8(%rbx)
+; AVX512-NEXT: movq %rbp, (%rbx)
+; AVX512-NEXT: movq %rbx, %rax
+; AVX512-NEXT: addq $40, %rsp
+; AVX512-NEXT: .cfi_def_cfa_offset 56
+; AVX512-NEXT: popq %rbx
+; AVX512-NEXT: .cfi_def_cfa_offset 48
+; AVX512-NEXT: popq %r12
+; AVX512-NEXT: .cfi_def_cfa_offset 40
+; AVX512-NEXT: popq %r13
+; AVX512-NEXT: .cfi_def_cfa_offset 32
+; AVX512-NEXT: popq %r14
+; AVX512-NEXT: .cfi_def_cfa_offset 24
+; AVX512-NEXT: popq %r15
+; AVX512-NEXT: .cfi_def_cfa_offset 16
+; AVX512-NEXT: popq %rbp
+; AVX512-NEXT: .cfi_def_cfa_offset 8
+; AVX512-NEXT: retq
+ %res = call <2 x i128> @llvm.masked.urem(<2 x i128> %x, <2 x i128> %y, <2 x i1> %m)
+ ret <2 x i128> %res
+}
+
+; Promotion and widening
+define <3 x i10> @urem_v3i10(<3 x i10> %x, <3 x i10> %y, <3 x i1> %m) {
+; SSE2-LABEL: urem_v3i10:
+; SSE2: # %bb.0:
+; SSE2-NEXT: movd %r8d, %xmm1
+; SSE2-NEXT: movd %ecx, %xmm0
+; SSE2-NEXT: punpckldq {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1]
+; SSE2-NEXT: movd %r9d, %xmm1
+; SSE2-NEXT: punpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm1[0]
+; SSE2-NEXT: movd {{.*#+}} xmm1 = mem[0],zero,zero,zero
+; SSE2-NEXT: movd {{.*#+}} xmm2 = mem[0],zero,zero,zero
+; SSE2-NEXT: punpckldq {{.*#+}} xmm2 = xmm2[0],xmm1[0],xmm2[1],xmm1[1]
+; SSE2-NEXT: movd {{.*#+}} xmm1 = mem[0],zero,zero,zero
+; SSE2-NEXT: punpcklqdq {{.*#+}} xmm2 = xmm2[0],xmm1[0]
+; SSE2-NEXT: pslld $31, %xmm2
+; SSE2-NEXT: psrad $31, %xmm2
+; SSE2-NEXT: movd %esi, %xmm3
+; SSE2-NEXT: movd %edi, %xmm1
+; SSE2-NEXT: punpckldq {{.*#+}} xmm1 = xmm1[0],xmm3[0],xmm1[1],xmm3[1]
+; SSE2-NEXT: movd %edx, %xmm3
+; SSE2-NEXT: punpcklqdq {{.*#+}} xmm1 = xmm1[0],xmm3[0]
+; SSE2-NEXT: movdqa {{.*#+}} xmm3 = [1023,1023,1023,1023]
+; SSE2-NEXT: pand %xmm3, %xmm1
+; SSE2-NEXT: pand %xmm3, %xmm0
+; SSE2-NEXT: pand %xmm2, %xmm0
+; SSE2-NEXT: paddd %xmm2, %xmm0
+; SSE2-NEXT: pcmpeqd %xmm2, %xmm2
+; SSE2-NEXT: psubd %xmm2, %xmm0
+; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm0[2,3,2,3]
+; SSE2-NEXT: movd %xmm2, %ecx
+; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm1[2,3,2,3]
+; SSE2-NEXT: movd %xmm2, %eax
+; SSE2-NEXT: xorl %edx, %edx
+; SSE2-NEXT: divl %ecx
+; SSE2-NEXT: movl %edx, %ecx
+; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm0[1,1,1,1]
+; SSE2-NEXT: movd %xmm2, %esi
+; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm1[1,1,1,1]
+; SSE2-NEXT: movd %xmm2, %eax
+; SSE2-NEXT: xorl %edx, %edx
+; SSE2-NEXT: divl %esi
+; SSE2-NEXT: movl %edx, %esi
+; SSE2-NEXT: movd %xmm0, %edi
+; SSE2-NEXT: movd %xmm1, %eax
+; SSE2-NEXT: xorl %edx, %edx
+; SSE2-NEXT: divl %edi
+; SSE2-NEXT: movl %edx, %eax
+; SSE2-NEXT: # kill: def $ax killed $ax killed $eax
+; SSE2-NEXT: movl %esi, %edx
+; SSE2-NEXT: # kill: def $cx killed $cx killed $ecx
+; SSE2-NEXT: retq
+;
+; AVX512-LABEL: urem_v3i10:
+; AVX512: # %bb.0:
+; AVX512-NEXT: movd %r8d, %xmm1
+; AVX512-NEXT: movd %ecx, %xmm0
+; AVX512-NEXT: punpckldq {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1]
+; AVX512-NEXT: movd %r9d, %xmm1
+; AVX512-NEXT: punpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm1[0]
+; AVX512-NEXT: movd {{.*#+}} xmm1 = mem[0],zero,zero,zero
+; AVX512-NEXT: movd {{.*#+}} xmm2 = mem[0],zero,zero,zero
+; AVX512-NEXT: punpckldq {{.*#+}} xmm2 = xmm2[0],xmm1[0],xmm2[1],xmm1[1]
+; AVX512-NEXT: movd {{.*#+}} xmm1 = mem[0],zero,zero,zero
+; AVX512-NEXT: punpcklqdq {{.*#+}} xmm2 = xmm2[0],xmm1[0]
+; AVX512-NEXT: pslld $31, %xmm2
+; AVX512-NEXT: psrad $31, %xmm2
+; AVX512-NEXT: movd %esi, %xmm3
+; AVX512-NEXT: movd %edi, %xmm1
+; AVX512-NEXT: punpckldq {{.*#+}} xmm1 = xmm1[0],xmm3[0],xmm1[1],xmm3[1]
+; AVX512-NEXT: movd %edx, %xmm3
+; AVX512-NEXT: punpcklqdq {{.*#+}} xmm1 = xmm1[0],xmm3[0]
+; AVX512-NEXT: movdqa {{.*#+}} xmm3 = [1023,1023,1023,1023]
+; AVX512-NEXT: pand %xmm3, %xmm1
+; AVX512-NEXT: pand %xmm3, %xmm0
+; AVX512-NEXT: pand %xmm2, %xmm0
+; AVX512-NEXT: paddd %xmm2, %xmm0
+; AVX512-NEXT: pcmpeqd %xmm2, %xmm2
+; AVX512-NEXT: psubd %xmm2, %xmm0
+; AVX512-NEXT: pshufd {{.*#+}} xmm2 = xmm0[2,3,2,3]
+; AVX512-NEXT: movd %xmm2, %ecx
+; AVX512-NEXT: pshufd {{.*#+}} xmm2 = xmm1[2,3,2,3]
+; AVX512-NEXT: movd %xmm2, %eax
+; AVX512-NEXT: xorl %edx, %edx
+; AVX512-NEXT: divl %ecx
+; AVX512-NEXT: movl %edx, %ecx
+; AVX512-NEXT: pshufd {{.*#+}} xmm2 = xmm0[1,1,1,1]
+; AVX512-NEXT: movd %xmm2, %esi
+; AVX512-NEXT: pshufd {{.*#+}} xmm2 = xmm1[1,1,1,1]
+; AVX512-NEXT: movd %xmm2, %eax
+; AVX512-NEXT: xorl %edx, %edx
+; AVX512-NEXT: divl %esi
+; AVX512-NEXT: movl %edx, %esi
+; AVX512-NEXT: movd %xmm0, %edi
+; AVX512-NEXT: movd %xmm1, %eax
+; AVX512-NEXT: xorl %edx, %edx
+; AVX512-NEXT: divl %edi
+; AVX512-NEXT: movl %edx, %eax
+; AVX512-NEXT: # kill: def $ax killed $ax killed $eax
+; AVX512-NEXT: movl %esi, %edx
+; AVX512-NEXT: # kill: def $cx killed $cx killed $ecx
+; AVX512-NEXT: retq
+ %res = call <3 x i10> @llvm.masked.urem(<3 x i10> %x, <3 x i10> %y, <3 x i1> %m)
+ ret <3 x i10> %res
+}
>From 041c5667252b948f8e4b877bcf15d9b987804f90 Mon Sep 17 00:00:00 2001
From: Luke Lau <luke at igalia.com>
Date: Wed, 1 Apr 2026 16:58:06 +0800
Subject: [PATCH 2/7] Mention overflow UB in sdiv/srem
---
llvm/docs/LangRef.rst | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/llvm/docs/LangRef.rst b/llvm/docs/LangRef.rst
index 6850160788ab9..3bbbf498dcd7f 100644
--- a/llvm/docs/LangRef.rst
+++ b/llvm/docs/LangRef.rst
@@ -27950,7 +27950,7 @@ The first two arguments and the result have the same vector of integer type. The
Semantics:
""""""""""
-Unlike :ref:`sdiv <i_sdiv>`, disabled lanes produce poison and division by zero on disabled lanes is not undefined behavior. Division by zero on enabled lanes is still undefined behavior.
+Unlike :ref:`sdiv <i_sdiv>`, disabled lanes produce poison, and both overflow and division by zero on disabled lanes is not undefined behavior. Overflow and division by zero on enabled lanes is still undefined behavior.
'``llvm.masked.urem.*``' Intrinsics
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -28004,7 +28004,7 @@ The first two arguments and the result have the same vector of integer type. The
Semantics:
""""""""""
-Unlike :ref:`srem <i_srem>`, disabled lanes produce poison and taking the remainder of a division by zero on disabled lanes is not undefined behavior. Taking the remainder of a division by zero on enabled lanes is still undefined behavior.
+Unlike :ref:`srem <i_srem>`, disabled lanes produce poison, and both overflow and taking the remainder of a division by zero on disabled lanes is not undefined behavior. Overflow and taking the remainder of a division by zero on enabled lanes is still undefined behavior.
Memory Use Markers
------------------
>From c93ca5d9c6b3a153970e2e3d23f48160b1137cce Mon Sep 17 00:00:00 2001
From: Luke Lau <luke at igalia.com>
Date: Wed, 1 Apr 2026 17:09:59 +0800
Subject: [PATCH 3/7] Update comments to mention safe divisor prevents overflow
too
---
llvm/lib/CodeGen/SelectionDAG/LegalizeVectorOps.cpp | 2 +-
llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp | 4 ++--
2 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorOps.cpp b/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorOps.cpp
index 131e24ad73c06..b07d2ba7ebde3 100644
--- a/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorOps.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorOps.cpp
@@ -1931,7 +1931,7 @@ SDValue VectorLegalizer::ExpandLOOP_DEPENDENCE_MASK(SDNode *N) {
SDValue VectorLegalizer::ExpandMaskedBinOp(SDNode *N) {
// Masked bin ops don't have undefined behaviour when dividing by zero
// on disabled lanes and produce poison instead. Replace the divisor on the
- // disabled lanes with 1 to avoid division by zero.
+ // disabled lanes with 1 to avoid division by zero or overflow.
SDLoc dl(N);
EVT VT = N->getValueType(0);
SDValue SafeDivisor = DAG.getSelect(
diff --git a/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp b/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
index 4d6172bcfbdaa..6a5e95cf626b6 100644
--- a/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
@@ -284,7 +284,7 @@ SDValue DAGTypeLegalizer::ScalarizeVecRes_MaskedBinOp(SDNode *N) {
else
Mask = DAG.getExtractVectorElt(DL, MaskVT.getVectorElementType(), Mask, 0);
// Masked binary ops don't have UB on disabled lanes but produce poison, so
- // use 1 as the divisor to avoid division by zero.
+ // use 1 as the divisor to avoid division by zero and overflow.
SDValue Divisor = DAG.getSelect(DL, LHS.getValueType(), Mask, RHS,
DAG.getConstant(1, DL, LHS.getValueType()));
return DAG.getNode(ISD::getUnmaskedBinOpOpcode(N->getOpcode()), DL,
@@ -1279,7 +1279,7 @@ SDValue DAGTypeLegalizer::ScalarizeVecOp_MaskedBinOp(SDNode *N, unsigned OpNo) {
SDValue RHS = DAG.getExtractVectorElt(DL, VT, N->getOperand(1), 0);
SDValue Mask = GetScalarizedVector(N->getOperand(2));
// Masked binary ops don't have UB on disabled lanes but produce poison, so
- // use 1 as the divisor to avoid division by zero.
+ // use 1 as the divisor to avoid division by zero and overflow.
SDValue BinOp =
DAG.getNode(ISD::getUnmaskedBinOpOpcode(N->getOpcode()), DL, VT, LHS,
DAG.getSelect(DL, VT, Mask, RHS, DAG.getConstant(1, DL, VT)));
>From 24eccf48df662a383a7bddf16f374e1f5340de6f Mon Sep 17 00:00:00 2001
From: Luke Lau <luke at igalia.com>
Date: Wed, 1 Apr 2026 17:10:31 +0800
Subject: [PATCH 4/7] Use PromoteTargetBoolean
---
.../SelectionDAG/LegalizeIntegerTypes.cpp | 7 +-
.../AArch64/masked-sdiv-fixed-length.ll | 224 ++++++---------
.../AArch64/masked-srem-fixed-length.ll | 256 +++++++-----------
.../AArch64/masked-udiv-fixed-length.ll | 224 ++++++---------
.../AArch64/masked-urem-fixed-length.ll | 256 +++++++-----------
5 files changed, 381 insertions(+), 586 deletions(-)
diff --git a/llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp b/llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp
index 8b42f64927bce..89fb102042d1f 100644
--- a/llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp
@@ -3050,12 +3050,7 @@ SDValue DAGTypeLegalizer::PromoteIntOp_GET_ACTIVE_LANE_MASK(SDNode *N) {
SDValue DAGTypeLegalizer::PromoteIntOp_MaskedBinOp(SDNode *N, unsigned OpNo) {
assert(OpNo == 2);
SmallVector<SDValue, 3> NewOps(N->ops());
-
- if (TLI.getBooleanContents(NewOps[2].getValueType()) ==
- TargetLowering::ZeroOrNegativeOneBooleanContent)
- NewOps[2] = SExtPromotedInteger(NewOps[2]);
- else
- NewOps[2] = ZExtPromotedInteger(NewOps[2]);
+ NewOps[2] = PromoteTargetBoolean(NewOps[2], N->getValueType(0));
return SDValue(DAG.UpdateNodeOperands(N, NewOps), 0);
}
diff --git a/llvm/test/CodeGen/AArch64/masked-sdiv-fixed-length.ll b/llvm/test/CodeGen/AArch64/masked-sdiv-fixed-length.ll
index 44d56c2a2afe7..bbb19d3a45265 100644
--- a/llvm/test/CodeGen/AArch64/masked-sdiv-fixed-length.ll
+++ b/llvm/test/CodeGen/AArch64/masked-sdiv-fixed-length.ll
@@ -6,65 +6,41 @@
define <4 x i32> @sdiv_v4i32(<4 x i32> %x, <4 x i32> %y, <4 x i1> %m) {
; NEON-LABEL: sdiv_v4i32:
; NEON: // %bb.0:
-; NEON-NEXT: shl v2.4h, v2.4h, #15
-; NEON-NEXT: mov w9, v1.s[1]
-; NEON-NEXT: mov w10, v0.s[1]
+; NEON-NEXT: ushll v2.4s, v2.4h, #0
+; NEON-NEXT: fmov w8, s0
+; NEON-NEXT: mov w12, v0.s[3]
+; NEON-NEXT: shl v2.4s, v2.4s, #31
+; NEON-NEXT: cmlt v2.4s, v2.4s, #0
+; NEON-NEXT: and v1.16b, v1.16b, v2.16b
+; NEON-NEXT: mvn v2.16b, v2.16b
+; NEON-NEXT: sub v1.4s, v1.4s, v2.4s
+; NEON-NEXT: fmov w9, s1
+; NEON-NEXT: mov w10, v1.s[1]
; NEON-NEXT: mov w11, v1.s[2]
-; NEON-NEXT: mov w12, v0.s[2]
-; NEON-NEXT: mov w13, v0.s[3]
-; NEON-NEXT: cmlt v2.4h, v2.4h, #0
-; NEON-NEXT: umov w8, v2.h[1]
-; NEON-NEXT: tst w8, #0xffff
-; NEON-NEXT: csinc w8, w9, wzr, ne
-; NEON-NEXT: umov w9, v2.h[0]
-; NEON-NEXT: sdiv w8, w10, w8
-; NEON-NEXT: fmov w10, s1
-; NEON-NEXT: tst w9, #0xffff
-; NEON-NEXT: fmov w9, s0
-; NEON-NEXT: csinc w10, w10, wzr, ne
+; NEON-NEXT: sdiv w8, w8, w9
+; NEON-NEXT: mov w9, v0.s[1]
; NEON-NEXT: sdiv w9, w9, w10
-; NEON-NEXT: umov w10, v2.h[2]
-; NEON-NEXT: tst w10, #0xffff
-; NEON-NEXT: csinc w10, w11, wzr, ne
-; NEON-NEXT: umov w11, v2.h[3]
-; NEON-NEXT: sdiv w10, w12, w10
-; NEON-NEXT: mov w12, v1.s[3]
-; NEON-NEXT: fmov s0, w9
-; NEON-NEXT: tst w11, #0xffff
-; NEON-NEXT: mov v0.s[1], w8
-; NEON-NEXT: csinc w9, w12, wzr, ne
-; NEON-NEXT: sdiv w8, w13, w9
+; NEON-NEXT: mov w10, v0.s[2]
+; NEON-NEXT: fmov s0, w8
+; NEON-NEXT: sdiv w10, w10, w11
+; NEON-NEXT: mov w11, v1.s[3]
+; NEON-NEXT: mov v0.s[1], w9
+; NEON-NEXT: sdiv w8, w12, w11
; NEON-NEXT: mov v0.s[2], w10
; NEON-NEXT: mov v0.s[3], w8
; NEON-NEXT: ret
;
; SVE-LABEL: sdiv_v4i32:
; SVE: // %bb.0:
-; SVE-NEXT: shl v2.4h, v2.4h, #15
-; SVE-NEXT: mov w9, v1.s[1]
-; SVE-NEXT: // kill: def $q0 killed $q0 def $z0
-; SVE-NEXT: mov w11, v1.s[2]
+; SVE-NEXT: ushll v2.4s, v2.4h, #0
; SVE-NEXT: ptrue p0.s, vl4
-; SVE-NEXT: cmlt v2.4h, v2.4h, #0
-; SVE-NEXT: umov w8, v2.h[1]
-; SVE-NEXT: umov w10, v2.h[0]
-; SVE-NEXT: tst w8, #0xffff
-; SVE-NEXT: csinc w8, w9, wzr, ne
-; SVE-NEXT: fmov w9, s1
-; SVE-NEXT: tst w10, #0xffff
-; SVE-NEXT: umov w10, v2.h[2]
-; SVE-NEXT: csinc w9, w9, wzr, ne
-; SVE-NEXT: fmov s3, w9
-; SVE-NEXT: mov w9, v1.s[3]
-; SVE-NEXT: tst w10, #0xffff
-; SVE-NEXT: csinc w10, w11, wzr, ne
-; SVE-NEXT: mov v3.s[1], w8
-; SVE-NEXT: umov w8, v2.h[3]
-; SVE-NEXT: mov v3.s[2], w10
-; SVE-NEXT: tst w8, #0xffff
-; SVE-NEXT: csinc w8, w9, wzr, ne
-; SVE-NEXT: mov v3.s[3], w8
-; SVE-NEXT: sdiv z0.s, p0/m, z0.s, z3.s
+; SVE-NEXT: // kill: def $q0 killed $q0 def $z0
+; SVE-NEXT: shl v2.4s, v2.4s, #31
+; SVE-NEXT: cmlt v2.4s, v2.4s, #0
+; SVE-NEXT: and v1.16b, v1.16b, v2.16b
+; SVE-NEXT: mvn v2.16b, v2.16b
+; SVE-NEXT: sub v1.4s, v1.4s, v2.4s
+; SVE-NEXT: sdiv z0.s, p0/m, z0.s, z1.s
; SVE-NEXT: // kill: def $q0 killed $q0 killed $z0
; SVE-NEXT: ret
%res = call <4 x i32> @llvm.masked.sdiv(<4 x i32> %x, <4 x i32> %y, <4 x i1> %m)
@@ -74,40 +50,32 @@ define <4 x i32> @sdiv_v4i32(<4 x i32> %x, <4 x i32> %y, <4 x i1> %m) {
define <2 x i64> @sdiv_v2i64(<2 x i64> %x, <2 x i64> %y, <2 x i1> %m) {
; NEON-LABEL: sdiv_v2i64:
; NEON: // %bb.0:
-; NEON-NEXT: shl v2.2s, v2.2s, #31
-; NEON-NEXT: mov x9, v1.d[1]
-; NEON-NEXT: cmlt v2.2s, v2.2s, #0
-; NEON-NEXT: mov w8, v2.s[1]
-; NEON-NEXT: fmov w10, s2
-; NEON-NEXT: cmp w8, #0
-; NEON-NEXT: fmov x8, d1
-; NEON-NEXT: csinc x9, x9, xzr, ne
-; NEON-NEXT: cmp w10, #0
-; NEON-NEXT: fmov x10, d0
-; NEON-NEXT: csinc x8, x8, xzr, ne
-; NEON-NEXT: sdiv x8, x10, x8
-; NEON-NEXT: mov x10, v0.d[1]
-; NEON-NEXT: sdiv x9, x10, x9
+; NEON-NEXT: ushll v2.2d, v2.2s, #0
+; NEON-NEXT: fmov x8, d0
+; NEON-NEXT: shl v2.2d, v2.2d, #63
+; NEON-NEXT: cmlt v2.2d, v2.2d, #0
+; NEON-NEXT: and v1.16b, v1.16b, v2.16b
+; NEON-NEXT: mvn v2.16b, v2.16b
+; NEON-NEXT: sub v1.2d, v1.2d, v2.2d
+; NEON-NEXT: fmov x9, d1
+; NEON-NEXT: mov x10, v1.d[1]
+; NEON-NEXT: sdiv x8, x8, x9
+; NEON-NEXT: mov x9, v0.d[1]
+; NEON-NEXT: sdiv x9, x9, x10
; NEON-NEXT: fmov d0, x8
; NEON-NEXT: mov v0.d[1], x9
; NEON-NEXT: ret
;
; SVE-LABEL: sdiv_v2i64:
; SVE: // %bb.0:
-; SVE-NEXT: shl v2.2s, v2.2s, #31
-; SVE-NEXT: mov x9, v1.d[1]
-; SVE-NEXT: // kill: def $q0 killed $q0 def $z0
-; SVE-NEXT: fmov x10, d1
+; SVE-NEXT: ushll v2.2d, v2.2s, #0
; SVE-NEXT: ptrue p0.d, vl2
-; SVE-NEXT: cmlt v2.2s, v2.2s, #0
-; SVE-NEXT: mov w8, v2.s[1]
-; SVE-NEXT: cmp w8, #0
-; SVE-NEXT: fmov w8, s2
-; SVE-NEXT: csinc x9, x9, xzr, ne
-; SVE-NEXT: cmp w8, #0
-; SVE-NEXT: csinc x8, x10, xzr, ne
-; SVE-NEXT: fmov d1, x8
-; SVE-NEXT: mov v1.d[1], x9
+; SVE-NEXT: // kill: def $q0 killed $q0 def $z0
+; SVE-NEXT: shl v2.2d, v2.2d, #63
+; SVE-NEXT: cmlt v2.2d, v2.2d, #0
+; SVE-NEXT: and v1.16b, v1.16b, v2.16b
+; SVE-NEXT: mvn v2.16b, v2.16b
+; SVE-NEXT: sub v1.2d, v1.2d, v2.2d
; SVE-NEXT: sdiv z0.d, p0/m, z0.d, z1.d
; SVE-NEXT: // kill: def $q0 killed $q0 killed $z0
; SVE-NEXT: ret
@@ -120,76 +88,58 @@ define <4 x i64> @sdiv_v4i64(<4 x i64> %x, <4 x i64> %y, <4 x i1> %m) {
; NEON-LABEL: sdiv_v4i64:
; NEON: // %bb.0:
; NEON-NEXT: ushll v4.4s, v4.4h, #0
-; NEON-NEXT: mov x9, v2.d[1]
-; NEON-NEXT: mov x10, v0.d[1]
-; NEON-NEXT: mov x11, v3.d[1]
-; NEON-NEXT: fmov x12, d3
-; NEON-NEXT: shl v5.2s, v4.2s, #31
-; NEON-NEXT: cmlt v5.2s, v5.2s, #0
-; NEON-NEXT: mov w8, v5.s[1]
-; NEON-NEXT: cmp w8, #0
-; NEON-NEXT: csinc x8, x9, xzr, ne
-; NEON-NEXT: fmov w9, s5
-; NEON-NEXT: sdiv x8, x10, x8
-; NEON-NEXT: fmov x10, d2
-; NEON-NEXT: cmp w9, #0
-; NEON-NEXT: fmov x9, d0
-; NEON-NEXT: ext v0.16b, v4.16b, v4.16b, #8
-; NEON-NEXT: csinc x10, x10, xzr, ne
-; NEON-NEXT: shl v0.2s, v0.2s, #31
-; NEON-NEXT: cmlt v0.2s, v0.2s, #0
+; NEON-NEXT: mov x8, v1.d[1]
+; NEON-NEXT: fmov x11, d0
+; NEON-NEXT: mov x12, v0.d[1]
+; NEON-NEXT: ushll2 v5.2d, v4.4s, #0
+; NEON-NEXT: shl v5.2d, v5.2d, #63
+; NEON-NEXT: cmlt v5.2d, v5.2d, #0
+; NEON-NEXT: and v3.16b, v3.16b, v5.16b
+; NEON-NEXT: mvn v5.16b, v5.16b
+; NEON-NEXT: sub v3.2d, v3.2d, v5.2d
+; NEON-NEXT: mov x9, v3.d[1]
+; NEON-NEXT: fmov x10, d3
+; NEON-NEXT: sdiv x8, x8, x9
+; NEON-NEXT: fmov x9, d1
+; NEON-NEXT: ushll v1.2d, v4.2s, #0
+; NEON-NEXT: shl v1.2d, v1.2d, #63
+; NEON-NEXT: cmlt v1.2d, v1.2d, #0
+; NEON-NEXT: and v2.16b, v2.16b, v1.16b
+; NEON-NEXT: mvn v1.16b, v1.16b
+; NEON-NEXT: sub v1.2d, v2.2d, v1.2d
; NEON-NEXT: sdiv x9, x9, x10
-; NEON-NEXT: mov w10, v0.s[1]
-; NEON-NEXT: cmp w10, #0
-; NEON-NEXT: fmov w10, s0
-; NEON-NEXT: csinc x11, x11, xzr, ne
-; NEON-NEXT: cmp w10, #0
-; NEON-NEXT: csinc x10, x12, xzr, ne
-; NEON-NEXT: fmov x12, d1
-; NEON-NEXT: sdiv x10, x12, x10
-; NEON-NEXT: mov x12, v1.d[1]
-; NEON-NEXT: fmov d0, x9
-; NEON-NEXT: mov v0.d[1], x8
+; NEON-NEXT: fmov x10, d1
+; NEON-NEXT: sdiv x10, x11, x10
+; NEON-NEXT: mov x11, v1.d[1]
+; NEON-NEXT: fmov d1, x9
+; NEON-NEXT: mov v1.d[1], x8
; NEON-NEXT: sdiv x11, x12, x11
-; NEON-NEXT: fmov d1, x10
-; NEON-NEXT: mov v1.d[1], x11
+; NEON-NEXT: fmov d0, x10
+; NEON-NEXT: mov v0.d[1], x11
; NEON-NEXT: ret
;
; SVE-LABEL: sdiv_v4i64:
; SVE: // %bb.0:
; SVE-NEXT: ushll v4.4s, v4.4h, #0
-; SVE-NEXT: mov x9, v2.d[1]
-; SVE-NEXT: // kill: def $q0 killed $q0 def $z0
-; SVE-NEXT: // kill: def $q1 killed $q1 def $z1
; SVE-NEXT: ptrue p0.d, vl2
-; SVE-NEXT: shl v5.2s, v4.2s, #31
-; SVE-NEXT: cmlt v5.2s, v5.2s, #0
-; SVE-NEXT: mov w8, v5.s[1]
-; SVE-NEXT: fmov w10, s5
-; SVE-NEXT: cmp w8, #0
-; SVE-NEXT: fmov x8, d2
-; SVE-NEXT: csinc x9, x9, xzr, ne
-; SVE-NEXT: cmp w10, #0
-; SVE-NEXT: csinc x8, x8, xzr, ne
-; SVE-NEXT: fmov d2, x8
-; SVE-NEXT: mov v2.d[1], x9
-; SVE-NEXT: mov x9, v3.d[1]
+; SVE-NEXT: // kill: def $q1 killed $q1 def $z1
+; SVE-NEXT: // kill: def $q0 killed $q0 def $z0
+; SVE-NEXT: ushll2 v5.2d, v4.4s, #0
+; SVE-NEXT: shl v5.2d, v5.2d, #63
+; SVE-NEXT: cmlt v5.2d, v5.2d, #0
+; SVE-NEXT: and v3.16b, v3.16b, v5.16b
+; SVE-NEXT: mvn v5.16b, v5.16b
+; SVE-NEXT: sub v3.2d, v3.2d, v5.2d
+; SVE-NEXT: sdiv z1.d, p0/m, z1.d, z3.d
+; SVE-NEXT: ushll v3.2d, v4.2s, #0
+; SVE-NEXT: // kill: def $q1 killed $q1 killed $z1
+; SVE-NEXT: shl v3.2d, v3.2d, #63
+; SVE-NEXT: cmlt v3.2d, v3.2d, #0
+; SVE-NEXT: and v2.16b, v2.16b, v3.16b
+; SVE-NEXT: mvn v3.16b, v3.16b
+; SVE-NEXT: sub v2.2d, v2.2d, v3.2d
; SVE-NEXT: sdiv z0.d, p0/m, z0.d, z2.d
-; SVE-NEXT: ext v2.16b, v4.16b, v4.16b, #8
; SVE-NEXT: // kill: def $q0 killed $q0 killed $z0
-; SVE-NEXT: shl v2.2s, v2.2s, #31
-; SVE-NEXT: cmlt v2.2s, v2.2s, #0
-; SVE-NEXT: mov w8, v2.s[1]
-; SVE-NEXT: fmov w10, s2
-; SVE-NEXT: cmp w8, #0
-; SVE-NEXT: fmov x8, d3
-; SVE-NEXT: csinc x9, x9, xzr, ne
-; SVE-NEXT: cmp w10, #0
-; SVE-NEXT: csinc x8, x8, xzr, ne
-; SVE-NEXT: fmov d2, x8
-; SVE-NEXT: mov v2.d[1], x9
-; SVE-NEXT: sdiv z1.d, p0/m, z1.d, z2.d
-; SVE-NEXT: // kill: def $q1 killed $q1 killed $z1
; SVE-NEXT: ret
%res = call <4 x i64> @llvm.masked.sdiv(<4 x i64> %x, <4 x i64> %y, <4 x i1> %m)
ret <4 x i64> %res
diff --git a/llvm/test/CodeGen/AArch64/masked-srem-fixed-length.ll b/llvm/test/CodeGen/AArch64/masked-srem-fixed-length.ll
index c822e9eb0afa8..438e0319dd33e 100644
--- a/llvm/test/CodeGen/AArch64/masked-srem-fixed-length.ll
+++ b/llvm/test/CodeGen/AArch64/masked-srem-fixed-length.ll
@@ -6,71 +6,47 @@
define <4 x i32> @srem_v4i32(<4 x i32> %x, <4 x i32> %y, <4 x i1> %m) {
; NEON-LABEL: srem_v4i32:
; NEON: // %bb.0:
-; NEON-NEXT: shl v2.4h, v2.4h, #15
-; NEON-NEXT: mov w9, v1.s[1]
-; NEON-NEXT: fmov w12, s1
-; NEON-NEXT: mov w10, v0.s[1]
+; NEON-NEXT: ushll v2.4s, v2.4h, #0
+; NEON-NEXT: fmov w8, s0
+; NEON-NEXT: mov w11, v0.s[1]
+; NEON-NEXT: mov w14, v0.s[2]
+; NEON-NEXT: mov w18, v0.s[3]
+; NEON-NEXT: shl v2.4s, v2.4s, #31
+; NEON-NEXT: cmlt v2.4s, v2.4s, #0
+; NEON-NEXT: and v1.16b, v1.16b, v2.16b
+; NEON-NEXT: mvn v2.16b, v2.16b
+; NEON-NEXT: sub v1.4s, v1.4s, v2.4s
+; NEON-NEXT: fmov w9, s1
+; NEON-NEXT: mov w12, v1.s[1]
; NEON-NEXT: mov w15, v1.s[2]
-; NEON-NEXT: mov w16, v0.s[2]
-; NEON-NEXT: mov w18, v1.s[3]
-; NEON-NEXT: mov w0, v0.s[3]
-; NEON-NEXT: cmlt v2.4h, v2.4h, #0
-; NEON-NEXT: umov w8, v2.h[1]
-; NEON-NEXT: umov w11, v2.h[0]
-; NEON-NEXT: umov w14, v2.h[2]
-; NEON-NEXT: umov w17, v2.h[3]
-; NEON-NEXT: tst w8, #0xffff
-; NEON-NEXT: csinc w8, w9, wzr, ne
-; NEON-NEXT: tst w11, #0xffff
-; NEON-NEXT: fmov w11, s0
-; NEON-NEXT: csinc w12, w12, wzr, ne
-; NEON-NEXT: sdiv w9, w10, w8
-; NEON-NEXT: tst w14, #0xffff
-; NEON-NEXT: csinc w14, w15, wzr, ne
-; NEON-NEXT: tst w17, #0xffff
+; NEON-NEXT: mov w17, v1.s[3]
+; NEON-NEXT: sdiv w10, w8, w9
; NEON-NEXT: sdiv w13, w11, w12
-; NEON-NEXT: msub w8, w9, w8, w10
-; NEON-NEXT: sdiv w15, w16, w14
-; NEON-NEXT: msub w11, w13, w12, w11
-; NEON-NEXT: csinc w12, w18, wzr, ne
-; NEON-NEXT: fmov s0, w11
-; NEON-NEXT: mov v0.s[1], w8
-; NEON-NEXT: sdiv w9, w0, w12
-; NEON-NEXT: msub w8, w15, w14, w16
+; NEON-NEXT: msub w8, w10, w9, w8
+; NEON-NEXT: fmov s0, w8
+; NEON-NEXT: sdiv w16, w14, w15
+; NEON-NEXT: msub w9, w13, w12, w11
+; NEON-NEXT: mov v0.s[1], w9
+; NEON-NEXT: sdiv w10, w18, w17
+; NEON-NEXT: msub w8, w16, w15, w14
; NEON-NEXT: mov v0.s[2], w8
-; NEON-NEXT: msub w8, w9, w12, w0
+; NEON-NEXT: msub w8, w10, w17, w18
; NEON-NEXT: mov v0.s[3], w8
; NEON-NEXT: ret
;
; SVE-LABEL: srem_v4i32:
; SVE: // %bb.0:
-; SVE-NEXT: shl v2.4h, v2.4h, #15
-; SVE-NEXT: mov w9, v1.s[1]
-; SVE-NEXT: // kill: def $q0 killed $q0 def $z0
-; SVE-NEXT: mov w11, v1.s[2]
+; SVE-NEXT: ushll v2.4s, v2.4h, #0
; SVE-NEXT: ptrue p0.s, vl4
-; SVE-NEXT: cmlt v2.4h, v2.4h, #0
-; SVE-NEXT: umov w8, v2.h[1]
-; SVE-NEXT: umov w10, v2.h[0]
-; SVE-NEXT: tst w8, #0xffff
-; SVE-NEXT: csinc w8, w9, wzr, ne
-; SVE-NEXT: fmov w9, s1
-; SVE-NEXT: tst w10, #0xffff
-; SVE-NEXT: umov w10, v2.h[2]
-; SVE-NEXT: csinc w9, w9, wzr, ne
-; SVE-NEXT: fmov s3, w9
-; SVE-NEXT: mov w9, v1.s[3]
-; SVE-NEXT: tst w10, #0xffff
-; SVE-NEXT: csinc w10, w11, wzr, ne
-; SVE-NEXT: mov v3.s[1], w8
-; SVE-NEXT: umov w8, v2.h[3]
-; SVE-NEXT: mov v3.s[2], w10
-; SVE-NEXT: tst w8, #0xffff
-; SVE-NEXT: csinc w8, w9, wzr, ne
-; SVE-NEXT: mov v3.s[3], w8
-; SVE-NEXT: movprfx z1, z0
-; SVE-NEXT: sdiv z1.s, p0/m, z1.s, z3.s
-; SVE-NEXT: mls v0.4s, v1.4s, v3.4s
+; SVE-NEXT: // kill: def $q0 killed $q0 def $z0
+; SVE-NEXT: shl v2.4s, v2.4s, #31
+; SVE-NEXT: cmlt v2.4s, v2.4s, #0
+; SVE-NEXT: and v1.16b, v1.16b, v2.16b
+; SVE-NEXT: mvn v2.16b, v2.16b
+; SVE-NEXT: sub v1.4s, v1.4s, v2.4s
+; SVE-NEXT: movprfx z2, z0
+; SVE-NEXT: sdiv z2.s, p0/m, z2.s, z1.s
+; SVE-NEXT: mls v0.4s, v2.4s, v1.4s
; SVE-NEXT: // kill: def $q0 killed $q0 killed $z0
; SVE-NEXT: ret
%res = call <4 x i32> @llvm.masked.srem(<4 x i32> %x, <4 x i32> %y, <4 x i1> %m)
@@ -80,42 +56,34 @@ define <4 x i32> @srem_v4i32(<4 x i32> %x, <4 x i32> %y, <4 x i1> %m) {
define <2 x i64> @srem_v2i64(<2 x i64> %x, <2 x i64> %y, <2 x i1> %m) {
; NEON-LABEL: srem_v2i64:
; NEON: // %bb.0:
-; NEON-NEXT: shl v2.2s, v2.2s, #31
-; NEON-NEXT: mov x9, v1.d[1]
-; NEON-NEXT: mov x12, v0.d[1]
-; NEON-NEXT: cmlt v2.2s, v2.2s, #0
-; NEON-NEXT: mov w8, v2.s[1]
-; NEON-NEXT: fmov w10, s2
-; NEON-NEXT: cmp w8, #0
-; NEON-NEXT: fmov x8, d1
-; NEON-NEXT: csinc x9, x9, xzr, ne
-; NEON-NEXT: cmp w10, #0
-; NEON-NEXT: fmov x10, d0
-; NEON-NEXT: sdiv x13, x12, x9
-; NEON-NEXT: csinc x8, x8, xzr, ne
-; NEON-NEXT: sdiv x11, x10, x8
-; NEON-NEXT: msub x9, x13, x9, x12
-; NEON-NEXT: msub x8, x11, x8, x10
+; NEON-NEXT: ushll v2.2d, v2.2s, #0
+; NEON-NEXT: fmov x8, d0
+; NEON-NEXT: mov x11, v0.d[1]
+; NEON-NEXT: shl v2.2d, v2.2d, #63
+; NEON-NEXT: cmlt v2.2d, v2.2d, #0
+; NEON-NEXT: and v1.16b, v1.16b, v2.16b
+; NEON-NEXT: mvn v2.16b, v2.16b
+; NEON-NEXT: sub v1.2d, v1.2d, v2.2d
+; NEON-NEXT: fmov x9, d1
+; NEON-NEXT: mov x12, v1.d[1]
+; NEON-NEXT: sdiv x10, x8, x9
+; NEON-NEXT: sdiv x13, x11, x12
+; NEON-NEXT: msub x8, x10, x9, x8
; NEON-NEXT: fmov d0, x8
+; NEON-NEXT: msub x9, x13, x12, x11
; NEON-NEXT: mov v0.d[1], x9
; NEON-NEXT: ret
;
; SVE-LABEL: srem_v2i64:
; SVE: // %bb.0:
-; SVE-NEXT: shl v2.2s, v2.2s, #31
-; SVE-NEXT: mov x9, v1.d[1]
-; SVE-NEXT: // kill: def $q0 killed $q0 def $z0
-; SVE-NEXT: fmov x10, d1
+; SVE-NEXT: ushll v2.2d, v2.2s, #0
; SVE-NEXT: ptrue p0.d, vl2
-; SVE-NEXT: cmlt v2.2s, v2.2s, #0
-; SVE-NEXT: mov w8, v2.s[1]
-; SVE-NEXT: cmp w8, #0
-; SVE-NEXT: fmov w8, s2
-; SVE-NEXT: csinc x9, x9, xzr, ne
-; SVE-NEXT: cmp w8, #0
-; SVE-NEXT: csinc x8, x10, xzr, ne
-; SVE-NEXT: fmov d1, x8
-; SVE-NEXT: mov v1.d[1], x9
+; SVE-NEXT: // kill: def $q0 killed $q0 def $z0
+; SVE-NEXT: shl v2.2d, v2.2d, #63
+; SVE-NEXT: cmlt v2.2d, v2.2d, #0
+; SVE-NEXT: and v1.16b, v1.16b, v2.16b
+; SVE-NEXT: mvn v2.16b, v2.16b
+; SVE-NEXT: sub v1.2d, v1.2d, v2.2d
; SVE-NEXT: movprfx z2, z0
; SVE-NEXT: sdiv z2.d, p0/m, z2.d, z1.d
; SVE-NEXT: mls z0.d, p0/m, z2.d, z1.d
@@ -130,84 +98,66 @@ define <4 x i64> @srem_v4i64(<4 x i64> %x, <4 x i64> %y, <4 x i1> %m) {
; NEON-LABEL: srem_v4i64:
; NEON: // %bb.0:
; NEON-NEXT: ushll v4.4s, v4.4h, #0
-; NEON-NEXT: mov x9, v2.d[1]
-; NEON-NEXT: mov x10, v0.d[1]
-; NEON-NEXT: fmov x12, d2
-; NEON-NEXT: mov x15, v3.d[1]
-; NEON-NEXT: fmov x16, d3
-; NEON-NEXT: mov x18, v1.d[1]
-; NEON-NEXT: shl v5.2s, v4.2s, #31
-; NEON-NEXT: cmlt v5.2s, v5.2s, #0
-; NEON-NEXT: mov w8, v5.s[1]
-; NEON-NEXT: fmov w11, s5
-; NEON-NEXT: cmp w8, #0
-; NEON-NEXT: csinc x8, x9, xzr, ne
-; NEON-NEXT: cmp w11, #0
-; NEON-NEXT: fmov x11, d0
-; NEON-NEXT: ext v0.16b, v4.16b, v4.16b, #8
-; NEON-NEXT: csinc x12, x12, xzr, ne
-; NEON-NEXT: sdiv x9, x10, x8
-; NEON-NEXT: shl v0.2s, v0.2s, #31
-; NEON-NEXT: cmlt v0.2s, v0.2s, #0
-; NEON-NEXT: mov w14, v0.s[1]
-; NEON-NEXT: cmp w14, #0
-; NEON-NEXT: fmov w14, s0
-; NEON-NEXT: csinc x15, x15, xzr, ne
+; NEON-NEXT: mov x8, v1.d[1]
+; NEON-NEXT: fmov x11, d1
+; NEON-NEXT: fmov x15, d0
+; NEON-NEXT: mov x18, v0.d[1]
+; NEON-NEXT: ushll2 v5.2d, v4.4s, #0
+; NEON-NEXT: ushll v1.2d, v4.2s, #0
+; NEON-NEXT: shl v5.2d, v5.2d, #63
+; NEON-NEXT: shl v1.2d, v1.2d, #63
+; NEON-NEXT: cmlt v5.2d, v5.2d, #0
+; NEON-NEXT: cmlt v1.2d, v1.2d, #0
+; NEON-NEXT: and v3.16b, v3.16b, v5.16b
+; NEON-NEXT: mvn v5.16b, v5.16b
+; NEON-NEXT: and v2.16b, v2.16b, v1.16b
+; NEON-NEXT: mvn v1.16b, v1.16b
+; NEON-NEXT: sub v3.2d, v3.2d, v5.2d
+; NEON-NEXT: sub v1.2d, v2.2d, v1.2d
+; NEON-NEXT: mov x9, v3.d[1]
+; NEON-NEXT: fmov x12, d3
+; NEON-NEXT: fmov x14, d1
+; NEON-NEXT: mov x17, v1.d[1]
; NEON-NEXT: sdiv x13, x11, x12
-; NEON-NEXT: msub x8, x9, x8, x10
-; NEON-NEXT: cmp w14, #0
-; NEON-NEXT: csinc x14, x16, xzr, ne
-; NEON-NEXT: fmov x16, d1
-; NEON-NEXT: sdiv x17, x16, x14
+; NEON-NEXT: sdiv x10, x8, x9
+; NEON-NEXT: sdiv x16, x15, x14
+; NEON-NEXT: msub x8, x10, x9, x8
; NEON-NEXT: msub x9, x13, x12, x11
-; NEON-NEXT: fmov d0, x9
-; NEON-NEXT: mov v0.d[1], x8
-; NEON-NEXT: sdiv x0, x18, x15
-; NEON-NEXT: msub x10, x17, x14, x16
-; NEON-NEXT: fmov d1, x10
-; NEON-NEXT: msub x11, x0, x15, x18
-; NEON-NEXT: mov v1.d[1], x11
+; NEON-NEXT: fmov d1, x9
+; NEON-NEXT: mov v1.d[1], x8
+; NEON-NEXT: sdiv x0, x18, x17
+; NEON-NEXT: msub x10, x16, x14, x15
+; NEON-NEXT: fmov d0, x10
+; NEON-NEXT: msub x11, x0, x17, x18
+; NEON-NEXT: mov v0.d[1], x11
; NEON-NEXT: ret
;
; SVE-LABEL: srem_v4i64:
; SVE: // %bb.0:
; SVE-NEXT: ushll v4.4s, v4.4h, #0
-; SVE-NEXT: mov x9, v2.d[1]
+; SVE-NEXT: ptrue p0.d, vl2
; SVE-NEXT: // kill: def $q1 killed $q1 def $z1
; SVE-NEXT: // kill: def $q0 killed $q0 def $z0
-; SVE-NEXT: ptrue p0.d, vl2
-; SVE-NEXT: shl v5.2s, v4.2s, #31
-; SVE-NEXT: ext v4.16b, v4.16b, v4.16b, #8
-; SVE-NEXT: cmlt v5.2s, v5.2s, #0
-; SVE-NEXT: shl v4.2s, v4.2s, #31
-; SVE-NEXT: mov w8, v5.s[1]
-; SVE-NEXT: fmov w10, s5
-; SVE-NEXT: cmlt v4.2s, v4.2s, #0
-; SVE-NEXT: cmp w8, #0
-; SVE-NEXT: fmov x8, d2
-; SVE-NEXT: csinc x9, x9, xzr, ne
-; SVE-NEXT: cmp w10, #0
-; SVE-NEXT: fmov w10, s4
-; SVE-NEXT: csinc x8, x8, xzr, ne
-; SVE-NEXT: fmov d2, x8
-; SVE-NEXT: mov w8, v4.s[1]
-; SVE-NEXT: mov v2.d[1], x9
-; SVE-NEXT: mov x9, v3.d[1]
-; SVE-NEXT: cmp w8, #0
-; SVE-NEXT: fmov x8, d3
-; SVE-NEXT: csinc x9, x9, xzr, ne
-; SVE-NEXT: cmp w10, #0
-; SVE-NEXT: movprfx z5, z0
-; SVE-NEXT: sdiv z5.d, p0/m, z5.d, z2.d
-; SVE-NEXT: csinc x8, x8, xzr, ne
-; SVE-NEXT: fmov d3, x8
-; SVE-NEXT: mov v3.d[1], x9
-; SVE-NEXT: movprfx z4, z1
-; SVE-NEXT: sdiv z4.d, p0/m, z4.d, z3.d
-; SVE-NEXT: mls z0.d, p0/m, z5.d, z2.d
-; SVE-NEXT: // kill: def $q0 killed $q0 killed $z0
-; SVE-NEXT: mls z1.d, p0/m, z4.d, z3.d
+; SVE-NEXT: ushll2 v5.2d, v4.4s, #0
+; SVE-NEXT: ushll v4.2d, v4.2s, #0
+; SVE-NEXT: shl v5.2d, v5.2d, #63
+; SVE-NEXT: shl v4.2d, v4.2d, #63
+; SVE-NEXT: cmlt v5.2d, v5.2d, #0
+; SVE-NEXT: cmlt v4.2d, v4.2d, #0
+; SVE-NEXT: and v3.16b, v3.16b, v5.16b
+; SVE-NEXT: mvn v5.16b, v5.16b
+; SVE-NEXT: and v2.16b, v2.16b, v4.16b
+; SVE-NEXT: mvn v4.16b, v4.16b
+; SVE-NEXT: sub v3.2d, v3.2d, v5.2d
+; SVE-NEXT: sub v2.2d, v2.2d, v4.2d
+; SVE-NEXT: movprfx z5, z1
+; SVE-NEXT: sdiv z5.d, p0/m, z5.d, z3.d
+; SVE-NEXT: movprfx z4, z0
+; SVE-NEXT: sdiv z4.d, p0/m, z4.d, z2.d
+; SVE-NEXT: mls z1.d, p0/m, z5.d, z3.d
+; SVE-NEXT: mls z0.d, p0/m, z4.d, z2.d
; SVE-NEXT: // kill: def $q1 killed $q1 killed $z1
+; SVE-NEXT: // kill: def $q0 killed $q0 killed $z0
; SVE-NEXT: ret
%res = call <4 x i64> @llvm.masked.srem(<4 x i64> %x, <4 x i64> %y, <4 x i1> %m)
ret <4 x i64> %res
diff --git a/llvm/test/CodeGen/AArch64/masked-udiv-fixed-length.ll b/llvm/test/CodeGen/AArch64/masked-udiv-fixed-length.ll
index 950cebfb4b614..65fb9d6c163d4 100644
--- a/llvm/test/CodeGen/AArch64/masked-udiv-fixed-length.ll
+++ b/llvm/test/CodeGen/AArch64/masked-udiv-fixed-length.ll
@@ -6,65 +6,41 @@
define <4 x i32> @udiv_v4i32(<4 x i32> %x, <4 x i32> %y, <4 x i1> %m) {
; NEON-LABEL: udiv_v4i32:
; NEON: // %bb.0:
-; NEON-NEXT: shl v2.4h, v2.4h, #15
-; NEON-NEXT: mov w9, v1.s[1]
-; NEON-NEXT: mov w10, v0.s[1]
+; NEON-NEXT: ushll v2.4s, v2.4h, #0
+; NEON-NEXT: fmov w8, s0
+; NEON-NEXT: mov w12, v0.s[3]
+; NEON-NEXT: shl v2.4s, v2.4s, #31
+; NEON-NEXT: cmlt v2.4s, v2.4s, #0
+; NEON-NEXT: and v1.16b, v1.16b, v2.16b
+; NEON-NEXT: mvn v2.16b, v2.16b
+; NEON-NEXT: sub v1.4s, v1.4s, v2.4s
+; NEON-NEXT: fmov w9, s1
+; NEON-NEXT: mov w10, v1.s[1]
; NEON-NEXT: mov w11, v1.s[2]
-; NEON-NEXT: mov w12, v0.s[2]
-; NEON-NEXT: mov w13, v0.s[3]
-; NEON-NEXT: cmlt v2.4h, v2.4h, #0
-; NEON-NEXT: umov w8, v2.h[1]
-; NEON-NEXT: tst w8, #0xffff
-; NEON-NEXT: csinc w8, w9, wzr, ne
-; NEON-NEXT: umov w9, v2.h[0]
-; NEON-NEXT: udiv w8, w10, w8
-; NEON-NEXT: fmov w10, s1
-; NEON-NEXT: tst w9, #0xffff
-; NEON-NEXT: fmov w9, s0
-; NEON-NEXT: csinc w10, w10, wzr, ne
+; NEON-NEXT: udiv w8, w8, w9
+; NEON-NEXT: mov w9, v0.s[1]
; NEON-NEXT: udiv w9, w9, w10
-; NEON-NEXT: umov w10, v2.h[2]
-; NEON-NEXT: tst w10, #0xffff
-; NEON-NEXT: csinc w10, w11, wzr, ne
-; NEON-NEXT: umov w11, v2.h[3]
-; NEON-NEXT: udiv w10, w12, w10
-; NEON-NEXT: mov w12, v1.s[3]
-; NEON-NEXT: fmov s0, w9
-; NEON-NEXT: tst w11, #0xffff
-; NEON-NEXT: mov v0.s[1], w8
-; NEON-NEXT: csinc w9, w12, wzr, ne
-; NEON-NEXT: udiv w8, w13, w9
+; NEON-NEXT: mov w10, v0.s[2]
+; NEON-NEXT: fmov s0, w8
+; NEON-NEXT: udiv w10, w10, w11
+; NEON-NEXT: mov w11, v1.s[3]
+; NEON-NEXT: mov v0.s[1], w9
+; NEON-NEXT: udiv w8, w12, w11
; NEON-NEXT: mov v0.s[2], w10
; NEON-NEXT: mov v0.s[3], w8
; NEON-NEXT: ret
;
; SVE-LABEL: udiv_v4i32:
; SVE: // %bb.0:
-; SVE-NEXT: shl v2.4h, v2.4h, #15
-; SVE-NEXT: mov w9, v1.s[1]
-; SVE-NEXT: // kill: def $q0 killed $q0 def $z0
-; SVE-NEXT: mov w11, v1.s[2]
+; SVE-NEXT: ushll v2.4s, v2.4h, #0
; SVE-NEXT: ptrue p0.s, vl4
-; SVE-NEXT: cmlt v2.4h, v2.4h, #0
-; SVE-NEXT: umov w8, v2.h[1]
-; SVE-NEXT: umov w10, v2.h[0]
-; SVE-NEXT: tst w8, #0xffff
-; SVE-NEXT: csinc w8, w9, wzr, ne
-; SVE-NEXT: fmov w9, s1
-; SVE-NEXT: tst w10, #0xffff
-; SVE-NEXT: umov w10, v2.h[2]
-; SVE-NEXT: csinc w9, w9, wzr, ne
-; SVE-NEXT: fmov s3, w9
-; SVE-NEXT: mov w9, v1.s[3]
-; SVE-NEXT: tst w10, #0xffff
-; SVE-NEXT: csinc w10, w11, wzr, ne
-; SVE-NEXT: mov v3.s[1], w8
-; SVE-NEXT: umov w8, v2.h[3]
-; SVE-NEXT: mov v3.s[2], w10
-; SVE-NEXT: tst w8, #0xffff
-; SVE-NEXT: csinc w8, w9, wzr, ne
-; SVE-NEXT: mov v3.s[3], w8
-; SVE-NEXT: udiv z0.s, p0/m, z0.s, z3.s
+; SVE-NEXT: // kill: def $q0 killed $q0 def $z0
+; SVE-NEXT: shl v2.4s, v2.4s, #31
+; SVE-NEXT: cmlt v2.4s, v2.4s, #0
+; SVE-NEXT: and v1.16b, v1.16b, v2.16b
+; SVE-NEXT: mvn v2.16b, v2.16b
+; SVE-NEXT: sub v1.4s, v1.4s, v2.4s
+; SVE-NEXT: udiv z0.s, p0/m, z0.s, z1.s
; SVE-NEXT: // kill: def $q0 killed $q0 killed $z0
; SVE-NEXT: ret
%res = call <4 x i32> @llvm.masked.udiv(<4 x i32> %x, <4 x i32> %y, <4 x i1> %m)
@@ -74,40 +50,32 @@ define <4 x i32> @udiv_v4i32(<4 x i32> %x, <4 x i32> %y, <4 x i1> %m) {
define <2 x i64> @udiv_v2i64(<2 x i64> %x, <2 x i64> %y, <2 x i1> %m) {
; NEON-LABEL: udiv_v2i64:
; NEON: // %bb.0:
-; NEON-NEXT: shl v2.2s, v2.2s, #31
-; NEON-NEXT: mov x9, v1.d[1]
-; NEON-NEXT: cmlt v2.2s, v2.2s, #0
-; NEON-NEXT: mov w8, v2.s[1]
-; NEON-NEXT: fmov w10, s2
-; NEON-NEXT: cmp w8, #0
-; NEON-NEXT: fmov x8, d1
-; NEON-NEXT: csinc x9, x9, xzr, ne
-; NEON-NEXT: cmp w10, #0
-; NEON-NEXT: fmov x10, d0
-; NEON-NEXT: csinc x8, x8, xzr, ne
-; NEON-NEXT: udiv x8, x10, x8
-; NEON-NEXT: mov x10, v0.d[1]
-; NEON-NEXT: udiv x9, x10, x9
+; NEON-NEXT: ushll v2.2d, v2.2s, #0
+; NEON-NEXT: fmov x8, d0
+; NEON-NEXT: shl v2.2d, v2.2d, #63
+; NEON-NEXT: cmlt v2.2d, v2.2d, #0
+; NEON-NEXT: and v1.16b, v1.16b, v2.16b
+; NEON-NEXT: mvn v2.16b, v2.16b
+; NEON-NEXT: sub v1.2d, v1.2d, v2.2d
+; NEON-NEXT: fmov x9, d1
+; NEON-NEXT: mov x10, v1.d[1]
+; NEON-NEXT: udiv x8, x8, x9
+; NEON-NEXT: mov x9, v0.d[1]
+; NEON-NEXT: udiv x9, x9, x10
; NEON-NEXT: fmov d0, x8
; NEON-NEXT: mov v0.d[1], x9
; NEON-NEXT: ret
;
; SVE-LABEL: udiv_v2i64:
; SVE: // %bb.0:
-; SVE-NEXT: shl v2.2s, v2.2s, #31
-; SVE-NEXT: mov x9, v1.d[1]
-; SVE-NEXT: // kill: def $q0 killed $q0 def $z0
-; SVE-NEXT: fmov x10, d1
+; SVE-NEXT: ushll v2.2d, v2.2s, #0
; SVE-NEXT: ptrue p0.d, vl2
-; SVE-NEXT: cmlt v2.2s, v2.2s, #0
-; SVE-NEXT: mov w8, v2.s[1]
-; SVE-NEXT: cmp w8, #0
-; SVE-NEXT: fmov w8, s2
-; SVE-NEXT: csinc x9, x9, xzr, ne
-; SVE-NEXT: cmp w8, #0
-; SVE-NEXT: csinc x8, x10, xzr, ne
-; SVE-NEXT: fmov d1, x8
-; SVE-NEXT: mov v1.d[1], x9
+; SVE-NEXT: // kill: def $q0 killed $q0 def $z0
+; SVE-NEXT: shl v2.2d, v2.2d, #63
+; SVE-NEXT: cmlt v2.2d, v2.2d, #0
+; SVE-NEXT: and v1.16b, v1.16b, v2.16b
+; SVE-NEXT: mvn v2.16b, v2.16b
+; SVE-NEXT: sub v1.2d, v1.2d, v2.2d
; SVE-NEXT: udiv z0.d, p0/m, z0.d, z1.d
; SVE-NEXT: // kill: def $q0 killed $q0 killed $z0
; SVE-NEXT: ret
@@ -120,76 +88,58 @@ define <4 x i64> @udiv_v4i64(<4 x i64> %x, <4 x i64> %y, <4 x i1> %m) {
; NEON-LABEL: udiv_v4i64:
; NEON: // %bb.0:
; NEON-NEXT: ushll v4.4s, v4.4h, #0
-; NEON-NEXT: mov x9, v2.d[1]
-; NEON-NEXT: mov x10, v0.d[1]
-; NEON-NEXT: mov x11, v3.d[1]
-; NEON-NEXT: fmov x12, d3
-; NEON-NEXT: shl v5.2s, v4.2s, #31
-; NEON-NEXT: cmlt v5.2s, v5.2s, #0
-; NEON-NEXT: mov w8, v5.s[1]
-; NEON-NEXT: cmp w8, #0
-; NEON-NEXT: csinc x8, x9, xzr, ne
-; NEON-NEXT: fmov w9, s5
-; NEON-NEXT: udiv x8, x10, x8
-; NEON-NEXT: fmov x10, d2
-; NEON-NEXT: cmp w9, #0
-; NEON-NEXT: fmov x9, d0
-; NEON-NEXT: ext v0.16b, v4.16b, v4.16b, #8
-; NEON-NEXT: csinc x10, x10, xzr, ne
-; NEON-NEXT: shl v0.2s, v0.2s, #31
-; NEON-NEXT: cmlt v0.2s, v0.2s, #0
+; NEON-NEXT: mov x8, v1.d[1]
+; NEON-NEXT: fmov x11, d0
+; NEON-NEXT: mov x12, v0.d[1]
+; NEON-NEXT: ushll2 v5.2d, v4.4s, #0
+; NEON-NEXT: shl v5.2d, v5.2d, #63
+; NEON-NEXT: cmlt v5.2d, v5.2d, #0
+; NEON-NEXT: and v3.16b, v3.16b, v5.16b
+; NEON-NEXT: mvn v5.16b, v5.16b
+; NEON-NEXT: sub v3.2d, v3.2d, v5.2d
+; NEON-NEXT: mov x9, v3.d[1]
+; NEON-NEXT: fmov x10, d3
+; NEON-NEXT: udiv x8, x8, x9
+; NEON-NEXT: fmov x9, d1
+; NEON-NEXT: ushll v1.2d, v4.2s, #0
+; NEON-NEXT: shl v1.2d, v1.2d, #63
+; NEON-NEXT: cmlt v1.2d, v1.2d, #0
+; NEON-NEXT: and v2.16b, v2.16b, v1.16b
+; NEON-NEXT: mvn v1.16b, v1.16b
+; NEON-NEXT: sub v1.2d, v2.2d, v1.2d
; NEON-NEXT: udiv x9, x9, x10
-; NEON-NEXT: mov w10, v0.s[1]
-; NEON-NEXT: cmp w10, #0
-; NEON-NEXT: fmov w10, s0
-; NEON-NEXT: csinc x11, x11, xzr, ne
-; NEON-NEXT: cmp w10, #0
-; NEON-NEXT: csinc x10, x12, xzr, ne
-; NEON-NEXT: fmov x12, d1
-; NEON-NEXT: udiv x10, x12, x10
-; NEON-NEXT: mov x12, v1.d[1]
-; NEON-NEXT: fmov d0, x9
-; NEON-NEXT: mov v0.d[1], x8
+; NEON-NEXT: fmov x10, d1
+; NEON-NEXT: udiv x10, x11, x10
+; NEON-NEXT: mov x11, v1.d[1]
+; NEON-NEXT: fmov d1, x9
+; NEON-NEXT: mov v1.d[1], x8
; NEON-NEXT: udiv x11, x12, x11
-; NEON-NEXT: fmov d1, x10
-; NEON-NEXT: mov v1.d[1], x11
+; NEON-NEXT: fmov d0, x10
+; NEON-NEXT: mov v0.d[1], x11
; NEON-NEXT: ret
;
; SVE-LABEL: udiv_v4i64:
; SVE: // %bb.0:
; SVE-NEXT: ushll v4.4s, v4.4h, #0
-; SVE-NEXT: mov x9, v2.d[1]
-; SVE-NEXT: // kill: def $q0 killed $q0 def $z0
-; SVE-NEXT: // kill: def $q1 killed $q1 def $z1
; SVE-NEXT: ptrue p0.d, vl2
-; SVE-NEXT: shl v5.2s, v4.2s, #31
-; SVE-NEXT: cmlt v5.2s, v5.2s, #0
-; SVE-NEXT: mov w8, v5.s[1]
-; SVE-NEXT: fmov w10, s5
-; SVE-NEXT: cmp w8, #0
-; SVE-NEXT: fmov x8, d2
-; SVE-NEXT: csinc x9, x9, xzr, ne
-; SVE-NEXT: cmp w10, #0
-; SVE-NEXT: csinc x8, x8, xzr, ne
-; SVE-NEXT: fmov d2, x8
-; SVE-NEXT: mov v2.d[1], x9
-; SVE-NEXT: mov x9, v3.d[1]
+; SVE-NEXT: // kill: def $q1 killed $q1 def $z1
+; SVE-NEXT: // kill: def $q0 killed $q0 def $z0
+; SVE-NEXT: ushll2 v5.2d, v4.4s, #0
+; SVE-NEXT: shl v5.2d, v5.2d, #63
+; SVE-NEXT: cmlt v5.2d, v5.2d, #0
+; SVE-NEXT: and v3.16b, v3.16b, v5.16b
+; SVE-NEXT: mvn v5.16b, v5.16b
+; SVE-NEXT: sub v3.2d, v3.2d, v5.2d
+; SVE-NEXT: udiv z1.d, p0/m, z1.d, z3.d
+; SVE-NEXT: ushll v3.2d, v4.2s, #0
+; SVE-NEXT: // kill: def $q1 killed $q1 killed $z1
+; SVE-NEXT: shl v3.2d, v3.2d, #63
+; SVE-NEXT: cmlt v3.2d, v3.2d, #0
+; SVE-NEXT: and v2.16b, v2.16b, v3.16b
+; SVE-NEXT: mvn v3.16b, v3.16b
+; SVE-NEXT: sub v2.2d, v2.2d, v3.2d
; SVE-NEXT: udiv z0.d, p0/m, z0.d, z2.d
-; SVE-NEXT: ext v2.16b, v4.16b, v4.16b, #8
; SVE-NEXT: // kill: def $q0 killed $q0 killed $z0
-; SVE-NEXT: shl v2.2s, v2.2s, #31
-; SVE-NEXT: cmlt v2.2s, v2.2s, #0
-; SVE-NEXT: mov w8, v2.s[1]
-; SVE-NEXT: fmov w10, s2
-; SVE-NEXT: cmp w8, #0
-; SVE-NEXT: fmov x8, d3
-; SVE-NEXT: csinc x9, x9, xzr, ne
-; SVE-NEXT: cmp w10, #0
-; SVE-NEXT: csinc x8, x8, xzr, ne
-; SVE-NEXT: fmov d2, x8
-; SVE-NEXT: mov v2.d[1], x9
-; SVE-NEXT: udiv z1.d, p0/m, z1.d, z2.d
-; SVE-NEXT: // kill: def $q1 killed $q1 killed $z1
; SVE-NEXT: ret
%res = call <4 x i64> @llvm.masked.udiv(<4 x i64> %x, <4 x i64> %y, <4 x i1> %m)
ret <4 x i64> %res
diff --git a/llvm/test/CodeGen/AArch64/masked-urem-fixed-length.ll b/llvm/test/CodeGen/AArch64/masked-urem-fixed-length.ll
index 07e635da011fc..1067c464a787f 100644
--- a/llvm/test/CodeGen/AArch64/masked-urem-fixed-length.ll
+++ b/llvm/test/CodeGen/AArch64/masked-urem-fixed-length.ll
@@ -6,71 +6,47 @@
define <4 x i32> @urem_v4i32(<4 x i32> %x, <4 x i32> %y, <4 x i1> %m) {
; NEON-LABEL: urem_v4i32:
; NEON: // %bb.0:
-; NEON-NEXT: shl v2.4h, v2.4h, #15
-; NEON-NEXT: mov w9, v1.s[1]
-; NEON-NEXT: fmov w12, s1
-; NEON-NEXT: mov w10, v0.s[1]
+; NEON-NEXT: ushll v2.4s, v2.4h, #0
+; NEON-NEXT: fmov w8, s0
+; NEON-NEXT: mov w11, v0.s[1]
+; NEON-NEXT: mov w14, v0.s[2]
+; NEON-NEXT: mov w18, v0.s[3]
+; NEON-NEXT: shl v2.4s, v2.4s, #31
+; NEON-NEXT: cmlt v2.4s, v2.4s, #0
+; NEON-NEXT: and v1.16b, v1.16b, v2.16b
+; NEON-NEXT: mvn v2.16b, v2.16b
+; NEON-NEXT: sub v1.4s, v1.4s, v2.4s
+; NEON-NEXT: fmov w9, s1
+; NEON-NEXT: mov w12, v1.s[1]
; NEON-NEXT: mov w15, v1.s[2]
-; NEON-NEXT: mov w16, v0.s[2]
-; NEON-NEXT: mov w18, v1.s[3]
-; NEON-NEXT: mov w0, v0.s[3]
-; NEON-NEXT: cmlt v2.4h, v2.4h, #0
-; NEON-NEXT: umov w8, v2.h[1]
-; NEON-NEXT: umov w11, v2.h[0]
-; NEON-NEXT: umov w14, v2.h[2]
-; NEON-NEXT: umov w17, v2.h[3]
-; NEON-NEXT: tst w8, #0xffff
-; NEON-NEXT: csinc w8, w9, wzr, ne
-; NEON-NEXT: tst w11, #0xffff
-; NEON-NEXT: fmov w11, s0
-; NEON-NEXT: csinc w12, w12, wzr, ne
-; NEON-NEXT: udiv w9, w10, w8
-; NEON-NEXT: tst w14, #0xffff
-; NEON-NEXT: csinc w14, w15, wzr, ne
-; NEON-NEXT: tst w17, #0xffff
+; NEON-NEXT: mov w17, v1.s[3]
+; NEON-NEXT: udiv w10, w8, w9
; NEON-NEXT: udiv w13, w11, w12
-; NEON-NEXT: msub w8, w9, w8, w10
-; NEON-NEXT: udiv w15, w16, w14
-; NEON-NEXT: msub w11, w13, w12, w11
-; NEON-NEXT: csinc w12, w18, wzr, ne
-; NEON-NEXT: fmov s0, w11
-; NEON-NEXT: mov v0.s[1], w8
-; NEON-NEXT: udiv w9, w0, w12
-; NEON-NEXT: msub w8, w15, w14, w16
+; NEON-NEXT: msub w8, w10, w9, w8
+; NEON-NEXT: fmov s0, w8
+; NEON-NEXT: udiv w16, w14, w15
+; NEON-NEXT: msub w9, w13, w12, w11
+; NEON-NEXT: mov v0.s[1], w9
+; NEON-NEXT: udiv w10, w18, w17
+; NEON-NEXT: msub w8, w16, w15, w14
; NEON-NEXT: mov v0.s[2], w8
-; NEON-NEXT: msub w8, w9, w12, w0
+; NEON-NEXT: msub w8, w10, w17, w18
; NEON-NEXT: mov v0.s[3], w8
; NEON-NEXT: ret
;
; SVE-LABEL: urem_v4i32:
; SVE: // %bb.0:
-; SVE-NEXT: shl v2.4h, v2.4h, #15
-; SVE-NEXT: mov w9, v1.s[1]
-; SVE-NEXT: // kill: def $q0 killed $q0 def $z0
-; SVE-NEXT: mov w11, v1.s[2]
+; SVE-NEXT: ushll v2.4s, v2.4h, #0
; SVE-NEXT: ptrue p0.s, vl4
-; SVE-NEXT: cmlt v2.4h, v2.4h, #0
-; SVE-NEXT: umov w8, v2.h[1]
-; SVE-NEXT: umov w10, v2.h[0]
-; SVE-NEXT: tst w8, #0xffff
-; SVE-NEXT: csinc w8, w9, wzr, ne
-; SVE-NEXT: fmov w9, s1
-; SVE-NEXT: tst w10, #0xffff
-; SVE-NEXT: umov w10, v2.h[2]
-; SVE-NEXT: csinc w9, w9, wzr, ne
-; SVE-NEXT: fmov s3, w9
-; SVE-NEXT: mov w9, v1.s[3]
-; SVE-NEXT: tst w10, #0xffff
-; SVE-NEXT: csinc w10, w11, wzr, ne
-; SVE-NEXT: mov v3.s[1], w8
-; SVE-NEXT: umov w8, v2.h[3]
-; SVE-NEXT: mov v3.s[2], w10
-; SVE-NEXT: tst w8, #0xffff
-; SVE-NEXT: csinc w8, w9, wzr, ne
-; SVE-NEXT: mov v3.s[3], w8
-; SVE-NEXT: movprfx z1, z0
-; SVE-NEXT: udiv z1.s, p0/m, z1.s, z3.s
-; SVE-NEXT: mls v0.4s, v1.4s, v3.4s
+; SVE-NEXT: // kill: def $q0 killed $q0 def $z0
+; SVE-NEXT: shl v2.4s, v2.4s, #31
+; SVE-NEXT: cmlt v2.4s, v2.4s, #0
+; SVE-NEXT: and v1.16b, v1.16b, v2.16b
+; SVE-NEXT: mvn v2.16b, v2.16b
+; SVE-NEXT: sub v1.4s, v1.4s, v2.4s
+; SVE-NEXT: movprfx z2, z0
+; SVE-NEXT: udiv z2.s, p0/m, z2.s, z1.s
+; SVE-NEXT: mls v0.4s, v2.4s, v1.4s
; SVE-NEXT: // kill: def $q0 killed $q0 killed $z0
; SVE-NEXT: ret
%res = call <4 x i32> @llvm.masked.urem(<4 x i32> %x, <4 x i32> %y, <4 x i1> %m)
@@ -80,42 +56,34 @@ define <4 x i32> @urem_v4i32(<4 x i32> %x, <4 x i32> %y, <4 x i1> %m) {
define <2 x i64> @urem_v2i64(<2 x i64> %x, <2 x i64> %y, <2 x i1> %m) {
; NEON-LABEL: urem_v2i64:
; NEON: // %bb.0:
-; NEON-NEXT: shl v2.2s, v2.2s, #31
-; NEON-NEXT: mov x9, v1.d[1]
-; NEON-NEXT: mov x12, v0.d[1]
-; NEON-NEXT: cmlt v2.2s, v2.2s, #0
-; NEON-NEXT: mov w8, v2.s[1]
-; NEON-NEXT: fmov w10, s2
-; NEON-NEXT: cmp w8, #0
-; NEON-NEXT: fmov x8, d1
-; NEON-NEXT: csinc x9, x9, xzr, ne
-; NEON-NEXT: cmp w10, #0
-; NEON-NEXT: fmov x10, d0
-; NEON-NEXT: udiv x13, x12, x9
-; NEON-NEXT: csinc x8, x8, xzr, ne
-; NEON-NEXT: udiv x11, x10, x8
-; NEON-NEXT: msub x9, x13, x9, x12
-; NEON-NEXT: msub x8, x11, x8, x10
+; NEON-NEXT: ushll v2.2d, v2.2s, #0
+; NEON-NEXT: fmov x8, d0
+; NEON-NEXT: mov x11, v0.d[1]
+; NEON-NEXT: shl v2.2d, v2.2d, #63
+; NEON-NEXT: cmlt v2.2d, v2.2d, #0
+; NEON-NEXT: and v1.16b, v1.16b, v2.16b
+; NEON-NEXT: mvn v2.16b, v2.16b
+; NEON-NEXT: sub v1.2d, v1.2d, v2.2d
+; NEON-NEXT: fmov x9, d1
+; NEON-NEXT: mov x12, v1.d[1]
+; NEON-NEXT: udiv x10, x8, x9
+; NEON-NEXT: udiv x13, x11, x12
+; NEON-NEXT: msub x8, x10, x9, x8
; NEON-NEXT: fmov d0, x8
+; NEON-NEXT: msub x9, x13, x12, x11
; NEON-NEXT: mov v0.d[1], x9
; NEON-NEXT: ret
;
; SVE-LABEL: urem_v2i64:
; SVE: // %bb.0:
-; SVE-NEXT: shl v2.2s, v2.2s, #31
-; SVE-NEXT: mov x9, v1.d[1]
-; SVE-NEXT: // kill: def $q0 killed $q0 def $z0
-; SVE-NEXT: fmov x10, d1
+; SVE-NEXT: ushll v2.2d, v2.2s, #0
; SVE-NEXT: ptrue p0.d, vl2
-; SVE-NEXT: cmlt v2.2s, v2.2s, #0
-; SVE-NEXT: mov w8, v2.s[1]
-; SVE-NEXT: cmp w8, #0
-; SVE-NEXT: fmov w8, s2
-; SVE-NEXT: csinc x9, x9, xzr, ne
-; SVE-NEXT: cmp w8, #0
-; SVE-NEXT: csinc x8, x10, xzr, ne
-; SVE-NEXT: fmov d1, x8
-; SVE-NEXT: mov v1.d[1], x9
+; SVE-NEXT: // kill: def $q0 killed $q0 def $z0
+; SVE-NEXT: shl v2.2d, v2.2d, #63
+; SVE-NEXT: cmlt v2.2d, v2.2d, #0
+; SVE-NEXT: and v1.16b, v1.16b, v2.16b
+; SVE-NEXT: mvn v2.16b, v2.16b
+; SVE-NEXT: sub v1.2d, v1.2d, v2.2d
; SVE-NEXT: movprfx z2, z0
; SVE-NEXT: udiv z2.d, p0/m, z2.d, z1.d
; SVE-NEXT: mls z0.d, p0/m, z2.d, z1.d
@@ -130,84 +98,66 @@ define <4 x i64> @urem_v4i64(<4 x i64> %x, <4 x i64> %y, <4 x i1> %m) {
; NEON-LABEL: urem_v4i64:
; NEON: // %bb.0:
; NEON-NEXT: ushll v4.4s, v4.4h, #0
-; NEON-NEXT: mov x9, v2.d[1]
-; NEON-NEXT: mov x10, v0.d[1]
-; NEON-NEXT: fmov x12, d2
-; NEON-NEXT: mov x15, v3.d[1]
-; NEON-NEXT: fmov x16, d3
-; NEON-NEXT: mov x18, v1.d[1]
-; NEON-NEXT: shl v5.2s, v4.2s, #31
-; NEON-NEXT: cmlt v5.2s, v5.2s, #0
-; NEON-NEXT: mov w8, v5.s[1]
-; NEON-NEXT: fmov w11, s5
-; NEON-NEXT: cmp w8, #0
-; NEON-NEXT: csinc x8, x9, xzr, ne
-; NEON-NEXT: cmp w11, #0
-; NEON-NEXT: fmov x11, d0
-; NEON-NEXT: ext v0.16b, v4.16b, v4.16b, #8
-; NEON-NEXT: csinc x12, x12, xzr, ne
-; NEON-NEXT: udiv x9, x10, x8
-; NEON-NEXT: shl v0.2s, v0.2s, #31
-; NEON-NEXT: cmlt v0.2s, v0.2s, #0
-; NEON-NEXT: mov w14, v0.s[1]
-; NEON-NEXT: cmp w14, #0
-; NEON-NEXT: fmov w14, s0
-; NEON-NEXT: csinc x15, x15, xzr, ne
+; NEON-NEXT: mov x8, v1.d[1]
+; NEON-NEXT: fmov x11, d1
+; NEON-NEXT: fmov x15, d0
+; NEON-NEXT: mov x18, v0.d[1]
+; NEON-NEXT: ushll2 v5.2d, v4.4s, #0
+; NEON-NEXT: ushll v1.2d, v4.2s, #0
+; NEON-NEXT: shl v5.2d, v5.2d, #63
+; NEON-NEXT: shl v1.2d, v1.2d, #63
+; NEON-NEXT: cmlt v5.2d, v5.2d, #0
+; NEON-NEXT: cmlt v1.2d, v1.2d, #0
+; NEON-NEXT: and v3.16b, v3.16b, v5.16b
+; NEON-NEXT: mvn v5.16b, v5.16b
+; NEON-NEXT: and v2.16b, v2.16b, v1.16b
+; NEON-NEXT: mvn v1.16b, v1.16b
+; NEON-NEXT: sub v3.2d, v3.2d, v5.2d
+; NEON-NEXT: sub v1.2d, v2.2d, v1.2d
+; NEON-NEXT: mov x9, v3.d[1]
+; NEON-NEXT: fmov x12, d3
+; NEON-NEXT: fmov x14, d1
+; NEON-NEXT: mov x17, v1.d[1]
; NEON-NEXT: udiv x13, x11, x12
-; NEON-NEXT: msub x8, x9, x8, x10
-; NEON-NEXT: cmp w14, #0
-; NEON-NEXT: csinc x14, x16, xzr, ne
-; NEON-NEXT: fmov x16, d1
-; NEON-NEXT: udiv x17, x16, x14
+; NEON-NEXT: udiv x10, x8, x9
+; NEON-NEXT: udiv x16, x15, x14
+; NEON-NEXT: msub x8, x10, x9, x8
; NEON-NEXT: msub x9, x13, x12, x11
-; NEON-NEXT: fmov d0, x9
-; NEON-NEXT: mov v0.d[1], x8
-; NEON-NEXT: udiv x0, x18, x15
-; NEON-NEXT: msub x10, x17, x14, x16
-; NEON-NEXT: fmov d1, x10
-; NEON-NEXT: msub x11, x0, x15, x18
-; NEON-NEXT: mov v1.d[1], x11
+; NEON-NEXT: fmov d1, x9
+; NEON-NEXT: mov v1.d[1], x8
+; NEON-NEXT: udiv x0, x18, x17
+; NEON-NEXT: msub x10, x16, x14, x15
+; NEON-NEXT: fmov d0, x10
+; NEON-NEXT: msub x11, x0, x17, x18
+; NEON-NEXT: mov v0.d[1], x11
; NEON-NEXT: ret
;
; SVE-LABEL: urem_v4i64:
; SVE: // %bb.0:
; SVE-NEXT: ushll v4.4s, v4.4h, #0
-; SVE-NEXT: mov x9, v2.d[1]
+; SVE-NEXT: ptrue p0.d, vl2
; SVE-NEXT: // kill: def $q1 killed $q1 def $z1
; SVE-NEXT: // kill: def $q0 killed $q0 def $z0
-; SVE-NEXT: ptrue p0.d, vl2
-; SVE-NEXT: shl v5.2s, v4.2s, #31
-; SVE-NEXT: ext v4.16b, v4.16b, v4.16b, #8
-; SVE-NEXT: cmlt v5.2s, v5.2s, #0
-; SVE-NEXT: shl v4.2s, v4.2s, #31
-; SVE-NEXT: mov w8, v5.s[1]
-; SVE-NEXT: fmov w10, s5
-; SVE-NEXT: cmlt v4.2s, v4.2s, #0
-; SVE-NEXT: cmp w8, #0
-; SVE-NEXT: fmov x8, d2
-; SVE-NEXT: csinc x9, x9, xzr, ne
-; SVE-NEXT: cmp w10, #0
-; SVE-NEXT: fmov w10, s4
-; SVE-NEXT: csinc x8, x8, xzr, ne
-; SVE-NEXT: fmov d2, x8
-; SVE-NEXT: mov w8, v4.s[1]
-; SVE-NEXT: mov v2.d[1], x9
-; SVE-NEXT: mov x9, v3.d[1]
-; SVE-NEXT: cmp w8, #0
-; SVE-NEXT: fmov x8, d3
-; SVE-NEXT: csinc x9, x9, xzr, ne
-; SVE-NEXT: cmp w10, #0
-; SVE-NEXT: movprfx z5, z0
-; SVE-NEXT: udiv z5.d, p0/m, z5.d, z2.d
-; SVE-NEXT: csinc x8, x8, xzr, ne
-; SVE-NEXT: fmov d3, x8
-; SVE-NEXT: mov v3.d[1], x9
-; SVE-NEXT: movprfx z4, z1
-; SVE-NEXT: udiv z4.d, p0/m, z4.d, z3.d
-; SVE-NEXT: mls z0.d, p0/m, z5.d, z2.d
-; SVE-NEXT: // kill: def $q0 killed $q0 killed $z0
-; SVE-NEXT: mls z1.d, p0/m, z4.d, z3.d
+; SVE-NEXT: ushll2 v5.2d, v4.4s, #0
+; SVE-NEXT: ushll v4.2d, v4.2s, #0
+; SVE-NEXT: shl v5.2d, v5.2d, #63
+; SVE-NEXT: shl v4.2d, v4.2d, #63
+; SVE-NEXT: cmlt v5.2d, v5.2d, #0
+; SVE-NEXT: cmlt v4.2d, v4.2d, #0
+; SVE-NEXT: and v3.16b, v3.16b, v5.16b
+; SVE-NEXT: mvn v5.16b, v5.16b
+; SVE-NEXT: and v2.16b, v2.16b, v4.16b
+; SVE-NEXT: mvn v4.16b, v4.16b
+; SVE-NEXT: sub v3.2d, v3.2d, v5.2d
+; SVE-NEXT: sub v2.2d, v2.2d, v4.2d
+; SVE-NEXT: movprfx z5, z1
+; SVE-NEXT: udiv z5.d, p0/m, z5.d, z3.d
+; SVE-NEXT: movprfx z4, z0
+; SVE-NEXT: udiv z4.d, p0/m, z4.d, z2.d
+; SVE-NEXT: mls z1.d, p0/m, z5.d, z3.d
+; SVE-NEXT: mls z0.d, p0/m, z4.d, z2.d
; SVE-NEXT: // kill: def $q1 killed $q1 killed $z1
+; SVE-NEXT: // kill: def $q0 killed $q0 killed $z0
; SVE-NEXT: ret
%res = call <4 x i64> @llvm.masked.urem(<4 x i64> %x, <4 x i64> %y, <4 x i1> %m)
ret <4 x i64> %res
>From b79ad48905152288d384150c5e97de6389b229cb Mon Sep 17 00:00:00 2001
From: Luke Lau <luke at igalia.com>
Date: Wed, 1 Apr 2026 17:54:40 +0800
Subject: [PATCH 5/7] Truncate masks to i1 to handle difference between vector
+ scalar boolean contents
---
llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp b/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
index 6a5e95cf626b6..2a9667e944319 100644
--- a/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
@@ -283,6 +283,9 @@ SDValue DAGTypeLegalizer::ScalarizeVecRes_MaskedBinOp(SDNode *N) {
Mask = GetScalarizedVector(Mask);
else
Mask = DAG.getExtractVectorElt(DL, MaskVT.getVectorElementType(), Mask, 0);
+ // Vectors may have a different boolean contents to scalars, so truncate to i1
+ // and let type legalization promote appropriately.
+ Mask = DAG.getNode(ISD::TRUNCATE, DL, MVT::i1, Mask);
// Masked binary ops don't have UB on disabled lanes but produce poison, so
// use 1 as the divisor to avoid division by zero and overflow.
SDValue Divisor = DAG.getSelect(DL, LHS.getValueType(), Mask, RHS,
@@ -1278,6 +1281,9 @@ SDValue DAGTypeLegalizer::ScalarizeVecOp_MaskedBinOp(SDNode *N, unsigned OpNo) {
SDValue LHS = DAG.getExtractVectorElt(DL, VT, N->getOperand(0), 0);
SDValue RHS = DAG.getExtractVectorElt(DL, VT, N->getOperand(1), 0);
SDValue Mask = GetScalarizedVector(N->getOperand(2));
+ // Vectors may have a different boolean contents to scalars, so truncate to i1
+ // and let type legalization promote appropriately.
+ Mask = DAG.getNode(ISD::TRUNCATE, DL, MVT::i1, Mask);
// Masked binary ops don't have UB on disabled lanes but produce poison, so
// use 1 as the divisor to avoid division by zero and overflow.
SDValue BinOp =
>From b6b3f78449cad565282599a0f5cd0c3ec06fc47e Mon Sep 17 00:00:00 2001
From: Luke Lau <luke at igalia.com>
Date: Wed, 1 Apr 2026 18:06:21 +0800
Subject: [PATCH 6/7] Widen mask with zeros
---
.../SelectionDAG/LegalizeVectorTypes.cpp | 12 +---
.../AArch64/masked-sdiv-fixed-length.ll | 24 +++----
.../AArch64/masked-srem-fixed-length.ll | 28 +++++----
.../AArch64/masked-udiv-fixed-length.ll | 20 +++---
.../AArch64/masked-urem-fixed-length.ll | 22 ++++---
llvm/test/CodeGen/PowerPC/masked-sdiv.ll | 59 +++++++++--------
llvm/test/CodeGen/PowerPC/masked-srem.ll | 63 ++++++++++---------
llvm/test/CodeGen/PowerPC/masked-udiv.ll | 51 ++++++++-------
llvm/test/CodeGen/PowerPC/masked-urem.ll | 51 ++++++++-------
llvm/test/CodeGen/RISCV/rvv/masked-sdiv.ll | 2 +
llvm/test/CodeGen/RISCV/rvv/masked-srem.ll | 2 +
llvm/test/CodeGen/RISCV/rvv/masked-udiv.ll | 2 +
llvm/test/CodeGen/RISCV/rvv/masked-urem.ll | 2 +
13 files changed, 184 insertions(+), 154 deletions(-)
diff --git a/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp b/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
index 2a9667e944319..fb09309fe9f83 100644
--- a/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
@@ -5439,15 +5439,9 @@ SDValue DAGTypeLegalizer::WidenVecRes_MaskedBinary(SDNode *N) {
SDValue InOp1 = GetWidenedVector(N->getOperand(0));
SDValue InOp2 = GetWidenedVector(N->getOperand(1));
SDValue Mask = N->getOperand(2);
- EVT MaskVT = Mask.getValueType();
- if (getTypeAction(MaskVT) == TargetLowering::TypeWidenVector)
- Mask = GetWidenedMask(Mask, WidenVT.getVectorElementCount());
- else {
- EVT WidenMaskVT = WidenVT.changeVectorElementType(
- *DAG.getContext(), MaskVT.getVectorElementType());
- Mask = DAG.getInsertSubvector(dl, DAG.getConstant(0, dl, WidenMaskVT), Mask,
- 0);
- }
+ EVT WideMaskVT = WidenVT.changeVectorElementType(
+ *DAG.getContext(), Mask.getValueType().getVectorElementType());
+ Mask = ModifyToType(Mask, WideMaskVT, true);
return DAG.getNode(N->getOpcode(), dl, WidenVT, InOp1, InOp2, Mask,
N->getFlags());
}
diff --git a/llvm/test/CodeGen/AArch64/masked-sdiv-fixed-length.ll b/llvm/test/CodeGen/AArch64/masked-sdiv-fixed-length.ll
index bbb19d3a45265..733ef4f39d5d9 100644
--- a/llvm/test/CodeGen/AArch64/masked-sdiv-fixed-length.ll
+++ b/llvm/test/CodeGen/AArch64/masked-sdiv-fixed-length.ll
@@ -354,25 +354,26 @@ define <2 x i128> @sdiv_v2i128(<2 x i128> %x, <2 x i128> %y, <2 x i1> %m) {
define <3 x i10> @sdiv_v3i10(<3 x i10> %x, <3 x i10> %y, <3 x i1> %m) {
; NEON-LABEL: sdiv_v3i10:
; NEON: // %bb.0:
-; NEON-NEXT: fmov s0, w6
+; NEON-NEXT: movi v0.2d, #0000000000000000
; NEON-NEXT: fmov s1, w3
; NEON-NEXT: ldr w8, [sp]
; NEON-NEXT: fmov s2, w0
-; NEON-NEXT: mov v0.h[1], w7
; NEON-NEXT: mov v1.h[1], w4
+; NEON-NEXT: mov v0.h[0], w6
; NEON-NEXT: mov v2.h[1], w1
-; NEON-NEXT: mov v0.h[2], w8
; NEON-NEXT: mov v1.h[2], w5
+; NEON-NEXT: mov v0.h[1], w7
; NEON-NEXT: mov v2.h[2], w2
-; NEON-NEXT: shl v0.4h, v0.4h, #15
; NEON-NEXT: shl v1.4h, v1.4h, #6
+; NEON-NEXT: mov v0.h[2], w8
; NEON-NEXT: shl v2.4h, v2.4h, #6
-; NEON-NEXT: cmlt v0.4h, v0.4h, #0
; NEON-NEXT: sshr v1.4h, v1.4h, #6
; NEON-NEXT: sshr v2.4h, v2.4h, #6
+; NEON-NEXT: shl v0.4h, v0.4h, #15
+; NEON-NEXT: smov w8, v2.h[0]
+; NEON-NEXT: cmlt v0.4h, v0.4h, #0
; NEON-NEXT: and v1.8b, v1.8b, v0.8b
; NEON-NEXT: mvn v0.8b, v0.8b
-; NEON-NEXT: smov w8, v2.h[0]
; NEON-NEXT: sub v0.4h, v1.4h, v0.4h
; NEON-NEXT: smov w9, v0.h[0]
; NEON-NEXT: sdiv w0, w8, w9
@@ -386,23 +387,24 @@ define <3 x i10> @sdiv_v3i10(<3 x i10> %x, <3 x i10> %y, <3 x i1> %m) {
;
; SVE-LABEL: sdiv_v3i10:
; SVE: // %bb.0:
-; SVE-NEXT: fmov s0, w6
+; SVE-NEXT: movi v0.2d, #0000000000000000
; SVE-NEXT: fmov s1, w3
; SVE-NEXT: ldr w8, [sp]
; SVE-NEXT: fmov s2, w0
; SVE-NEXT: ptrue p0.s, vl4
-; SVE-NEXT: mov v0.h[1], w7
; SVE-NEXT: mov v1.h[1], w4
+; SVE-NEXT: mov v0.h[0], w6
; SVE-NEXT: mov v2.h[1], w1
-; SVE-NEXT: mov v0.h[2], w8
; SVE-NEXT: mov v1.h[2], w5
+; SVE-NEXT: mov v0.h[1], w7
; SVE-NEXT: mov v2.h[2], w2
-; SVE-NEXT: shl v0.4h, v0.4h, #15
; SVE-NEXT: shl v1.4h, v1.4h, #6
+; SVE-NEXT: mov v0.h[2], w8
; SVE-NEXT: shl v2.4h, v2.4h, #6
-; SVE-NEXT: cmlt v0.4h, v0.4h, #0
; SVE-NEXT: sshr v1.4h, v1.4h, #6
; SVE-NEXT: sshr v2.4h, v2.4h, #6
+; SVE-NEXT: shl v0.4h, v0.4h, #15
+; SVE-NEXT: cmlt v0.4h, v0.4h, #0
; SVE-NEXT: and v1.8b, v1.8b, v0.8b
; SVE-NEXT: mvn v0.8b, v0.8b
; SVE-NEXT: sub v0.4h, v1.4h, v0.4h
diff --git a/llvm/test/CodeGen/AArch64/masked-srem-fixed-length.ll b/llvm/test/CodeGen/AArch64/masked-srem-fixed-length.ll
index 438e0319dd33e..2fbd21fb45d10 100644
--- a/llvm/test/CodeGen/AArch64/masked-srem-fixed-length.ll
+++ b/llvm/test/CodeGen/AArch64/masked-srem-fixed-length.ll
@@ -383,27 +383,28 @@ define <2 x i128> @srem_v2i128(<2 x i128> %x, <2 x i128> %y, <2 x i1> %m) {
define <3 x i10> @srem_v3i10(<3 x i10> %x, <3 x i10> %y, <3 x i1> %m) {
; NEON-LABEL: srem_v3i10:
; NEON: // %bb.0:
-; NEON-NEXT: fmov s0, w6
+; NEON-NEXT: movi v0.2d, #0000000000000000
; NEON-NEXT: fmov s1, w3
; NEON-NEXT: ldr w8, [sp]
; NEON-NEXT: fmov s2, w0
-; NEON-NEXT: mov v0.h[1], w7
; NEON-NEXT: mov v1.h[1], w4
+; NEON-NEXT: mov v0.h[0], w6
; NEON-NEXT: mov v2.h[1], w1
-; NEON-NEXT: mov v0.h[2], w8
; NEON-NEXT: mov v1.h[2], w5
+; NEON-NEXT: mov v0.h[1], w7
; NEON-NEXT: mov v2.h[2], w2
-; NEON-NEXT: shl v0.4h, v0.4h, #15
; NEON-NEXT: shl v1.4h, v1.4h, #6
+; NEON-NEXT: mov v0.h[2], w8
; NEON-NEXT: shl v2.4h, v2.4h, #6
-; NEON-NEXT: cmlt v0.4h, v0.4h, #0
; NEON-NEXT: sshr v1.4h, v1.4h, #6
; NEON-NEXT: sshr v2.4h, v2.4h, #6
-; NEON-NEXT: and v1.8b, v1.8b, v0.8b
-; NEON-NEXT: mvn v0.8b, v0.8b
+; NEON-NEXT: shl v0.4h, v0.4h, #15
; NEON-NEXT: smov w8, v2.h[0]
; NEON-NEXT: smov w11, v2.h[1]
; NEON-NEXT: smov w14, v2.h[2]
+; NEON-NEXT: cmlt v0.4h, v0.4h, #0
+; NEON-NEXT: and v1.8b, v1.8b, v0.8b
+; NEON-NEXT: mvn v0.8b, v0.8b
; NEON-NEXT: sub v0.4h, v1.4h, v0.4h
; NEON-NEXT: smov w9, v0.h[0]
; NEON-NEXT: smov w12, v0.h[1]
@@ -418,26 +419,27 @@ define <3 x i10> @srem_v3i10(<3 x i10> %x, <3 x i10> %y, <3 x i1> %m) {
;
; SVE-LABEL: srem_v3i10:
; SVE: // %bb.0:
-; SVE-NEXT: fmov s0, w6
+; SVE-NEXT: movi v0.2d, #0000000000000000
; SVE-NEXT: fmov s1, w3
; SVE-NEXT: ldr w8, [sp]
; SVE-NEXT: fmov s2, w0
; SVE-NEXT: ptrue p0.s, vl4
-; SVE-NEXT: mov v0.h[1], w7
; SVE-NEXT: mov v1.h[1], w4
+; SVE-NEXT: mov v0.h[0], w6
; SVE-NEXT: mov v2.h[1], w1
-; SVE-NEXT: mov v0.h[2], w8
; SVE-NEXT: mov v1.h[2], w5
+; SVE-NEXT: mov v0.h[1], w7
; SVE-NEXT: mov v2.h[2], w2
-; SVE-NEXT: shl v0.4h, v0.4h, #15
; SVE-NEXT: shl v1.4h, v1.4h, #6
+; SVE-NEXT: mov v0.h[2], w8
; SVE-NEXT: shl v2.4h, v2.4h, #6
-; SVE-NEXT: cmlt v0.4h, v0.4h, #0
; SVE-NEXT: sshr v1.4h, v1.4h, #6
; SVE-NEXT: sshr v2.4h, v2.4h, #6
+; SVE-NEXT: shl v0.4h, v0.4h, #15
+; SVE-NEXT: sshll v3.4s, v2.4h, #0
+; SVE-NEXT: cmlt v0.4h, v0.4h, #0
; SVE-NEXT: and v1.8b, v1.8b, v0.8b
; SVE-NEXT: mvn v0.8b, v0.8b
-; SVE-NEXT: sshll v3.4s, v2.4h, #0
; SVE-NEXT: sub v0.4h, v1.4h, v0.4h
; SVE-NEXT: sshll v1.4s, v0.4h, #0
; SVE-NEXT: sdivr z1.s, p0/m, z1.s, z3.s
diff --git a/llvm/test/CodeGen/AArch64/masked-udiv-fixed-length.ll b/llvm/test/CodeGen/AArch64/masked-udiv-fixed-length.ll
index 65fb9d6c163d4..b28a1cf5b7de8 100644
--- a/llvm/test/CodeGen/AArch64/masked-udiv-fixed-length.ll
+++ b/llvm/test/CodeGen/AArch64/masked-udiv-fixed-length.ll
@@ -354,21 +354,22 @@ define <2 x i128> @udiv_v2i128(<2 x i128> %x, <2 x i128> %y, <2 x i1> %m) {
define <3 x i10> @udiv_v3i10(<3 x i10> %x, <3 x i10> %y, <3 x i1> %m) {
; NEON-LABEL: udiv_v3i10:
; NEON: // %bb.0:
-; NEON-NEXT: fmov s0, w6
+; NEON-NEXT: movi v0.2d, #0000000000000000
; NEON-NEXT: fmov s1, w3
; NEON-NEXT: ldr w8, [sp]
; NEON-NEXT: fmov s2, w0
-; NEON-NEXT: mov v0.h[1], w7
; NEON-NEXT: mov v1.h[1], w4
+; NEON-NEXT: mov v0.h[0], w6
; NEON-NEXT: mov v2.h[1], w1
-; NEON-NEXT: mov v0.h[2], w8
; NEON-NEXT: mov v1.h[2], w5
+; NEON-NEXT: mov v0.h[1], w7
; NEON-NEXT: mov v2.h[2], w2
-; NEON-NEXT: shl v0.4h, v0.4h, #15
; NEON-NEXT: bic v1.4h, #252, lsl #8
+; NEON-NEXT: mov v0.h[2], w8
; NEON-NEXT: bic v2.4h, #252, lsl #8
-; NEON-NEXT: cmlt v0.4h, v0.4h, #0
; NEON-NEXT: umov w9, v2.h[0]
+; NEON-NEXT: shl v0.4h, v0.4h, #15
+; NEON-NEXT: cmlt v0.4h, v0.4h, #0
; NEON-NEXT: and v1.8b, v1.8b, v0.8b
; NEON-NEXT: mvn v0.8b, v0.8b
; NEON-NEXT: sub v0.4h, v1.4h, v0.4h
@@ -387,20 +388,21 @@ define <3 x i10> @udiv_v3i10(<3 x i10> %x, <3 x i10> %y, <3 x i1> %m) {
;
; SVE-LABEL: udiv_v3i10:
; SVE: // %bb.0:
-; SVE-NEXT: fmov s0, w6
+; SVE-NEXT: movi v0.2d, #0000000000000000
; SVE-NEXT: fmov s1, w3
; SVE-NEXT: ldr w8, [sp]
; SVE-NEXT: fmov s2, w0
; SVE-NEXT: ptrue p0.s, vl4
-; SVE-NEXT: mov v0.h[1], w7
; SVE-NEXT: mov v1.h[1], w4
+; SVE-NEXT: mov v0.h[0], w6
; SVE-NEXT: mov v2.h[1], w1
-; SVE-NEXT: mov v0.h[2], w8
; SVE-NEXT: mov v1.h[2], w5
+; SVE-NEXT: mov v0.h[1], w7
; SVE-NEXT: mov v2.h[2], w2
-; SVE-NEXT: shl v0.4h, v0.4h, #15
; SVE-NEXT: bic v1.4h, #252, lsl #8
+; SVE-NEXT: mov v0.h[2], w8
; SVE-NEXT: bic v2.4h, #252, lsl #8
+; SVE-NEXT: shl v0.4h, v0.4h, #15
; SVE-NEXT: cmlt v0.4h, v0.4h, #0
; SVE-NEXT: and v1.8b, v1.8b, v0.8b
; SVE-NEXT: mvn v0.8b, v0.8b
diff --git a/llvm/test/CodeGen/AArch64/masked-urem-fixed-length.ll b/llvm/test/CodeGen/AArch64/masked-urem-fixed-length.ll
index 1067c464a787f..254e4ed3047f6 100644
--- a/llvm/test/CodeGen/AArch64/masked-urem-fixed-length.ll
+++ b/llvm/test/CodeGen/AArch64/masked-urem-fixed-length.ll
@@ -383,23 +383,24 @@ define <2 x i128> @urem_v2i128(<2 x i128> %x, <2 x i128> %y, <2 x i1> %m) {
define <3 x i10> @urem_v3i10(<3 x i10> %x, <3 x i10> %y, <3 x i1> %m) {
; NEON-LABEL: urem_v3i10:
; NEON: // %bb.0:
-; NEON-NEXT: fmov s0, w6
+; NEON-NEXT: movi v0.2d, #0000000000000000
; NEON-NEXT: fmov s1, w3
; NEON-NEXT: ldr w8, [sp]
; NEON-NEXT: fmov s2, w0
-; NEON-NEXT: mov v0.h[1], w7
; NEON-NEXT: mov v1.h[1], w4
+; NEON-NEXT: mov v0.h[0], w6
; NEON-NEXT: mov v2.h[1], w1
-; NEON-NEXT: mov v0.h[2], w8
; NEON-NEXT: mov v1.h[2], w5
+; NEON-NEXT: mov v0.h[1], w7
; NEON-NEXT: mov v2.h[2], w2
-; NEON-NEXT: shl v0.4h, v0.4h, #15
; NEON-NEXT: bic v1.4h, #252, lsl #8
+; NEON-NEXT: mov v0.h[2], w8
; NEON-NEXT: bic v2.4h, #252, lsl #8
-; NEON-NEXT: cmlt v0.4h, v0.4h, #0
; NEON-NEXT: umov w9, v2.h[0]
; NEON-NEXT: umov w12, v2.h[1]
; NEON-NEXT: umov w15, v2.h[2]
+; NEON-NEXT: shl v0.4h, v0.4h, #15
+; NEON-NEXT: cmlt v0.4h, v0.4h, #0
; NEON-NEXT: and v1.8b, v1.8b, v0.8b
; NEON-NEXT: mvn v0.8b, v0.8b
; NEON-NEXT: sub v0.4h, v1.4h, v0.4h
@@ -419,22 +420,23 @@ define <3 x i10> @urem_v3i10(<3 x i10> %x, <3 x i10> %y, <3 x i1> %m) {
;
; SVE-LABEL: urem_v3i10:
; SVE: // %bb.0:
-; SVE-NEXT: fmov s0, w6
+; SVE-NEXT: movi v0.2d, #0000000000000000
; SVE-NEXT: fmov s1, w3
; SVE-NEXT: ldr w8, [sp]
; SVE-NEXT: fmov s2, w0
; SVE-NEXT: ptrue p0.s, vl4
-; SVE-NEXT: mov v0.h[1], w7
; SVE-NEXT: mov v1.h[1], w4
+; SVE-NEXT: mov v0.h[0], w6
; SVE-NEXT: mov v2.h[1], w1
-; SVE-NEXT: mov v0.h[2], w8
; SVE-NEXT: mov v1.h[2], w5
+; SVE-NEXT: mov v0.h[1], w7
; SVE-NEXT: mov v2.h[2], w2
-; SVE-NEXT: shl v0.4h, v0.4h, #15
; SVE-NEXT: bic v1.4h, #252, lsl #8
+; SVE-NEXT: mov v0.h[2], w8
; SVE-NEXT: bic v2.4h, #252, lsl #8
-; SVE-NEXT: cmlt v0.4h, v0.4h, #0
; SVE-NEXT: ushll v3.4s, v2.4h, #0
+; SVE-NEXT: shl v0.4h, v0.4h, #15
+; SVE-NEXT: cmlt v0.4h, v0.4h, #0
; SVE-NEXT: and v1.8b, v1.8b, v0.8b
; SVE-NEXT: mvn v0.8b, v0.8b
; SVE-NEXT: sub v0.4h, v1.4h, v0.4h
diff --git a/llvm/test/CodeGen/PowerPC/masked-sdiv.ll b/llvm/test/CodeGen/PowerPC/masked-sdiv.ll
index 0d824bc79fec2..51fc8ade4bcc1 100644
--- a/llvm/test/CodeGen/PowerPC/masked-sdiv.ll
+++ b/llvm/test/CodeGen/PowerPC/masked-sdiv.ll
@@ -342,34 +342,39 @@ define <3 x i10> @sdiv_v3i10(<3 x i10> %x, <3 x i10> %y, <3 x i1> %m) {
; CHECK-NEXT: mtfprwz 0, 9
; CHECK-NEXT: mtfprwz 1, 10
; CHECK-NEXT: addis 9, 2, .LCPI7_0 at toc@ha
-; CHECK-NEXT: xxleqv 38, 38, 38
-; CHECK-NEXT: vspltisw 5, 11
; CHECK-NEXT: addi 9, 9, .LCPI7_0 at toc@l
-; CHECK-NEXT: vadduwm 5, 5, 5
+; CHECK-NEXT: mtvsrwz 38, 8
+; CHECK-NEXT: vspltisw 4, 11
+; CHECK-NEXT: vadduwm 4, 4, 4
+; CHECK-NEXT: lxvd2x 2, 0, 9
; CHECK-NEXT: xxmrghw 35, 1, 0
-; CHECK-NEXT: lxvd2x 0, 0, 9
-; CHECK-NEXT: mtfprwz 1, 7
-; CHECK-NEXT: xxswapd 34, 0
; CHECK-NEXT: mtfprwz 0, 6
-; CHECK-NEXT: xxmrghw 36, 1, 0
-; CHECK-NEXT: mtfprwz 0, 3
-; CHECK-NEXT: lbz 3, 96(1)
+; CHECK-NEXT: lbz 6, 96(1)
+; CHECK-NEXT: mtfprwz 1, 7
+; CHECK-NEXT: mtvsrwz 32, 6
+; CHECK-NEXT: addis 6, 2, .LCPI7_1 at toc@ha
+; CHECK-NEXT: addi 6, 6, .LCPI7_1 at toc@l
+; CHECK-NEXT: xxswapd 34, 2
+; CHECK-NEXT: xxmrghw 37, 1, 0
; CHECK-NEXT: mtfprwz 1, 4
-; CHECK-NEXT: mtvsrwz 33, 3
-; CHECK-NEXT: xxmrghw 32, 1, 0
-; CHECK-NEXT: vperm 3, 1, 3, 2
-; CHECK-NEXT: mtvsrwz 33, 8
-; CHECK-NEXT: vslw 3, 3, 6
-; CHECK-NEXT: vsraw 3, 3, 6
-; CHECK-NEXT: vperm 4, 1, 4, 2
-; CHECK-NEXT: mtvsrwz 33, 5
-; CHECK-NEXT: vslw 4, 4, 5
-; CHECK-NEXT: vsraw 4, 4, 5
-; CHECK-NEXT: vperm 0, 1, 0, 2
-; CHECK-NEXT: vspltisw 1, 1
-; CHECK-NEXT: xxsel 0, 33, 36, 35
-; CHECK-NEXT: vslw 3, 0, 5
-; CHECK-NEXT: vsraw 3, 3, 5
+; CHECK-NEXT: lxvd2x 0, 0, 6
+; CHECK-NEXT: vperm 5, 6, 5, 2
+; CHECK-NEXT: mtvsrwz 38, 5
+; CHECK-NEXT: vslw 5, 5, 4
+; CHECK-NEXT: vsraw 5, 5, 4
+; CHECK-NEXT: vperm 3, 0, 3, 2
+; CHECK-NEXT: xxswapd 32, 0
+; CHECK-NEXT: mtfprwz 0, 3
+; CHECK-NEXT: xxland 35, 35, 32
+; CHECK-NEXT: xxleqv 32, 32, 32
+; CHECK-NEXT: vslw 3, 3, 0
+; CHECK-NEXT: vsraw 3, 3, 0
+; CHECK-NEXT: xxmrghw 33, 1, 0
+; CHECK-NEXT: vperm 1, 6, 1, 2
+; CHECK-NEXT: vspltisw 6, 1
+; CHECK-NEXT: xxsel 0, 38, 37, 35
+; CHECK-NEXT: vslw 3, 1, 4
+; CHECK-NEXT: vsraw 3, 3, 4
; CHECK-NEXT: xxswapd 1, 0
; CHECK-NEXT: xxsldwi 3, 0, 0, 1
; CHECK-NEXT: mffprwz 3, 1
@@ -377,10 +382,10 @@ define <3 x i10> @sdiv_v3i10(<3 x i10> %x, <3 x i10> %y, <3 x i1> %m) {
; CHECK-NEXT: xxsldwi 4, 35, 35, 1
; CHECK-NEXT: mffprwz 4, 2
; CHECK-NEXT: divw 3, 4, 3
-; CHECK-NEXT: mffprwz 4, 4
+; CHECK-NEXT: mffprwz 4, 3
; CHECK-NEXT: mtfprwz 1, 3
-; CHECK-NEXT: mffprwz 3, 3
-; CHECK-NEXT: divw 3, 4, 3
+; CHECK-NEXT: mffprwz 3, 4
+; CHECK-NEXT: divw 3, 3, 4
; CHECK-NEXT: mfvsrwz 4, 35
; CHECK-NEXT: mtfprwz 2, 3
; CHECK-NEXT: mffprwz 3, 0
diff --git a/llvm/test/CodeGen/PowerPC/masked-srem.ll b/llvm/test/CodeGen/PowerPC/masked-srem.ll
index be2e1ed7c12e6..4fd7c58a01c0a 100644
--- a/llvm/test/CodeGen/PowerPC/masked-srem.ll
+++ b/llvm/test/CodeGen/PowerPC/masked-srem.ll
@@ -400,34 +400,39 @@ define <3 x i10> @srem_v3i10(<3 x i10> %x, <3 x i10> %y, <3 x i1> %m) {
; CHECK-NEXT: mtfprwz 0, 9
; CHECK-NEXT: mtfprwz 1, 10
; CHECK-NEXT: addis 9, 2, .LCPI7_0 at toc@ha
-; CHECK-NEXT: xxleqv 38, 38, 38
-; CHECK-NEXT: vspltisw 5, 11
; CHECK-NEXT: addi 9, 9, .LCPI7_0 at toc@l
-; CHECK-NEXT: vadduwm 5, 5, 5
+; CHECK-NEXT: mtvsrwz 38, 8
+; CHECK-NEXT: vspltisw 4, 11
+; CHECK-NEXT: vadduwm 4, 4, 4
+; CHECK-NEXT: lxvd2x 2, 0, 9
; CHECK-NEXT: xxmrghw 35, 1, 0
-; CHECK-NEXT: lxvd2x 0, 0, 9
-; CHECK-NEXT: mtfprwz 1, 7
-; CHECK-NEXT: xxswapd 34, 0
; CHECK-NEXT: mtfprwz 0, 6
-; CHECK-NEXT: xxmrghw 36, 1, 0
-; CHECK-NEXT: mtfprwz 0, 3
-; CHECK-NEXT: lbz 3, 96(1)
+; CHECK-NEXT: lbz 6, 96(1)
+; CHECK-NEXT: mtfprwz 1, 7
+; CHECK-NEXT: mtvsrwz 32, 6
+; CHECK-NEXT: addis 6, 2, .LCPI7_1 at toc@ha
+; CHECK-NEXT: addi 6, 6, .LCPI7_1 at toc@l
+; CHECK-NEXT: xxswapd 34, 2
+; CHECK-NEXT: xxmrghw 37, 1, 0
; CHECK-NEXT: mtfprwz 1, 4
-; CHECK-NEXT: mtvsrwz 33, 3
-; CHECK-NEXT: xxmrghw 32, 1, 0
-; CHECK-NEXT: vperm 3, 1, 3, 2
-; CHECK-NEXT: mtvsrwz 33, 8
-; CHECK-NEXT: vslw 3, 3, 6
-; CHECK-NEXT: vsraw 3, 3, 6
-; CHECK-NEXT: vperm 4, 1, 4, 2
-; CHECK-NEXT: mtvsrwz 33, 5
-; CHECK-NEXT: vslw 4, 4, 5
-; CHECK-NEXT: vsraw 4, 4, 5
-; CHECK-NEXT: vperm 0, 1, 0, 2
-; CHECK-NEXT: vspltisw 1, 1
-; CHECK-NEXT: xxsel 0, 33, 36, 35
-; CHECK-NEXT: vslw 3, 0, 5
-; CHECK-NEXT: vsraw 3, 3, 5
+; CHECK-NEXT: lxvd2x 0, 0, 6
+; CHECK-NEXT: vperm 5, 6, 5, 2
+; CHECK-NEXT: mtvsrwz 38, 5
+; CHECK-NEXT: vslw 5, 5, 4
+; CHECK-NEXT: vsraw 5, 5, 4
+; CHECK-NEXT: vperm 3, 0, 3, 2
+; CHECK-NEXT: xxswapd 32, 0
+; CHECK-NEXT: mtfprwz 0, 3
+; CHECK-NEXT: xxland 35, 35, 32
+; CHECK-NEXT: xxleqv 32, 32, 32
+; CHECK-NEXT: vslw 3, 3, 0
+; CHECK-NEXT: vsraw 3, 3, 0
+; CHECK-NEXT: xxmrghw 33, 1, 0
+; CHECK-NEXT: vperm 1, 6, 1, 2
+; CHECK-NEXT: vspltisw 6, 1
+; CHECK-NEXT: xxsel 0, 38, 37, 35
+; CHECK-NEXT: vslw 3, 1, 4
+; CHECK-NEXT: vsraw 3, 3, 4
; CHECK-NEXT: xxswapd 1, 0
; CHECK-NEXT: xxsldwi 3, 0, 0, 1
; CHECK-NEXT: mffprwz 3, 1
@@ -437,12 +442,12 @@ define <3 x i10> @srem_v3i10(<3 x i10> %x, <3 x i10> %y, <3 x i1> %m) {
; CHECK-NEXT: divw 5, 4, 3
; CHECK-NEXT: mullw 3, 5, 3
; CHECK-NEXT: sub 3, 4, 3
-; CHECK-NEXT: mffprwz 4, 4
+; CHECK-NEXT: mffprwz 4, 3
; CHECK-NEXT: mtfprwz 1, 3
-; CHECK-NEXT: mffprwz 3, 3
-; CHECK-NEXT: divw 5, 4, 3
-; CHECK-NEXT: mullw 3, 5, 3
-; CHECK-NEXT: sub 3, 4, 3
+; CHECK-NEXT: mffprwz 3, 4
+; CHECK-NEXT: divw 5, 3, 4
+; CHECK-NEXT: mullw 4, 5, 4
+; CHECK-NEXT: sub 3, 3, 4
; CHECK-NEXT: mfvsrwz 4, 35
; CHECK-NEXT: mtfprwz 2, 3
; CHECK-NEXT: mffprwz 3, 0
diff --git a/llvm/test/CodeGen/PowerPC/masked-udiv.ll b/llvm/test/CodeGen/PowerPC/masked-udiv.ll
index c0d4fd8f4ddc2..282f015c51a19 100644
--- a/llvm/test/CodeGen/PowerPC/masked-udiv.ll
+++ b/llvm/test/CodeGen/PowerPC/masked-udiv.ll
@@ -340,40 +340,45 @@ define <3 x i10> @udiv_v3i10(<3 x i10> %x, <3 x i10> %y, <3 x i1> %m) {
; CHECK-LABEL: udiv_v3i10:
; CHECK: # %bb.0:
; CHECK-NEXT: mtfprwz 0, 9
-; CHECK-NEXT: mtfprwz 1, 10
; CHECK-NEXT: addis 9, 2, .LCPI7_0 at toc@ha
+; CHECK-NEXT: mtfprwz 1, 10
; CHECK-NEXT: addi 9, 9, .LCPI7_0 at toc@l
-; CHECK-NEXT: mtvsrwz 33, 8
+; CHECK-NEXT: mtvsrwz 38, 8
+; CHECK-NEXT: vspltisw 0, -10
+; CHECK-NEXT: vsrw 0, 0, 0
+; CHECK-NEXT: lxvd2x 2, 0, 9
+; CHECK-NEXT: lbz 9, 96(1)
+; CHECK-NEXT: mtvsrwz 36, 9
+; CHECK-NEXT: addis 9, 2, .LCPI7_1 at toc@ha
+; CHECK-NEXT: addi 9, 9, .LCPI7_1 at toc@l
; CHECK-NEXT: xxmrghw 34, 1, 0
-; CHECK-NEXT: lxvd2x 0, 0, 9
; CHECK-NEXT: mtfprwz 1, 7
-; CHECK-NEXT: xxswapd 35, 0
+; CHECK-NEXT: lxvd2x 0, 0, 9
+; CHECK-NEXT: xxswapd 35, 2
+; CHECK-NEXT: vperm 2, 4, 2, 3
+; CHECK-NEXT: xxswapd 36, 0
; CHECK-NEXT: mtfprwz 0, 6
-; CHECK-NEXT: lbz 6, 96(1)
-; CHECK-NEXT: mtvsrwz 37, 6
-; CHECK-NEXT: xxmrghw 36, 1, 0
-; CHECK-NEXT: mtfprwz 0, 3
-; CHECK-NEXT: mtfprwz 1, 4
-; CHECK-NEXT: vperm 4, 1, 4, 3
-; CHECK-NEXT: mtvsrwz 33, 5
-; CHECK-NEXT: vperm 2, 5, 2, 3
-; CHECK-NEXT: vspltisw 5, -10
-; CHECK-NEXT: vsrw 5, 5, 5
-; CHECK-NEXT: xxmrghw 32, 1, 0
-; CHECK-NEXT: xxland 0, 36, 37
+; CHECK-NEXT: xxland 34, 34, 36
; CHECK-NEXT: xxleqv 36, 36, 36
; CHECK-NEXT: vslw 2, 2, 4
; CHECK-NEXT: vsraw 2, 2, 4
-; CHECK-NEXT: vperm 0, 1, 0, 3
-; CHECK-NEXT: vspltisw 1, 1
-; CHECK-NEXT: xxland 1, 32, 37
-; CHECK-NEXT: xxswapd 3, 1
-; CHECK-NEXT: xxsldwi 5, 1, 1, 1
-; CHECK-NEXT: mffprwz 4, 3
-; CHECK-NEXT: xxsel 0, 33, 0, 34
+; CHECK-NEXT: xxmrghw 37, 1, 0
+; CHECK-NEXT: mtfprwz 0, 3
+; CHECK-NEXT: mtfprwz 1, 4
+; CHECK-NEXT: vperm 5, 6, 5, 3
+; CHECK-NEXT: mtvsrwz 38, 5
+; CHECK-NEXT: xxmrghw 33, 1, 0
+; CHECK-NEXT: xxland 0, 37, 32
+; CHECK-NEXT: vperm 1, 6, 1, 3
+; CHECK-NEXT: vspltisw 6, 1
+; CHECK-NEXT: xxland 1, 33, 32
+; CHECK-NEXT: xxsel 0, 38, 0, 34
; CHECK-NEXT: xxswapd 2, 0
+; CHECK-NEXT: xxswapd 3, 1
; CHECK-NEXT: xxsldwi 4, 0, 0, 1
+; CHECK-NEXT: xxsldwi 5, 1, 1, 1
; CHECK-NEXT: mffprwz 3, 2
+; CHECK-NEXT: mffprwz 4, 3
; CHECK-NEXT: divwu 3, 4, 3
; CHECK-NEXT: mffprwz 4, 4
; CHECK-NEXT: mtfprwz 2, 3
diff --git a/llvm/test/CodeGen/PowerPC/masked-urem.ll b/llvm/test/CodeGen/PowerPC/masked-urem.ll
index c76d57d572a8c..64718cf795207 100644
--- a/llvm/test/CodeGen/PowerPC/masked-urem.ll
+++ b/llvm/test/CodeGen/PowerPC/masked-urem.ll
@@ -398,40 +398,45 @@ define <3 x i10> @urem_v3i10(<3 x i10> %x, <3 x i10> %y, <3 x i1> %m) {
; CHECK-LABEL: urem_v3i10:
; CHECK: # %bb.0:
; CHECK-NEXT: mtfprwz 0, 9
-; CHECK-NEXT: mtfprwz 1, 10
; CHECK-NEXT: addis 9, 2, .LCPI7_0 at toc@ha
+; CHECK-NEXT: mtfprwz 1, 10
; CHECK-NEXT: addi 9, 9, .LCPI7_0 at toc@l
-; CHECK-NEXT: mtvsrwz 33, 8
+; CHECK-NEXT: mtvsrwz 38, 8
+; CHECK-NEXT: vspltisw 0, -10
+; CHECK-NEXT: vsrw 0, 0, 0
+; CHECK-NEXT: lxvd2x 2, 0, 9
+; CHECK-NEXT: lbz 9, 96(1)
+; CHECK-NEXT: mtvsrwz 36, 9
+; CHECK-NEXT: addis 9, 2, .LCPI7_1 at toc@ha
+; CHECK-NEXT: addi 9, 9, .LCPI7_1 at toc@l
; CHECK-NEXT: xxmrghw 35, 1, 0
-; CHECK-NEXT: lxvd2x 0, 0, 9
; CHECK-NEXT: mtfprwz 1, 7
-; CHECK-NEXT: xxswapd 34, 0
+; CHECK-NEXT: lxvd2x 0, 0, 9
+; CHECK-NEXT: xxswapd 34, 2
+; CHECK-NEXT: vperm 3, 4, 3, 2
+; CHECK-NEXT: xxswapd 36, 0
; CHECK-NEXT: mtfprwz 0, 6
-; CHECK-NEXT: lbz 6, 96(1)
-; CHECK-NEXT: mtvsrwz 37, 6
-; CHECK-NEXT: xxmrghw 36, 1, 0
-; CHECK-NEXT: mtfprwz 0, 3
-; CHECK-NEXT: mtfprwz 1, 4
-; CHECK-NEXT: vperm 4, 1, 4, 2
-; CHECK-NEXT: mtvsrwz 33, 5
-; CHECK-NEXT: vperm 3, 5, 3, 2
-; CHECK-NEXT: vspltisw 5, -10
-; CHECK-NEXT: vsrw 5, 5, 5
-; CHECK-NEXT: xxmrghw 32, 1, 0
-; CHECK-NEXT: xxland 0, 36, 37
+; CHECK-NEXT: xxland 35, 35, 36
; CHECK-NEXT: xxleqv 36, 36, 36
; CHECK-NEXT: vslw 3, 3, 4
; CHECK-NEXT: vsraw 3, 3, 4
-; CHECK-NEXT: vperm 0, 1, 0, 2
-; CHECK-NEXT: vspltisw 1, 1
-; CHECK-NEXT: xxland 1, 32, 37
-; CHECK-NEXT: xxswapd 3, 1
-; CHECK-NEXT: xxsldwi 5, 1, 1, 1
-; CHECK-NEXT: mffprwz 4, 3
-; CHECK-NEXT: xxsel 0, 33, 0, 35
+; CHECK-NEXT: xxmrghw 37, 1, 0
+; CHECK-NEXT: mtfprwz 0, 3
+; CHECK-NEXT: mtfprwz 1, 4
+; CHECK-NEXT: vperm 5, 6, 5, 2
+; CHECK-NEXT: mtvsrwz 38, 5
+; CHECK-NEXT: xxmrghw 33, 1, 0
+; CHECK-NEXT: xxland 0, 37, 32
+; CHECK-NEXT: vperm 1, 6, 1, 2
+; CHECK-NEXT: vspltisw 6, 1
+; CHECK-NEXT: xxland 1, 33, 32
+; CHECK-NEXT: xxsel 0, 38, 0, 35
; CHECK-NEXT: xxswapd 2, 0
+; CHECK-NEXT: xxswapd 3, 1
; CHECK-NEXT: xxsldwi 4, 0, 0, 1
+; CHECK-NEXT: xxsldwi 5, 1, 1, 1
; CHECK-NEXT: mffprwz 3, 2
+; CHECK-NEXT: mffprwz 4, 3
; CHECK-NEXT: divwu 5, 4, 3
; CHECK-NEXT: mullw 3, 5, 3
; CHECK-NEXT: sub 3, 4, 3
diff --git a/llvm/test/CodeGen/RISCV/rvv/masked-sdiv.ll b/llvm/test/CodeGen/RISCV/rvv/masked-sdiv.ll
index fc6e4afdbc31e..50b40659c6f37 100644
--- a/llvm/test/CodeGen/RISCV/rvv/masked-sdiv.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/masked-sdiv.ll
@@ -271,6 +271,8 @@ define <3 x i10> @sdiv_v3i10(<3 x i10> %x, <3 x i10> %y, <3 x i1> %m) {
; CHECK-NEXT: ld a3, 8(a2)
; CHECK-NEXT: ld a2, 16(a2)
; CHECK-NEXT: vmv.v.x v9, a5
+; CHECK-NEXT: vmv.v.i v10, 7
+; CHECK-NEXT: vmand.mm v0, v0, v10
; CHECK-NEXT: vmv.v.i v10, 1
; CHECK-NEXT: vslide1down.vx v8, v8, a4
; CHECK-NEXT: vslide1down.vx v9, v9, a3
diff --git a/llvm/test/CodeGen/RISCV/rvv/masked-srem.ll b/llvm/test/CodeGen/RISCV/rvv/masked-srem.ll
index eb0c9e97b023a..bdcab24d275db 100644
--- a/llvm/test/CodeGen/RISCV/rvv/masked-srem.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/masked-srem.ll
@@ -271,6 +271,8 @@ define <3 x i10> @srem_v3i10(<3 x i10> %x, <3 x i10> %y, <3 x i1> %m) {
; CHECK-NEXT: ld a3, 8(a2)
; CHECK-NEXT: ld a2, 16(a2)
; CHECK-NEXT: vmv.v.x v9, a5
+; CHECK-NEXT: vmv.v.i v10, 7
+; CHECK-NEXT: vmand.mm v0, v0, v10
; CHECK-NEXT: vmv.v.i v10, 1
; CHECK-NEXT: vslide1down.vx v8, v8, a4
; CHECK-NEXT: vslide1down.vx v9, v9, a3
diff --git a/llvm/test/CodeGen/RISCV/rvv/masked-udiv.ll b/llvm/test/CodeGen/RISCV/rvv/masked-udiv.ll
index c2f151d8fd47e..1d45e4153d830 100644
--- a/llvm/test/CodeGen/RISCV/rvv/masked-udiv.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/masked-udiv.ll
@@ -273,6 +273,8 @@ define <3 x i10> @udiv_v3i10(<3 x i10> %x, <3 x i10> %y, <3 x i1> %m) {
; CHECK-NEXT: vmv.v.x v9, a5
; CHECK-NEXT: vslide1down.vx v8, v8, a4
; CHECK-NEXT: li a4, 1023
+; CHECK-NEXT: vmv.v.i v10, 7
+; CHECK-NEXT: vmand.mm v0, v0, v10
; CHECK-NEXT: vmv.v.i v10, 1
; CHECK-NEXT: vslide1down.vx v9, v9, a3
; CHECK-NEXT: vslide1down.vx v8, v8, a1
diff --git a/llvm/test/CodeGen/RISCV/rvv/masked-urem.ll b/llvm/test/CodeGen/RISCV/rvv/masked-urem.ll
index b0d2bdae583b0..ef20eddeedf95 100644
--- a/llvm/test/CodeGen/RISCV/rvv/masked-urem.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/masked-urem.ll
@@ -273,6 +273,8 @@ define <3 x i10> @urem_v3i10(<3 x i10> %x, <3 x i10> %y, <3 x i1> %m) {
; CHECK-NEXT: vmv.v.x v9, a5
; CHECK-NEXT: vslide1down.vx v8, v8, a4
; CHECK-NEXT: li a4, 1023
+; CHECK-NEXT: vmv.v.i v10, 7
+; CHECK-NEXT: vmand.mm v0, v0, v10
; CHECK-NEXT: vmv.v.i v10, 1
; CHECK-NEXT: vslide1down.vx v9, v9, a3
; CHECK-NEXT: vslide1down.vx v8, v8, a1
>From 2cde89ff3ade7161204567fe7df9738abd5c9709 Mon Sep 17 00:00:00 2001
From: Luke Lau <luke at igalia.com>
Date: Wed, 1 Apr 2026 18:20:46 +0800
Subject: [PATCH 7/7] Tweak wording to be more readable
---
llvm/docs/LangRef.rst | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/llvm/docs/LangRef.rst b/llvm/docs/LangRef.rst
index 3bbbf498dcd7f..c9b387cc85263 100644
--- a/llvm/docs/LangRef.rst
+++ b/llvm/docs/LangRef.rst
@@ -27950,7 +27950,7 @@ The first two arguments and the result have the same vector of integer type. The
Semantics:
""""""""""
-Unlike :ref:`sdiv <i_sdiv>`, disabled lanes produce poison, and both overflow and division by zero on disabled lanes is not undefined behavior. Overflow and division by zero on enabled lanes is still undefined behavior.
+Unlike :ref:`sdiv <i_sdiv>`, disabled lanes produce poison. Overflow and division by zero on disabled lanes is not undefined behavior. Overflow and division by zero on enabled lanes is still undefined behavior.
'``llvm.masked.urem.*``' Intrinsics
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -28004,7 +28004,7 @@ The first two arguments and the result have the same vector of integer type. The
Semantics:
""""""""""
-Unlike :ref:`srem <i_srem>`, disabled lanes produce poison, and both overflow and taking the remainder of a division by zero on disabled lanes is not undefined behavior. Overflow and taking the remainder of a division by zero on enabled lanes is still undefined behavior.
+Unlike :ref:`srem <i_srem>`, disabled lanes produce poison. Overflow and taking the remainder of a division by zero on disabled lanes is not undefined behavior. Overflow and taking the remainder of a division by zero on enabled lanes is still undefined behavior.
Memory Use Markers
------------------
More information about the llvm-commits
mailing list