[llvm] 3a8ea86 - [Intrinsics] define semantics for experimental fmax/fmin vector reductions

Sanjay Patel via llvm-commits llvm-commits at lists.llvm.org
Sat Sep 12 06:10:37 PDT 2020


Author: Sanjay Patel
Date: 2020-09-12T09:10:28-04:00
New Revision: 3a8ea8609b82b7e5401698b7c63df6680e1257a8

URL: https://github.com/llvm/llvm-project/commit/3a8ea8609b82b7e5401698b7c63df6680e1257a8
DIFF: https://github.com/llvm/llvm-project/commit/3a8ea8609b82b7e5401698b7c63df6680e1257a8.diff

LOG: [Intrinsics] define semantics for experimental fmax/fmin vector reductions

As discussed on llvm-dev:
http://lists.llvm.org/pipermail/llvm-dev/2020-April/140729.html

This is hopefully the final remaining showstopper before we can remove
the 'experimental' from the reduction intrinsics.

No behavior was specified for the FP min/max reductions, so we have a
mess of different interpretations.

There are a few potential options for the semantics of these max/min ops.
I think this is the simplest based on current behavior/implementation:
make the reductions inherit from the existing llvm.maxnum/minnum intrinsics.
These correspond to libm fmax/fmin, and those are similar to the (now
deprecated?) IEEE-754 maxNum/minNum functions (NaNs are treated as missing
data). So the default expansion creates calls to libm functions.

Another option would be to inherit from llvm.maximum/minimum (NaNs propagate),
but most targets just crash in codegen when given those nodes because no
default expansion was ever implemented AFAICT.

We could also just assume 'nnan' semantics by default (we are already
assuming 'nsz' semantics in the maxnum/minnum intrinsics), but some targets
(AArch64, PowerPC) support the more defined behavior, so it doesn't make much
sense to not allow a tighter spec. Fast-math-flags (nnan) can be used to
loosen the semantics.

(Note that D67507 was proposed to update the LangRef to acknowledge the more
recent IEEE-754 2019 standard, but that patch seems to have stalled. If we do
update based on the new standard, the reduction instructions can seamlessly
inherit from whatever updates are made to the max/min intrinsics.)

x86 sees a regression here on 'nnan' tests because we have underlying,
longstanding bugs in FMF creation/propagation. Those need to be fixed apart
from this change (for example: https://llvm.org/PR35538). The expansion
sequence before this patch may not have been correct.

Differential Revision: https://reviews.llvm.org/D87391

Added: 
    

Modified: 
    llvm/docs/LangRef.rst
    llvm/include/llvm/CodeGen/BasicTTIImpl.h
    llvm/lib/CodeGen/ExpandReductions.cpp
    llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
    llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
    llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
    llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
    llvm/lib/Target/ARM/ARMTargetTransformInfo.h
    llvm/test/CodeGen/AArch64/vecreduce-fmax-legalization-nan.ll
    llvm/test/CodeGen/AArch64/vecreduce-fmax-legalization.ll
    llvm/test/CodeGen/Generic/expand-experimental-reductions.ll
    llvm/test/CodeGen/Thumb2/mve-vecreduce-fminmax.ll
    llvm/test/CodeGen/Thumb2/mve-vecreduce-loops.ll
    llvm/test/CodeGen/X86/vector-reduce-fmax-nnan.ll
    llvm/test/CodeGen/X86/vector-reduce-fmax.ll
    llvm/test/CodeGen/X86/vector-reduce-fmin-nnan.ll
    llvm/test/CodeGen/X86/vector-reduce-fmin.ll

Removed: 
    


################################################################################
diff  --git a/llvm/docs/LangRef.rst b/llvm/docs/LangRef.rst
index 781b2385de50..5e35b913bef4 100644
--- a/llvm/docs/LangRef.rst
+++ b/llvm/docs/LangRef.rst
@@ -15824,7 +15824,12 @@ The '``llvm.experimental.vector.reduce.fmax.*``' intrinsics do a floating-point
 ``MAX`` reduction of a vector, returning the result as a scalar. The return type
 matches the element-type of the vector input.
 
-If the intrinsic call has the ``nnan`` fast-math flag then the operation can
+This instruction has the same comparison semantics as the '``llvm.maxnum.*``'
+intrinsic. That is, the result will always be a number unless all elements of
+the vector are NaN. For a vector with maximum element magnitude 0.0 and
+containing both +0.0 and -0.0 elements, the sign of the result is unspecified.
+
+If the intrinsic call has the ``nnan`` fast-math flag, then the operation can
 assume that NaNs are not present in the input vector.
 
 Arguments:
@@ -15850,7 +15855,12 @@ The '``llvm.experimental.vector.reduce.fmin.*``' intrinsics do a floating-point
 ``MIN`` reduction of a vector, returning the result as a scalar. The return type
 matches the element-type of the vector input.
 
-If the intrinsic call has the ``nnan`` fast-math flag then the operation can
+This instruction has the same comparison semantics as the '``llvm.minnum.*``'
+intrinsic. That is, the result will always be a number unless all elements of
+the vector are NaN. For a vector with minimum element magnitude 0.0 and
+containing both +0.0 and -0.0 elements, the sign of the result is unspecified.
+
+If the intrinsic call has the ``nnan`` fast-math flag, then the operation can
 assume that NaNs are not present in the input vector.
 
 Arguments:

diff  --git a/llvm/include/llvm/CodeGen/BasicTTIImpl.h b/llvm/include/llvm/CodeGen/BasicTTIImpl.h
index 2b72dc3490d7..d5c0b83ea6f7 100644
--- a/llvm/include/llvm/CodeGen/BasicTTIImpl.h
+++ b/llvm/include/llvm/CodeGen/BasicTTIImpl.h
@@ -1349,13 +1349,9 @@ class BasicTTIImplBase : public TargetTransformInfoImplCRTPBase<T> {
       break;
     case Intrinsic::minnum:
       ISDs.push_back(ISD::FMINNUM);
-      if (FMF.noNaNs())
-        ISDs.push_back(ISD::FMINIMUM);
       break;
     case Intrinsic::maxnum:
       ISDs.push_back(ISD::FMAXNUM);
-      if (FMF.noNaNs())
-        ISDs.push_back(ISD::FMAXIMUM);
       break;
     case Intrinsic::copysign:
       ISDs.push_back(ISD::FCOPYSIGN);

diff  --git a/llvm/lib/CodeGen/ExpandReductions.cpp b/llvm/lib/CodeGen/ExpandReductions.cpp
index 45f21c1085dd..dfaaafaf811f 100644
--- a/llvm/lib/CodeGen/ExpandReductions.cpp
+++ b/llvm/lib/CodeGen/ExpandReductions.cpp
@@ -143,12 +143,24 @@ bool expandReductions(Function &F, const TargetTransformInfo *TTI) {
     case Intrinsic::experimental_vector_reduce_smax:
     case Intrinsic::experimental_vector_reduce_smin:
     case Intrinsic::experimental_vector_reduce_umax:
-    case Intrinsic::experimental_vector_reduce_umin:
+    case Intrinsic::experimental_vector_reduce_umin: {
+      Value *Vec = II->getArgOperand(0);
+      if (!isPowerOf2_32(
+              cast<FixedVectorType>(Vec->getType())->getNumElements()))
+        continue;
+
+      Rdx = getShuffleReduction(Builder, Vec, getOpcode(ID), MRK);
+      break;
+    }
     case Intrinsic::experimental_vector_reduce_fmax:
     case Intrinsic::experimental_vector_reduce_fmin: {
+      // FIXME: We only expand 'fast' reductions here because the underlying
+      //        code in createMinMaxOp() assumes that comparisons use 'fast'
+      //        semantics.
       Value *Vec = II->getArgOperand(0);
       if (!isPowerOf2_32(
-              cast<FixedVectorType>(Vec->getType())->getNumElements()))
+              cast<FixedVectorType>(Vec->getType())->getNumElements()) ||
+          !FMF.isFast())
         continue;
 
       Rdx = getShuffleReduction(Builder, Vec, getOpcode(ID), MRK);

diff  --git a/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp b/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
index 764472e570c0..509ae2c6bdcb 100644
--- a/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
@@ -2146,7 +2146,6 @@ SDValue DAGTypeLegalizer::SplitVecOp_VECREDUCE(SDNode *N, unsigned OpNo) {
   EVT LoOpVT, HiOpVT;
   std::tie(LoOpVT, HiOpVT) = DAG.GetSplitDestVTs(VecVT);
 
-  bool NoNaN = N->getFlags().hasNoNaNs();
   unsigned CombineOpc = 0;
   switch (N->getOpcode()) {
   case ISD::VECREDUCE_FADD: CombineOpc = ISD::FADD; break;
@@ -2160,12 +2159,8 @@ SDValue DAGTypeLegalizer::SplitVecOp_VECREDUCE(SDNode *N, unsigned OpNo) {
   case ISD::VECREDUCE_SMIN: CombineOpc = ISD::SMIN; break;
   case ISD::VECREDUCE_UMAX: CombineOpc = ISD::UMAX; break;
   case ISD::VECREDUCE_UMIN: CombineOpc = ISD::UMIN; break;
-  case ISD::VECREDUCE_FMAX:
-    CombineOpc = NoNaN ? ISD::FMAXNUM : ISD::FMAXIMUM;
-    break;
-  case ISD::VECREDUCE_FMIN:
-    CombineOpc = NoNaN ? ISD::FMINNUM : ISD::FMINIMUM;
-    break;
+  case ISD::VECREDUCE_FMAX: CombineOpc = ISD::FMAXNUM; break;
+  case ISD::VECREDUCE_FMIN: CombineOpc = ISD::FMINNUM; break;
   default:
     llvm_unreachable("Unexpected reduce ISD node");
   }
@@ -4771,6 +4766,7 @@ SDValue DAGTypeLegalizer::WidenVecOp_VECREDUCE(SDNode *N) {
   EVT OrigVT = N->getOperand(0).getValueType();
   EVT WideVT = Op.getValueType();
   EVT ElemVT = OrigVT.getVectorElementType();
+  SDNodeFlags Flags = N->getFlags();
 
   SDValue NeutralElem;
   switch (N->getOpcode()) {
@@ -4802,12 +4798,18 @@ SDValue DAGTypeLegalizer::WidenVecOp_VECREDUCE(SDNode *N) {
     NeutralElem = DAG.getConstantFP(1.0, dl, ElemVT);
     break;
   case ISD::VECREDUCE_FMAX:
+    // This has maxnum semantics, so NaN represents missing data. We must clear
+    // 'nnan' if it was set because the NaN would be a poison value.
     NeutralElem = DAG.getConstantFP(
-        -std::numeric_limits<double>::infinity(), dl, ElemVT);
+        std::numeric_limits<double>::quiet_NaN(), dl, ElemVT);
+    Flags.setNoNaNs(false);
     break;
   case ISD::VECREDUCE_FMIN:
+    // This has minnum semantics, so NaN represents missing data. We must clear
+    // 'nnan' if it was set because the NaN would be a poison value.
     NeutralElem = DAG.getConstantFP(
-        std::numeric_limits<double>::infinity(), dl, ElemVT);
+        std::numeric_limits<double>::quiet_NaN(), dl, ElemVT);
+    Flags.setNoNaNs(false);
     break;
   }
 
@@ -4818,7 +4820,7 @@ SDValue DAGTypeLegalizer::WidenVecOp_VECREDUCE(SDNode *N) {
     Op = DAG.getNode(ISD::INSERT_VECTOR_ELT, dl, WideVT, Op, NeutralElem,
                      DAG.getVectorIdxConstant(Idx, dl));
 
-  return DAG.getNode(N->getOpcode(), dl, N->getValueType(0), Op, N->getFlags());
+  return DAG.getNode(N->getOpcode(), dl, N->getValueType(0), Op, Flags);
 }
 
 SDValue DAGTypeLegalizer::WidenVecOp_VSELECT(SDNode *N) {

diff  --git a/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp b/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
index a80ca04921f4..ea2344e4f551 100644
--- a/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
@@ -7934,7 +7934,6 @@ bool TargetLowering::expandMULO(SDNode *Node, SDValue &Result,
 
 SDValue TargetLowering::expandVecReduce(SDNode *Node, SelectionDAG &DAG) const {
   SDLoc dl(Node);
-  bool NoNaN = Node->getFlags().hasNoNaNs();
   unsigned BaseOpcode = 0;
   switch (Node->getOpcode()) {
   default: llvm_unreachable("Expected VECREDUCE opcode");
@@ -7949,12 +7948,8 @@ SDValue TargetLowering::expandVecReduce(SDNode *Node, SelectionDAG &DAG) const {
   case ISD::VECREDUCE_SMIN: BaseOpcode = ISD::SMIN; break;
   case ISD::VECREDUCE_UMAX: BaseOpcode = ISD::UMAX; break;
   case ISD::VECREDUCE_UMIN: BaseOpcode = ISD::UMIN; break;
-  case ISD::VECREDUCE_FMAX:
-    BaseOpcode = NoNaN ? ISD::FMAXNUM : ISD::FMAXIMUM;
-    break;
-  case ISD::VECREDUCE_FMIN:
-    BaseOpcode = NoNaN ? ISD::FMINNUM : ISD::FMINIMUM;
-    break;
+  case ISD::VECREDUCE_FMAX: BaseOpcode = ISD::FMAXNUM; break;
+  case ISD::VECREDUCE_FMIN: BaseOpcode = ISD::FMINNUM; break;
   }
 
   SDValue Op = Node->getOperand(0);

diff  --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
index d4f324490430..6745b848f0ed 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
@@ -9529,14 +9529,12 @@ SDValue AArch64TargetLowering::LowerVECREDUCE(SDValue Op,
   case ISD::VECREDUCE_UMIN:
     return getReductionSDNode(AArch64ISD::UMINV, dl, Op, DAG);
   case ISD::VECREDUCE_FMAX: {
-    assert(Op->getFlags().hasNoNaNs() && "fmax vector reduction needs NoNaN flag");
     return DAG.getNode(
         ISD::INTRINSIC_WO_CHAIN, dl, Op.getValueType(),
         DAG.getConstant(Intrinsic::aarch64_neon_fmaxnmv, dl, MVT::i32),
         Op.getOperand(0));
   }
   case ISD::VECREDUCE_FMIN: {
-    assert(Op->getFlags().hasNoNaNs() && "fmin vector reduction needs NoNaN flag");
     return DAG.getNode(
         ISD::INTRINSIC_WO_CHAIN, dl, Op.getValueType(),
         DAG.getConstant(Intrinsic::aarch64_neon_fminnmv, dl, MVT::i32),

diff  --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
index 05b7f70f2335..3c3a246b90a1 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
@@ -223,11 +223,6 @@ class AArch64TTIImpl : public BasicTTIImplBase<AArch64TTIImpl> {
       // We don't have legalization support for ordered FP reductions.
       return !II->getFastMathFlags().allowReassoc();
 
-    case Intrinsic::experimental_vector_reduce_fmax:
-    case Intrinsic::experimental_vector_reduce_fmin:
-      // Lowering asserts that there are no NaNs.
-      return !II->getFastMathFlags().noNaNs();
-
     default:
       // Don't expand anything else, let legalization deal with it.
       return false;

diff  --git a/llvm/lib/Target/ARM/ARMTargetTransformInfo.h b/llvm/lib/Target/ARM/ARMTargetTransformInfo.h
index cc2019b47a07..508bb9e21d3a 100644
--- a/llvm/lib/Target/ARM/ARMTargetTransformInfo.h
+++ b/llvm/lib/Target/ARM/ARMTargetTransformInfo.h
@@ -201,10 +201,8 @@ class ARMTTIImpl : public BasicTTIImplBase<ARMTTIImpl> {
 
     case Intrinsic::experimental_vector_reduce_fmin:
     case Intrinsic::experimental_vector_reduce_fmax:
-      // Can't legalize reductions with soft floats, and NoNan will create
-      // fminimum which we do not know how to lower.
-      return TLI->useSoftFloat() || !TLI->getSubtarget()->hasFPRegs() ||
-             !II->getFastMathFlags().noNaNs();
+      // Can't legalize reductions with soft floats.
+      return TLI->useSoftFloat() || !TLI->getSubtarget()->hasFPRegs();
 
     default:
       // Don't expand anything else, let legalization deal with it.

diff  --git a/llvm/test/CodeGen/AArch64/vecreduce-fmax-legalization-nan.ll b/llvm/test/CodeGen/AArch64/vecreduce-fmax-legalization-nan.ll
index 4d888317b343..514a43a5e171 100644
--- a/llvm/test/CodeGen/AArch64/vecreduce-fmax-legalization-nan.ll
+++ b/llvm/test/CodeGen/AArch64/vecreduce-fmax-legalization-nan.ll
@@ -54,19 +54,7 @@ define fp128 @test_v1f128(<1 x fp128> %a) nounwind {
 define fp128 @test_v2f128(<2 x fp128> %a) nounwind {
 ; CHECK-LABEL: test_v2f128:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    sub sp, sp, #48 // =48
-; CHECK-NEXT:    str x30, [sp, #32] // 8-byte Folded Spill
-; CHECK-NEXT:    stp q0, q1, [sp] // 32-byte Folded Spill
-; CHECK-NEXT:    bl __gttf2
-; CHECK-NEXT:    ldr q0, [sp, #16] // 16-byte Folded Reload
-; CHECK-NEXT:    cmp w0, #0 // =0
-; CHECK-NEXT:    b.le .LBB4_2
-; CHECK-NEXT:  // %bb.1:
-; CHECK-NEXT:    ldr q0, [sp] // 16-byte Folded Reload
-; CHECK-NEXT:  .LBB4_2:
-; CHECK-NEXT:    ldr x30, [sp, #32] // 8-byte Folded Reload
-; CHECK-NEXT:    add sp, sp, #48 // =48
-; CHECK-NEXT:    ret
+; CHECK-NEXT:    b fmaxl
   %b = call fp128 @llvm.experimental.vector.reduce.fmax.v2f128(<2 x fp128> %a)
   ret fp128 %b
 }
@@ -77,11 +65,7 @@ define float @test_v16f32(<16 x float> %a) nounwind {
 ; CHECK-NEXT:    fmaxnm v1.4s, v1.4s, v3.4s
 ; CHECK-NEXT:    fmaxnm v0.4s, v0.4s, v2.4s
 ; CHECK-NEXT:    fmaxnm v0.4s, v0.4s, v1.4s
-; CHECK-NEXT:    dup v1.2d, v0.d[1]
-; CHECK-NEXT:    fmaxnm v0.4s, v0.4s, v1.4s
-; CHECK-NEXT:    dup v1.4s, v0.s[1]
-; CHECK-NEXT:    fmaxnm v0.4s, v0.4s, v1.4s
-; CHECK-NEXT:    // kill: def $s0 killed $s0 killed $q0
+; CHECK-NEXT:    fmaxnmv s0, v0.4s
 ; CHECK-NEXT:    ret
   %b = call float @llvm.experimental.vector.reduce.fmax.v16f32(<16 x float> %a)
   ret float %b

diff  --git a/llvm/test/CodeGen/AArch64/vecreduce-fmax-legalization.ll b/llvm/test/CodeGen/AArch64/vecreduce-fmax-legalization.ll
index 975ba2687792..7d6d424d64a9 100644
--- a/llvm/test/CodeGen/AArch64/vecreduce-fmax-legalization.ll
+++ b/llvm/test/CodeGen/AArch64/vecreduce-fmax-legalization.ll
@@ -47,7 +47,7 @@ define fp128 @test_v1f128(<1 x fp128> %a) nounwind {
 define float @test_v3f32(<3 x float> %a) nounwind {
 ; CHECK-LABEL: test_v3f32:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    mov w8, #-8388608
+; CHECK-NEXT:    mov w8, #2143289344
 ; CHECK-NEXT:    fmov s1, w8
 ; CHECK-NEXT:    mov v0.s[3], v1.s[0]
 ; CHECK-NEXT:    fmaxnmv s0, v0.4s

diff  --git a/llvm/test/CodeGen/Generic/expand-experimental-reductions.ll b/llvm/test/CodeGen/Generic/expand-experimental-reductions.ll
index 11abf902eeb3..e0e3149e3511 100644
--- a/llvm/test/CodeGen/Generic/expand-experimental-reductions.ll
+++ b/llvm/test/CodeGen/Generic/expand-experimental-reductions.ll
@@ -93,8 +93,8 @@ define float @fadd_f32(<4 x float> %vec) {
 ; CHECK-NEXT:    [[RDX_SHUF1:%.*]] = shufflevector <4 x float> [[BIN_RDX]], <4 x float> undef, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>
 ; CHECK-NEXT:    [[BIN_RDX2:%.*]] = fadd fast <4 x float> [[BIN_RDX]], [[RDX_SHUF1]]
 ; CHECK-NEXT:    [[TMP0:%.*]] = extractelement <4 x float> [[BIN_RDX2]], i32 0
-; CHECK-NEXT:    [[TMP1:%.*]] = fadd fast float 0.000000e+00, [[TMP0]]
-; CHECK-NEXT:    ret float [[TMP1]]
+; CHECK-NEXT:    [[BIN_RDX3:%.*]] = fadd fast float 0.000000e+00, [[TMP0]]
+; CHECK-NEXT:    ret float [[BIN_RDX3]]
 ;
 entry:
   %r = call fast float @llvm.experimental.vector.reduce.v2.fadd.f32.v4f32(float 0.0, <4 x float> %vec)
@@ -109,8 +109,8 @@ define float @fadd_f32_accum(float %accum, <4 x float> %vec) {
 ; CHECK-NEXT:    [[RDX_SHUF1:%.*]] = shufflevector <4 x float> [[BIN_RDX]], <4 x float> undef, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>
 ; CHECK-NEXT:    [[BIN_RDX2:%.*]] = fadd fast <4 x float> [[BIN_RDX]], [[RDX_SHUF1]]
 ; CHECK-NEXT:    [[TMP0:%.*]] = extractelement <4 x float> [[BIN_RDX2]], i32 0
-; CHECK-NEXT:    [[TMP1:%.*]] = fadd fast float %accum, [[TMP0]]
-; CHECK-NEXT:    ret float [[TMP1]]
+; CHECK-NEXT:    [[BIN_RDX3:%.*]] = fadd fast float [[ACCUM:%.*]], [[TMP0]]
+; CHECK-NEXT:    ret float [[BIN_RDX3]]
 ;
 entry:
   %r = call fast float @llvm.experimental.vector.reduce.v2.fadd.f32.v4f32(float %accum, <4 x float> %vec)
@@ -161,8 +161,8 @@ define float @fmul_f32(<4 x float> %vec) {
 ; CHECK-NEXT:    [[RDX_SHUF1:%.*]] = shufflevector <4 x float> [[BIN_RDX]], <4 x float> undef, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>
 ; CHECK-NEXT:    [[BIN_RDX2:%.*]] = fmul fast <4 x float> [[BIN_RDX]], [[RDX_SHUF1]]
 ; CHECK-NEXT:    [[TMP0:%.*]] = extractelement <4 x float> [[BIN_RDX2]], i32 0
-; CHECK-NEXT:    [[TMP1:%.*]] = fmul fast float 1.000000e+00, [[TMP0]]
-; CHECK-NEXT:    ret float [[TMP1]]
+; CHECK-NEXT:    [[BIN_RDX3:%.*]] = fmul fast float 1.000000e+00, [[TMP0]]
+; CHECK-NEXT:    ret float [[BIN_RDX3]]
 ;
 entry:
   %r = call fast float @llvm.experimental.vector.reduce.v2.fmul.f32.v4f32(float 1.0, <4 x float> %vec)
@@ -177,8 +177,8 @@ define float @fmul_f32_accum(float %accum, <4 x float> %vec) {
 ; CHECK-NEXT:    [[RDX_SHUF1:%.*]] = shufflevector <4 x float> [[BIN_RDX]], <4 x float> undef, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>
 ; CHECK-NEXT:    [[BIN_RDX2:%.*]] = fmul fast <4 x float> [[BIN_RDX]], [[RDX_SHUF1]]
 ; CHECK-NEXT:    [[TMP0:%.*]] = extractelement <4 x float> [[BIN_RDX2]], i32 0
-; CHECK-NEXT:    [[TMP1:%.*]] = fmul fast float %accum, [[TMP0]]
-; CHECK-NEXT:    ret float [[TMP1]]
+; CHECK-NEXT:    [[BIN_RDX3:%.*]] = fmul fast float [[ACCUM:%.*]], [[TMP0]]
+; CHECK-NEXT:    ret float [[BIN_RDX3]]
 ;
 entry:
   %r = call fast float @llvm.experimental.vector.reduce.v2.fmul.f32.v4f32(float %accum, <4 x float> %vec)
@@ -277,40 +277,40 @@ entry:
   ret i64 %r
 }
 
+; FIXME: Expand using maxnum intrinsic?
+
 define double @fmax_f64(<2 x double> %vec) {
 ; CHECK-LABEL: @fmax_f64(
 ; CHECK-NEXT:  entry:
-; CHECK-NEXT:    [[RDX_SHUF:%.*]] = shufflevector <2 x double> [[VEC:%.*]], <2 x double> undef, <2 x i32> <i32 1, i32 undef>
-; CHECK-NEXT:    [[RDX_MINMAX_CMP:%.*]] = fcmp fast ogt <2 x double> [[VEC]], [[RDX_SHUF]]
-; CHECK-NEXT:    [[RDX_MINMAX_SELECT:%.*]] = select fast <2 x i1> [[RDX_MINMAX_CMP]], <2 x double> [[VEC]], <2 x double> [[RDX_SHUF]]
-; CHECK-NEXT:    [[TMP0:%.*]] = extractelement <2 x double> [[RDX_MINMAX_SELECT]], i32 0
-; CHECK-NEXT:    ret double [[TMP0]]
+; CHECK-NEXT:    [[R:%.*]] = call double @llvm.experimental.vector.reduce.fmax.v2f64(<2 x double> [[VEC:%.*]])
+; CHECK-NEXT:    ret double [[R]]
 ;
 entry:
   %r = call double @llvm.experimental.vector.reduce.fmax.v2f64(<2 x double> %vec)
   ret double %r
 }
 
+; FIXME: Expand using minnum intrinsic?
+
 define double @fmin_f64(<2 x double> %vec) {
 ; CHECK-LABEL: @fmin_f64(
 ; CHECK-NEXT:  entry:
-; CHECK-NEXT:    [[RDX_SHUF:%.*]] = shufflevector <2 x double> [[VEC:%.*]], <2 x double> undef, <2 x i32> <i32 1, i32 undef>
-; CHECK-NEXT:    [[RDX_MINMAX_CMP:%.*]] = fcmp fast olt <2 x double> [[VEC]], [[RDX_SHUF]]
-; CHECK-NEXT:    [[RDX_MINMAX_SELECT:%.*]] = select fast <2 x i1> [[RDX_MINMAX_CMP]], <2 x double> [[VEC]], <2 x double> [[RDX_SHUF]]
-; CHECK-NEXT:    [[TMP0:%.*]] = extractelement <2 x double> [[RDX_MINMAX_SELECT]], i32 0
-; CHECK-NEXT:    ret double [[TMP0]]
+; CHECK-NEXT:    [[R:%.*]] = call double @llvm.experimental.vector.reduce.fmin.v2f64(<2 x double> [[VEC:%.*]])
+; CHECK-NEXT:    ret double [[R]]
 ;
 entry:
   %r = call double @llvm.experimental.vector.reduce.fmin.v2f64(<2 x double> %vec)
   ret double %r
 }
 
+; FIXME: Why is this not expanded?
+
 ; Test when the vector size is not power of two.
 define i8 @test_v3i8(<3 x i8> %a) nounwind {
 ; CHECK-LABEL: @test_v3i8(
 ; CHECK-NEXT:  entry:
-; CHECK-NEXT:    %b = call i8 @llvm.experimental.vector.reduce.and.v3i8(<3 x i8> %a)
-; CHECK-NEXT:    ret i8 %b
+; CHECK-NEXT:    [[B:%.*]] = call i8 @llvm.experimental.vector.reduce.and.v3i8(<3 x i8> [[A:%.*]])
+; CHECK-NEXT:    ret i8 [[B]]
 ;
 entry:
   %b = call i8 @llvm.experimental.vector.reduce.and.i8.v3i8(<3 x i8> %a)

diff  --git a/llvm/test/CodeGen/Thumb2/mve-vecreduce-fminmax.ll b/llvm/test/CodeGen/Thumb2/mve-vecreduce-fminmax.ll
index 6936b7ea3ad1..a83fa6882cb9 100644
--- a/llvm/test/CodeGen/Thumb2/mve-vecreduce-fminmax.ll
+++ b/llvm/test/CodeGen/Thumb2/mve-vecreduce-fminmax.ll
@@ -2,30 +2,11 @@
 ; RUN: llc -mtriple=thumbv8.1m.main-none-none-eabi -mattr=+mve.fp,+fp64 -verify-machineinstrs %s -o - | FileCheck %s --check-prefix=CHECK --check-prefix=CHECK-FP
 ; RUN: llc -mtriple=thumbv8.1m.main-none-none-eabi -mattr=+mve,+fullfp16,+fp64 -verify-machineinstrs %s -o - | FileCheck %s --check-prefix=CHECK --check-prefix=CHECK-NOFP
 
-; FIXME minnum nonan X, +Inf -> X   ?
 define arm_aapcs_vfpcc float @fmin_v2f32(<2 x float> %x) {
-; CHECK-FP-LABEL: fmin_v2f32:
-; CHECK-FP:       @ %bb.0: @ %entry
-; CHECK-FP-NEXT:    vldr s4, .LCPI0_0
-; CHECK-FP-NEXT:    vminnm.f32 s0, s0, s1
-; CHECK-FP-NEXT:    vminnm.f32 s0, s0, s4
-; CHECK-FP-NEXT:    bx lr
-; CHECK-FP-NEXT:    .p2align 2
-; CHECK-FP-NEXT:  @ %bb.1:
-; CHECK-FP-NEXT:  .LCPI0_0:
-; CHECK-FP-NEXT:    .long 0x7f800000 @ float +Inf
-;
-; CHECK-NOFP-LABEL: fmin_v2f32:
-; CHECK-NOFP:       @ %bb.0: @ %entry
-; CHECK-NOFP-NEXT:    vldr s4, .LCPI0_0
-; CHECK-NOFP-NEXT:    vminnm.f32 s0, s0, s1
-; CHECK-NOFP-NEXT:    vminnm.f32 s0, s0, s4
-; CHECK-NOFP-NEXT:    vminnm.f32 s0, s0, s4
-; CHECK-NOFP-NEXT:    bx lr
-; CHECK-NOFP-NEXT:    .p2align 2
-; CHECK-NOFP-NEXT:  @ %bb.1:
-; CHECK-NOFP-NEXT:  .LCPI0_0:
-; CHECK-NOFP-NEXT:    .long 0x7f800000 @ float +Inf
+; CHECK-LABEL: fmin_v2f32:
+; CHECK:       @ %bb.0: @ %entry
+; CHECK-NEXT:    vminnm.f32 s0, s0, s1
+; CHECK-NEXT:    bx lr
 entry:
   %z = call fast float @llvm.experimental.vector.reduce.fmin.v2f32(<2 x float> %x)
   ret float %z
@@ -99,17 +80,8 @@ define arm_aapcs_vfpcc half @fmin_v4f16(<4 x half> %x) {
 ; CHECK-NOFP-NEXT:    vminnm.f16 s4, s0, s4
 ; CHECK-NOFP-NEXT:    vmovx.f16 s0, s1
 ; CHECK-NOFP-NEXT:    vminnm.f16 s4, s4, s1
-; CHECK-NOFP-NEXT:    vldr.16 s2, .LCPI3_0
 ; CHECK-NOFP-NEXT:    vminnm.f16 s0, s4, s0
-; CHECK-NOFP-NEXT:    vminnm.f16 s0, s0, s2
-; CHECK-NOFP-NEXT:    vminnm.f16 s0, s0, s2
-; CHECK-NOFP-NEXT:    vminnm.f16 s0, s0, s2
-; CHECK-NOFP-NEXT:    vminnm.f16 s0, s0, s2
 ; CHECK-NOFP-NEXT:    bx lr
-; CHECK-NOFP-NEXT:    .p2align 1
-; CHECK-NOFP-NEXT:  @ %bb.1:
-; CHECK-NOFP-NEXT:  .LCPI3_0:
-; CHECK-NOFP-NEXT:    .short 0x7c00 @ half +Inf
 entry:
   %z = call fast half @llvm.experimental.vector.reduce.fmin.v4f16(<4 x half> %x)
   ret half %z
@@ -237,23 +209,11 @@ entry:
   ret double %z
 }
 
-; FIXME should not be vminnm
-; FIXME better reductions (no vmovs/vdups)
 define arm_aapcs_vfpcc float @fmin_v2f32_nofast(<2 x float> %x) {
-; CHECK-FP-LABEL: fmin_v2f32_nofast:
-; CHECK-FP:       @ %bb.0: @ %entry
-; CHECK-FP-NEXT:    vmov r0, s1
-; CHECK-FP-NEXT:    vdup.32 q1, r0
-; CHECK-FP-NEXT:    vminnm.f32 q0, q0, q1
-; CHECK-FP-NEXT:    @ kill: def $s0 killed $s0 killed $q0
-; CHECK-FP-NEXT:    bx lr
-;
-; CHECK-NOFP-LABEL: fmin_v2f32_nofast:
-; CHECK-NOFP:       @ %bb.0: @ %entry
-; CHECK-NOFP-NEXT:    vcmp.f32 s1, s0
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f32 s0, s0, s1
-; CHECK-NOFP-NEXT:    bx lr
+; CHECK-LABEL: fmin_v2f32_nofast:
+; CHECK:       @ %bb.0: @ %entry
+; CHECK-NEXT:    vminnm.f32 s0, s0, s1
+; CHECK-NEXT:    bx lr
 entry:
   %z = call float @llvm.experimental.vector.reduce.fmin.v2f32(<2 x float> %x)
   ret float %z
@@ -262,28 +222,16 @@ entry:
 define arm_aapcs_vfpcc float @fmin_v4f32_nofast(<4 x float> %x) {
 ; CHECK-FP-LABEL: fmin_v4f32_nofast:
 ; CHECK-FP:       @ %bb.0: @ %entry
-; CHECK-FP-NEXT:    vmov.f64 d2, d1
-; CHECK-FP-NEXT:    vmov.f32 s5, s3
-; CHECK-FP-NEXT:    vminnm.f32 q0, q0, q1
-; CHECK-FP-NEXT:    vmov r0, s1
-; CHECK-FP-NEXT:    vdup.32 q1, r0
-; CHECK-FP-NEXT:    vminnm.f32 q0, q0, q1
-; CHECK-FP-NEXT:    @ kill: def $s0 killed $s0 killed $q0
+; CHECK-FP-NEXT:    vminnm.f32 s4, s2, s3
+; CHECK-FP-NEXT:    vminnm.f32 s0, s0, s1
+; CHECK-FP-NEXT:    vminnm.f32 s0, s0, s4
 ; CHECK-FP-NEXT:    bx lr
 ;
 ; CHECK-NOFP-LABEL: fmin_v4f32_nofast:
 ; CHECK-NOFP:       @ %bb.0: @ %entry
-; CHECK-NOFP-NEXT:    vcmp.f32 s3, s1
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vmov.f64 d2, d1
-; CHECK-NOFP-NEXT:    vmov.f32 s5, s3
-; CHECK-NOFP-NEXT:    vcmp.f32 s4, s0
-; CHECK-NOFP-NEXT:    vselgt.f32 s8, s1, s3
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f32 s0, s0, s4
-; CHECK-NOFP-NEXT:    vcmp.f32 s8, s0
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f32 s0, s0, s8
+; CHECK-NOFP-NEXT:    vminnm.f32 s4, s0, s1
+; CHECK-NOFP-NEXT:    vminnm.f32 s4, s4, s2
+; CHECK-NOFP-NEXT:    vminnm.f32 s0, s4, s3
 ; CHECK-NOFP-NEXT:    bx lr
 entry:
   %z = call float @llvm.experimental.vector.reduce.fmin.v4f32(<4 x float> %x)
@@ -294,38 +242,20 @@ define arm_aapcs_vfpcc float @fmin_v8f32_nofast(<8 x float> %x) {
 ; CHECK-FP-LABEL: fmin_v8f32_nofast:
 ; CHECK-FP:       @ %bb.0: @ %entry
 ; CHECK-FP-NEXT:    vminnm.f32 q0, q0, q1
-; CHECK-FP-NEXT:    vmov.f64 d2, d1
-; CHECK-FP-NEXT:    vmov.f32 s5, s3
-; CHECK-FP-NEXT:    vminnm.f32 q0, q0, q1
-; CHECK-FP-NEXT:    vmov r0, s1
-; CHECK-FP-NEXT:    vdup.32 q1, r0
-; CHECK-FP-NEXT:    vminnm.f32 q0, q0, q1
-; CHECK-FP-NEXT:    @ kill: def $s0 killed $s0 killed $q0
+; CHECK-FP-NEXT:    vminnm.f32 s4, s2, s3
+; CHECK-FP-NEXT:    vminnm.f32 s0, s0, s1
+; CHECK-FP-NEXT:    vminnm.f32 s0, s0, s4
 ; CHECK-FP-NEXT:    bx lr
 ;
 ; CHECK-NOFP-LABEL: fmin_v8f32_nofast:
 ; CHECK-NOFP:       @ %bb.0: @ %entry
-; CHECK-NOFP-NEXT:    vcmp.f32 s7, s3
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vcmp.f32 s5, s1
-; CHECK-NOFP-NEXT:    vselgt.f32 s8, s3, s7
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vcmp.f32 s6, s2
-; CHECK-NOFP-NEXT:    vselgt.f32 s10, s1, s5
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vcmp.f32 s4, s0
-; CHECK-NOFP-NEXT:    vselgt.f32 s12, s2, s6
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vcmp.f32 s8, s10
-; CHECK-NOFP-NEXT:    vselgt.f32 s0, s0, s4
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vcmp.f32 s12, s0
-; CHECK-NOFP-NEXT:    vselgt.f32 s2, s10, s8
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f32 s0, s0, s12
-; CHECK-NOFP-NEXT:    vcmp.f32 s2, s0
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f32 s0, s0, s2
+; CHECK-NOFP-NEXT:    vminnm.f32 s10, s0, s4
+; CHECK-NOFP-NEXT:    vminnm.f32 s8, s1, s5
+; CHECK-NOFP-NEXT:    vminnm.f32 s8, s10, s8
+; CHECK-NOFP-NEXT:    vminnm.f32 s10, s2, s6
+; CHECK-NOFP-NEXT:    vminnm.f32 s8, s8, s10
+; CHECK-NOFP-NEXT:    vminnm.f32 s0, s3, s7
+; CHECK-NOFP-NEXT:    vminnm.f32 s0, s8, s0
 ; CHECK-NOFP-NEXT:    bx lr
 entry:
   %z = call float @llvm.experimental.vector.reduce.fmin.v8f32(<8 x float> %x)
@@ -335,30 +265,20 @@ entry:
 define arm_aapcs_vfpcc half @fmin_v4f16_nofast(<4 x half> %x) {
 ; CHECK-FP-LABEL: fmin_v4f16_nofast:
 ; CHECK-FP:       @ %bb.0: @ %entry
-; CHECK-FP-NEXT:    vmov r0, s1
-; CHECK-FP-NEXT:    vdup.32 q1, r0
-; CHECK-FP-NEXT:    vminnm.f16 q0, q0, q1
-; CHECK-FP-NEXT:    vmov.u16 r0, q0[1]
-; CHECK-FP-NEXT:    vdup.16 q1, r0
-; CHECK-FP-NEXT:    vminnm.f16 q0, q0, q1
-; CHECK-FP-NEXT:    @ kill: def $s0 killed $s0 killed $q0
+; CHECK-FP-NEXT:    vmovx.f16 s4, s1
+; CHECK-FP-NEXT:    vmovx.f16 s6, s0
+; CHECK-FP-NEXT:    vminnm.f16 s4, s1, s4
+; CHECK-FP-NEXT:    vminnm.f16 s0, s0, s6
+; CHECK-FP-NEXT:    vminnm.f16 s0, s0, s4
 ; CHECK-FP-NEXT:    bx lr
 ;
 ; CHECK-NOFP-LABEL: fmin_v4f16_nofast:
 ; CHECK-NOFP:       @ %bb.0: @ %entry
-; CHECK-NOFP-NEXT:    vmov r0, s1
-; CHECK-NOFP-NEXT:    vmovx.f16 s10, s0
-; CHECK-NOFP-NEXT:    vdup.32 q1, r0
-; CHECK-NOFP-NEXT:    vmovx.f16 s8, s4
-; CHECK-NOFP-NEXT:    vcmp.f16 s8, s10
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vcmp.f16 s4, s0
-; CHECK-NOFP-NEXT:    vselgt.f16 s8, s10, s8
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f16 s0, s0, s4
-; CHECK-NOFP-NEXT:    vcmp.f16 s8, s0
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f16 s0, s0, s8
+; CHECK-NOFP-NEXT:    vmovx.f16 s4, s0
+; CHECK-NOFP-NEXT:    vminnm.f16 s4, s0, s4
+; CHECK-NOFP-NEXT:    vmovx.f16 s0, s1
+; CHECK-NOFP-NEXT:    vminnm.f16 s4, s4, s1
+; CHECK-NOFP-NEXT:    vminnm.f16 s0, s4, s0
 ; CHECK-NOFP-NEXT:    bx lr
 entry:
   %z = call half @llvm.experimental.vector.reduce.fmin.v4f16(<4 x half> %x)
@@ -368,47 +288,26 @@ entry:
 define arm_aapcs_vfpcc half @fmin_v8f16_nofast(<8 x half> %x) {
 ; CHECK-FP-LABEL: fmin_v8f16_nofast:
 ; CHECK-FP:       @ %bb.0: @ %entry
-; CHECK-FP-NEXT:    vmov.f64 d2, d1
-; CHECK-FP-NEXT:    vmov.f32 s5, s3
-; CHECK-FP-NEXT:    vminnm.f16 q0, q0, q1
-; CHECK-FP-NEXT:    vmov r0, s1
-; CHECK-FP-NEXT:    vdup.32 q1, r0
-; CHECK-FP-NEXT:    vminnm.f16 q0, q0, q1
-; CHECK-FP-NEXT:    vmov.u16 r0, q0[1]
-; CHECK-FP-NEXT:    vdup.16 q1, r0
+; CHECK-FP-NEXT:    vrev32.16 q1, q0
 ; CHECK-FP-NEXT:    vminnm.f16 q0, q0, q1
-; CHECK-FP-NEXT:    @ kill: def $s0 killed $s0 killed $q0
+; CHECK-FP-NEXT:    vminnm.f16 s4, s2, s3
+; CHECK-FP-NEXT:    vminnm.f16 s0, s0, s1
+; CHECK-FP-NEXT:    vminnm.f16 s0, s0, s4
 ; CHECK-FP-NEXT:    bx lr
 ;
 ; CHECK-NOFP-LABEL: fmin_v8f16_nofast:
 ; CHECK-NOFP:       @ %bb.0: @ %entry
-; CHECK-NOFP-NEXT:    vmovx.f16 s8, s3
-; CHECK-NOFP-NEXT:    vmovx.f16 s10, s1
-; CHECK-NOFP-NEXT:    vcmp.f16 s8, s10
-; CHECK-NOFP-NEXT:    vmov.f64 d2, d1
-; CHECK-NOFP-NEXT:    vmovx.f16 s12, s0
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vmov.f32 s5, s3
-; CHECK-NOFP-NEXT:    vselgt.f16 s8, s10, s8
-; CHECK-NOFP-NEXT:    vmovx.f16 s10, s4
-; CHECK-NOFP-NEXT:    vcmp.f16 s10, s12
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f16 s10, s12, s10
-; CHECK-NOFP-NEXT:    vcmp.f16 s8, s10
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vcmp.f16 s3, s1
-; CHECK-NOFP-NEXT:    vselgt.f16 s8, s10, s8
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vcmp.f16 s4, s0
-; CHECK-NOFP-NEXT:    vselgt.f16 s10, s1, s3
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f16 s0, s0, s4
-; CHECK-NOFP-NEXT:    vcmp.f16 s10, s0
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f16 s0, s0, s10
-; CHECK-NOFP-NEXT:    vcmp.f16 s8, s0
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f16 s0, s0, s8
+; CHECK-NOFP-NEXT:    vmovx.f16 s4, s0
+; CHECK-NOFP-NEXT:    vmovx.f16 s6, s1
+; CHECK-NOFP-NEXT:    vminnm.f16 s4, s0, s4
+; CHECK-NOFP-NEXT:    vmovx.f16 s0, s3
+; CHECK-NOFP-NEXT:    vminnm.f16 s4, s4, s1
+; CHECK-NOFP-NEXT:    vminnm.f16 s4, s4, s6
+; CHECK-NOFP-NEXT:    vmovx.f16 s6, s2
+; CHECK-NOFP-NEXT:    vminnm.f16 s4, s4, s2
+; CHECK-NOFP-NEXT:    vminnm.f16 s4, s4, s6
+; CHECK-NOFP-NEXT:    vminnm.f16 s4, s4, s3
+; CHECK-NOFP-NEXT:    vminnm.f16 s0, s4, s0
 ; CHECK-NOFP-NEXT:    bx lr
 entry:
   %z = call half @llvm.experimental.vector.reduce.fmin.v8f16(<8 x half> %x)
@@ -419,73 +318,38 @@ define arm_aapcs_vfpcc half @fmin_v16f16_nofast(<16 x half> %x) {
 ; CHECK-FP-LABEL: fmin_v16f16_nofast:
 ; CHECK-FP:       @ %bb.0: @ %entry
 ; CHECK-FP-NEXT:    vminnm.f16 q0, q0, q1
-; CHECK-FP-NEXT:    vmov.f64 d2, d1
-; CHECK-FP-NEXT:    vmov.f32 s5, s3
-; CHECK-FP-NEXT:    vminnm.f16 q0, q0, q1
-; CHECK-FP-NEXT:    vmov r0, s1
-; CHECK-FP-NEXT:    vdup.32 q1, r0
-; CHECK-FP-NEXT:    vminnm.f16 q0, q0, q1
-; CHECK-FP-NEXT:    vmov.u16 r0, q0[1]
-; CHECK-FP-NEXT:    vdup.16 q1, r0
+; CHECK-FP-NEXT:    vrev32.16 q1, q0
 ; CHECK-FP-NEXT:    vminnm.f16 q0, q0, q1
-; CHECK-FP-NEXT:    @ kill: def $s0 killed $s0 killed $q0
+; CHECK-FP-NEXT:    vminnm.f16 s4, s2, s3
+; CHECK-FP-NEXT:    vminnm.f16 s0, s0, s1
+; CHECK-FP-NEXT:    vminnm.f16 s0, s0, s4
 ; CHECK-FP-NEXT:    bx lr
 ;
 ; CHECK-NOFP-LABEL: fmin_v16f16_nofast:
 ; CHECK-NOFP:       @ %bb.0: @ %entry
-; CHECK-NOFP-NEXT:    vmovx.f16 s8, s7
-; CHECK-NOFP-NEXT:    vmovx.f16 s10, s3
-; CHECK-NOFP-NEXT:    vcmp.f16 s8, s10
-; CHECK-NOFP-NEXT:    vmovx.f16 s12, s1
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vmovx.f16 s14, s0
-; CHECK-NOFP-NEXT:    vselgt.f16 s8, s10, s8
+; CHECK-NOFP-NEXT:    vmovx.f16 s8, s4
+; CHECK-NOFP-NEXT:    vmovx.f16 s10, s0
+; CHECK-NOFP-NEXT:    vminnm.f16 s8, s10, s8
+; CHECK-NOFP-NEXT:    vminnm.f16 s10, s0, s4
+; CHECK-NOFP-NEXT:    vminnm.f16 s8, s10, s8
+; CHECK-NOFP-NEXT:    vminnm.f16 s10, s1, s5
+; CHECK-NOFP-NEXT:    vminnm.f16 s8, s8, s10
 ; CHECK-NOFP-NEXT:    vmovx.f16 s10, s5
-; CHECK-NOFP-NEXT:    vcmp.f16 s10, s12
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f16 s10, s12, s10
+; CHECK-NOFP-NEXT:    vmovx.f16 s12, s1
+; CHECK-NOFP-NEXT:    vmovx.f16 s4, s7
+; CHECK-NOFP-NEXT:    vminnm.f16 s10, s12, s10
 ; CHECK-NOFP-NEXT:    vmovx.f16 s12, s2
-; CHECK-NOFP-NEXT:    vcmp.f16 s8, s10
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f16 s8, s10, s8
+; CHECK-NOFP-NEXT:    vminnm.f16 s8, s8, s10
+; CHECK-NOFP-NEXT:    vminnm.f16 s10, s2, s6
+; CHECK-NOFP-NEXT:    vminnm.f16 s8, s8, s10
 ; CHECK-NOFP-NEXT:    vmovx.f16 s10, s6
-; CHECK-NOFP-NEXT:    vcmp.f16 s10, s12
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f16 s10, s12, s10
-; CHECK-NOFP-NEXT:    vmovx.f16 s12, s4
-; CHECK-NOFP-NEXT:    vcmp.f16 s12, s14
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f16 s12, s14, s12
-; CHECK-NOFP-NEXT:    vcmp.f16 s10, s12
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f16 s10, s12, s10
-; CHECK-NOFP-NEXT:    vcmp.f16 s8, s10
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vcmp.f16 s7, s3
-; CHECK-NOFP-NEXT:    vselgt.f16 s8, s10, s8
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vcmp.f16 s5, s1
-; CHECK-NOFP-NEXT:    vselgt.f16 s10, s3, s7
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f16 s12, s1, s5
-; CHECK-NOFP-NEXT:    vcmp.f16 s10, s12
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vcmp.f16 s6, s2
-; CHECK-NOFP-NEXT:    vselgt.f16 s10, s12, s10
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vcmp.f16 s4, s0
-; CHECK-NOFP-NEXT:    vselgt.f16 s12, s2, s6
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f16 s0, s0, s4
-; CHECK-NOFP-NEXT:    vcmp.f16 s12, s0
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f16 s0, s0, s12
-; CHECK-NOFP-NEXT:    vcmp.f16 s10, s0
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f16 s0, s0, s10
-; CHECK-NOFP-NEXT:    vcmp.f16 s8, s0
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f16 s0, s0, s8
+; CHECK-NOFP-NEXT:    vminnm.f16 s10, s12, s10
+; CHECK-NOFP-NEXT:    vmovx.f16 s0, s3
+; CHECK-NOFP-NEXT:    vminnm.f16 s8, s8, s10
+; CHECK-NOFP-NEXT:    vminnm.f16 s10, s3, s7
+; CHECK-NOFP-NEXT:    vminnm.f16 s8, s8, s10
+; CHECK-NOFP-NEXT:    vminnm.f16 s0, s0, s4
+; CHECK-NOFP-NEXT:    vminnm.f16 s0, s8, s0
 ; CHECK-NOFP-NEXT:    bx lr
 entry:
   %z = call half @llvm.experimental.vector.reduce.fmin.v16f16(<16 x half> %x)
@@ -504,9 +368,7 @@ entry:
 define arm_aapcs_vfpcc double @fmin_v2f64_nofast(<2 x double> %x) {
 ; CHECK-LABEL: fmin_v2f64_nofast:
 ; CHECK:       @ %bb.0: @ %entry
-; CHECK-NEXT:    vcmp.f64 d1, d0
-; CHECK-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NEXT:    vselgt.f64 d0, d0, d1
+; CHECK-NEXT:    vminnm.f64 d0, d0, d1
 ; CHECK-NEXT:    bx lr
 entry:
   %z = call double @llvm.experimental.vector.reduce.fmin.v2f64(<2 x double> %x)
@@ -516,15 +378,9 @@ entry:
 define arm_aapcs_vfpcc double @fmin_v4f64_nofast(<4 x double> %x) {
 ; CHECK-LABEL: fmin_v4f64_nofast:
 ; CHECK:       @ %bb.0: @ %entry
-; CHECK-NEXT:    vcmp.f64 d3, d1
-; CHECK-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NEXT:    vcmp.f64 d2, d0
-; CHECK-NEXT:    vselgt.f64 d4, d1, d3
-; CHECK-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NEXT:    vselgt.f64 d0, d0, d2
-; CHECK-NEXT:    vcmp.f64 d4, d0
-; CHECK-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NEXT:    vselgt.f64 d0, d0, d4
+; CHECK-NEXT:    vminnm.f64 d4, d1, d3
+; CHECK-NEXT:    vminnm.f64 d0, d0, d2
+; CHECK-NEXT:    vminnm.f64 d0, d0, d4
 ; CHECK-NEXT:    bx lr
 entry:
   %z = call double @llvm.experimental.vector.reduce.fmin.v4f64(<4 x double> %x)
@@ -532,30 +388,11 @@ entry:
 }
 
 define arm_aapcs_vfpcc float @fmin_v2f32_acc(<2 x float> %x, float %y) {
-; CHECK-FP-LABEL: fmin_v2f32_acc:
-; CHECK-FP:       @ %bb.0: @ %entry
-; CHECK-FP-NEXT:    vldr s6, .LCPI18_0
-; CHECK-FP-NEXT:    vminnm.f32 s0, s0, s1
-; CHECK-FP-NEXT:    vminnm.f32 s0, s0, s6
-; CHECK-FP-NEXT:    vminnm.f32 s0, s4, s0
-; CHECK-FP-NEXT:    bx lr
-; CHECK-FP-NEXT:    .p2align 2
-; CHECK-FP-NEXT:  @ %bb.1:
-; CHECK-FP-NEXT:  .LCPI18_0:
-; CHECK-FP-NEXT:    .long 0x7f800000 @ float +Inf
-;
-; CHECK-NOFP-LABEL: fmin_v2f32_acc:
-; CHECK-NOFP:       @ %bb.0: @ %entry
-; CHECK-NOFP-NEXT:    vldr s6, .LCPI18_0
-; CHECK-NOFP-NEXT:    vminnm.f32 s0, s0, s1
-; CHECK-NOFP-NEXT:    vminnm.f32 s0, s0, s6
-; CHECK-NOFP-NEXT:    vminnm.f32 s0, s0, s6
-; CHECK-NOFP-NEXT:    vminnm.f32 s0, s4, s0
-; CHECK-NOFP-NEXT:    bx lr
-; CHECK-NOFP-NEXT:    .p2align 2
-; CHECK-NOFP-NEXT:  @ %bb.1:
-; CHECK-NOFP-NEXT:  .LCPI18_0:
-; CHECK-NOFP-NEXT:    .long 0x7f800000 @ float +Inf
+; CHECK-LABEL: fmin_v2f32_acc:
+; CHECK:       @ %bb.0: @ %entry
+; CHECK-NEXT:    vminnm.f32 s0, s0, s1
+; CHECK-NEXT:    vminnm.f32 s0, s4, s0
+; CHECK-NEXT:    bx lr
 entry:
   %z = call fast float @llvm.experimental.vector.reduce.fmin.v2f32(<2 x float> %x)
   %c = fcmp fast olt float %y, %z
@@ -641,20 +478,11 @@ define arm_aapcs_vfpcc void @fmin_v4f16_acc(<4 x half> %x, half* %yy) {
 ; CHECK-NOFP-NEXT:    vminnm.f16 s4, s0, s4
 ; CHECK-NOFP-NEXT:    vmovx.f16 s0, s1
 ; CHECK-NOFP-NEXT:    vminnm.f16 s4, s4, s1
-; CHECK-NOFP-NEXT:    vldr.16 s2, .LCPI21_0
-; CHECK-NOFP-NEXT:    vminnm.f16 s0, s4, s0
-; CHECK-NOFP-NEXT:    vminnm.f16 s0, s0, s2
-; CHECK-NOFP-NEXT:    vminnm.f16 s0, s0, s2
-; CHECK-NOFP-NEXT:    vminnm.f16 s0, s0, s2
-; CHECK-NOFP-NEXT:    vminnm.f16 s0, s0, s2
 ; CHECK-NOFP-NEXT:    vldr.16 s2, [r0]
+; CHECK-NOFP-NEXT:    vminnm.f16 s0, s4, s0
 ; CHECK-NOFP-NEXT:    vminnm.f16 s0, s2, s0
 ; CHECK-NOFP-NEXT:    vstr.16 s0, [r0]
 ; CHECK-NOFP-NEXT:    bx lr
-; CHECK-NOFP-NEXT:    .p2align 1
-; CHECK-NOFP-NEXT:  @ %bb.1:
-; CHECK-NOFP-NEXT:  .LCPI21_0:
-; CHECK-NOFP-NEXT:    .short 0x7c00 @ half +Inf
 entry:
   %y = load half, half* %yy
   %z = call fast half @llvm.experimental.vector.reduce.fmin.v4f16(<4 x half> %x)
@@ -665,34 +493,14 @@ entry:
 }
 
 define arm_aapcs_vfpcc void @fmin_v2f16_acc(<2 x half> %x, half* %yy) {
-; CHECK-FP-LABEL: fmin_v2f16_acc:
-; CHECK-FP:       @ %bb.0: @ %entry
-; CHECK-FP-NEXT:    vmovx.f16 s4, s0
-; CHECK-FP-NEXT:    vminnm.f16 s0, s0, s4
-; CHECK-FP-NEXT:    vldr.16 s2, [r0]
-; CHECK-FP-NEXT:    vminnm.f16 s0, s2, s0
-; CHECK-FP-NEXT:    vstr.16 s0, [r0]
-; CHECK-FP-NEXT:    bx lr
-;
-; CHECK-NOFP-LABEL: fmin_v2f16_acc:
-; CHECK-NOFP:       @ %bb.0: @ %entry
-; CHECK-NOFP-NEXT:    vmovx.f16 s4, s0
-; CHECK-NOFP-NEXT:    vminnm.f16 s0, s0, s4
-; CHECK-NOFP-NEXT:    vldr.16 s2, .LCPI22_0
-; CHECK-NOFP-NEXT:    vminnm.f16 s0, s0, s2
-; CHECK-NOFP-NEXT:    vminnm.f16 s0, s0, s2
-; CHECK-NOFP-NEXT:    vminnm.f16 s0, s0, s2
-; CHECK-NOFP-NEXT:    vminnm.f16 s0, s0, s2
-; CHECK-NOFP-NEXT:    vminnm.f16 s0, s0, s2
-; CHECK-NOFP-NEXT:    vminnm.f16 s0, s0, s2
-; CHECK-NOFP-NEXT:    vldr.16 s2, [r0]
-; CHECK-NOFP-NEXT:    vminnm.f16 s0, s2, s0
-; CHECK-NOFP-NEXT:    vstr.16 s0, [r0]
-; CHECK-NOFP-NEXT:    bx lr
-; CHECK-NOFP-NEXT:    .p2align 1
-; CHECK-NOFP-NEXT:  @ %bb.1:
-; CHECK-NOFP-NEXT:  .LCPI22_0:
-; CHECK-NOFP-NEXT:    .short 0x7c00 @ half +Inf
+; CHECK-LABEL: fmin_v2f16_acc:
+; CHECK:       @ %bb.0: @ %entry
+; CHECK-NEXT:    vmovx.f16 s4, s0
+; CHECK-NEXT:    vminnm.f16 s0, s0, s4
+; CHECK-NEXT:    vldr.16 s2, [r0]
+; CHECK-NEXT:    vminnm.f16 s0, s2, s0
+; CHECK-NEXT:    vstr.16 s0, [r0]
+; CHECK-NEXT:    bx lr
 entry:
   %y = load half, half* %yy
   %z = call fast half @llvm.experimental.vector.reduce.fmin.v2f16(<2 x half> %x)
@@ -854,25 +662,13 @@ entry:
 }
 
 define arm_aapcs_vfpcc float @fmin_v2f32_acc_nofast(<2 x float> %x, float %y) {
-; CHECK-FP-LABEL: fmin_v2f32_acc_nofast:
-; CHECK-FP:       @ %bb.0: @ %entry
-; CHECK-FP-NEXT:    vmov r0, s1
-; CHECK-FP-NEXT:    vdup.32 q2, r0
-; CHECK-FP-NEXT:    vminnm.f32 q0, q0, q2
-; CHECK-FP-NEXT:    vcmp.f32 s0, s4
-; CHECK-FP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-FP-NEXT:    vselgt.f32 s0, s4, s0
-; CHECK-FP-NEXT:    bx lr
-;
-; CHECK-NOFP-LABEL: fmin_v2f32_acc_nofast:
-; CHECK-NOFP:       @ %bb.0: @ %entry
-; CHECK-NOFP-NEXT:    vcmp.f32 s1, s0
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f32 s0, s0, s1
-; CHECK-NOFP-NEXT:    vcmp.f32 s0, s4
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f32 s0, s4, s0
-; CHECK-NOFP-NEXT:    bx lr
+; CHECK-LABEL: fmin_v2f32_acc_nofast:
+; CHECK:       @ %bb.0: @ %entry
+; CHECK-NEXT:    vminnm.f32 s0, s0, s1
+; CHECK-NEXT:    vcmp.f32 s0, s4
+; CHECK-NEXT:    vmrs APSR_nzcv, fpscr
+; CHECK-NEXT:    vselgt.f32 s0, s4, s0
+; CHECK-NEXT:    bx lr
 entry:
   %z = call float @llvm.experimental.vector.reduce.fmin.v2f32(<2 x float> %x)
   %c = fcmp olt float %y, %z
@@ -883,12 +679,9 @@ entry:
 define arm_aapcs_vfpcc float @fmin_v4f32_acc_nofast(<4 x float> %x, float %y) {
 ; CHECK-FP-LABEL: fmin_v4f32_acc_nofast:
 ; CHECK-FP:       @ %bb.0: @ %entry
-; CHECK-FP-NEXT:    vmov.f64 d4, d1
-; CHECK-FP-NEXT:    vmov.f32 s9, s3
-; CHECK-FP-NEXT:    vminnm.f32 q0, q0, q2
-; CHECK-FP-NEXT:    vmov r0, s1
-; CHECK-FP-NEXT:    vdup.32 q2, r0
-; CHECK-FP-NEXT:    vminnm.f32 q0, q0, q2
+; CHECK-FP-NEXT:    vminnm.f32 s6, s2, s3
+; CHECK-FP-NEXT:    vminnm.f32 s0, s0, s1
+; CHECK-FP-NEXT:    vminnm.f32 s0, s0, s6
 ; CHECK-FP-NEXT:    vcmp.f32 s0, s4
 ; CHECK-FP-NEXT:    vmrs APSR_nzcv, fpscr
 ; CHECK-FP-NEXT:    vselgt.f32 s0, s4, s0
@@ -896,17 +689,9 @@ define arm_aapcs_vfpcc float @fmin_v4f32_acc_nofast(<4 x float> %x, float %y) {
 ;
 ; CHECK-NOFP-LABEL: fmin_v4f32_acc_nofast:
 ; CHECK-NOFP:       @ %bb.0: @ %entry
-; CHECK-NOFP-NEXT:    vcmp.f32 s3, s1
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vmov.f64 d4, d1
-; CHECK-NOFP-NEXT:    vmov.f32 s9, s3
-; CHECK-NOFP-NEXT:    vcmp.f32 s8, s0
-; CHECK-NOFP-NEXT:    vselgt.f32 s6, s1, s3
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f32 s0, s0, s8
-; CHECK-NOFP-NEXT:    vcmp.f32 s6, s0
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f32 s0, s0, s6
+; CHECK-NOFP-NEXT:    vminnm.f32 s6, s0, s1
+; CHECK-NOFP-NEXT:    vminnm.f32 s6, s6, s2
+; CHECK-NOFP-NEXT:    vminnm.f32 s0, s6, s3
 ; CHECK-NOFP-NEXT:    vcmp.f32 s0, s4
 ; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
 ; CHECK-NOFP-NEXT:    vselgt.f32 s0, s4, s0
@@ -922,12 +707,9 @@ define arm_aapcs_vfpcc float @fmin_v8f32_acc_nofast(<8 x float> %x, float %y) {
 ; CHECK-FP-LABEL: fmin_v8f32_acc_nofast:
 ; CHECK-FP:       @ %bb.0: @ %entry
 ; CHECK-FP-NEXT:    vminnm.f32 q0, q0, q1
-; CHECK-FP-NEXT:    vmov.f64 d2, d1
-; CHECK-FP-NEXT:    vmov.f32 s5, s3
-; CHECK-FP-NEXT:    vminnm.f32 q0, q0, q1
-; CHECK-FP-NEXT:    vmov r0, s1
-; CHECK-FP-NEXT:    vdup.32 q1, r0
-; CHECK-FP-NEXT:    vminnm.f32 q0, q0, q1
+; CHECK-FP-NEXT:    vminnm.f32 s4, s2, s3
+; CHECK-FP-NEXT:    vminnm.f32 s0, s0, s1
+; CHECK-FP-NEXT:    vminnm.f32 s0, s0, s4
 ; CHECK-FP-NEXT:    vcmp.f32 s0, s8
 ; CHECK-FP-NEXT:    vmrs APSR_nzcv, fpscr
 ; CHECK-FP-NEXT:    vselgt.f32 s0, s8, s0
@@ -935,27 +717,13 @@ define arm_aapcs_vfpcc float @fmin_v8f32_acc_nofast(<8 x float> %x, float %y) {
 ;
 ; CHECK-NOFP-LABEL: fmin_v8f32_acc_nofast:
 ; CHECK-NOFP:       @ %bb.0: @ %entry
-; CHECK-NOFP-NEXT:    vcmp.f32 s7, s3
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vcmp.f32 s5, s1
-; CHECK-NOFP-NEXT:    vselgt.f32 s10, s3, s7
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vcmp.f32 s6, s2
-; CHECK-NOFP-NEXT:    vselgt.f32 s12, s1, s5
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vcmp.f32 s4, s0
-; CHECK-NOFP-NEXT:    vselgt.f32 s14, s2, s6
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vcmp.f32 s10, s12
-; CHECK-NOFP-NEXT:    vselgt.f32 s0, s0, s4
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vcmp.f32 s14, s0
-; CHECK-NOFP-NEXT:    vselgt.f32 s2, s12, s10
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f32 s0, s0, s14
-; CHECK-NOFP-NEXT:    vcmp.f32 s2, s0
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f32 s0, s0, s2
+; CHECK-NOFP-NEXT:    vminnm.f32 s12, s0, s4
+; CHECK-NOFP-NEXT:    vminnm.f32 s10, s1, s5
+; CHECK-NOFP-NEXT:    vminnm.f32 s10, s12, s10
+; CHECK-NOFP-NEXT:    vminnm.f32 s12, s2, s6
+; CHECK-NOFP-NEXT:    vminnm.f32 s10, s10, s12
+; CHECK-NOFP-NEXT:    vminnm.f32 s0, s3, s7
+; CHECK-NOFP-NEXT:    vminnm.f32 s0, s10, s0
 ; CHECK-NOFP-NEXT:    vcmp.f32 s0, s8
 ; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
 ; CHECK-NOFP-NEXT:    vselgt.f32 s0, s8, s0
@@ -970,35 +738,26 @@ entry:
 define arm_aapcs_vfpcc void @fmin_v4f16_acc_nofast(<4 x half> %x, half* %yy) {
 ; CHECK-FP-LABEL: fmin_v4f16_acc_nofast:
 ; CHECK-FP:       @ %bb.0: @ %entry
-; CHECK-FP-NEXT:    vmov r1, s1
-; CHECK-FP-NEXT:    vdup.32 q1, r1
-; CHECK-FP-NEXT:    vminnm.f16 q0, q0, q1
-; CHECK-FP-NEXT:    vmov.u16 r1, q0[1]
-; CHECK-FP-NEXT:    vdup.16 q1, r1
-; CHECK-FP-NEXT:    vminnm.f16 q0, q0, q1
-; CHECK-FP-NEXT:    vldr.16 s4, [r0]
-; CHECK-FP-NEXT:    vcmp.f16 s0, s4
+; CHECK-FP-NEXT:    vmovx.f16 s4, s1
+; CHECK-FP-NEXT:    vmovx.f16 s6, s0
+; CHECK-FP-NEXT:    vminnm.f16 s0, s0, s6
+; CHECK-FP-NEXT:    vminnm.f16 s4, s1, s4
+; CHECK-FP-NEXT:    vldr.16 s2, [r0]
+; CHECK-FP-NEXT:    vminnm.f16 s0, s0, s4
+; CHECK-FP-NEXT:    vcmp.f16 s0, s2
 ; CHECK-FP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-FP-NEXT:    vselgt.f16 s0, s4, s0
+; CHECK-FP-NEXT:    vselgt.f16 s0, s2, s0
 ; CHECK-FP-NEXT:    vstr.16 s0, [r0]
 ; CHECK-FP-NEXT:    bx lr
 ;
 ; CHECK-NOFP-LABEL: fmin_v4f16_acc_nofast:
 ; CHECK-NOFP:       @ %bb.0: @ %entry
-; CHECK-NOFP-NEXT:    vmov r1, s1
-; CHECK-NOFP-NEXT:    vmovx.f16 s10, s0
-; CHECK-NOFP-NEXT:    vdup.32 q1, r1
-; CHECK-NOFP-NEXT:    vmovx.f16 s8, s4
-; CHECK-NOFP-NEXT:    vcmp.f16 s8, s10
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vcmp.f16 s4, s0
-; CHECK-NOFP-NEXT:    vselgt.f16 s8, s10, s8
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f16 s0, s0, s4
+; CHECK-NOFP-NEXT:    vmovx.f16 s4, s0
+; CHECK-NOFP-NEXT:    vminnm.f16 s4, s0, s4
+; CHECK-NOFP-NEXT:    vmovx.f16 s0, s1
+; CHECK-NOFP-NEXT:    vminnm.f16 s4, s4, s1
 ; CHECK-NOFP-NEXT:    vldr.16 s2, [r0]
-; CHECK-NOFP-NEXT:    vcmp.f16 s8, s0
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f16 s0, s0, s8
+; CHECK-NOFP-NEXT:    vminnm.f16 s0, s4, s0
 ; CHECK-NOFP-NEXT:    vcmp.f16 s0, s2
 ; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
 ; CHECK-NOFP-NEXT:    vselgt.f16 s0, s2, s0
@@ -1016,52 +775,32 @@ entry:
 define arm_aapcs_vfpcc void @fmin_v8f16_acc_nofast(<8 x half> %x, half* %yy) {
 ; CHECK-FP-LABEL: fmin_v8f16_acc_nofast:
 ; CHECK-FP:       @ %bb.0: @ %entry
-; CHECK-FP-NEXT:    vmov.f64 d2, d1
-; CHECK-FP-NEXT:    vmov.f32 s5, s3
-; CHECK-FP-NEXT:    vminnm.f16 q0, q0, q1
-; CHECK-FP-NEXT:    vmov r1, s1
-; CHECK-FP-NEXT:    vdup.32 q1, r1
-; CHECK-FP-NEXT:    vminnm.f16 q0, q0, q1
-; CHECK-FP-NEXT:    vmov.u16 r1, q0[1]
-; CHECK-FP-NEXT:    vdup.16 q1, r1
+; CHECK-FP-NEXT:    vrev32.16 q1, q0
 ; CHECK-FP-NEXT:    vminnm.f16 q0, q0, q1
-; CHECK-FP-NEXT:    vldr.16 s4, [r0]
-; CHECK-FP-NEXT:    vcmp.f16 s0, s4
+; CHECK-FP-NEXT:    vminnm.f16 s4, s2, s3
+; CHECK-FP-NEXT:    vminnm.f16 s0, s0, s1
+; CHECK-FP-NEXT:    vldr.16 s2, [r0]
+; CHECK-FP-NEXT:    vminnm.f16 s0, s0, s4
+; CHECK-FP-NEXT:    vcmp.f16 s0, s2
 ; CHECK-FP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-FP-NEXT:    vselgt.f16 s0, s4, s0
+; CHECK-FP-NEXT:    vselgt.f16 s0, s2, s0
 ; CHECK-FP-NEXT:    vstr.16 s0, [r0]
 ; CHECK-FP-NEXT:    bx lr
 ;
 ; CHECK-NOFP-LABEL: fmin_v8f16_acc_nofast:
 ; CHECK-NOFP:       @ %bb.0: @ %entry
-; CHECK-NOFP-NEXT:    vmovx.f16 s8, s3
-; CHECK-NOFP-NEXT:    vmovx.f16 s10, s1
-; CHECK-NOFP-NEXT:    vcmp.f16 s8, s10
-; CHECK-NOFP-NEXT:    vmov.f64 d2, d1
-; CHECK-NOFP-NEXT:    vmovx.f16 s12, s0
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vmov.f32 s5, s3
-; CHECK-NOFP-NEXT:    vselgt.f16 s8, s10, s8
-; CHECK-NOFP-NEXT:    vmovx.f16 s10, s4
-; CHECK-NOFP-NEXT:    vcmp.f16 s10, s12
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f16 s10, s12, s10
-; CHECK-NOFP-NEXT:    vcmp.f16 s8, s10
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vcmp.f16 s3, s1
-; CHECK-NOFP-NEXT:    vselgt.f16 s8, s10, s8
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vcmp.f16 s4, s0
-; CHECK-NOFP-NEXT:    vselgt.f16 s10, s1, s3
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f16 s0, s0, s4
+; CHECK-NOFP-NEXT:    vmovx.f16 s4, s0
+; CHECK-NOFP-NEXT:    vmovx.f16 s6, s1
+; CHECK-NOFP-NEXT:    vminnm.f16 s4, s0, s4
+; CHECK-NOFP-NEXT:    vmovx.f16 s0, s3
+; CHECK-NOFP-NEXT:    vminnm.f16 s4, s4, s1
+; CHECK-NOFP-NEXT:    vminnm.f16 s4, s4, s6
+; CHECK-NOFP-NEXT:    vmovx.f16 s6, s2
+; CHECK-NOFP-NEXT:    vminnm.f16 s4, s4, s2
 ; CHECK-NOFP-NEXT:    vldr.16 s2, [r0]
-; CHECK-NOFP-NEXT:    vcmp.f16 s10, s0
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f16 s0, s0, s10
-; CHECK-NOFP-NEXT:    vcmp.f16 s8, s0
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f16 s0, s0, s8
+; CHECK-NOFP-NEXT:    vminnm.f16 s4, s4, s6
+; CHECK-NOFP-NEXT:    vminnm.f16 s4, s4, s3
+; CHECK-NOFP-NEXT:    vminnm.f16 s0, s4, s0
 ; CHECK-NOFP-NEXT:    vcmp.f16 s0, s2
 ; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
 ; CHECK-NOFP-NEXT:    vselgt.f16 s0, s2, s0
@@ -1080,78 +819,44 @@ define arm_aapcs_vfpcc void @fmin_v16f16_acc_nofast(<16 x half> %x, half* %yy) {
 ; CHECK-FP-LABEL: fmin_v16f16_acc_nofast:
 ; CHECK-FP:       @ %bb.0: @ %entry
 ; CHECK-FP-NEXT:    vminnm.f16 q0, q0, q1
-; CHECK-FP-NEXT:    vmov.f64 d2, d1
-; CHECK-FP-NEXT:    vmov.f32 s5, s3
-; CHECK-FP-NEXT:    vminnm.f16 q0, q0, q1
-; CHECK-FP-NEXT:    vmov r1, s1
-; CHECK-FP-NEXT:    vdup.32 q1, r1
-; CHECK-FP-NEXT:    vminnm.f16 q0, q0, q1
-; CHECK-FP-NEXT:    vmov.u16 r1, q0[1]
-; CHECK-FP-NEXT:    vdup.16 q1, r1
+; CHECK-FP-NEXT:    vrev32.16 q1, q0
 ; CHECK-FP-NEXT:    vminnm.f16 q0, q0, q1
-; CHECK-FP-NEXT:    vldr.16 s4, [r0]
-; CHECK-FP-NEXT:    vcmp.f16 s0, s4
+; CHECK-FP-NEXT:    vminnm.f16 s4, s2, s3
+; CHECK-FP-NEXT:    vminnm.f16 s0, s0, s1
+; CHECK-FP-NEXT:    vldr.16 s2, [r0]
+; CHECK-FP-NEXT:    vminnm.f16 s0, s0, s4
+; CHECK-FP-NEXT:    vcmp.f16 s0, s2
 ; CHECK-FP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-FP-NEXT:    vselgt.f16 s0, s4, s0
+; CHECK-FP-NEXT:    vselgt.f16 s0, s2, s0
 ; CHECK-FP-NEXT:    vstr.16 s0, [r0]
 ; CHECK-FP-NEXT:    bx lr
 ;
 ; CHECK-NOFP-LABEL: fmin_v16f16_acc_nofast:
 ; CHECK-NOFP:       @ %bb.0: @ %entry
-; CHECK-NOFP-NEXT:    vmovx.f16 s8, s7
-; CHECK-NOFP-NEXT:    vmovx.f16 s10, s3
-; CHECK-NOFP-NEXT:    vcmp.f16 s8, s10
-; CHECK-NOFP-NEXT:    vmovx.f16 s12, s1
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vmovx.f16 s14, s0
-; CHECK-NOFP-NEXT:    vselgt.f16 s8, s10, s8
+; CHECK-NOFP-NEXT:    vmovx.f16 s8, s4
+; CHECK-NOFP-NEXT:    vmovx.f16 s10, s0
+; CHECK-NOFP-NEXT:    vminnm.f16 s8, s10, s8
+; CHECK-NOFP-NEXT:    vminnm.f16 s10, s0, s4
+; CHECK-NOFP-NEXT:    vminnm.f16 s8, s10, s8
+; CHECK-NOFP-NEXT:    vminnm.f16 s10, s1, s5
+; CHECK-NOFP-NEXT:    vminnm.f16 s8, s8, s10
 ; CHECK-NOFP-NEXT:    vmovx.f16 s10, s5
-; CHECK-NOFP-NEXT:    vcmp.f16 s10, s12
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f16 s10, s12, s10
+; CHECK-NOFP-NEXT:    vmovx.f16 s12, s1
+; CHECK-NOFP-NEXT:    vmovx.f16 s4, s7
+; CHECK-NOFP-NEXT:    vminnm.f16 s10, s12, s10
 ; CHECK-NOFP-NEXT:    vmovx.f16 s12, s2
-; CHECK-NOFP-NEXT:    vcmp.f16 s8, s10
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f16 s8, s10, s8
+; CHECK-NOFP-NEXT:    vminnm.f16 s8, s8, s10
+; CHECK-NOFP-NEXT:    vminnm.f16 s10, s2, s6
+; CHECK-NOFP-NEXT:    vminnm.f16 s8, s8, s10
 ; CHECK-NOFP-NEXT:    vmovx.f16 s10, s6
-; CHECK-NOFP-NEXT:    vcmp.f16 s10, s12
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f16 s10, s12, s10
-; CHECK-NOFP-NEXT:    vmovx.f16 s12, s4
-; CHECK-NOFP-NEXT:    vcmp.f16 s12, s14
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f16 s12, s14, s12
-; CHECK-NOFP-NEXT:    vcmp.f16 s10, s12
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f16 s10, s12, s10
-; CHECK-NOFP-NEXT:    vcmp.f16 s8, s10
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vcmp.f16 s7, s3
-; CHECK-NOFP-NEXT:    vselgt.f16 s8, s10, s8
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vcmp.f16 s5, s1
-; CHECK-NOFP-NEXT:    vselgt.f16 s10, s3, s7
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f16 s12, s1, s5
-; CHECK-NOFP-NEXT:    vcmp.f16 s10, s12
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vcmp.f16 s6, s2
-; CHECK-NOFP-NEXT:    vselgt.f16 s10, s12, s10
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vcmp.f16 s4, s0
-; CHECK-NOFP-NEXT:    vselgt.f16 s12, s2, s6
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f16 s0, s0, s4
+; CHECK-NOFP-NEXT:    vminnm.f16 s10, s12, s10
+; CHECK-NOFP-NEXT:    vmovx.f16 s0, s3
+; CHECK-NOFP-NEXT:    vminnm.f16 s8, s8, s10
+; CHECK-NOFP-NEXT:    vminnm.f16 s10, s3, s7
+; CHECK-NOFP-NEXT:    vminnm.f16 s8, s8, s10
+; CHECK-NOFP-NEXT:    vminnm.f16 s0, s0, s4
 ; CHECK-NOFP-NEXT:    vldr.16 s2, [r0]
-; CHECK-NOFP-NEXT:    vcmp.f16 s12, s0
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f16 s0, s0, s12
-; CHECK-NOFP-NEXT:    vcmp.f16 s10, s0
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f16 s0, s0, s10
-; CHECK-NOFP-NEXT:    vcmp.f16 s8, s0
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f16 s0, s0, s8
+; CHECK-NOFP-NEXT:    vminnm.f16 s0, s8, s0
 ; CHECK-NOFP-NEXT:    vcmp.f16 s0, s2
 ; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
 ; CHECK-NOFP-NEXT:    vselgt.f16 s0, s2, s0
@@ -1183,9 +888,7 @@ entry:
 define arm_aapcs_vfpcc double @fmin_v2f64_acc_nofast(<2 x double> %x, double %y) {
 ; CHECK-LABEL: fmin_v2f64_acc_nofast:
 ; CHECK:       @ %bb.0: @ %entry
-; CHECK-NEXT:    vcmp.f64 d1, d0
-; CHECK-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NEXT:    vselgt.f64 d0, d0, d1
+; CHECK-NEXT:    vminnm.f64 d0, d0, d1
 ; CHECK-NEXT:    vcmp.f64 d0, d2
 ; CHECK-NEXT:    vmrs APSR_nzcv, fpscr
 ; CHECK-NEXT:    vselgt.f64 d0, d2, d0
@@ -1200,15 +903,9 @@ entry:
 define arm_aapcs_vfpcc double @fmin_v4f64_acc_nofast(<4 x double> %x, double %y) {
 ; CHECK-LABEL: fmin_v4f64_acc_nofast:
 ; CHECK:       @ %bb.0: @ %entry
-; CHECK-NEXT:    vcmp.f64 d3, d1
-; CHECK-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NEXT:    vcmp.f64 d2, d0
-; CHECK-NEXT:    vselgt.f64 d5, d1, d3
-; CHECK-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NEXT:    vselgt.f64 d0, d0, d2
-; CHECK-NEXT:    vcmp.f64 d5, d0
-; CHECK-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NEXT:    vselgt.f64 d0, d0, d5
+; CHECK-NEXT:    vminnm.f64 d5, d1, d3
+; CHECK-NEXT:    vminnm.f64 d0, d0, d2
+; CHECK-NEXT:    vminnm.f64 d0, d0, d5
 ; CHECK-NEXT:    vcmp.f64 d0, d4
 ; CHECK-NEXT:    vmrs APSR_nzcv, fpscr
 ; CHECK-NEXT:    vselgt.f64 d0, d4, d0
@@ -1221,28 +918,10 @@ entry:
 }
 
 define arm_aapcs_vfpcc float @fmax_v2f32(<2 x float> %x) {
-; CHECK-FP-LABEL: fmax_v2f32:
-; CHECK-FP:       @ %bb.0: @ %entry
-; CHECK-FP-NEXT:    vldr s4, .LCPI37_0
-; CHECK-FP-NEXT:    vmaxnm.f32 s0, s0, s1
-; CHECK-FP-NEXT:    vmaxnm.f32 s0, s0, s4
-; CHECK-FP-NEXT:    bx lr
-; CHECK-FP-NEXT:    .p2align 2
-; CHECK-FP-NEXT:  @ %bb.1:
-; CHECK-FP-NEXT:  .LCPI37_0:
-; CHECK-FP-NEXT:    .long 0xff800000 @ float -Inf
-;
-; CHECK-NOFP-LABEL: fmax_v2f32:
-; CHECK-NOFP:       @ %bb.0: @ %entry
-; CHECK-NOFP-NEXT:    vldr s4, .LCPI37_0
-; CHECK-NOFP-NEXT:    vmaxnm.f32 s0, s0, s1
-; CHECK-NOFP-NEXT:    vmaxnm.f32 s0, s0, s4
-; CHECK-NOFP-NEXT:    vmaxnm.f32 s0, s0, s4
-; CHECK-NOFP-NEXT:    bx lr
-; CHECK-NOFP-NEXT:    .p2align 2
-; CHECK-NOFP-NEXT:  @ %bb.1:
-; CHECK-NOFP-NEXT:  .LCPI37_0:
-; CHECK-NOFP-NEXT:    .long 0xff800000 @ float -Inf
+; CHECK-LABEL: fmax_v2f32:
+; CHECK:       @ %bb.0: @ %entry
+; CHECK-NEXT:    vmaxnm.f32 s0, s0, s1
+; CHECK-NEXT:    bx lr
 entry:
   %z = call fast float @llvm.experimental.vector.reduce.fmax.v2f32(<2 x float> %x)
   ret float %z
@@ -1315,17 +994,8 @@ define arm_aapcs_vfpcc half @fmax_v4f16(<4 x half> %x) {
 ; CHECK-NOFP-NEXT:    vmaxnm.f16 s4, s0, s4
 ; CHECK-NOFP-NEXT:    vmovx.f16 s0, s1
 ; CHECK-NOFP-NEXT:    vmaxnm.f16 s4, s4, s1
-; CHECK-NOFP-NEXT:    vldr.16 s2, .LCPI40_0
 ; CHECK-NOFP-NEXT:    vmaxnm.f16 s0, s4, s0
-; CHECK-NOFP-NEXT:    vmaxnm.f16 s0, s0, s2
-; CHECK-NOFP-NEXT:    vmaxnm.f16 s0, s0, s2
-; CHECK-NOFP-NEXT:    vmaxnm.f16 s0, s0, s2
-; CHECK-NOFP-NEXT:    vmaxnm.f16 s0, s0, s2
 ; CHECK-NOFP-NEXT:    bx lr
-; CHECK-NOFP-NEXT:    .p2align 1
-; CHECK-NOFP-NEXT:  @ %bb.1:
-; CHECK-NOFP-NEXT:  .LCPI40_0:
-; CHECK-NOFP-NEXT:    .short 0xfc00 @ half -Inf
 entry:
   %z = call fast half @llvm.experimental.vector.reduce.fmax.v4f16(<4 x half> %x)
   ret half %z
@@ -1454,20 +1124,10 @@ entry:
 }
 
 define arm_aapcs_vfpcc float @fmax_v2f32_nofast(<2 x float> %x) {
-; CHECK-FP-LABEL: fmax_v2f32_nofast:
-; CHECK-FP:       @ %bb.0: @ %entry
-; CHECK-FP-NEXT:    vmov r0, s1
-; CHECK-FP-NEXT:    vdup.32 q1, r0
-; CHECK-FP-NEXT:    vmaxnm.f32 q0, q0, q1
-; CHECK-FP-NEXT:    @ kill: def $s0 killed $s0 killed $q0
-; CHECK-FP-NEXT:    bx lr
-;
-; CHECK-NOFP-LABEL: fmax_v2f32_nofast:
-; CHECK-NOFP:       @ %bb.0: @ %entry
-; CHECK-NOFP-NEXT:    vcmp.f32 s0, s1
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f32 s0, s0, s1
-; CHECK-NOFP-NEXT:    bx lr
+; CHECK-LABEL: fmax_v2f32_nofast:
+; CHECK:       @ %bb.0: @ %entry
+; CHECK-NEXT:    vmaxnm.f32 s0, s0, s1
+; CHECK-NEXT:    bx lr
 entry:
   %z = call float @llvm.experimental.vector.reduce.fmax.v2f32(<2 x float> %x)
   ret float %z
@@ -1476,28 +1136,16 @@ entry:
 define arm_aapcs_vfpcc float @fmax_v4f32_nofast(<4 x float> %x) {
 ; CHECK-FP-LABEL: fmax_v4f32_nofast:
 ; CHECK-FP:       @ %bb.0: @ %entry
-; CHECK-FP-NEXT:    vmov.f64 d2, d1
-; CHECK-FP-NEXT:    vmov.f32 s5, s3
-; CHECK-FP-NEXT:    vmaxnm.f32 q0, q0, q1
-; CHECK-FP-NEXT:    vmov r0, s1
-; CHECK-FP-NEXT:    vdup.32 q1, r0
-; CHECK-FP-NEXT:    vmaxnm.f32 q0, q0, q1
-; CHECK-FP-NEXT:    @ kill: def $s0 killed $s0 killed $q0
+; CHECK-FP-NEXT:    vmaxnm.f32 s4, s2, s3
+; CHECK-FP-NEXT:    vmaxnm.f32 s0, s0, s1
+; CHECK-FP-NEXT:    vmaxnm.f32 s0, s0, s4
 ; CHECK-FP-NEXT:    bx lr
 ;
 ; CHECK-NOFP-LABEL: fmax_v4f32_nofast:
 ; CHECK-NOFP:       @ %bb.0: @ %entry
-; CHECK-NOFP-NEXT:    vcmp.f32 s1, s3
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vmov.f64 d2, d1
-; CHECK-NOFP-NEXT:    vmov.f32 s5, s3
-; CHECK-NOFP-NEXT:    vcmp.f32 s0, s4
-; CHECK-NOFP-NEXT:    vselgt.f32 s8, s1, s3
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f32 s0, s0, s4
-; CHECK-NOFP-NEXT:    vcmp.f32 s0, s8
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f32 s0, s0, s8
+; CHECK-NOFP-NEXT:    vmaxnm.f32 s4, s0, s1
+; CHECK-NOFP-NEXT:    vmaxnm.f32 s4, s4, s2
+; CHECK-NOFP-NEXT:    vmaxnm.f32 s0, s4, s3
 ; CHECK-NOFP-NEXT:    bx lr
 entry:
   %z = call float @llvm.experimental.vector.reduce.fmax.v4f32(<4 x float> %x)
@@ -1508,38 +1156,20 @@ define arm_aapcs_vfpcc float @fmax_v8f32_nofast(<8 x float> %x) {
 ; CHECK-FP-LABEL: fmax_v8f32_nofast:
 ; CHECK-FP:       @ %bb.0: @ %entry
 ; CHECK-FP-NEXT:    vmaxnm.f32 q0, q0, q1
-; CHECK-FP-NEXT:    vmov.f64 d2, d1
-; CHECK-FP-NEXT:    vmov.f32 s5, s3
-; CHECK-FP-NEXT:    vmaxnm.f32 q0, q0, q1
-; CHECK-FP-NEXT:    vmov r0, s1
-; CHECK-FP-NEXT:    vdup.32 q1, r0
-; CHECK-FP-NEXT:    vmaxnm.f32 q0, q0, q1
-; CHECK-FP-NEXT:    @ kill: def $s0 killed $s0 killed $q0
+; CHECK-FP-NEXT:    vmaxnm.f32 s4, s2, s3
+; CHECK-FP-NEXT:    vmaxnm.f32 s0, s0, s1
+; CHECK-FP-NEXT:    vmaxnm.f32 s0, s0, s4
 ; CHECK-FP-NEXT:    bx lr
 ;
 ; CHECK-NOFP-LABEL: fmax_v8f32_nofast:
 ; CHECK-NOFP:       @ %bb.0: @ %entry
-; CHECK-NOFP-NEXT:    vcmp.f32 s3, s7
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vcmp.f32 s1, s5
-; CHECK-NOFP-NEXT:    vselgt.f32 s8, s3, s7
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vcmp.f32 s2, s6
-; CHECK-NOFP-NEXT:    vselgt.f32 s10, s1, s5
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vcmp.f32 s0, s4
-; CHECK-NOFP-NEXT:    vselgt.f32 s12, s2, s6
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vcmp.f32 s10, s8
-; CHECK-NOFP-NEXT:    vselgt.f32 s0, s0, s4
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vcmp.f32 s0, s12
-; CHECK-NOFP-NEXT:    vselgt.f32 s2, s10, s8
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f32 s0, s0, s12
-; CHECK-NOFP-NEXT:    vcmp.f32 s0, s2
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f32 s0, s0, s2
+; CHECK-NOFP-NEXT:    vmaxnm.f32 s10, s0, s4
+; CHECK-NOFP-NEXT:    vmaxnm.f32 s8, s1, s5
+; CHECK-NOFP-NEXT:    vmaxnm.f32 s8, s10, s8
+; CHECK-NOFP-NEXT:    vmaxnm.f32 s10, s2, s6
+; CHECK-NOFP-NEXT:    vmaxnm.f32 s8, s8, s10
+; CHECK-NOFP-NEXT:    vmaxnm.f32 s0, s3, s7
+; CHECK-NOFP-NEXT:    vmaxnm.f32 s0, s8, s0
 ; CHECK-NOFP-NEXT:    bx lr
 entry:
   %z = call float @llvm.experimental.vector.reduce.fmax.v8f32(<8 x float> %x)
@@ -1549,30 +1179,20 @@ entry:
 define arm_aapcs_vfpcc half @fmax_v4f16_nofast(<4 x half> %x) {
 ; CHECK-FP-LABEL: fmax_v4f16_nofast:
 ; CHECK-FP:       @ %bb.0: @ %entry
-; CHECK-FP-NEXT:    vmov r0, s1
-; CHECK-FP-NEXT:    vdup.32 q1, r0
-; CHECK-FP-NEXT:    vmaxnm.f16 q0, q0, q1
-; CHECK-FP-NEXT:    vmov.u16 r0, q0[1]
-; CHECK-FP-NEXT:    vdup.16 q1, r0
-; CHECK-FP-NEXT:    vmaxnm.f16 q0, q0, q1
-; CHECK-FP-NEXT:    @ kill: def $s0 killed $s0 killed $q0
+; CHECK-FP-NEXT:    vmovx.f16 s4, s1
+; CHECK-FP-NEXT:    vmovx.f16 s6, s0
+; CHECK-FP-NEXT:    vmaxnm.f16 s4, s1, s4
+; CHECK-FP-NEXT:    vmaxnm.f16 s0, s0, s6
+; CHECK-FP-NEXT:    vmaxnm.f16 s0, s0, s4
 ; CHECK-FP-NEXT:    bx lr
 ;
 ; CHECK-NOFP-LABEL: fmax_v4f16_nofast:
 ; CHECK-NOFP:       @ %bb.0: @ %entry
-; CHECK-NOFP-NEXT:    vmov r0, s1
-; CHECK-NOFP-NEXT:    vmovx.f16 s10, s0
-; CHECK-NOFP-NEXT:    vdup.32 q1, r0
-; CHECK-NOFP-NEXT:    vmovx.f16 s8, s4
-; CHECK-NOFP-NEXT:    vcmp.f16 s10, s8
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vcmp.f16 s0, s4
-; CHECK-NOFP-NEXT:    vselgt.f16 s8, s10, s8
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f16 s0, s0, s4
-; CHECK-NOFP-NEXT:    vcmp.f16 s0, s8
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f16 s0, s0, s8
+; CHECK-NOFP-NEXT:    vmovx.f16 s4, s0
+; CHECK-NOFP-NEXT:    vmaxnm.f16 s4, s0, s4
+; CHECK-NOFP-NEXT:    vmovx.f16 s0, s1
+; CHECK-NOFP-NEXT:    vmaxnm.f16 s4, s4, s1
+; CHECK-NOFP-NEXT:    vmaxnm.f16 s0, s4, s0
 ; CHECK-NOFP-NEXT:    bx lr
 entry:
   %z = call half @llvm.experimental.vector.reduce.fmax.v4f16(<4 x half> %x)
@@ -1582,47 +1202,26 @@ entry:
 define arm_aapcs_vfpcc half @fmax_v8f16_nofast(<8 x half> %x) {
 ; CHECK-FP-LABEL: fmax_v8f16_nofast:
 ; CHECK-FP:       @ %bb.0: @ %entry
-; CHECK-FP-NEXT:    vmov.f64 d2, d1
-; CHECK-FP-NEXT:    vmov.f32 s5, s3
-; CHECK-FP-NEXT:    vmaxnm.f16 q0, q0, q1
-; CHECK-FP-NEXT:    vmov r0, s1
-; CHECK-FP-NEXT:    vdup.32 q1, r0
-; CHECK-FP-NEXT:    vmaxnm.f16 q0, q0, q1
-; CHECK-FP-NEXT:    vmov.u16 r0, q0[1]
-; CHECK-FP-NEXT:    vdup.16 q1, r0
+; CHECK-FP-NEXT:    vrev32.16 q1, q0
 ; CHECK-FP-NEXT:    vmaxnm.f16 q0, q0, q1
-; CHECK-FP-NEXT:    @ kill: def $s0 killed $s0 killed $q0
+; CHECK-FP-NEXT:    vmaxnm.f16 s4, s2, s3
+; CHECK-FP-NEXT:    vmaxnm.f16 s0, s0, s1
+; CHECK-FP-NEXT:    vmaxnm.f16 s0, s0, s4
 ; CHECK-FP-NEXT:    bx lr
 ;
 ; CHECK-NOFP-LABEL: fmax_v8f16_nofast:
 ; CHECK-NOFP:       @ %bb.0: @ %entry
-; CHECK-NOFP-NEXT:    vmovx.f16 s8, s3
-; CHECK-NOFP-NEXT:    vmovx.f16 s10, s1
-; CHECK-NOFP-NEXT:    vcmp.f16 s10, s8
-; CHECK-NOFP-NEXT:    vmov.f64 d2, d1
-; CHECK-NOFP-NEXT:    vmovx.f16 s12, s0
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vmov.f32 s5, s3
-; CHECK-NOFP-NEXT:    vselgt.f16 s8, s10, s8
-; CHECK-NOFP-NEXT:    vmovx.f16 s10, s4
-; CHECK-NOFP-NEXT:    vcmp.f16 s12, s10
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f16 s10, s12, s10
-; CHECK-NOFP-NEXT:    vcmp.f16 s10, s8
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vcmp.f16 s1, s3
-; CHECK-NOFP-NEXT:    vselgt.f16 s8, s10, s8
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vcmp.f16 s0, s4
-; CHECK-NOFP-NEXT:    vselgt.f16 s10, s1, s3
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f16 s0, s0, s4
-; CHECK-NOFP-NEXT:    vcmp.f16 s0, s10
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f16 s0, s0, s10
-; CHECK-NOFP-NEXT:    vcmp.f16 s0, s8
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f16 s0, s0, s8
+; CHECK-NOFP-NEXT:    vmovx.f16 s4, s0
+; CHECK-NOFP-NEXT:    vmovx.f16 s6, s1
+; CHECK-NOFP-NEXT:    vmaxnm.f16 s4, s0, s4
+; CHECK-NOFP-NEXT:    vmovx.f16 s0, s3
+; CHECK-NOFP-NEXT:    vmaxnm.f16 s4, s4, s1
+; CHECK-NOFP-NEXT:    vmaxnm.f16 s4, s4, s6
+; CHECK-NOFP-NEXT:    vmovx.f16 s6, s2
+; CHECK-NOFP-NEXT:    vmaxnm.f16 s4, s4, s2
+; CHECK-NOFP-NEXT:    vmaxnm.f16 s4, s4, s6
+; CHECK-NOFP-NEXT:    vmaxnm.f16 s4, s4, s3
+; CHECK-NOFP-NEXT:    vmaxnm.f16 s0, s4, s0
 ; CHECK-NOFP-NEXT:    bx lr
 entry:
   %z = call half @llvm.experimental.vector.reduce.fmax.v8f16(<8 x half> %x)
@@ -1633,73 +1232,38 @@ define arm_aapcs_vfpcc half @fmax_v16f16_nofast(<16 x half> %x) {
 ; CHECK-FP-LABEL: fmax_v16f16_nofast:
 ; CHECK-FP:       @ %bb.0: @ %entry
 ; CHECK-FP-NEXT:    vmaxnm.f16 q0, q0, q1
-; CHECK-FP-NEXT:    vmov.f64 d2, d1
-; CHECK-FP-NEXT:    vmov.f32 s5, s3
-; CHECK-FP-NEXT:    vmaxnm.f16 q0, q0, q1
-; CHECK-FP-NEXT:    vmov r0, s1
-; CHECK-FP-NEXT:    vdup.32 q1, r0
-; CHECK-FP-NEXT:    vmaxnm.f16 q0, q0, q1
-; CHECK-FP-NEXT:    vmov.u16 r0, q0[1]
-; CHECK-FP-NEXT:    vdup.16 q1, r0
+; CHECK-FP-NEXT:    vrev32.16 q1, q0
 ; CHECK-FP-NEXT:    vmaxnm.f16 q0, q0, q1
-; CHECK-FP-NEXT:    @ kill: def $s0 killed $s0 killed $q0
+; CHECK-FP-NEXT:    vmaxnm.f16 s4, s2, s3
+; CHECK-FP-NEXT:    vmaxnm.f16 s0, s0, s1
+; CHECK-FP-NEXT:    vmaxnm.f16 s0, s0, s4
 ; CHECK-FP-NEXT:    bx lr
 ;
 ; CHECK-NOFP-LABEL: fmax_v16f16_nofast:
 ; CHECK-NOFP:       @ %bb.0: @ %entry
-; CHECK-NOFP-NEXT:    vmovx.f16 s8, s7
-; CHECK-NOFP-NEXT:    vmovx.f16 s10, s3
-; CHECK-NOFP-NEXT:    vcmp.f16 s10, s8
-; CHECK-NOFP-NEXT:    vmovx.f16 s12, s1
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vmovx.f16 s14, s0
-; CHECK-NOFP-NEXT:    vselgt.f16 s8, s10, s8
+; CHECK-NOFP-NEXT:    vmovx.f16 s8, s4
+; CHECK-NOFP-NEXT:    vmovx.f16 s10, s0
+; CHECK-NOFP-NEXT:    vmaxnm.f16 s8, s10, s8
+; CHECK-NOFP-NEXT:    vmaxnm.f16 s10, s0, s4
+; CHECK-NOFP-NEXT:    vmaxnm.f16 s8, s10, s8
+; CHECK-NOFP-NEXT:    vmaxnm.f16 s10, s1, s5
+; CHECK-NOFP-NEXT:    vmaxnm.f16 s8, s8, s10
 ; CHECK-NOFP-NEXT:    vmovx.f16 s10, s5
-; CHECK-NOFP-NEXT:    vcmp.f16 s12, s10
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f16 s10, s12, s10
+; CHECK-NOFP-NEXT:    vmovx.f16 s12, s1
+; CHECK-NOFP-NEXT:    vmovx.f16 s4, s7
+; CHECK-NOFP-NEXT:    vmaxnm.f16 s10, s12, s10
 ; CHECK-NOFP-NEXT:    vmovx.f16 s12, s2
-; CHECK-NOFP-NEXT:    vcmp.f16 s10, s8
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f16 s8, s10, s8
+; CHECK-NOFP-NEXT:    vmaxnm.f16 s8, s8, s10
+; CHECK-NOFP-NEXT:    vmaxnm.f16 s10, s2, s6
+; CHECK-NOFP-NEXT:    vmaxnm.f16 s8, s8, s10
 ; CHECK-NOFP-NEXT:    vmovx.f16 s10, s6
-; CHECK-NOFP-NEXT:    vcmp.f16 s12, s10
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f16 s10, s12, s10
-; CHECK-NOFP-NEXT:    vmovx.f16 s12, s4
-; CHECK-NOFP-NEXT:    vcmp.f16 s14, s12
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f16 s12, s14, s12
-; CHECK-NOFP-NEXT:    vcmp.f16 s12, s10
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f16 s10, s12, s10
-; CHECK-NOFP-NEXT:    vcmp.f16 s10, s8
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vcmp.f16 s3, s7
-; CHECK-NOFP-NEXT:    vselgt.f16 s8, s10, s8
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vcmp.f16 s1, s5
-; CHECK-NOFP-NEXT:    vselgt.f16 s10, s3, s7
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f16 s12, s1, s5
-; CHECK-NOFP-NEXT:    vcmp.f16 s12, s10
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vcmp.f16 s2, s6
-; CHECK-NOFP-NEXT:    vselgt.f16 s10, s12, s10
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vcmp.f16 s0, s4
-; CHECK-NOFP-NEXT:    vselgt.f16 s12, s2, s6
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f16 s0, s0, s4
-; CHECK-NOFP-NEXT:    vcmp.f16 s0, s12
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f16 s0, s0, s12
-; CHECK-NOFP-NEXT:    vcmp.f16 s0, s10
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f16 s0, s0, s10
-; CHECK-NOFP-NEXT:    vcmp.f16 s0, s8
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f16 s0, s0, s8
+; CHECK-NOFP-NEXT:    vmaxnm.f16 s10, s12, s10
+; CHECK-NOFP-NEXT:    vmovx.f16 s0, s3
+; CHECK-NOFP-NEXT:    vmaxnm.f16 s8, s8, s10
+; CHECK-NOFP-NEXT:    vmaxnm.f16 s10, s3, s7
+; CHECK-NOFP-NEXT:    vmaxnm.f16 s8, s8, s10
+; CHECK-NOFP-NEXT:    vmaxnm.f16 s0, s0, s4
+; CHECK-NOFP-NEXT:    vmaxnm.f16 s0, s8, s0
 ; CHECK-NOFP-NEXT:    bx lr
 entry:
   %z = call half @llvm.experimental.vector.reduce.fmax.v16f16(<16 x half> %x)
@@ -1718,9 +1282,7 @@ entry:
 define arm_aapcs_vfpcc double @fmax_v2f64_nofast(<2 x double> %x) {
 ; CHECK-LABEL: fmax_v2f64_nofast:
 ; CHECK:       @ %bb.0: @ %entry
-; CHECK-NEXT:    vcmp.f64 d0, d1
-; CHECK-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NEXT:    vselgt.f64 d0, d0, d1
+; CHECK-NEXT:    vmaxnm.f64 d0, d0, d1
 ; CHECK-NEXT:    bx lr
 entry:
   %z = call double @llvm.experimental.vector.reduce.fmax.v2f64(<2 x double> %x)
@@ -1730,15 +1292,9 @@ entry:
 define arm_aapcs_vfpcc double @fmax_v4f64_nofast(<4 x double> %x) {
 ; CHECK-LABEL: fmax_v4f64_nofast:
 ; CHECK:       @ %bb.0: @ %entry
-; CHECK-NEXT:    vcmp.f64 d1, d3
-; CHECK-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NEXT:    vcmp.f64 d0, d2
-; CHECK-NEXT:    vselgt.f64 d4, d1, d3
-; CHECK-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NEXT:    vselgt.f64 d0, d0, d2
-; CHECK-NEXT:    vcmp.f64 d0, d4
-; CHECK-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NEXT:    vselgt.f64 d0, d0, d4
+; CHECK-NEXT:    vmaxnm.f64 d4, d1, d3
+; CHECK-NEXT:    vmaxnm.f64 d0, d0, d2
+; CHECK-NEXT:    vmaxnm.f64 d0, d0, d4
 ; CHECK-NEXT:    bx lr
 entry:
   %z = call double @llvm.experimental.vector.reduce.fmax.v4f64(<4 x double> %x)
@@ -1746,30 +1302,11 @@ entry:
 }
 
 define arm_aapcs_vfpcc float @fmax_v2f32_acc(<2 x float> %x, float %y) {
-; CHECK-FP-LABEL: fmax_v2f32_acc:
-; CHECK-FP:       @ %bb.0: @ %entry
-; CHECK-FP-NEXT:    vldr s6, .LCPI55_0
-; CHECK-FP-NEXT:    vmaxnm.f32 s0, s0, s1
-; CHECK-FP-NEXT:    vmaxnm.f32 s0, s0, s6
-; CHECK-FP-NEXT:    vmaxnm.f32 s0, s4, s0
-; CHECK-FP-NEXT:    bx lr
-; CHECK-FP-NEXT:    .p2align 2
-; CHECK-FP-NEXT:  @ %bb.1:
-; CHECK-FP-NEXT:  .LCPI55_0:
-; CHECK-FP-NEXT:    .long 0xff800000 @ float -Inf
-;
-; CHECK-NOFP-LABEL: fmax_v2f32_acc:
-; CHECK-NOFP:       @ %bb.0: @ %entry
-; CHECK-NOFP-NEXT:    vldr s6, .LCPI55_0
-; CHECK-NOFP-NEXT:    vmaxnm.f32 s0, s0, s1
-; CHECK-NOFP-NEXT:    vmaxnm.f32 s0, s0, s6
-; CHECK-NOFP-NEXT:    vmaxnm.f32 s0, s0, s6
-; CHECK-NOFP-NEXT:    vmaxnm.f32 s0, s4, s0
-; CHECK-NOFP-NEXT:    bx lr
-; CHECK-NOFP-NEXT:    .p2align 2
-; CHECK-NOFP-NEXT:  @ %bb.1:
-; CHECK-NOFP-NEXT:  .LCPI55_0:
-; CHECK-NOFP-NEXT:    .long 0xff800000 @ float -Inf
+; CHECK-LABEL: fmax_v2f32_acc:
+; CHECK:       @ %bb.0: @ %entry
+; CHECK-NEXT:    vmaxnm.f32 s0, s0, s1
+; CHECK-NEXT:    vmaxnm.f32 s0, s4, s0
+; CHECK-NEXT:    bx lr
 entry:
   %z = call fast float @llvm.experimental.vector.reduce.fmax.v2f32(<2 x float> %x)
   %c = fcmp fast ogt float %y, %z
@@ -1837,34 +1374,14 @@ entry:
 }
 
 define arm_aapcs_vfpcc void @fmax_v2f16_acc(<2 x half> %x, half* %yy) {
-; CHECK-FP-LABEL: fmax_v2f16_acc:
-; CHECK-FP:       @ %bb.0: @ %entry
-; CHECK-FP-NEXT:    vmovx.f16 s4, s0
-; CHECK-FP-NEXT:    vmaxnm.f16 s0, s0, s4
-; CHECK-FP-NEXT:    vldr.16 s2, [r0]
-; CHECK-FP-NEXT:    vmaxnm.f16 s0, s2, s0
-; CHECK-FP-NEXT:    vstr.16 s0, [r0]
-; CHECK-FP-NEXT:    bx lr
-;
-; CHECK-NOFP-LABEL: fmax_v2f16_acc:
-; CHECK-NOFP:       @ %bb.0: @ %entry
-; CHECK-NOFP-NEXT:    vmovx.f16 s4, s0
-; CHECK-NOFP-NEXT:    vmaxnm.f16 s0, s0, s4
-; CHECK-NOFP-NEXT:    vldr.16 s2, .LCPI58_0
-; CHECK-NOFP-NEXT:    vmaxnm.f16 s0, s0, s2
-; CHECK-NOFP-NEXT:    vmaxnm.f16 s0, s0, s2
-; CHECK-NOFP-NEXT:    vmaxnm.f16 s0, s0, s2
-; CHECK-NOFP-NEXT:    vmaxnm.f16 s0, s0, s2
-; CHECK-NOFP-NEXT:    vmaxnm.f16 s0, s0, s2
-; CHECK-NOFP-NEXT:    vmaxnm.f16 s0, s0, s2
-; CHECK-NOFP-NEXT:    vldr.16 s2, [r0]
-; CHECK-NOFP-NEXT:    vmaxnm.f16 s0, s2, s0
-; CHECK-NOFP-NEXT:    vstr.16 s0, [r0]
-; CHECK-NOFP-NEXT:    bx lr
-; CHECK-NOFP-NEXT:    .p2align 1
-; CHECK-NOFP-NEXT:  @ %bb.1:
-; CHECK-NOFP-NEXT:  .LCPI58_0:
-; CHECK-NOFP-NEXT:    .short 0xfc00 @ half -Inf
+; CHECK-LABEL: fmax_v2f16_acc:
+; CHECK:       @ %bb.0: @ %entry
+; CHECK-NEXT:    vmovx.f16 s4, s0
+; CHECK-NEXT:    vmaxnm.f16 s0, s0, s4
+; CHECK-NEXT:    vldr.16 s2, [r0]
+; CHECK-NEXT:    vmaxnm.f16 s0, s2, s0
+; CHECK-NEXT:    vstr.16 s0, [r0]
+; CHECK-NEXT:    bx lr
 entry:
   %y = load half, half* %yy
   %z = call fast half @llvm.experimental.vector.reduce.fmax.v2f16(<2 x half> %x)
@@ -1893,20 +1410,11 @@ define arm_aapcs_vfpcc void @fmax_v4f16_acc(<4 x half> %x, half* %yy) {
 ; CHECK-NOFP-NEXT:    vmaxnm.f16 s4, s0, s4
 ; CHECK-NOFP-NEXT:    vmovx.f16 s0, s1
 ; CHECK-NOFP-NEXT:    vmaxnm.f16 s4, s4, s1
-; CHECK-NOFP-NEXT:    vldr.16 s2, .LCPI59_0
-; CHECK-NOFP-NEXT:    vmaxnm.f16 s0, s4, s0
-; CHECK-NOFP-NEXT:    vmaxnm.f16 s0, s0, s2
-; CHECK-NOFP-NEXT:    vmaxnm.f16 s0, s0, s2
-; CHECK-NOFP-NEXT:    vmaxnm.f16 s0, s0, s2
-; CHECK-NOFP-NEXT:    vmaxnm.f16 s0, s0, s2
 ; CHECK-NOFP-NEXT:    vldr.16 s2, [r0]
+; CHECK-NOFP-NEXT:    vmaxnm.f16 s0, s4, s0
 ; CHECK-NOFP-NEXT:    vmaxnm.f16 s0, s2, s0
 ; CHECK-NOFP-NEXT:    vstr.16 s0, [r0]
 ; CHECK-NOFP-NEXT:    bx lr
-; CHECK-NOFP-NEXT:    .p2align 1
-; CHECK-NOFP-NEXT:  @ %bb.1:
-; CHECK-NOFP-NEXT:  .LCPI59_0:
-; CHECK-NOFP-NEXT:    .short 0xfc00 @ half -Inf
 entry:
   %y = load half, half* %yy
   %z = call fast half @llvm.experimental.vector.reduce.fmax.v4f16(<4 x half> %x)
@@ -2068,25 +1576,13 @@ entry:
 }
 
 define arm_aapcs_vfpcc float @fmax_v2f32_acc_nofast(<2 x float> %x, float %y) {
-; CHECK-FP-LABEL: fmax_v2f32_acc_nofast:
-; CHECK-FP:       @ %bb.0: @ %entry
-; CHECK-FP-NEXT:    vmov r0, s1
-; CHECK-FP-NEXT:    vdup.32 q2, r0
-; CHECK-FP-NEXT:    vmaxnm.f32 q0, q0, q2
-; CHECK-FP-NEXT:    vcmp.f32 s4, s0
-; CHECK-FP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-FP-NEXT:    vselgt.f32 s0, s4, s0
-; CHECK-FP-NEXT:    bx lr
-;
-; CHECK-NOFP-LABEL: fmax_v2f32_acc_nofast:
-; CHECK-NOFP:       @ %bb.0: @ %entry
-; CHECK-NOFP-NEXT:    vcmp.f32 s0, s1
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f32 s0, s0, s1
-; CHECK-NOFP-NEXT:    vcmp.f32 s4, s0
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f32 s0, s4, s0
-; CHECK-NOFP-NEXT:    bx lr
+; CHECK-LABEL: fmax_v2f32_acc_nofast:
+; CHECK:       @ %bb.0: @ %entry
+; CHECK-NEXT:    vmaxnm.f32 s0, s0, s1
+; CHECK-NEXT:    vcmp.f32 s4, s0
+; CHECK-NEXT:    vmrs APSR_nzcv, fpscr
+; CHECK-NEXT:    vselgt.f32 s0, s4, s0
+; CHECK-NEXT:    bx lr
 entry:
   %z = call float @llvm.experimental.vector.reduce.fmax.v2f32(<2 x float> %x)
   %c = fcmp ogt float %y, %z
@@ -2097,12 +1593,9 @@ entry:
 define arm_aapcs_vfpcc float @fmax_v4f32_acc_nofast(<4 x float> %x, float %y) {
 ; CHECK-FP-LABEL: fmax_v4f32_acc_nofast:
 ; CHECK-FP:       @ %bb.0: @ %entry
-; CHECK-FP-NEXT:    vmov.f64 d4, d1
-; CHECK-FP-NEXT:    vmov.f32 s9, s3
-; CHECK-FP-NEXT:    vmaxnm.f32 q0, q0, q2
-; CHECK-FP-NEXT:    vmov r0, s1
-; CHECK-FP-NEXT:    vdup.32 q2, r0
-; CHECK-FP-NEXT:    vmaxnm.f32 q0, q0, q2
+; CHECK-FP-NEXT:    vmaxnm.f32 s6, s2, s3
+; CHECK-FP-NEXT:    vmaxnm.f32 s0, s0, s1
+; CHECK-FP-NEXT:    vmaxnm.f32 s0, s0, s6
 ; CHECK-FP-NEXT:    vcmp.f32 s4, s0
 ; CHECK-FP-NEXT:    vmrs APSR_nzcv, fpscr
 ; CHECK-FP-NEXT:    vselgt.f32 s0, s4, s0
@@ -2110,17 +1603,9 @@ define arm_aapcs_vfpcc float @fmax_v4f32_acc_nofast(<4 x float> %x, float %y) {
 ;
 ; CHECK-NOFP-LABEL: fmax_v4f32_acc_nofast:
 ; CHECK-NOFP:       @ %bb.0: @ %entry
-; CHECK-NOFP-NEXT:    vcmp.f32 s1, s3
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vmov.f64 d4, d1
-; CHECK-NOFP-NEXT:    vmov.f32 s9, s3
-; CHECK-NOFP-NEXT:    vcmp.f32 s0, s8
-; CHECK-NOFP-NEXT:    vselgt.f32 s6, s1, s3
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f32 s0, s0, s8
-; CHECK-NOFP-NEXT:    vcmp.f32 s0, s6
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f32 s0, s0, s6
+; CHECK-NOFP-NEXT:    vmaxnm.f32 s6, s0, s1
+; CHECK-NOFP-NEXT:    vmaxnm.f32 s6, s6, s2
+; CHECK-NOFP-NEXT:    vmaxnm.f32 s0, s6, s3
 ; CHECK-NOFP-NEXT:    vcmp.f32 s4, s0
 ; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
 ; CHECK-NOFP-NEXT:    vselgt.f32 s0, s4, s0
@@ -2136,12 +1621,9 @@ define arm_aapcs_vfpcc float @fmax_v8f32_acc_nofast(<8 x float> %x, float %y) {
 ; CHECK-FP-LABEL: fmax_v8f32_acc_nofast:
 ; CHECK-FP:       @ %bb.0: @ %entry
 ; CHECK-FP-NEXT:    vmaxnm.f32 q0, q0, q1
-; CHECK-FP-NEXT:    vmov.f64 d2, d1
-; CHECK-FP-NEXT:    vmov.f32 s5, s3
-; CHECK-FP-NEXT:    vmaxnm.f32 q0, q0, q1
-; CHECK-FP-NEXT:    vmov r0, s1
-; CHECK-FP-NEXT:    vdup.32 q1, r0
-; CHECK-FP-NEXT:    vmaxnm.f32 q0, q0, q1
+; CHECK-FP-NEXT:    vmaxnm.f32 s4, s2, s3
+; CHECK-FP-NEXT:    vmaxnm.f32 s0, s0, s1
+; CHECK-FP-NEXT:    vmaxnm.f32 s0, s0, s4
 ; CHECK-FP-NEXT:    vcmp.f32 s8, s0
 ; CHECK-FP-NEXT:    vmrs APSR_nzcv, fpscr
 ; CHECK-FP-NEXT:    vselgt.f32 s0, s8, s0
@@ -2149,27 +1631,13 @@ define arm_aapcs_vfpcc float @fmax_v8f32_acc_nofast(<8 x float> %x, float %y) {
 ;
 ; CHECK-NOFP-LABEL: fmax_v8f32_acc_nofast:
 ; CHECK-NOFP:       @ %bb.0: @ %entry
-; CHECK-NOFP-NEXT:    vcmp.f32 s3, s7
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vcmp.f32 s1, s5
-; CHECK-NOFP-NEXT:    vselgt.f32 s10, s3, s7
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vcmp.f32 s2, s6
-; CHECK-NOFP-NEXT:    vselgt.f32 s12, s1, s5
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vcmp.f32 s0, s4
-; CHECK-NOFP-NEXT:    vselgt.f32 s14, s2, s6
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vcmp.f32 s12, s10
-; CHECK-NOFP-NEXT:    vselgt.f32 s0, s0, s4
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vcmp.f32 s0, s14
-; CHECK-NOFP-NEXT:    vselgt.f32 s2, s12, s10
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f32 s0, s0, s14
-; CHECK-NOFP-NEXT:    vcmp.f32 s0, s2
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f32 s0, s0, s2
+; CHECK-NOFP-NEXT:    vmaxnm.f32 s12, s0, s4
+; CHECK-NOFP-NEXT:    vmaxnm.f32 s10, s1, s5
+; CHECK-NOFP-NEXT:    vmaxnm.f32 s10, s12, s10
+; CHECK-NOFP-NEXT:    vmaxnm.f32 s12, s2, s6
+; CHECK-NOFP-NEXT:    vmaxnm.f32 s10, s10, s12
+; CHECK-NOFP-NEXT:    vmaxnm.f32 s0, s3, s7
+; CHECK-NOFP-NEXT:    vmaxnm.f32 s0, s10, s0
 ; CHECK-NOFP-NEXT:    vcmp.f32 s8, s0
 ; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
 ; CHECK-NOFP-NEXT:    vselgt.f32 s0, s8, s0
@@ -2184,35 +1652,26 @@ entry:
 define arm_aapcs_vfpcc void @fmax_v4f16_acc_nofast(<4 x half> %x, half* %yy) {
 ; CHECK-FP-LABEL: fmax_v4f16_acc_nofast:
 ; CHECK-FP:       @ %bb.0: @ %entry
-; CHECK-FP-NEXT:    vmov r1, s1
-; CHECK-FP-NEXT:    vdup.32 q1, r1
-; CHECK-FP-NEXT:    vmaxnm.f16 q0, q0, q1
-; CHECK-FP-NEXT:    vmov.u16 r1, q0[1]
-; CHECK-FP-NEXT:    vdup.16 q1, r1
-; CHECK-FP-NEXT:    vmaxnm.f16 q0, q0, q1
-; CHECK-FP-NEXT:    vldr.16 s4, [r0]
-; CHECK-FP-NEXT:    vcmp.f16 s4, s0
+; CHECK-FP-NEXT:    vmovx.f16 s4, s1
+; CHECK-FP-NEXT:    vmovx.f16 s6, s0
+; CHECK-FP-NEXT:    vmaxnm.f16 s0, s0, s6
+; CHECK-FP-NEXT:    vmaxnm.f16 s4, s1, s4
+; CHECK-FP-NEXT:    vldr.16 s2, [r0]
+; CHECK-FP-NEXT:    vmaxnm.f16 s0, s0, s4
+; CHECK-FP-NEXT:    vcmp.f16 s2, s0
 ; CHECK-FP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-FP-NEXT:    vselgt.f16 s0, s4, s0
+; CHECK-FP-NEXT:    vselgt.f16 s0, s2, s0
 ; CHECK-FP-NEXT:    vstr.16 s0, [r0]
 ; CHECK-FP-NEXT:    bx lr
 ;
 ; CHECK-NOFP-LABEL: fmax_v4f16_acc_nofast:
 ; CHECK-NOFP:       @ %bb.0: @ %entry
-; CHECK-NOFP-NEXT:    vmov r1, s1
-; CHECK-NOFP-NEXT:    vmovx.f16 s10, s0
-; CHECK-NOFP-NEXT:    vdup.32 q1, r1
-; CHECK-NOFP-NEXT:    vmovx.f16 s8, s4
-; CHECK-NOFP-NEXT:    vcmp.f16 s10, s8
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vcmp.f16 s0, s4
-; CHECK-NOFP-NEXT:    vselgt.f16 s8, s10, s8
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f16 s0, s0, s4
+; CHECK-NOFP-NEXT:    vmovx.f16 s4, s0
+; CHECK-NOFP-NEXT:    vmaxnm.f16 s4, s0, s4
+; CHECK-NOFP-NEXT:    vmovx.f16 s0, s1
+; CHECK-NOFP-NEXT:    vmaxnm.f16 s4, s4, s1
 ; CHECK-NOFP-NEXT:    vldr.16 s2, [r0]
-; CHECK-NOFP-NEXT:    vcmp.f16 s0, s8
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f16 s0, s0, s8
+; CHECK-NOFP-NEXT:    vmaxnm.f16 s0, s4, s0
 ; CHECK-NOFP-NEXT:    vcmp.f16 s2, s0
 ; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
 ; CHECK-NOFP-NEXT:    vselgt.f16 s0, s2, s0
@@ -2230,52 +1689,32 @@ entry:
 define arm_aapcs_vfpcc void @fmax_v8f16_acc_nofast(<8 x half> %x, half* %yy) {
 ; CHECK-FP-LABEL: fmax_v8f16_acc_nofast:
 ; CHECK-FP:       @ %bb.0: @ %entry
-; CHECK-FP-NEXT:    vmov.f64 d2, d1
-; CHECK-FP-NEXT:    vmov.f32 s5, s3
-; CHECK-FP-NEXT:    vmaxnm.f16 q0, q0, q1
-; CHECK-FP-NEXT:    vmov r1, s1
-; CHECK-FP-NEXT:    vdup.32 q1, r1
-; CHECK-FP-NEXT:    vmaxnm.f16 q0, q0, q1
-; CHECK-FP-NEXT:    vmov.u16 r1, q0[1]
-; CHECK-FP-NEXT:    vdup.16 q1, r1
+; CHECK-FP-NEXT:    vrev32.16 q1, q0
 ; CHECK-FP-NEXT:    vmaxnm.f16 q0, q0, q1
-; CHECK-FP-NEXT:    vldr.16 s4, [r0]
-; CHECK-FP-NEXT:    vcmp.f16 s4, s0
+; CHECK-FP-NEXT:    vmaxnm.f16 s4, s2, s3
+; CHECK-FP-NEXT:    vmaxnm.f16 s0, s0, s1
+; CHECK-FP-NEXT:    vldr.16 s2, [r0]
+; CHECK-FP-NEXT:    vmaxnm.f16 s0, s0, s4
+; CHECK-FP-NEXT:    vcmp.f16 s2, s0
 ; CHECK-FP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-FP-NEXT:    vselgt.f16 s0, s4, s0
+; CHECK-FP-NEXT:    vselgt.f16 s0, s2, s0
 ; CHECK-FP-NEXT:    vstr.16 s0, [r0]
 ; CHECK-FP-NEXT:    bx lr
 ;
 ; CHECK-NOFP-LABEL: fmax_v8f16_acc_nofast:
 ; CHECK-NOFP:       @ %bb.0: @ %entry
-; CHECK-NOFP-NEXT:    vmovx.f16 s8, s3
-; CHECK-NOFP-NEXT:    vmovx.f16 s10, s1
-; CHECK-NOFP-NEXT:    vcmp.f16 s10, s8
-; CHECK-NOFP-NEXT:    vmov.f64 d2, d1
-; CHECK-NOFP-NEXT:    vmovx.f16 s12, s0
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vmov.f32 s5, s3
-; CHECK-NOFP-NEXT:    vselgt.f16 s8, s10, s8
-; CHECK-NOFP-NEXT:    vmovx.f16 s10, s4
-; CHECK-NOFP-NEXT:    vcmp.f16 s12, s10
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f16 s10, s12, s10
-; CHECK-NOFP-NEXT:    vcmp.f16 s10, s8
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vcmp.f16 s1, s3
-; CHECK-NOFP-NEXT:    vselgt.f16 s8, s10, s8
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vcmp.f16 s0, s4
-; CHECK-NOFP-NEXT:    vselgt.f16 s10, s1, s3
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f16 s0, s0, s4
+; CHECK-NOFP-NEXT:    vmovx.f16 s4, s0
+; CHECK-NOFP-NEXT:    vmovx.f16 s6, s1
+; CHECK-NOFP-NEXT:    vmaxnm.f16 s4, s0, s4
+; CHECK-NOFP-NEXT:    vmovx.f16 s0, s3
+; CHECK-NOFP-NEXT:    vmaxnm.f16 s4, s4, s1
+; CHECK-NOFP-NEXT:    vmaxnm.f16 s4, s4, s6
+; CHECK-NOFP-NEXT:    vmovx.f16 s6, s2
+; CHECK-NOFP-NEXT:    vmaxnm.f16 s4, s4, s2
 ; CHECK-NOFP-NEXT:    vldr.16 s2, [r0]
-; CHECK-NOFP-NEXT:    vcmp.f16 s0, s10
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f16 s0, s0, s10
-; CHECK-NOFP-NEXT:    vcmp.f16 s0, s8
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f16 s0, s0, s8
+; CHECK-NOFP-NEXT:    vmaxnm.f16 s4, s4, s6
+; CHECK-NOFP-NEXT:    vmaxnm.f16 s4, s4, s3
+; CHECK-NOFP-NEXT:    vmaxnm.f16 s0, s4, s0
 ; CHECK-NOFP-NEXT:    vcmp.f16 s2, s0
 ; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
 ; CHECK-NOFP-NEXT:    vselgt.f16 s0, s2, s0
@@ -2294,78 +1733,44 @@ define arm_aapcs_vfpcc void @fmax_v16f16_acc_nofast(<16 x half> %x, half* %yy) {
 ; CHECK-FP-LABEL: fmax_v16f16_acc_nofast:
 ; CHECK-FP:       @ %bb.0: @ %entry
 ; CHECK-FP-NEXT:    vmaxnm.f16 q0, q0, q1
-; CHECK-FP-NEXT:    vmov.f64 d2, d1
-; CHECK-FP-NEXT:    vmov.f32 s5, s3
-; CHECK-FP-NEXT:    vmaxnm.f16 q0, q0, q1
-; CHECK-FP-NEXT:    vmov r1, s1
-; CHECK-FP-NEXT:    vdup.32 q1, r1
-; CHECK-FP-NEXT:    vmaxnm.f16 q0, q0, q1
-; CHECK-FP-NEXT:    vmov.u16 r1, q0[1]
-; CHECK-FP-NEXT:    vdup.16 q1, r1
+; CHECK-FP-NEXT:    vrev32.16 q1, q0
 ; CHECK-FP-NEXT:    vmaxnm.f16 q0, q0, q1
-; CHECK-FP-NEXT:    vldr.16 s4, [r0]
-; CHECK-FP-NEXT:    vcmp.f16 s4, s0
+; CHECK-FP-NEXT:    vmaxnm.f16 s4, s2, s3
+; CHECK-FP-NEXT:    vmaxnm.f16 s0, s0, s1
+; CHECK-FP-NEXT:    vldr.16 s2, [r0]
+; CHECK-FP-NEXT:    vmaxnm.f16 s0, s0, s4
+; CHECK-FP-NEXT:    vcmp.f16 s2, s0
 ; CHECK-FP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-FP-NEXT:    vselgt.f16 s0, s4, s0
+; CHECK-FP-NEXT:    vselgt.f16 s0, s2, s0
 ; CHECK-FP-NEXT:    vstr.16 s0, [r0]
 ; CHECK-FP-NEXT:    bx lr
 ;
 ; CHECK-NOFP-LABEL: fmax_v16f16_acc_nofast:
 ; CHECK-NOFP:       @ %bb.0: @ %entry
-; CHECK-NOFP-NEXT:    vmovx.f16 s8, s7
-; CHECK-NOFP-NEXT:    vmovx.f16 s10, s3
-; CHECK-NOFP-NEXT:    vcmp.f16 s10, s8
-; CHECK-NOFP-NEXT:    vmovx.f16 s12, s1
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vmovx.f16 s14, s0
-; CHECK-NOFP-NEXT:    vselgt.f16 s8, s10, s8
+; CHECK-NOFP-NEXT:    vmovx.f16 s8, s4
+; CHECK-NOFP-NEXT:    vmovx.f16 s10, s0
+; CHECK-NOFP-NEXT:    vmaxnm.f16 s8, s10, s8
+; CHECK-NOFP-NEXT:    vmaxnm.f16 s10, s0, s4
+; CHECK-NOFP-NEXT:    vmaxnm.f16 s8, s10, s8
+; CHECK-NOFP-NEXT:    vmaxnm.f16 s10, s1, s5
+; CHECK-NOFP-NEXT:    vmaxnm.f16 s8, s8, s10
 ; CHECK-NOFP-NEXT:    vmovx.f16 s10, s5
-; CHECK-NOFP-NEXT:    vcmp.f16 s12, s10
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f16 s10, s12, s10
+; CHECK-NOFP-NEXT:    vmovx.f16 s12, s1
+; CHECK-NOFP-NEXT:    vmovx.f16 s4, s7
+; CHECK-NOFP-NEXT:    vmaxnm.f16 s10, s12, s10
 ; CHECK-NOFP-NEXT:    vmovx.f16 s12, s2
-; CHECK-NOFP-NEXT:    vcmp.f16 s10, s8
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f16 s8, s10, s8
+; CHECK-NOFP-NEXT:    vmaxnm.f16 s8, s8, s10
+; CHECK-NOFP-NEXT:    vmaxnm.f16 s10, s2, s6
+; CHECK-NOFP-NEXT:    vmaxnm.f16 s8, s8, s10
 ; CHECK-NOFP-NEXT:    vmovx.f16 s10, s6
-; CHECK-NOFP-NEXT:    vcmp.f16 s12, s10
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f16 s10, s12, s10
-; CHECK-NOFP-NEXT:    vmovx.f16 s12, s4
-; CHECK-NOFP-NEXT:    vcmp.f16 s14, s12
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f16 s12, s14, s12
-; CHECK-NOFP-NEXT:    vcmp.f16 s12, s10
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f16 s10, s12, s10
-; CHECK-NOFP-NEXT:    vcmp.f16 s10, s8
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vcmp.f16 s3, s7
-; CHECK-NOFP-NEXT:    vselgt.f16 s8, s10, s8
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vcmp.f16 s1, s5
-; CHECK-NOFP-NEXT:    vselgt.f16 s10, s3, s7
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f16 s12, s1, s5
-; CHECK-NOFP-NEXT:    vcmp.f16 s12, s10
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vcmp.f16 s2, s6
-; CHECK-NOFP-NEXT:    vselgt.f16 s10, s12, s10
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vcmp.f16 s0, s4
-; CHECK-NOFP-NEXT:    vselgt.f16 s12, s2, s6
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f16 s0, s0, s4
+; CHECK-NOFP-NEXT:    vmaxnm.f16 s10, s12, s10
+; CHECK-NOFP-NEXT:    vmovx.f16 s0, s3
+; CHECK-NOFP-NEXT:    vmaxnm.f16 s8, s8, s10
+; CHECK-NOFP-NEXT:    vmaxnm.f16 s10, s3, s7
+; CHECK-NOFP-NEXT:    vmaxnm.f16 s8, s8, s10
+; CHECK-NOFP-NEXT:    vmaxnm.f16 s0, s0, s4
 ; CHECK-NOFP-NEXT:    vldr.16 s2, [r0]
-; CHECK-NOFP-NEXT:    vcmp.f16 s0, s12
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f16 s0, s0, s12
-; CHECK-NOFP-NEXT:    vcmp.f16 s0, s10
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f16 s0, s0, s10
-; CHECK-NOFP-NEXT:    vcmp.f16 s0, s8
-; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NOFP-NEXT:    vselgt.f16 s0, s0, s8
+; CHECK-NOFP-NEXT:    vmaxnm.f16 s0, s8, s0
 ; CHECK-NOFP-NEXT:    vcmp.f16 s2, s0
 ; CHECK-NOFP-NEXT:    vmrs APSR_nzcv, fpscr
 ; CHECK-NOFP-NEXT:    vselgt.f16 s0, s2, s0
@@ -2397,9 +1802,7 @@ entry:
 define arm_aapcs_vfpcc double @fmax_v2f64_acc_nofast(<2 x double> %x, double %y) {
 ; CHECK-LABEL: fmax_v2f64_acc_nofast:
 ; CHECK:       @ %bb.0: @ %entry
-; CHECK-NEXT:    vcmp.f64 d0, d1
-; CHECK-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NEXT:    vselgt.f64 d0, d0, d1
+; CHECK-NEXT:    vmaxnm.f64 d0, d0, d1
 ; CHECK-NEXT:    vcmp.f64 d2, d0
 ; CHECK-NEXT:    vmrs APSR_nzcv, fpscr
 ; CHECK-NEXT:    vselgt.f64 d0, d2, d0
@@ -2414,15 +1817,9 @@ entry:
 define arm_aapcs_vfpcc double @fmax_v4f64_acc_nofast(<4 x double> %x, double %y) {
 ; CHECK-LABEL: fmax_v4f64_acc_nofast:
 ; CHECK:       @ %bb.0: @ %entry
-; CHECK-NEXT:    vcmp.f64 d1, d3
-; CHECK-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NEXT:    vcmp.f64 d0, d2
-; CHECK-NEXT:    vselgt.f64 d5, d1, d3
-; CHECK-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NEXT:    vselgt.f64 d0, d0, d2
-; CHECK-NEXT:    vcmp.f64 d0, d5
-; CHECK-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NEXT:    vselgt.f64 d0, d0, d5
+; CHECK-NEXT:    vmaxnm.f64 d5, d1, d3
+; CHECK-NEXT:    vmaxnm.f64 d0, d0, d2
+; CHECK-NEXT:    vmaxnm.f64 d0, d0, d5
 ; CHECK-NEXT:    vcmp.f64 d4, d0
 ; CHECK-NEXT:    vmrs APSR_nzcv, fpscr
 ; CHECK-NEXT:    vselgt.f64 d0, d4, d0

diff  --git a/llvm/test/CodeGen/Thumb2/mve-vecreduce-loops.ll b/llvm/test/CodeGen/Thumb2/mve-vecreduce-loops.ll
index 64a76f38920a..382c32dbe2bf 100644
--- a/llvm/test/CodeGen/Thumb2/mve-vecreduce-loops.ll
+++ b/llvm/test/CodeGen/Thumb2/mve-vecreduce-loops.ll
@@ -1512,13 +1512,10 @@ define float @fmin_f32(float* nocapture readonly %x, i32 %n) {
 ; CHECK-NEXT:    vpsel q0, q0, q1
 ; CHECK-NEXT:    le lr, .LBB15_5
 ; CHECK-NEXT:  @ %bb.6: @ %middle.block
-; CHECK-NEXT:    vmov.f32 s4, s2
+; CHECK-NEXT:    vminnm.f32 s4, s2, s3
+; CHECK-NEXT:    vminnm.f32 s0, s0, s1
+; CHECK-NEXT:    vminnm.f32 s0, s0, s4
 ; CHECK-NEXT:    cmp r2, r1
-; CHECK-NEXT:    vmov.f32 s5, s3
-; CHECK-NEXT:    vminnm.f32 q0, q0, q1
-; CHECK-NEXT:    vmov r3, s1
-; CHECK-NEXT:    vdup.32 q1, r3
-; CHECK-NEXT:    vminnm.f32 q0, q0, q1
 ; CHECK-NEXT:    beq .LBB15_9
 ; CHECK-NEXT:  .LBB15_7: @ %for.body.preheader1
 ; CHECK-NEXT:    sub.w lr, r1, r2
@@ -1526,10 +1523,10 @@ define float @fmin_f32(float* nocapture readonly %x, i32 %n) {
 ; CHECK-NEXT:    dls lr, lr
 ; CHECK-NEXT:  .LBB15_8: @ %for.body
 ; CHECK-NEXT:    @ =>This Inner Loop Header: Depth=1
-; CHECK-NEXT:    vldmia r0!, {s4}
-; CHECK-NEXT:    vcmp.f32 s0, s4
+; CHECK-NEXT:    vldmia r0!, {s2}
+; CHECK-NEXT:    vcmp.f32 s0, s2
 ; CHECK-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NEXT:    vselge.f32 s0, s4, s0
+; CHECK-NEXT:    vselge.f32 s0, s2, s0
 ; CHECK-NEXT:    le lr, .LBB15_8
 ; CHECK-NEXT:  .LBB15_9: @ %for.cond.cleanup
 ; CHECK-NEXT:    vmov r0, s0
@@ -1620,13 +1617,10 @@ define float @fmax_f32(float* nocapture readonly %x, i32 %n) {
 ; CHECK-NEXT:    vpsel q0, q0, q1
 ; CHECK-NEXT:    le lr, .LBB16_5
 ; CHECK-NEXT:  @ %bb.6: @ %middle.block
-; CHECK-NEXT:    vmov.f32 s4, s2
+; CHECK-NEXT:    vmaxnm.f32 s4, s2, s3
+; CHECK-NEXT:    vmaxnm.f32 s0, s0, s1
+; CHECK-NEXT:    vmaxnm.f32 s0, s0, s4
 ; CHECK-NEXT:    cmp r2, r1
-; CHECK-NEXT:    vmov.f32 s5, s3
-; CHECK-NEXT:    vmaxnm.f32 q0, q0, q1
-; CHECK-NEXT:    vmov r3, s1
-; CHECK-NEXT:    vdup.32 q1, r3
-; CHECK-NEXT:    vmaxnm.f32 q0, q0, q1
 ; CHECK-NEXT:    beq .LBB16_9
 ; CHECK-NEXT:  .LBB16_7: @ %for.body.preheader1
 ; CHECK-NEXT:    sub.w lr, r1, r2
@@ -1634,10 +1628,10 @@ define float @fmax_f32(float* nocapture readonly %x, i32 %n) {
 ; CHECK-NEXT:    dls lr, lr
 ; CHECK-NEXT:  .LBB16_8: @ %for.body
 ; CHECK-NEXT:    @ =>This Inner Loop Header: Depth=1
-; CHECK-NEXT:    vldmia r0!, {s4}
-; CHECK-NEXT:    vcmp.f32 s4, s0
+; CHECK-NEXT:    vldmia r0!, {s2}
+; CHECK-NEXT:    vcmp.f32 s2, s0
 ; CHECK-NEXT:    vmrs APSR_nzcv, fpscr
-; CHECK-NEXT:    vselge.f32 s0, s4, s0
+; CHECK-NEXT:    vselge.f32 s0, s2, s0
 ; CHECK-NEXT:    le lr, .LBB16_8
 ; CHECK-NEXT:  .LBB16_9: @ %for.cond.cleanup
 ; CHECK-NEXT:    vmov r0, s0

diff  --git a/llvm/test/CodeGen/X86/vector-reduce-fmax-nnan.ll b/llvm/test/CodeGen/X86/vector-reduce-fmax-nnan.ll
index e2025be01134..d304a925d24a 100644
--- a/llvm/test/CodeGen/X86/vector-reduce-fmax-nnan.ll
+++ b/llvm/test/CodeGen/X86/vector-reduce-fmax-nnan.ll
@@ -13,27 +13,46 @@
 define float @test_v2f32(<2 x float> %a0) {
 ; SSE2-LABEL: test_v2f32:
 ; SSE2:       # %bb.0:
+; SSE2-NEXT:    movaps %xmm0, %xmm2
+; SSE2-NEXT:    shufps {{.*#+}} xmm2 = xmm2[1,1],xmm0[1,1]
 ; SSE2-NEXT:    movaps %xmm0, %xmm1
-; SSE2-NEXT:    shufps {{.*#+}} xmm1 = xmm1[1,1],xmm0[1,1]
-; SSE2-NEXT:    maxss %xmm1, %xmm0
+; SSE2-NEXT:    cmpunordss %xmm0, %xmm1
+; SSE2-NEXT:    movaps %xmm1, %xmm3
+; SSE2-NEXT:    andps %xmm2, %xmm3
+; SSE2-NEXT:    maxss %xmm0, %xmm2
+; SSE2-NEXT:    andnps %xmm2, %xmm1
+; SSE2-NEXT:    orps %xmm3, %xmm1
+; SSE2-NEXT:    movaps %xmm1, %xmm0
 ; SSE2-NEXT:    retq
 ;
 ; SSE41-LABEL: test_v2f32:
 ; SSE41:       # %bb.0:
-; SSE41-NEXT:    movshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
-; SSE41-NEXT:    maxss %xmm1, %xmm0
+; SSE41-NEXT:    movshdup {{.*#+}} xmm2 = xmm0[1,1,3,3]
+; SSE41-NEXT:    movaps %xmm0, %xmm1
+; SSE41-NEXT:    cmpunordss %xmm0, %xmm1
+; SSE41-NEXT:    movaps %xmm1, %xmm3
+; SSE41-NEXT:    andps %xmm2, %xmm3
+; SSE41-NEXT:    maxss %xmm0, %xmm2
+; SSE41-NEXT:    andnps %xmm2, %xmm1
+; SSE41-NEXT:    orps %xmm3, %xmm1
+; SSE41-NEXT:    movaps %xmm1, %xmm0
 ; SSE41-NEXT:    retq
 ;
 ; AVX-LABEL: test_v2f32:
 ; AVX:       # %bb.0:
 ; AVX-NEXT:    vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
-; AVX-NEXT:    vmaxss %xmm1, %xmm0, %xmm0
+; AVX-NEXT:    vmaxss %xmm0, %xmm1, %xmm2
+; AVX-NEXT:    vcmpunordss %xmm0, %xmm0, %xmm0
+; AVX-NEXT:    vblendvps %xmm0, %xmm1, %xmm2, %xmm0
 ; AVX-NEXT:    retq
 ;
 ; AVX512-LABEL: test_v2f32:
 ; AVX512:       # %bb.0:
-; AVX512-NEXT:    vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
-; AVX512-NEXT:    vmaxss %xmm1, %xmm0, %xmm0
+; AVX512-NEXT:    vmovshdup {{.*#+}} xmm2 = xmm0[1,1,3,3]
+; AVX512-NEXT:    vmaxss %xmm0, %xmm2, %xmm1
+; AVX512-NEXT:    vcmpunordss %xmm0, %xmm0, %k1
+; AVX512-NEXT:    vmovss %xmm2, %xmm1, %xmm1 {%k1}
+; AVX512-NEXT:    vmovaps %xmm1, %xmm0
 ; AVX512-NEXT:    retq
   %1 = call nnan float @llvm.experimental.vector.reduce.fmax.v2f32(<2 x float> %a0)
   ret float %1
@@ -43,35 +62,45 @@ define float @test_v4f32(<4 x float> %a0) {
 ; SSE2-LABEL: test_v4f32:
 ; SSE2:       # %bb.0:
 ; SSE2-NEXT:    movaps %xmm0, %xmm1
-; SSE2-NEXT:    unpckhpd {{.*#+}} xmm1 = xmm1[1],xmm0[1]
-; SSE2-NEXT:    maxps %xmm1, %xmm0
-; SSE2-NEXT:    movaps %xmm0, %xmm1
-; SSE2-NEXT:    shufps {{.*#+}} xmm1 = xmm1[1,1],xmm0[1,1]
+; SSE2-NEXT:    shufps {{.*#+}} xmm1 = xmm1[3,3],xmm0[3,3]
+; SSE2-NEXT:    movaps %xmm0, %xmm2
+; SSE2-NEXT:    unpckhpd {{.*#+}} xmm2 = xmm2[1],xmm0[1]
+; SSE2-NEXT:    movaps %xmm0, %xmm3
+; SSE2-NEXT:    shufps {{.*#+}} xmm3 = xmm3[1,1],xmm0[1,1]
+; SSE2-NEXT:    maxss %xmm3, %xmm0
+; SSE2-NEXT:    maxss %xmm2, %xmm0
 ; SSE2-NEXT:    maxss %xmm1, %xmm0
 ; SSE2-NEXT:    retq
 ;
 ; SSE41-LABEL: test_v4f32:
 ; SSE41:       # %bb.0:
 ; SSE41-NEXT:    movaps %xmm0, %xmm1
-; SSE41-NEXT:    unpckhpd {{.*#+}} xmm1 = xmm1[1],xmm0[1]
-; SSE41-NEXT:    maxps %xmm1, %xmm0
-; SSE41-NEXT:    movshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
+; SSE41-NEXT:    shufps {{.*#+}} xmm1 = xmm1[3,3],xmm0[3,3]
+; SSE41-NEXT:    movaps %xmm0, %xmm2
+; SSE41-NEXT:    unpckhpd {{.*#+}} xmm2 = xmm2[1],xmm0[1]
+; SSE41-NEXT:    movshdup {{.*#+}} xmm3 = xmm0[1,1,3,3]
+; SSE41-NEXT:    maxss %xmm3, %xmm0
+; SSE41-NEXT:    maxss %xmm2, %xmm0
 ; SSE41-NEXT:    maxss %xmm1, %xmm0
 ; SSE41-NEXT:    retq
 ;
 ; AVX-LABEL: test_v4f32:
 ; AVX:       # %bb.0:
-; AVX-NEXT:    vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
-; AVX-NEXT:    vmaxps %xmm1, %xmm0, %xmm0
-; AVX-NEXT:    vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
+; AVX-NEXT:    vpermilps {{.*#+}} xmm1 = xmm0[3,3,3,3]
+; AVX-NEXT:    vpermilpd {{.*#+}} xmm2 = xmm0[1,0]
+; AVX-NEXT:    vmovshdup {{.*#+}} xmm3 = xmm0[1,1,3,3]
+; AVX-NEXT:    vmaxss %xmm3, %xmm0, %xmm0
+; AVX-NEXT:    vmaxss %xmm2, %xmm0, %xmm0
 ; AVX-NEXT:    vmaxss %xmm1, %xmm0, %xmm0
 ; AVX-NEXT:    retq
 ;
 ; AVX512-LABEL: test_v4f32:
 ; AVX512:       # %bb.0:
-; AVX512-NEXT:    vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
-; AVX512-NEXT:    vmaxps %xmm1, %xmm0, %xmm0
-; AVX512-NEXT:    vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
+; AVX512-NEXT:    vpermilps {{.*#+}} xmm1 = xmm0[3,3,3,3]
+; AVX512-NEXT:    vpermilpd {{.*#+}} xmm2 = xmm0[1,0]
+; AVX512-NEXT:    vmovshdup {{.*#+}} xmm3 = xmm0[1,1,3,3]
+; AVX512-NEXT:    vmaxss %xmm3, %xmm0, %xmm0
+; AVX512-NEXT:    vmaxss %xmm2, %xmm0, %xmm0
 ; AVX512-NEXT:    vmaxss %xmm1, %xmm0, %xmm0
 ; AVX512-NEXT:    retq
   %1 = call nnan float @llvm.experimental.vector.reduce.fmax.v4f32(<4 x float> %a0)
@@ -82,43 +111,67 @@ define float @test_v8f32(<8 x float> %a0) {
 ; SSE2-LABEL: test_v8f32:
 ; SSE2:       # %bb.0:
 ; SSE2-NEXT:    maxps %xmm1, %xmm0
+; SSE2-NEXT:    movaps %xmm0, %xmm2
+; SSE2-NEXT:    shufps {{.*#+}} xmm2 = xmm2[1,1],xmm0[1,1]
 ; SSE2-NEXT:    movaps %xmm0, %xmm1
-; SSE2-NEXT:    unpckhpd {{.*#+}} xmm1 = xmm1[1],xmm0[1]
-; SSE2-NEXT:    maxps %xmm1, %xmm0
-; SSE2-NEXT:    movaps %xmm0, %xmm1
-; SSE2-NEXT:    shufps {{.*#+}} xmm1 = xmm1[1,1],xmm0[1,1]
-; SSE2-NEXT:    maxss %xmm1, %xmm0
+; SSE2-NEXT:    maxss %xmm2, %xmm1
+; SSE2-NEXT:    movaps %xmm0, %xmm2
+; SSE2-NEXT:    unpckhpd {{.*#+}} xmm2 = xmm2[1],xmm0[1]
+; SSE2-NEXT:    maxss %xmm2, %xmm1
+; SSE2-NEXT:    shufps {{.*#+}} xmm0 = xmm0[3,3,3,3]
+; SSE2-NEXT:    maxss %xmm0, %xmm1
+; SSE2-NEXT:    movaps %xmm1, %xmm0
 ; SSE2-NEXT:    retq
 ;
 ; SSE41-LABEL: test_v8f32:
 ; SSE41:       # %bb.0:
 ; SSE41-NEXT:    maxps %xmm1, %xmm0
+; SSE41-NEXT:    movshdup {{.*#+}} xmm2 = xmm0[1,1,3,3]
 ; SSE41-NEXT:    movaps %xmm0, %xmm1
-; SSE41-NEXT:    unpckhpd {{.*#+}} xmm1 = xmm1[1],xmm0[1]
-; SSE41-NEXT:    maxps %xmm1, %xmm0
-; SSE41-NEXT:    movshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
-; SSE41-NEXT:    maxss %xmm1, %xmm0
+; SSE41-NEXT:    maxss %xmm2, %xmm1
+; SSE41-NEXT:    movaps %xmm0, %xmm2
+; SSE41-NEXT:    unpckhpd {{.*#+}} xmm2 = xmm2[1],xmm0[1]
+; SSE41-NEXT:    maxss %xmm2, %xmm1
+; SSE41-NEXT:    shufps {{.*#+}} xmm0 = xmm0[3,3,3,3]
+; SSE41-NEXT:    maxss %xmm0, %xmm1
+; SSE41-NEXT:    movaps %xmm1, %xmm0
 ; SSE41-NEXT:    retq
 ;
 ; AVX-LABEL: test_v8f32:
 ; AVX:       # %bb.0:
 ; AVX-NEXT:    vextractf128 $1, %ymm0, %xmm1
-; AVX-NEXT:    vmaxps %xmm1, %xmm0, %xmm0
-; AVX-NEXT:    vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
-; AVX-NEXT:    vmaxps %xmm1, %xmm0, %xmm0
-; AVX-NEXT:    vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
+; AVX-NEXT:    vpermilps {{.*#+}} xmm2 = xmm1[3,3,3,3]
+; AVX-NEXT:    vpermilpd {{.*#+}} xmm3 = xmm1[1,0]
+; AVX-NEXT:    vmovshdup {{.*#+}} xmm4 = xmm1[1,1,3,3]
+; AVX-NEXT:    vpermilps {{.*#+}} xmm5 = xmm0[3,3,3,3]
+; AVX-NEXT:    vpermilpd {{.*#+}} xmm6 = xmm0[1,0]
+; AVX-NEXT:    vmovshdup {{.*#+}} xmm7 = xmm0[1,1,3,3]
+; AVX-NEXT:    vmaxss %xmm7, %xmm0, %xmm0
+; AVX-NEXT:    vmaxss %xmm6, %xmm0, %xmm0
+; AVX-NEXT:    vmaxss %xmm5, %xmm0, %xmm0
 ; AVX-NEXT:    vmaxss %xmm1, %xmm0, %xmm0
+; AVX-NEXT:    vmaxss %xmm4, %xmm0, %xmm0
+; AVX-NEXT:    vmaxss %xmm3, %xmm0, %xmm0
+; AVX-NEXT:    vmaxss %xmm2, %xmm0, %xmm0
 ; AVX-NEXT:    vzeroupper
 ; AVX-NEXT:    retq
 ;
 ; AVX512-LABEL: test_v8f32:
 ; AVX512:       # %bb.0:
 ; AVX512-NEXT:    vextractf128 $1, %ymm0, %xmm1
-; AVX512-NEXT:    vmaxps %xmm1, %xmm0, %xmm0
-; AVX512-NEXT:    vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
-; AVX512-NEXT:    vmaxps %xmm1, %xmm0, %xmm0
-; AVX512-NEXT:    vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
+; AVX512-NEXT:    vpermilps {{.*#+}} xmm2 = xmm1[3,3,3,3]
+; AVX512-NEXT:    vpermilpd {{.*#+}} xmm3 = xmm1[1,0]
+; AVX512-NEXT:    vmovshdup {{.*#+}} xmm4 = xmm1[1,1,3,3]
+; AVX512-NEXT:    vpermilps {{.*#+}} xmm5 = xmm0[3,3,3,3]
+; AVX512-NEXT:    vpermilpd {{.*#+}} xmm6 = xmm0[1,0]
+; AVX512-NEXT:    vmovshdup {{.*#+}} xmm7 = xmm0[1,1,3,3]
+; AVX512-NEXT:    vmaxss %xmm7, %xmm0, %xmm0
+; AVX512-NEXT:    vmaxss %xmm6, %xmm0, %xmm0
+; AVX512-NEXT:    vmaxss %xmm5, %xmm0, %xmm0
 ; AVX512-NEXT:    vmaxss %xmm1, %xmm0, %xmm0
+; AVX512-NEXT:    vmaxss %xmm4, %xmm0, %xmm0
+; AVX512-NEXT:    vmaxss %xmm3, %xmm0, %xmm0
+; AVX512-NEXT:    vmaxss %xmm2, %xmm0, %xmm0
 ; AVX512-NEXT:    vzeroupper
 ; AVX512-NEXT:    retq
   %1 = call nnan float @llvm.experimental.vector.reduce.fmax.v8f32(<8 x float> %a0)
@@ -131,12 +184,16 @@ define float @test_v16f32(<16 x float> %a0) {
 ; SSE2-NEXT:    maxps %xmm3, %xmm1
 ; SSE2-NEXT:    maxps %xmm2, %xmm0
 ; SSE2-NEXT:    maxps %xmm1, %xmm0
+; SSE2-NEXT:    movaps %xmm0, %xmm2
+; SSE2-NEXT:    shufps {{.*#+}} xmm2 = xmm2[1,1],xmm0[1,1]
 ; SSE2-NEXT:    movaps %xmm0, %xmm1
-; SSE2-NEXT:    unpckhpd {{.*#+}} xmm1 = xmm1[1],xmm0[1]
-; SSE2-NEXT:    maxps %xmm1, %xmm0
-; SSE2-NEXT:    movaps %xmm0, %xmm1
-; SSE2-NEXT:    shufps {{.*#+}} xmm1 = xmm1[1,1],xmm0[1,1]
-; SSE2-NEXT:    maxss %xmm1, %xmm0
+; SSE2-NEXT:    maxss %xmm2, %xmm1
+; SSE2-NEXT:    movaps %xmm0, %xmm2
+; SSE2-NEXT:    unpckhpd {{.*#+}} xmm2 = xmm2[1],xmm0[1]
+; SSE2-NEXT:    maxss %xmm2, %xmm1
+; SSE2-NEXT:    shufps {{.*#+}} xmm0 = xmm0[3,3,3,3]
+; SSE2-NEXT:    maxss %xmm0, %xmm1
+; SSE2-NEXT:    movaps %xmm1, %xmm0
 ; SSE2-NEXT:    retq
 ;
 ; SSE41-LABEL: test_v16f32:
@@ -144,35 +201,69 @@ define float @test_v16f32(<16 x float> %a0) {
 ; SSE41-NEXT:    maxps %xmm3, %xmm1
 ; SSE41-NEXT:    maxps %xmm2, %xmm0
 ; SSE41-NEXT:    maxps %xmm1, %xmm0
+; SSE41-NEXT:    movshdup {{.*#+}} xmm2 = xmm0[1,1,3,3]
 ; SSE41-NEXT:    movaps %xmm0, %xmm1
-; SSE41-NEXT:    unpckhpd {{.*#+}} xmm1 = xmm1[1],xmm0[1]
-; SSE41-NEXT:    maxps %xmm1, %xmm0
-; SSE41-NEXT:    movshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
-; SSE41-NEXT:    maxss %xmm1, %xmm0
+; SSE41-NEXT:    maxss %xmm2, %xmm1
+; SSE41-NEXT:    movaps %xmm0, %xmm2
+; SSE41-NEXT:    unpckhpd {{.*#+}} xmm2 = xmm2[1],xmm0[1]
+; SSE41-NEXT:    maxss %xmm2, %xmm1
+; SSE41-NEXT:    shufps {{.*#+}} xmm0 = xmm0[3,3,3,3]
+; SSE41-NEXT:    maxss %xmm0, %xmm1
+; SSE41-NEXT:    movaps %xmm1, %xmm0
 ; SSE41-NEXT:    retq
 ;
 ; AVX-LABEL: test_v16f32:
 ; AVX:       # %bb.0:
 ; AVX-NEXT:    vmaxps %ymm1, %ymm0, %ymm0
-; AVX-NEXT:    vextractf128 $1, %ymm0, %xmm1
-; AVX-NEXT:    vmaxps %xmm1, %xmm0, %xmm0
-; AVX-NEXT:    vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
-; AVX-NEXT:    vmaxps %xmm1, %xmm0, %xmm0
 ; AVX-NEXT:    vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
-; AVX-NEXT:    vmaxss %xmm1, %xmm0, %xmm0
+; AVX-NEXT:    vmaxss %xmm1, %xmm0, %xmm1
+; AVX-NEXT:    vpermilpd {{.*#+}} xmm2 = xmm0[1,0]
+; AVX-NEXT:    vmaxss %xmm2, %xmm1, %xmm1
+; AVX-NEXT:    vpermilps {{.*#+}} xmm2 = xmm0[3,3,3,3]
+; AVX-NEXT:    vmaxss %xmm2, %xmm1, %xmm1
+; AVX-NEXT:    vextractf128 $1, %ymm0, %xmm0
+; AVX-NEXT:    vmaxss %xmm0, %xmm1, %xmm1
+; AVX-NEXT:    vmovshdup {{.*#+}} xmm2 = xmm0[1,1,3,3]
+; AVX-NEXT:    vmaxss %xmm2, %xmm1, %xmm1
+; AVX-NEXT:    vpermilpd {{.*#+}} xmm2 = xmm0[1,0]
+; AVX-NEXT:    vmaxss %xmm2, %xmm1, %xmm1
+; AVX-NEXT:    vpermilps {{.*#+}} xmm0 = xmm0[3,3,3,3]
+; AVX-NEXT:    vmaxss %xmm0, %xmm1, %xmm0
 ; AVX-NEXT:    vzeroupper
 ; AVX-NEXT:    retq
 ;
 ; AVX512-LABEL: test_v16f32:
 ; AVX512:       # %bb.0:
-; AVX512-NEXT:    vextractf64x4 $1, %zmm0, %ymm1
-; AVX512-NEXT:    vmaxps %zmm1, %zmm0, %zmm0
-; AVX512-NEXT:    vextractf128 $1, %ymm0, %xmm1
-; AVX512-NEXT:    vmaxps %xmm1, %xmm0, %xmm0
-; AVX512-NEXT:    vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
-; AVX512-NEXT:    vmaxps %xmm1, %xmm0, %xmm0
-; AVX512-NEXT:    vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
+; AVX512-NEXT:    vextractf32x4 $3, %zmm0, %xmm1
+; AVX512-NEXT:    vpermilps {{.*#+}} xmm8 = xmm1[3,3,3,3]
+; AVX512-NEXT:    vpermilpd {{.*#+}} xmm9 = xmm1[1,0]
+; AVX512-NEXT:    vmovshdup {{.*#+}} xmm10 = xmm1[1,1,3,3]
+; AVX512-NEXT:    vextractf32x4 $2, %zmm0, %xmm5
+; AVX512-NEXT:    vpermilps {{.*#+}} xmm11 = xmm5[3,3,3,3]
+; AVX512-NEXT:    vpermilpd {{.*#+}} xmm12 = xmm5[1,0]
+; AVX512-NEXT:    vmovshdup {{.*#+}} xmm13 = xmm5[1,1,3,3]
+; AVX512-NEXT:    vextractf128 $1, %ymm0, %xmm3
+; AVX512-NEXT:    vpermilps {{.*#+}} xmm14 = xmm3[3,3,3,3]
+; AVX512-NEXT:    vpermilpd {{.*#+}} xmm15 = xmm3[1,0]
+; AVX512-NEXT:    vmovshdup {{.*#+}} xmm7 = xmm3[1,1,3,3]
+; AVX512-NEXT:    vpermilps {{.*#+}} xmm2 = xmm0[3,3,3,3]
+; AVX512-NEXT:    vpermilpd {{.*#+}} xmm4 = xmm0[1,0]
+; AVX512-NEXT:    vmovshdup {{.*#+}} xmm6 = xmm0[1,1,3,3]
+; AVX512-NEXT:    vmaxss %xmm6, %xmm0, %xmm0
+; AVX512-NEXT:    vmaxss %xmm4, %xmm0, %xmm0
+; AVX512-NEXT:    vmaxss %xmm2, %xmm0, %xmm0
+; AVX512-NEXT:    vmaxss %xmm3, %xmm0, %xmm0
+; AVX512-NEXT:    vmaxss %xmm7, %xmm0, %xmm0
+; AVX512-NEXT:    vmaxss %xmm15, %xmm0, %xmm0
+; AVX512-NEXT:    vmaxss %xmm14, %xmm0, %xmm0
+; AVX512-NEXT:    vmaxss %xmm5, %xmm0, %xmm0
+; AVX512-NEXT:    vmaxss %xmm13, %xmm0, %xmm0
+; AVX512-NEXT:    vmaxss %xmm12, %xmm0, %xmm0
+; AVX512-NEXT:    vmaxss %xmm11, %xmm0, %xmm0
 ; AVX512-NEXT:    vmaxss %xmm1, %xmm0, %xmm0
+; AVX512-NEXT:    vmaxss %xmm10, %xmm0, %xmm0
+; AVX512-NEXT:    vmaxss %xmm9, %xmm0, %xmm0
+; AVX512-NEXT:    vmaxss %xmm8, %xmm0, %xmm0
 ; AVX512-NEXT:    vzeroupper
 ; AVX512-NEXT:    retq
   %1 = call nnan float @llvm.experimental.vector.reduce.fmax.v16f32(<16 x float> %a0)
@@ -206,6 +297,76 @@ define double @test_v2f64(<2 x double> %a0) {
   ret double %1
 }
 
+define double @test_v3f64(<3 x double> %a0) {
+; SSE2-LABEL: test_v3f64:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    unpcklpd {{.*#+}} xmm0 = xmm0[0],xmm1[0]
+; SSE2-NEXT:    shufpd {{.*#+}} xmm2 = xmm2[0],mem[1]
+; SSE2-NEXT:    movapd %xmm2, %xmm1
+; SSE2-NEXT:    maxpd %xmm0, %xmm1
+; SSE2-NEXT:    cmpunordpd %xmm0, %xmm0
+; SSE2-NEXT:    andpd %xmm0, %xmm2
+; SSE2-NEXT:    andnpd %xmm1, %xmm0
+; SSE2-NEXT:    orpd %xmm2, %xmm0
+; SSE2-NEXT:    pshufd {{.*#+}} xmm2 = xmm0[2,3,2,3]
+; SSE2-NEXT:    movapd %xmm0, %xmm1
+; SSE2-NEXT:    cmpunordsd %xmm0, %xmm1
+; SSE2-NEXT:    movapd %xmm1, %xmm3
+; SSE2-NEXT:    andpd %xmm2, %xmm3
+; SSE2-NEXT:    maxsd %xmm0, %xmm2
+; SSE2-NEXT:    andnpd %xmm2, %xmm1
+; SSE2-NEXT:    orpd %xmm3, %xmm1
+; SSE2-NEXT:    movapd %xmm1, %xmm0
+; SSE2-NEXT:    retq
+;
+; SSE41-LABEL: test_v3f64:
+; SSE41:       # %bb.0:
+; SSE41-NEXT:    unpcklpd {{.*#+}} xmm0 = xmm0[0],xmm1[0]
+; SSE41-NEXT:    blendpd {{.*#+}} xmm2 = xmm2[0],mem[1]
+; SSE41-NEXT:    movapd %xmm2, %xmm1
+; SSE41-NEXT:    maxpd %xmm0, %xmm1
+; SSE41-NEXT:    cmpunordpd %xmm0, %xmm0
+; SSE41-NEXT:    blendvpd %xmm0, %xmm2, %xmm1
+; SSE41-NEXT:    movapd %xmm1, %xmm2
+; SSE41-NEXT:    unpckhpd {{.*#+}} xmm2 = xmm2[1],xmm1[1]
+; SSE41-NEXT:    movapd %xmm1, %xmm0
+; SSE41-NEXT:    cmpunordsd %xmm1, %xmm0
+; SSE41-NEXT:    movapd %xmm0, %xmm3
+; SSE41-NEXT:    andpd %xmm2, %xmm3
+; SSE41-NEXT:    maxsd %xmm1, %xmm2
+; SSE41-NEXT:    andnpd %xmm2, %xmm0
+; SSE41-NEXT:    orpd %xmm3, %xmm0
+; SSE41-NEXT:    retq
+;
+; AVX-LABEL: test_v3f64:
+; AVX:       # %bb.0:
+; AVX-NEXT:    vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
+; AVX-NEXT:    vmaxsd %xmm0, %xmm1, %xmm2
+; AVX-NEXT:    vcmpunordsd %xmm0, %xmm0, %xmm3
+; AVX-NEXT:    vblendvpd %xmm3, %xmm1, %xmm2, %xmm1
+; AVX-NEXT:    vcmpunordsd %xmm1, %xmm1, %xmm2
+; AVX-NEXT:    vextractf128 $1, %ymm0, %xmm0
+; AVX-NEXT:    vmaxsd %xmm1, %xmm0, %xmm1
+; AVX-NEXT:    vblendvpd %xmm2, %xmm0, %xmm1, %xmm0
+; AVX-NEXT:    vzeroupper
+; AVX-NEXT:    retq
+;
+; AVX512-LABEL: test_v3f64:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
+; AVX512-NEXT:    vmaxsd %xmm0, %xmm1, %xmm2
+; AVX512-NEXT:    vcmpunordsd %xmm0, %xmm0, %k1
+; AVX512-NEXT:    vmovsd %xmm1, %xmm2, %xmm2 {%k1}
+; AVX512-NEXT:    vcmpunordsd %xmm2, %xmm2, %k1
+; AVX512-NEXT:    vextractf128 $1, %ymm0, %xmm1
+; AVX512-NEXT:    vmaxsd %xmm2, %xmm1, %xmm0
+; AVX512-NEXT:    vmovsd %xmm1, %xmm0, %xmm0 {%k1}
+; AVX512-NEXT:    vzeroupper
+; AVX512-NEXT:    retq
+  %1 = call nnan double @llvm.experimental.vector.reduce.fmax.v3f64(<3 x double> %a0)
+  ret double %1
+}
+
 define double @test_v4f64(<4 x double> %a0) {
 ; SSE-LABEL: test_v4f64:
 ; SSE:       # %bb.0:
@@ -218,18 +379,22 @@ define double @test_v4f64(<4 x double> %a0) {
 ; AVX-LABEL: test_v4f64:
 ; AVX:       # %bb.0:
 ; AVX-NEXT:    vextractf128 $1, %ymm0, %xmm1
-; AVX-NEXT:    vmaxpd %xmm1, %xmm0, %xmm0
-; AVX-NEXT:    vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
+; AVX-NEXT:    vpermilpd {{.*#+}} xmm2 = xmm1[1,0]
+; AVX-NEXT:    vpermilpd {{.*#+}} xmm3 = xmm0[1,0]
+; AVX-NEXT:    vmaxsd %xmm3, %xmm0, %xmm0
 ; AVX-NEXT:    vmaxsd %xmm1, %xmm0, %xmm0
+; AVX-NEXT:    vmaxsd %xmm2, %xmm0, %xmm0
 ; AVX-NEXT:    vzeroupper
 ; AVX-NEXT:    retq
 ;
 ; AVX512-LABEL: test_v4f64:
 ; AVX512:       # %bb.0:
 ; AVX512-NEXT:    vextractf128 $1, %ymm0, %xmm1
-; AVX512-NEXT:    vmaxpd %xmm1, %xmm0, %xmm0
-; AVX512-NEXT:    vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
+; AVX512-NEXT:    vpermilpd {{.*#+}} xmm2 = xmm1[1,0]
+; AVX512-NEXT:    vpermilpd {{.*#+}} xmm3 = xmm0[1,0]
+; AVX512-NEXT:    vmaxsd %xmm3, %xmm0, %xmm0
 ; AVX512-NEXT:    vmaxsd %xmm1, %xmm0, %xmm0
+; AVX512-NEXT:    vmaxsd %xmm2, %xmm0, %xmm0
 ; AVX512-NEXT:    vzeroupper
 ; AVX512-NEXT:    retq
   %1 = call nnan double @llvm.experimental.vector.reduce.fmax.v4f64(<4 x double> %a0)
@@ -250,21 +415,31 @@ define double @test_v8f64(<8 x double> %a0) {
 ; AVX-LABEL: test_v8f64:
 ; AVX:       # %bb.0:
 ; AVX-NEXT:    vmaxpd %ymm1, %ymm0, %ymm0
-; AVX-NEXT:    vextractf128 $1, %ymm0, %xmm1
-; AVX-NEXT:    vmaxpd %xmm1, %xmm0, %xmm0
 ; AVX-NEXT:    vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
-; AVX-NEXT:    vmaxsd %xmm1, %xmm0, %xmm0
+; AVX-NEXT:    vmaxsd %xmm1, %xmm0, %xmm1
+; AVX-NEXT:    vextractf128 $1, %ymm0, %xmm0
+; AVX-NEXT:    vmaxsd %xmm0, %xmm1, %xmm1
+; AVX-NEXT:    vpermilpd {{.*#+}} xmm0 = xmm0[1,0]
+; AVX-NEXT:    vmaxsd %xmm0, %xmm1, %xmm0
 ; AVX-NEXT:    vzeroupper
 ; AVX-NEXT:    retq
 ;
 ; AVX512-LABEL: test_v8f64:
 ; AVX512:       # %bb.0:
-; AVX512-NEXT:    vextractf64x4 $1, %zmm0, %ymm1
-; AVX512-NEXT:    vmaxpd %zmm1, %zmm0, %zmm0
-; AVX512-NEXT:    vextractf128 $1, %ymm0, %xmm1
-; AVX512-NEXT:    vmaxpd %xmm1, %xmm0, %xmm0
-; AVX512-NEXT:    vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
+; AVX512-NEXT:    vextractf32x4 $3, %zmm0, %xmm1
+; AVX512-NEXT:    vpermilpd {{.*#+}} xmm2 = xmm1[1,0]
+; AVX512-NEXT:    vextractf32x4 $2, %zmm0, %xmm3
+; AVX512-NEXT:    vpermilpd {{.*#+}} xmm4 = xmm3[1,0]
+; AVX512-NEXT:    vextractf128 $1, %ymm0, %xmm5
+; AVX512-NEXT:    vpermilpd {{.*#+}} xmm6 = xmm5[1,0]
+; AVX512-NEXT:    vpermilpd {{.*#+}} xmm7 = xmm0[1,0]
+; AVX512-NEXT:    vmaxsd %xmm7, %xmm0, %xmm0
+; AVX512-NEXT:    vmaxsd %xmm5, %xmm0, %xmm0
+; AVX512-NEXT:    vmaxsd %xmm6, %xmm0, %xmm0
+; AVX512-NEXT:    vmaxsd %xmm3, %xmm0, %xmm0
+; AVX512-NEXT:    vmaxsd %xmm4, %xmm0, %xmm0
 ; AVX512-NEXT:    vmaxsd %xmm1, %xmm0, %xmm0
+; AVX512-NEXT:    vmaxsd %xmm2, %xmm0, %xmm0
 ; AVX512-NEXT:    vzeroupper
 ; AVX512-NEXT:    retq
   %1 = call nnan double @llvm.experimental.vector.reduce.fmax.v8f64(<8 x double> %a0)
@@ -274,12 +449,12 @@ define double @test_v8f64(<8 x double> %a0) {
 define double @test_v16f64(<16 x double> %a0) {
 ; SSE-LABEL: test_v16f64:
 ; SSE:       # %bb.0:
-; SSE-NEXT:    maxpd %xmm6, %xmm2
-; SSE-NEXT:    maxpd %xmm4, %xmm0
-; SSE-NEXT:    maxpd %xmm2, %xmm0
 ; SSE-NEXT:    maxpd %xmm7, %xmm3
 ; SSE-NEXT:    maxpd %xmm5, %xmm1
 ; SSE-NEXT:    maxpd %xmm3, %xmm1
+; SSE-NEXT:    maxpd %xmm6, %xmm2
+; SSE-NEXT:    maxpd %xmm4, %xmm0
+; SSE-NEXT:    maxpd %xmm2, %xmm0
 ; SSE-NEXT:    maxpd %xmm1, %xmm0
 ; SSE-NEXT:    movapd %xmm0, %xmm1
 ; SSE-NEXT:    unpckhpd {{.*#+}} xmm1 = xmm1[1],xmm0[1]
@@ -291,22 +466,32 @@ define double @test_v16f64(<16 x double> %a0) {
 ; AVX-NEXT:    vmaxpd %ymm3, %ymm1, %ymm1
 ; AVX-NEXT:    vmaxpd %ymm2, %ymm0, %ymm0
 ; AVX-NEXT:    vmaxpd %ymm1, %ymm0, %ymm0
-; AVX-NEXT:    vextractf128 $1, %ymm0, %xmm1
-; AVX-NEXT:    vmaxpd %xmm1, %xmm0, %xmm0
 ; AVX-NEXT:    vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
-; AVX-NEXT:    vmaxsd %xmm1, %xmm0, %xmm0
+; AVX-NEXT:    vmaxsd %xmm1, %xmm0, %xmm1
+; AVX-NEXT:    vextractf128 $1, %ymm0, %xmm0
+; AVX-NEXT:    vmaxsd %xmm0, %xmm1, %xmm1
+; AVX-NEXT:    vpermilpd {{.*#+}} xmm0 = xmm0[1,0]
+; AVX-NEXT:    vmaxsd %xmm0, %xmm1, %xmm0
 ; AVX-NEXT:    vzeroupper
 ; AVX-NEXT:    retq
 ;
 ; AVX512-LABEL: test_v16f64:
 ; AVX512:       # %bb.0:
 ; AVX512-NEXT:    vmaxpd %zmm1, %zmm0, %zmm0
-; AVX512-NEXT:    vextractf64x4 $1, %zmm0, %ymm1
-; AVX512-NEXT:    vmaxpd %zmm1, %zmm0, %zmm0
-; AVX512-NEXT:    vextractf128 $1, %ymm0, %xmm1
-; AVX512-NEXT:    vmaxpd %xmm1, %xmm0, %xmm0
 ; AVX512-NEXT:    vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
-; AVX512-NEXT:    vmaxsd %xmm1, %xmm0, %xmm0
+; AVX512-NEXT:    vmaxsd %xmm1, %xmm0, %xmm1
+; AVX512-NEXT:    vextractf128 $1, %ymm0, %xmm2
+; AVX512-NEXT:    vmaxsd %xmm2, %xmm1, %xmm1
+; AVX512-NEXT:    vpermilpd {{.*#+}} xmm2 = xmm2[1,0]
+; AVX512-NEXT:    vmaxsd %xmm2, %xmm1, %xmm1
+; AVX512-NEXT:    vextractf32x4 $2, %zmm0, %xmm2
+; AVX512-NEXT:    vmaxsd %xmm2, %xmm1, %xmm1
+; AVX512-NEXT:    vpermilpd {{.*#+}} xmm2 = xmm2[1,0]
+; AVX512-NEXT:    vmaxsd %xmm2, %xmm1, %xmm1
+; AVX512-NEXT:    vextractf32x4 $3, %zmm0, %xmm0
+; AVX512-NEXT:    vmaxsd %xmm0, %xmm1, %xmm1
+; AVX512-NEXT:    vpermilpd {{.*#+}} xmm0 = xmm0[1,0]
+; AVX512-NEXT:    vmaxsd %xmm0, %xmm1, %xmm0
 ; AVX512-NEXT:    vzeroupper
 ; AVX512-NEXT:    retq
   %1 = call nnan double @llvm.experimental.vector.reduce.fmax.v16f64(<16 x double> %a0)
@@ -319,6 +504,7 @@ declare float @llvm.experimental.vector.reduce.fmax.v8f32(<8 x float>)
 declare float @llvm.experimental.vector.reduce.fmax.v16f32(<16 x float>)
 
 declare double @llvm.experimental.vector.reduce.fmax.v2f64(<2 x double>)
+declare double @llvm.experimental.vector.reduce.fmax.v3f64(<3 x double>)
 declare double @llvm.experimental.vector.reduce.fmax.v4f64(<4 x double>)
 declare double @llvm.experimental.vector.reduce.fmax.v8f64(<8 x double>)
 declare double @llvm.experimental.vector.reduce.fmax.v16f64(<16 x double>)

diff  --git a/llvm/test/CodeGen/X86/vector-reduce-fmax.ll b/llvm/test/CodeGen/X86/vector-reduce-fmax.ll
index d3b17d25ef09..c5e025be5423 100644
--- a/llvm/test/CodeGen/X86/vector-reduce-fmax.ll
+++ b/llvm/test/CodeGen/X86/vector-reduce-fmax.ll
@@ -10,69 +10,225 @@
 ; vXf32
 ;
 
+define float @test_v1f32(<1 x float> %a0) {
+; ALL-LABEL: test_v1f32:
+; ALL:       # %bb.0:
+; ALL-NEXT:    retq
+  %1 = call float @llvm.experimental.vector.reduce.fmax.v1f32(<1 x float> %a0)
+  ret float %1
+}
+
 define float @test_v2f32(<2 x float> %a0) {
 ; SSE2-LABEL: test_v2f32:
 ; SSE2:       # %bb.0:
+; SSE2-NEXT:    movaps %xmm0, %xmm2
+; SSE2-NEXT:    shufps {{.*#+}} xmm2 = xmm2[1,1],xmm0[1,1]
 ; SSE2-NEXT:    movaps %xmm0, %xmm1
-; SSE2-NEXT:    shufps {{.*#+}} xmm1 = xmm1[1,1],xmm0[1,1]
-; SSE2-NEXT:    maxss %xmm1, %xmm0
+; SSE2-NEXT:    cmpunordss %xmm0, %xmm1
+; SSE2-NEXT:    movaps %xmm1, %xmm3
+; SSE2-NEXT:    andps %xmm2, %xmm3
+; SSE2-NEXT:    maxss %xmm0, %xmm2
+; SSE2-NEXT:    andnps %xmm2, %xmm1
+; SSE2-NEXT:    orps %xmm3, %xmm1
+; SSE2-NEXT:    movaps %xmm1, %xmm0
 ; SSE2-NEXT:    retq
 ;
 ; SSE41-LABEL: test_v2f32:
 ; SSE41:       # %bb.0:
-; SSE41-NEXT:    movshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
-; SSE41-NEXT:    maxss %xmm1, %xmm0
+; SSE41-NEXT:    movshdup {{.*#+}} xmm2 = xmm0[1,1,3,3]
+; SSE41-NEXT:    movaps %xmm0, %xmm1
+; SSE41-NEXT:    cmpunordss %xmm0, %xmm1
+; SSE41-NEXT:    movaps %xmm1, %xmm3
+; SSE41-NEXT:    andps %xmm2, %xmm3
+; SSE41-NEXT:    maxss %xmm0, %xmm2
+; SSE41-NEXT:    andnps %xmm2, %xmm1
+; SSE41-NEXT:    orps %xmm3, %xmm1
+; SSE41-NEXT:    movaps %xmm1, %xmm0
 ; SSE41-NEXT:    retq
 ;
 ; AVX-LABEL: test_v2f32:
 ; AVX:       # %bb.0:
 ; AVX-NEXT:    vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
-; AVX-NEXT:    vmaxss %xmm1, %xmm0, %xmm0
+; AVX-NEXT:    vmaxss %xmm0, %xmm1, %xmm2
+; AVX-NEXT:    vcmpunordss %xmm0, %xmm0, %xmm0
+; AVX-NEXT:    vblendvps %xmm0, %xmm1, %xmm2, %xmm0
 ; AVX-NEXT:    retq
 ;
 ; AVX512-LABEL: test_v2f32:
 ; AVX512:       # %bb.0:
-; AVX512-NEXT:    vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
-; AVX512-NEXT:    vmaxss %xmm1, %xmm0, %xmm0
+; AVX512-NEXT:    vmovshdup {{.*#+}} xmm2 = xmm0[1,1,3,3]
+; AVX512-NEXT:    vmaxss %xmm0, %xmm2, %xmm1
+; AVX512-NEXT:    vcmpunordss %xmm0, %xmm0, %k1
+; AVX512-NEXT:    vmovss %xmm2, %xmm1, %xmm1 {%k1}
+; AVX512-NEXT:    vmovaps %xmm1, %xmm0
 ; AVX512-NEXT:    retq
   %1 = call float @llvm.experimental.vector.reduce.fmax.v2f32(<2 x float> %a0)
   ret float %1
 }
 
+define float @test_v3f32(<3 x float> %a0) {
+; SSE2-LABEL: test_v3f32:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    movaps %xmm0, %xmm2
+; SSE2-NEXT:    shufps {{.*#+}} xmm2 = xmm2[1,1],xmm0[1,1]
+; SSE2-NEXT:    movaps %xmm0, %xmm1
+; SSE2-NEXT:    cmpunordss %xmm0, %xmm1
+; SSE2-NEXT:    movaps %xmm1, %xmm3
+; SSE2-NEXT:    andps %xmm2, %xmm3
+; SSE2-NEXT:    maxss %xmm0, %xmm2
+; SSE2-NEXT:    andnps %xmm2, %xmm1
+; SSE2-NEXT:    orps %xmm3, %xmm1
+; SSE2-NEXT:    movhlps {{.*#+}} xmm0 = xmm0[1,1]
+; SSE2-NEXT:    movaps %xmm0, %xmm2
+; SSE2-NEXT:    maxss %xmm1, %xmm2
+; SSE2-NEXT:    cmpunordss %xmm1, %xmm1
+; SSE2-NEXT:    movaps %xmm1, %xmm3
+; SSE2-NEXT:    andnps %xmm2, %xmm3
+; SSE2-NEXT:    andps %xmm0, %xmm1
+; SSE2-NEXT:    orps %xmm3, %xmm1
+; SSE2-NEXT:    movaps %xmm1, %xmm0
+; SSE2-NEXT:    retq
+;
+; SSE41-LABEL: test_v3f32:
+; SSE41:       # %bb.0:
+; SSE41-NEXT:    movshdup {{.*#+}} xmm2 = xmm0[1,1,3,3]
+; SSE41-NEXT:    movaps %xmm0, %xmm1
+; SSE41-NEXT:    cmpunordss %xmm0, %xmm1
+; SSE41-NEXT:    movaps %xmm1, %xmm3
+; SSE41-NEXT:    andps %xmm2, %xmm3
+; SSE41-NEXT:    maxss %xmm0, %xmm2
+; SSE41-NEXT:    andnps %xmm2, %xmm1
+; SSE41-NEXT:    orps %xmm3, %xmm1
+; SSE41-NEXT:    movhlps {{.*#+}} xmm0 = xmm0[1,1]
+; SSE41-NEXT:    movaps %xmm0, %xmm2
+; SSE41-NEXT:    maxss %xmm1, %xmm2
+; SSE41-NEXT:    cmpunordss %xmm1, %xmm1
+; SSE41-NEXT:    movaps %xmm1, %xmm3
+; SSE41-NEXT:    andnps %xmm2, %xmm3
+; SSE41-NEXT:    andps %xmm0, %xmm1
+; SSE41-NEXT:    orps %xmm3, %xmm1
+; SSE41-NEXT:    movaps %xmm1, %xmm0
+; SSE41-NEXT:    retq
+;
+; AVX-LABEL: test_v3f32:
+; AVX:       # %bb.0:
+; AVX-NEXT:    vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
+; AVX-NEXT:    vmaxss %xmm0, %xmm1, %xmm2
+; AVX-NEXT:    vcmpunordss %xmm0, %xmm0, %xmm3
+; AVX-NEXT:    vblendvps %xmm3, %xmm1, %xmm2, %xmm1
+; AVX-NEXT:    vcmpunordss %xmm1, %xmm1, %xmm2
+; AVX-NEXT:    vpermilpd {{.*#+}} xmm0 = xmm0[1,0]
+; AVX-NEXT:    vmaxss %xmm1, %xmm0, %xmm1
+; AVX-NEXT:    vblendvps %xmm2, %xmm0, %xmm1, %xmm0
+; AVX-NEXT:    retq
+;
+; AVX512-LABEL: test_v3f32:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
+; AVX512-NEXT:    vmaxss %xmm0, %xmm1, %xmm2
+; AVX512-NEXT:    vcmpunordss %xmm0, %xmm0, %k1
+; AVX512-NEXT:    vmovss %xmm1, %xmm2, %xmm2 {%k1}
+; AVX512-NEXT:    vcmpunordss %xmm2, %xmm2, %k1
+; AVX512-NEXT:    vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
+; AVX512-NEXT:    vmaxss %xmm2, %xmm1, %xmm0
+; AVX512-NEXT:    vmovss %xmm1, %xmm0, %xmm0 {%k1}
+; AVX512-NEXT:    retq
+  %1 = call float @llvm.experimental.vector.reduce.fmax.v3f32(<3 x float> %a0)
+  ret float %1
+}
+
 define float @test_v4f32(<4 x float> %a0) {
 ; SSE2-LABEL: test_v4f32:
 ; SSE2:       # %bb.0:
+; SSE2-NEXT:    movaps %xmm0, %xmm2
+; SSE2-NEXT:    movaps %xmm0, %xmm3
+; SSE2-NEXT:    shufps {{.*#+}} xmm3 = xmm3[1,1],xmm0[1,1]
 ; SSE2-NEXT:    movaps %xmm0, %xmm1
-; SSE2-NEXT:    unpckhpd {{.*#+}} xmm1 = xmm1[1],xmm0[1]
-; SSE2-NEXT:    maxps %xmm1, %xmm0
-; SSE2-NEXT:    movaps %xmm0, %xmm1
-; SSE2-NEXT:    shufps {{.*#+}} xmm1 = xmm1[1,1],xmm0[1,1]
-; SSE2-NEXT:    maxss %xmm1, %xmm0
+; SSE2-NEXT:    cmpunordss %xmm0, %xmm1
+; SSE2-NEXT:    movaps %xmm1, %xmm4
+; SSE2-NEXT:    andps %xmm3, %xmm4
+; SSE2-NEXT:    maxss %xmm0, %xmm3
+; SSE2-NEXT:    shufps {{.*#+}} xmm0 = xmm0[3,3,3,3]
+; SSE2-NEXT:    movhlps {{.*#+}} xmm2 = xmm2[1,1]
+; SSE2-NEXT:    andnps %xmm3, %xmm1
+; SSE2-NEXT:    orps %xmm4, %xmm1
+; SSE2-NEXT:    movaps %xmm2, %xmm3
+; SSE2-NEXT:    maxss %xmm1, %xmm3
+; SSE2-NEXT:    cmpunordss %xmm1, %xmm1
+; SSE2-NEXT:    movaps %xmm1, %xmm4
+; SSE2-NEXT:    andnps %xmm3, %xmm4
+; SSE2-NEXT:    andps %xmm2, %xmm1
+; SSE2-NEXT:    orps %xmm4, %xmm1
+; SSE2-NEXT:    movaps %xmm0, %xmm2
+; SSE2-NEXT:    maxss %xmm1, %xmm2
+; SSE2-NEXT:    cmpunordss %xmm1, %xmm1
+; SSE2-NEXT:    movaps %xmm1, %xmm3
+; SSE2-NEXT:    andnps %xmm2, %xmm3
+; SSE2-NEXT:    andps %xmm0, %xmm1
+; SSE2-NEXT:    orps %xmm3, %xmm1
+; SSE2-NEXT:    movaps %xmm1, %xmm0
 ; SSE2-NEXT:    retq
 ;
 ; SSE41-LABEL: test_v4f32:
 ; SSE41:       # %bb.0:
+; SSE41-NEXT:    movaps %xmm0, %xmm2
+; SSE41-NEXT:    movshdup {{.*#+}} xmm3 = xmm0[1,1,3,3]
 ; SSE41-NEXT:    movaps %xmm0, %xmm1
-; SSE41-NEXT:    unpckhpd {{.*#+}} xmm1 = xmm1[1],xmm0[1]
-; SSE41-NEXT:    maxps %xmm1, %xmm0
-; SSE41-NEXT:    movshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
-; SSE41-NEXT:    maxss %xmm1, %xmm0
+; SSE41-NEXT:    cmpunordss %xmm0, %xmm1
+; SSE41-NEXT:    movaps %xmm1, %xmm4
+; SSE41-NEXT:    andps %xmm3, %xmm4
+; SSE41-NEXT:    maxss %xmm0, %xmm3
+; SSE41-NEXT:    shufps {{.*#+}} xmm0 = xmm0[3,3,3,3]
+; SSE41-NEXT:    movhlps {{.*#+}} xmm2 = xmm2[1,1]
+; SSE41-NEXT:    andnps %xmm3, %xmm1
+; SSE41-NEXT:    orps %xmm4, %xmm1
+; SSE41-NEXT:    movaps %xmm2, %xmm3
+; SSE41-NEXT:    maxss %xmm1, %xmm3
+; SSE41-NEXT:    cmpunordss %xmm1, %xmm1
+; SSE41-NEXT:    movaps %xmm1, %xmm4
+; SSE41-NEXT:    andnps %xmm3, %xmm4
+; SSE41-NEXT:    andps %xmm2, %xmm1
+; SSE41-NEXT:    orps %xmm4, %xmm1
+; SSE41-NEXT:    movaps %xmm0, %xmm2
+; SSE41-NEXT:    maxss %xmm1, %xmm2
+; SSE41-NEXT:    cmpunordss %xmm1, %xmm1
+; SSE41-NEXT:    movaps %xmm1, %xmm3
+; SSE41-NEXT:    andnps %xmm2, %xmm3
+; SSE41-NEXT:    andps %xmm0, %xmm1
+; SSE41-NEXT:    orps %xmm3, %xmm1
+; SSE41-NEXT:    movaps %xmm1, %xmm0
 ; SSE41-NEXT:    retq
 ;
 ; AVX-LABEL: test_v4f32:
 ; AVX:       # %bb.0:
-; AVX-NEXT:    vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
-; AVX-NEXT:    vmaxps %xmm1, %xmm0, %xmm0
-; AVX-NEXT:    vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
-; AVX-NEXT:    vmaxss %xmm1, %xmm0, %xmm0
+; AVX-NEXT:    vpermilps {{.*#+}} xmm1 = xmm0[3,3,3,3]
+; AVX-NEXT:    vpermilpd {{.*#+}} xmm2 = xmm0[1,0]
+; AVX-NEXT:    vmovshdup {{.*#+}} xmm3 = xmm0[1,1,3,3]
+; AVX-NEXT:    vmaxss %xmm0, %xmm3, %xmm4
+; AVX-NEXT:    vcmpunordss %xmm0, %xmm0, %xmm0
+; AVX-NEXT:    vblendvps %xmm0, %xmm3, %xmm4, %xmm0
+; AVX-NEXT:    vcmpunordss %xmm0, %xmm0, %xmm3
+; AVX-NEXT:    vmaxss %xmm0, %xmm2, %xmm0
+; AVX-NEXT:    vblendvps %xmm3, %xmm2, %xmm0, %xmm0
+; AVX-NEXT:    vcmpunordss %xmm0, %xmm0, %xmm2
+; AVX-NEXT:    vmaxss %xmm0, %xmm1, %xmm0
+; AVX-NEXT:    vblendvps %xmm2, %xmm1, %xmm0, %xmm0
 ; AVX-NEXT:    retq
 ;
 ; AVX512-LABEL: test_v4f32:
 ; AVX512:       # %bb.0:
-; AVX512-NEXT:    vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
-; AVX512-NEXT:    vmaxps %xmm1, %xmm0, %xmm0
-; AVX512-NEXT:    vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
-; AVX512-NEXT:    vmaxss %xmm1, %xmm0, %xmm0
+; AVX512-NEXT:    vpermilps {{.*#+}} xmm1 = xmm0[3,3,3,3]
+; AVX512-NEXT:    vpermilpd {{.*#+}} xmm2 = xmm0[1,0]
+; AVX512-NEXT:    vmovshdup {{.*#+}} xmm3 = xmm0[1,1,3,3]
+; AVX512-NEXT:    vmaxss %xmm0, %xmm3, %xmm4
+; AVX512-NEXT:    vcmpunordss %xmm0, %xmm0, %k1
+; AVX512-NEXT:    vmovss %xmm3, %xmm4, %xmm4 {%k1}
+; AVX512-NEXT:    vcmpunordss %xmm4, %xmm4, %k1
+; AVX512-NEXT:    vmaxss %xmm4, %xmm2, %xmm0
+; AVX512-NEXT:    vmovss %xmm2, %xmm0, %xmm0 {%k1}
+; AVX512-NEXT:    vcmpunordss %xmm0, %xmm0, %k1
+; AVX512-NEXT:    vmaxss %xmm0, %xmm1, %xmm0
+; AVX512-NEXT:    vmovss %xmm1, %xmm0, %xmm0 {%k1}
 ; AVX512-NEXT:    retq
   %1 = call float @llvm.experimental.vector.reduce.fmax.v4f32(<4 x float> %a0)
   ret float %1
@@ -81,46 +237,170 @@ define float @test_v4f32(<4 x float> %a0) {
 define float @test_v8f32(<8 x float> %a0) {
 ; SSE2-LABEL: test_v8f32:
 ; SSE2:       # %bb.0:
-; SSE2-NEXT:    maxps %xmm1, %xmm0
-; SSE2-NEXT:    movaps %xmm0, %xmm1
-; SSE2-NEXT:    unpckhpd {{.*#+}} xmm1 = xmm1[1],xmm0[1]
-; SSE2-NEXT:    maxps %xmm1, %xmm0
+; SSE2-NEXT:    movaps %xmm1, %xmm2
+; SSE2-NEXT:    maxps %xmm0, %xmm2
+; SSE2-NEXT:    cmpunordps %xmm0, %xmm0
+; SSE2-NEXT:    andps %xmm0, %xmm1
+; SSE2-NEXT:    andnps %xmm2, %xmm0
+; SSE2-NEXT:    orps %xmm1, %xmm0
+; SSE2-NEXT:    pshufd {{.*#+}} xmm2 = xmm0[1,1,1,1]
 ; SSE2-NEXT:    movaps %xmm0, %xmm1
-; SSE2-NEXT:    shufps {{.*#+}} xmm1 = xmm1[1,1],xmm0[1,1]
-; SSE2-NEXT:    maxss %xmm1, %xmm0
+; SSE2-NEXT:    cmpunordss %xmm0, %xmm1
+; SSE2-NEXT:    movaps %xmm1, %xmm3
+; SSE2-NEXT:    andps %xmm2, %xmm3
+; SSE2-NEXT:    maxss %xmm0, %xmm2
+; SSE2-NEXT:    andnps %xmm2, %xmm1
+; SSE2-NEXT:    orps %xmm3, %xmm1
+; SSE2-NEXT:    pshufd {{.*#+}} xmm2 = xmm0[2,3,2,3]
+; SSE2-NEXT:    movdqa %xmm2, %xmm3
+; SSE2-NEXT:    maxss %xmm1, %xmm3
+; SSE2-NEXT:    cmpunordss %xmm1, %xmm1
+; SSE2-NEXT:    movaps %xmm1, %xmm4
+; SSE2-NEXT:    andnps %xmm3, %xmm4
+; SSE2-NEXT:    andps %xmm2, %xmm1
+; SSE2-NEXT:    orps %xmm4, %xmm1
+; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[3,3,3,3]
+; SSE2-NEXT:    movdqa %xmm0, %xmm2
+; SSE2-NEXT:    maxss %xmm1, %xmm2
+; SSE2-NEXT:    cmpunordss %xmm1, %xmm1
+; SSE2-NEXT:    movaps %xmm1, %xmm3
+; SSE2-NEXT:    andnps %xmm2, %xmm3
+; SSE2-NEXT:    andps %xmm0, %xmm1
+; SSE2-NEXT:    orps %xmm3, %xmm1
+; SSE2-NEXT:    movaps %xmm1, %xmm0
 ; SSE2-NEXT:    retq
 ;
 ; SSE41-LABEL: test_v8f32:
 ; SSE41:       # %bb.0:
-; SSE41-NEXT:    maxps %xmm1, %xmm0
-; SSE41-NEXT:    movaps %xmm0, %xmm1
-; SSE41-NEXT:    unpckhpd {{.*#+}} xmm1 = xmm1[1],xmm0[1]
-; SSE41-NEXT:    maxps %xmm1, %xmm0
-; SSE41-NEXT:    movshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
-; SSE41-NEXT:    maxss %xmm1, %xmm0
+; SSE41-NEXT:    movaps %xmm1, %xmm2
+; SSE41-NEXT:    maxps %xmm0, %xmm2
+; SSE41-NEXT:    cmpunordps %xmm0, %xmm0
+; SSE41-NEXT:    blendvps %xmm0, %xmm1, %xmm2
+; SSE41-NEXT:    movshdup {{.*#+}} xmm1 = xmm2[1,1,3,3]
+; SSE41-NEXT:    movaps %xmm2, %xmm0
+; SSE41-NEXT:    cmpunordss %xmm2, %xmm0
+; SSE41-NEXT:    movaps %xmm0, %xmm3
+; SSE41-NEXT:    andps %xmm1, %xmm3
+; SSE41-NEXT:    maxss %xmm2, %xmm1
+; SSE41-NEXT:    andnps %xmm1, %xmm0
+; SSE41-NEXT:    orps %xmm3, %xmm0
+; SSE41-NEXT:    movaps %xmm2, %xmm1
+; SSE41-NEXT:    unpckhpd {{.*#+}} xmm1 = xmm1[1],xmm2[1]
+; SSE41-NEXT:    movaps %xmm1, %xmm3
+; SSE41-NEXT:    maxss %xmm0, %xmm3
+; SSE41-NEXT:    cmpunordss %xmm0, %xmm0
+; SSE41-NEXT:    movaps %xmm0, %xmm4
+; SSE41-NEXT:    andnps %xmm3, %xmm4
+; SSE41-NEXT:    andps %xmm1, %xmm0
+; SSE41-NEXT:    orps %xmm4, %xmm0
+; SSE41-NEXT:    shufps {{.*#+}} xmm2 = xmm2[3,3,3,3]
+; SSE41-NEXT:    movaps %xmm2, %xmm1
+; SSE41-NEXT:    maxss %xmm0, %xmm1
+; SSE41-NEXT:    cmpunordss %xmm0, %xmm0
+; SSE41-NEXT:    movaps %xmm0, %xmm3
+; SSE41-NEXT:    andnps %xmm1, %xmm3
+; SSE41-NEXT:    andps %xmm2, %xmm0
+; SSE41-NEXT:    orps %xmm3, %xmm0
 ; SSE41-NEXT:    retq
 ;
 ; AVX-LABEL: test_v8f32:
 ; AVX:       # %bb.0:
 ; AVX-NEXT:    vextractf128 $1, %ymm0, %xmm1
-; AVX-NEXT:    vmaxps %xmm1, %xmm0, %xmm0
-; AVX-NEXT:    vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
-; AVX-NEXT:    vmaxps %xmm1, %xmm0, %xmm0
-; AVX-NEXT:    vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
-; AVX-NEXT:    vmaxss %xmm1, %xmm0, %xmm0
+; AVX-NEXT:    vpermilps {{.*#+}} xmm8 = xmm1[3,3,3,3]
+; AVX-NEXT:    vpermilpd {{.*#+}} xmm3 = xmm1[1,0]
+; AVX-NEXT:    vmovshdup {{.*#+}} xmm4 = xmm1[1,1,3,3]
+; AVX-NEXT:    vpermilps {{.*#+}} xmm5 = xmm0[3,3,3,3]
+; AVX-NEXT:    vpermilpd {{.*#+}} xmm6 = xmm0[1,0]
+; AVX-NEXT:    vmovshdup {{.*#+}} xmm7 = xmm0[1,1,3,3]
+; AVX-NEXT:    vmaxss %xmm0, %xmm7, %xmm2
+; AVX-NEXT:    vcmpunordss %xmm0, %xmm0, %xmm0
+; AVX-NEXT:    vblendvps %xmm0, %xmm7, %xmm2, %xmm0
+; AVX-NEXT:    vcmpunordss %xmm0, %xmm0, %xmm2
+; AVX-NEXT:    vmaxss %xmm0, %xmm6, %xmm0
+; AVX-NEXT:    vblendvps %xmm2, %xmm6, %xmm0, %xmm0
+; AVX-NEXT:    vcmpunordss %xmm0, %xmm0, %xmm2
+; AVX-NEXT:    vmaxss %xmm0, %xmm5, %xmm0
+; AVX-NEXT:    vblendvps %xmm2, %xmm5, %xmm0, %xmm0
+; AVX-NEXT:    vcmpunordss %xmm0, %xmm0, %xmm2
+; AVX-NEXT:    vmaxss %xmm0, %xmm1, %xmm0
+; AVX-NEXT:    vblendvps %xmm2, %xmm1, %xmm0, %xmm0
+; AVX-NEXT:    vcmpunordss %xmm0, %xmm0, %xmm1
+; AVX-NEXT:    vmaxss %xmm0, %xmm4, %xmm0
+; AVX-NEXT:    vblendvps %xmm1, %xmm4, %xmm0, %xmm0
+; AVX-NEXT:    vcmpunordss %xmm0, %xmm0, %xmm1
+; AVX-NEXT:    vmaxss %xmm0, %xmm3, %xmm0
+; AVX-NEXT:    vblendvps %xmm1, %xmm3, %xmm0, %xmm0
+; AVX-NEXT:    vcmpunordss %xmm0, %xmm0, %xmm1
+; AVX-NEXT:    vmaxss %xmm0, %xmm8, %xmm0
+; AVX-NEXT:    vblendvps %xmm1, %xmm8, %xmm0, %xmm0
 ; AVX-NEXT:    vzeroupper
 ; AVX-NEXT:    retq
 ;
-; AVX512-LABEL: test_v8f32:
-; AVX512:       # %bb.0:
-; AVX512-NEXT:    vextractf128 $1, %ymm0, %xmm1
-; AVX512-NEXT:    vmaxps %xmm1, %xmm0, %xmm0
-; AVX512-NEXT:    vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
-; AVX512-NEXT:    vmaxps %xmm1, %xmm0, %xmm0
-; AVX512-NEXT:    vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
-; AVX512-NEXT:    vmaxss %xmm1, %xmm0, %xmm0
-; AVX512-NEXT:    vzeroupper
-; AVX512-NEXT:    retq
+; AVX512BW-LABEL: test_v8f32:
+; AVX512BW:       # %bb.0:
+; AVX512BW-NEXT:    vextractf128 $1, %ymm0, %xmm3
+; AVX512BW-NEXT:    vpermilps {{.*#+}} xmm8 = xmm3[3,3,3,3]
+; AVX512BW-NEXT:    vpermilpd {{.*#+}} xmm2 = xmm3[1,0]
+; AVX512BW-NEXT:    vmovshdup {{.*#+}} xmm4 = xmm3[1,1,3,3]
+; AVX512BW-NEXT:    vpermilps {{.*#+}} xmm5 = xmm0[3,3,3,3]
+; AVX512BW-NEXT:    vpermilpd {{.*#+}} xmm6 = xmm0[1,0]
+; AVX512BW-NEXT:    vmovshdup {{.*#+}} xmm7 = xmm0[1,1,3,3]
+; AVX512BW-NEXT:    vmaxss %xmm0, %xmm7, %xmm1
+; AVX512BW-NEXT:    vcmpunordss %xmm0, %xmm0, %k1
+; AVX512BW-NEXT:    vmovss %xmm7, %xmm1, %xmm1 {%k1}
+; AVX512BW-NEXT:    vcmpunordss %xmm1, %xmm1, %k1
+; AVX512BW-NEXT:    vmaxss %xmm1, %xmm6, %xmm0
+; AVX512BW-NEXT:    vmovss %xmm6, %xmm0, %xmm0 {%k1}
+; AVX512BW-NEXT:    vcmpunordss %xmm0, %xmm0, %k1
+; AVX512BW-NEXT:    vmaxss %xmm0, %xmm5, %xmm0
+; AVX512BW-NEXT:    vmovss %xmm5, %xmm0, %xmm0 {%k1}
+; AVX512BW-NEXT:    vcmpunordss %xmm0, %xmm0, %k1
+; AVX512BW-NEXT:    vmaxss %xmm0, %xmm3, %xmm0
+; AVX512BW-NEXT:    vmovss %xmm3, %xmm0, %xmm0 {%k1}
+; AVX512BW-NEXT:    vcmpunordss %xmm0, %xmm0, %k1
+; AVX512BW-NEXT:    vmaxss %xmm0, %xmm4, %xmm0
+; AVX512BW-NEXT:    vmovss %xmm4, %xmm0, %xmm0 {%k1}
+; AVX512BW-NEXT:    vcmpunordss %xmm0, %xmm0, %k1
+; AVX512BW-NEXT:    vmaxss %xmm0, %xmm2, %xmm0
+; AVX512BW-NEXT:    vmovss %xmm2, %xmm0, %xmm0 {%k1}
+; AVX512BW-NEXT:    vcmpunordss %xmm0, %xmm0, %k1
+; AVX512BW-NEXT:    vmaxss %xmm0, %xmm8, %xmm0
+; AVX512BW-NEXT:    vmovss %xmm8, %xmm0, %xmm0 {%k1}
+; AVX512BW-NEXT:    vzeroupper
+; AVX512BW-NEXT:    retq
+;
+; AVX512VL-LABEL: test_v8f32:
+; AVX512VL:       # %bb.0:
+; AVX512VL-NEXT:    vextractf128 $1, %ymm0, %xmm1
+; AVX512VL-NEXT:    vpermilps {{.*#+}} xmm8 = xmm1[3,3,3,3]
+; AVX512VL-NEXT:    vpermilpd {{.*#+}} xmm3 = xmm1[1,0]
+; AVX512VL-NEXT:    vmovshdup {{.*#+}} xmm4 = xmm1[1,1,3,3]
+; AVX512VL-NEXT:    vpermilps {{.*#+}} xmm5 = xmm0[3,3,3,3]
+; AVX512VL-NEXT:    vpermilpd {{.*#+}} xmm6 = xmm0[1,0]
+; AVX512VL-NEXT:    vmovshdup {{.*#+}} xmm7 = xmm0[1,1,3,3]
+; AVX512VL-NEXT:    vmaxss %xmm0, %xmm7, %xmm2
+; AVX512VL-NEXT:    vcmpunordss %xmm0, %xmm0, %k1
+; AVX512VL-NEXT:    vmovss %xmm7, %xmm2, %xmm2 {%k1}
+; AVX512VL-NEXT:    vcmpunordss %xmm2, %xmm2, %k1
+; AVX512VL-NEXT:    vmaxss %xmm2, %xmm6, %xmm0
+; AVX512VL-NEXT:    vmovss %xmm6, %xmm0, %xmm0 {%k1}
+; AVX512VL-NEXT:    vcmpunordss %xmm0, %xmm0, %k1
+; AVX512VL-NEXT:    vmaxss %xmm0, %xmm5, %xmm0
+; AVX512VL-NEXT:    vmovss %xmm5, %xmm0, %xmm0 {%k1}
+; AVX512VL-NEXT:    vcmpunordss %xmm0, %xmm0, %k1
+; AVX512VL-NEXT:    vmaxss %xmm0, %xmm1, %xmm0
+; AVX512VL-NEXT:    vmovss %xmm1, %xmm0, %xmm0 {%k1}
+; AVX512VL-NEXT:    vcmpunordss %xmm0, %xmm0, %k1
+; AVX512VL-NEXT:    vmaxss %xmm0, %xmm4, %xmm0
+; AVX512VL-NEXT:    vmovss %xmm4, %xmm0, %xmm0 {%k1}
+; AVX512VL-NEXT:    vcmpunordss %xmm0, %xmm0, %k1
+; AVX512VL-NEXT:    vmaxss %xmm0, %xmm3, %xmm0
+; AVX512VL-NEXT:    vmovss %xmm3, %xmm0, %xmm0 {%k1}
+; AVX512VL-NEXT:    vcmpunordss %xmm0, %xmm0, %k1
+; AVX512VL-NEXT:    vmaxss %xmm0, %xmm8, %xmm0
+; AVX512VL-NEXT:    vmovss %xmm8, %xmm0, %xmm0 {%k1}
+; AVX512VL-NEXT:    vzeroupper
+; AVX512VL-NEXT:    retq
   %1 = call float @llvm.experimental.vector.reduce.fmax.v8f32(<8 x float> %a0)
   ret float %1
 }
@@ -128,53 +408,259 @@ define float @test_v8f32(<8 x float> %a0) {
 define float @test_v16f32(<16 x float> %a0) {
 ; SSE2-LABEL: test_v16f32:
 ; SSE2:       # %bb.0:
-; SSE2-NEXT:    maxps %xmm3, %xmm1
-; SSE2-NEXT:    maxps %xmm2, %xmm0
-; SSE2-NEXT:    maxps %xmm1, %xmm0
-; SSE2-NEXT:    movaps %xmm0, %xmm1
-; SSE2-NEXT:    unpckhpd {{.*#+}} xmm1 = xmm1[1],xmm0[1]
-; SSE2-NEXT:    maxps %xmm1, %xmm0
+; SSE2-NEXT:    movaps %xmm2, %xmm4
+; SSE2-NEXT:    maxps %xmm0, %xmm4
+; SSE2-NEXT:    cmpunordps %xmm0, %xmm0
+; SSE2-NEXT:    andps %xmm0, %xmm2
+; SSE2-NEXT:    andnps %xmm4, %xmm0
+; SSE2-NEXT:    orps %xmm2, %xmm0
+; SSE2-NEXT:    movaps %xmm3, %xmm2
+; SSE2-NEXT:    maxps %xmm1, %xmm2
+; SSE2-NEXT:    cmpunordps %xmm1, %xmm1
+; SSE2-NEXT:    andps %xmm1, %xmm3
+; SSE2-NEXT:    andnps %xmm2, %xmm1
+; SSE2-NEXT:    orps %xmm3, %xmm1
+; SSE2-NEXT:    movaps %xmm1, %xmm2
+; SSE2-NEXT:    maxps %xmm0, %xmm2
+; SSE2-NEXT:    cmpunordps %xmm0, %xmm0
+; SSE2-NEXT:    andps %xmm0, %xmm1
+; SSE2-NEXT:    andnps %xmm2, %xmm0
+; SSE2-NEXT:    orps %xmm1, %xmm0
+; SSE2-NEXT:    pshufd {{.*#+}} xmm2 = xmm0[1,1,1,1]
 ; SSE2-NEXT:    movaps %xmm0, %xmm1
-; SSE2-NEXT:    shufps {{.*#+}} xmm1 = xmm1[1,1],xmm0[1,1]
-; SSE2-NEXT:    maxss %xmm1, %xmm0
+; SSE2-NEXT:    cmpunordss %xmm0, %xmm1
+; SSE2-NEXT:    movaps %xmm1, %xmm3
+; SSE2-NEXT:    andps %xmm2, %xmm3
+; SSE2-NEXT:    maxss %xmm0, %xmm2
+; SSE2-NEXT:    andnps %xmm2, %xmm1
+; SSE2-NEXT:    orps %xmm3, %xmm1
+; SSE2-NEXT:    pshufd {{.*#+}} xmm2 = xmm0[2,3,2,3]
+; SSE2-NEXT:    movdqa %xmm2, %xmm3
+; SSE2-NEXT:    maxss %xmm1, %xmm3
+; SSE2-NEXT:    cmpunordss %xmm1, %xmm1
+; SSE2-NEXT:    movaps %xmm1, %xmm4
+; SSE2-NEXT:    andnps %xmm3, %xmm4
+; SSE2-NEXT:    andps %xmm2, %xmm1
+; SSE2-NEXT:    orps %xmm4, %xmm1
+; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[3,3,3,3]
+; SSE2-NEXT:    movdqa %xmm0, %xmm2
+; SSE2-NEXT:    maxss %xmm1, %xmm2
+; SSE2-NEXT:    cmpunordss %xmm1, %xmm1
+; SSE2-NEXT:    movaps %xmm1, %xmm3
+; SSE2-NEXT:    andnps %xmm2, %xmm3
+; SSE2-NEXT:    andps %xmm0, %xmm1
+; SSE2-NEXT:    orps %xmm3, %xmm1
+; SSE2-NEXT:    movaps %xmm1, %xmm0
 ; SSE2-NEXT:    retq
 ;
 ; SSE41-LABEL: test_v16f32:
 ; SSE41:       # %bb.0:
-; SSE41-NEXT:    maxps %xmm3, %xmm1
-; SSE41-NEXT:    maxps %xmm2, %xmm0
-; SSE41-NEXT:    maxps %xmm1, %xmm0
-; SSE41-NEXT:    movaps %xmm0, %xmm1
-; SSE41-NEXT:    unpckhpd {{.*#+}} xmm1 = xmm1[1],xmm0[1]
-; SSE41-NEXT:    maxps %xmm1, %xmm0
-; SSE41-NEXT:    movshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
-; SSE41-NEXT:    maxss %xmm1, %xmm0
+; SSE41-NEXT:    movaps %xmm2, %xmm4
+; SSE41-NEXT:    maxps %xmm0, %xmm4
+; SSE41-NEXT:    cmpunordps %xmm0, %xmm0
+; SSE41-NEXT:    blendvps %xmm0, %xmm2, %xmm4
+; SSE41-NEXT:    movaps %xmm3, %xmm2
+; SSE41-NEXT:    maxps %xmm1, %xmm2
+; SSE41-NEXT:    cmpunordps %xmm1, %xmm1
+; SSE41-NEXT:    movaps %xmm1, %xmm0
+; SSE41-NEXT:    blendvps %xmm0, %xmm3, %xmm2
+; SSE41-NEXT:    movaps %xmm2, %xmm1
+; SSE41-NEXT:    maxps %xmm4, %xmm1
+; SSE41-NEXT:    cmpunordps %xmm4, %xmm4
+; SSE41-NEXT:    movaps %xmm4, %xmm0
+; SSE41-NEXT:    blendvps %xmm0, %xmm2, %xmm1
+; SSE41-NEXT:    movshdup {{.*#+}} xmm2 = xmm1[1,1,3,3]
+; SSE41-NEXT:    movaps %xmm1, %xmm0
+; SSE41-NEXT:    cmpunordss %xmm1, %xmm0
+; SSE41-NEXT:    movaps %xmm0, %xmm3
+; SSE41-NEXT:    andps %xmm2, %xmm3
+; SSE41-NEXT:    maxss %xmm1, %xmm2
+; SSE41-NEXT:    andnps %xmm2, %xmm0
+; SSE41-NEXT:    orps %xmm3, %xmm0
+; SSE41-NEXT:    movaps %xmm1, %xmm2
+; SSE41-NEXT:    unpckhpd {{.*#+}} xmm2 = xmm2[1],xmm1[1]
+; SSE41-NEXT:    movaps %xmm2, %xmm3
+; SSE41-NEXT:    maxss %xmm0, %xmm3
+; SSE41-NEXT:    cmpunordss %xmm0, %xmm0
+; SSE41-NEXT:    movaps %xmm0, %xmm4
+; SSE41-NEXT:    andnps %xmm3, %xmm4
+; SSE41-NEXT:    andps %xmm2, %xmm0
+; SSE41-NEXT:    orps %xmm4, %xmm0
+; SSE41-NEXT:    shufps {{.*#+}} xmm1 = xmm1[3,3,3,3]
+; SSE41-NEXT:    movaps %xmm1, %xmm2
+; SSE41-NEXT:    maxss %xmm0, %xmm2
+; SSE41-NEXT:    cmpunordss %xmm0, %xmm0
+; SSE41-NEXT:    movaps %xmm0, %xmm3
+; SSE41-NEXT:    andnps %xmm2, %xmm3
+; SSE41-NEXT:    andps %xmm1, %xmm0
+; SSE41-NEXT:    orps %xmm3, %xmm0
 ; SSE41-NEXT:    retq
 ;
 ; AVX-LABEL: test_v16f32:
 ; AVX:       # %bb.0:
-; AVX-NEXT:    vmaxps %ymm1, %ymm0, %ymm0
-; AVX-NEXT:    vextractf128 $1, %ymm0, %xmm1
-; AVX-NEXT:    vmaxps %xmm1, %xmm0, %xmm0
-; AVX-NEXT:    vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
-; AVX-NEXT:    vmaxps %xmm1, %xmm0, %xmm0
+; AVX-NEXT:    vmaxps %ymm0, %ymm1, %ymm2
+; AVX-NEXT:    vcmpunordps %ymm0, %ymm0, %ymm0
+; AVX-NEXT:    vblendvps %ymm0, %ymm1, %ymm2, %ymm0
 ; AVX-NEXT:    vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
-; AVX-NEXT:    vmaxss %xmm1, %xmm0, %xmm0
+; AVX-NEXT:    vmaxss %xmm0, %xmm1, %xmm2
+; AVX-NEXT:    vcmpunordss %xmm0, %xmm0, %xmm3
+; AVX-NEXT:    vblendvps %xmm3, %xmm1, %xmm2, %xmm1
+; AVX-NEXT:    vcmpunordss %xmm1, %xmm1, %xmm2
+; AVX-NEXT:    vpermilpd {{.*#+}} xmm3 = xmm0[1,0]
+; AVX-NEXT:    vmaxss %xmm1, %xmm3, %xmm1
+; AVX-NEXT:    vblendvps %xmm2, %xmm3, %xmm1, %xmm1
+; AVX-NEXT:    vcmpunordss %xmm1, %xmm1, %xmm2
+; AVX-NEXT:    vpermilps {{.*#+}} xmm3 = xmm0[3,3,3,3]
+; AVX-NEXT:    vmaxss %xmm1, %xmm3, %xmm1
+; AVX-NEXT:    vblendvps %xmm2, %xmm3, %xmm1, %xmm1
+; AVX-NEXT:    vcmpunordss %xmm1, %xmm1, %xmm2
+; AVX-NEXT:    vextractf128 $1, %ymm0, %xmm0
+; AVX-NEXT:    vmaxss %xmm1, %xmm0, %xmm1
+; AVX-NEXT:    vblendvps %xmm2, %xmm0, %xmm1, %xmm1
+; AVX-NEXT:    vcmpunordss %xmm1, %xmm1, %xmm2
+; AVX-NEXT:    vmovshdup {{.*#+}} xmm3 = xmm0[1,1,3,3]
+; AVX-NEXT:    vmaxss %xmm1, %xmm3, %xmm1
+; AVX-NEXT:    vblendvps %xmm2, %xmm3, %xmm1, %xmm1
+; AVX-NEXT:    vcmpunordss %xmm1, %xmm1, %xmm2
+; AVX-NEXT:    vpermilpd {{.*#+}} xmm3 = xmm0[1,0]
+; AVX-NEXT:    vmaxss %xmm1, %xmm3, %xmm1
+; AVX-NEXT:    vblendvps %xmm2, %xmm3, %xmm1, %xmm1
+; AVX-NEXT:    vcmpunordss %xmm1, %xmm1, %xmm2
+; AVX-NEXT:    vpermilps {{.*#+}} xmm0 = xmm0[3,3,3,3]
+; AVX-NEXT:    vmaxss %xmm1, %xmm0, %xmm1
+; AVX-NEXT:    vblendvps %xmm2, %xmm0, %xmm1, %xmm0
 ; AVX-NEXT:    vzeroupper
 ; AVX-NEXT:    retq
 ;
-; AVX512-LABEL: test_v16f32:
-; AVX512:       # %bb.0:
-; AVX512-NEXT:    vextractf64x4 $1, %zmm0, %ymm1
-; AVX512-NEXT:    vmaxps %zmm1, %zmm0, %zmm0
-; AVX512-NEXT:    vextractf128 $1, %ymm0, %xmm1
-; AVX512-NEXT:    vmaxps %xmm1, %xmm0, %xmm0
-; AVX512-NEXT:    vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
-; AVX512-NEXT:    vmaxps %xmm1, %xmm0, %xmm0
-; AVX512-NEXT:    vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
-; AVX512-NEXT:    vmaxss %xmm1, %xmm0, %xmm0
-; AVX512-NEXT:    vzeroupper
-; AVX512-NEXT:    retq
+; AVX512BW-LABEL: test_v16f32:
+; AVX512BW:       # %bb.0:
+; AVX512BW-NEXT:    vextractf128 $1, %ymm0, %xmm1
+; AVX512BW-NEXT:    vmovshdup {{.*#+}} xmm2 = xmm0[1,1,3,3]
+; AVX512BW-NEXT:    vmaxss %xmm0, %xmm2, %xmm3
+; AVX512BW-NEXT:    vcmpunordss %xmm0, %xmm0, %k1
+; AVX512BW-NEXT:    vmovss %xmm2, %xmm3, %xmm3 {%k1}
+; AVX512BW-NEXT:    vpermilpd {{.*#+}} xmm2 = xmm0[1,0]
+; AVX512BW-NEXT:    vcmpunordss %xmm3, %xmm3, %k1
+; AVX512BW-NEXT:    vmaxss %xmm3, %xmm2, %xmm3
+; AVX512BW-NEXT:    vmovss %xmm2, %xmm3, %xmm3 {%k1}
+; AVX512BW-NEXT:    vpermilps {{.*#+}} xmm2 = xmm0[3,3,3,3]
+; AVX512BW-NEXT:    vcmpunordss %xmm3, %xmm3, %k1
+; AVX512BW-NEXT:    vmaxss %xmm3, %xmm2, %xmm3
+; AVX512BW-NEXT:    vmovss %xmm2, %xmm3, %xmm3 {%k1}
+; AVX512BW-NEXT:    vmovshdup {{.*#+}} xmm2 = xmm1[1,1,3,3]
+; AVX512BW-NEXT:    vcmpunordss %xmm3, %xmm3, %k1
+; AVX512BW-NEXT:    vmaxss %xmm3, %xmm1, %xmm3
+; AVX512BW-NEXT:    vmovss %xmm1, %xmm3, %xmm3 {%k1}
+; AVX512BW-NEXT:    vcmpunordss %xmm3, %xmm3, %k1
+; AVX512BW-NEXT:    vmaxss %xmm3, %xmm2, %xmm3
+; AVX512BW-NEXT:    vmovss %xmm2, %xmm3, %xmm3 {%k1}
+; AVX512BW-NEXT:    vpermilpd {{.*#+}} xmm2 = xmm1[1,0]
+; AVX512BW-NEXT:    vcmpunordss %xmm3, %xmm3, %k1
+; AVX512BW-NEXT:    vmaxss %xmm3, %xmm2, %xmm3
+; AVX512BW-NEXT:    vmovss %xmm2, %xmm3, %xmm3 {%k1}
+; AVX512BW-NEXT:    vextractf32x4 $2, %zmm0, %xmm2
+; AVX512BW-NEXT:    vpermilps {{.*#+}} xmm1 = xmm1[3,3,3,3]
+; AVX512BW-NEXT:    vcmpunordss %xmm3, %xmm3, %k1
+; AVX512BW-NEXT:    vmaxss %xmm3, %xmm1, %xmm3
+; AVX512BW-NEXT:    vmovss %xmm1, %xmm3, %xmm3 {%k1}
+; AVX512BW-NEXT:    vmovshdup {{.*#+}} xmm1 = xmm2[1,1,3,3]
+; AVX512BW-NEXT:    vcmpunordss %xmm3, %xmm3, %k1
+; AVX512BW-NEXT:    vmaxss %xmm3, %xmm2, %xmm3
+; AVX512BW-NEXT:    vmovss %xmm2, %xmm3, %xmm3 {%k1}
+; AVX512BW-NEXT:    vcmpunordss %xmm3, %xmm3, %k1
+; AVX512BW-NEXT:    vmaxss %xmm3, %xmm1, %xmm3
+; AVX512BW-NEXT:    vmovss %xmm1, %xmm3, %xmm3 {%k1}
+; AVX512BW-NEXT:    vpermilpd {{.*#+}} xmm1 = xmm2[1,0]
+; AVX512BW-NEXT:    vcmpunordss %xmm3, %xmm3, %k1
+; AVX512BW-NEXT:    vmaxss %xmm3, %xmm1, %xmm3
+; AVX512BW-NEXT:    vmovss %xmm1, %xmm3, %xmm3 {%k1}
+; AVX512BW-NEXT:    vextractf32x4 $3, %zmm0, %xmm0
+; AVX512BW-NEXT:    vpermilps {{.*#+}} xmm1 = xmm2[3,3,3,3]
+; AVX512BW-NEXT:    vcmpunordss %xmm3, %xmm3, %k1
+; AVX512BW-NEXT:    vmaxss %xmm3, %xmm1, %xmm2
+; AVX512BW-NEXT:    vmovss %xmm1, %xmm2, %xmm2 {%k1}
+; AVX512BW-NEXT:    vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
+; AVX512BW-NEXT:    vcmpunordss %xmm2, %xmm2, %k1
+; AVX512BW-NEXT:    vmaxss %xmm2, %xmm0, %xmm2
+; AVX512BW-NEXT:    vmovss %xmm0, %xmm2, %xmm2 {%k1}
+; AVX512BW-NEXT:    vcmpunordss %xmm2, %xmm2, %k1
+; AVX512BW-NEXT:    vmaxss %xmm2, %xmm1, %xmm2
+; AVX512BW-NEXT:    vmovss %xmm1, %xmm2, %xmm2 {%k1}
+; AVX512BW-NEXT:    vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
+; AVX512BW-NEXT:    vcmpunordss %xmm2, %xmm2, %k1
+; AVX512BW-NEXT:    vmaxss %xmm2, %xmm1, %xmm2
+; AVX512BW-NEXT:    vmovss %xmm1, %xmm2, %xmm2 {%k1}
+; AVX512BW-NEXT:    vpermilps {{.*#+}} xmm1 = xmm0[3,3,3,3]
+; AVX512BW-NEXT:    vcmpunordss %xmm2, %xmm2, %k1
+; AVX512BW-NEXT:    vmaxss %xmm2, %xmm1, %xmm0
+; AVX512BW-NEXT:    vmovss %xmm1, %xmm0, %xmm0 {%k1}
+; AVX512BW-NEXT:    vzeroupper
+; AVX512BW-NEXT:    retq
+;
+; AVX512VL-LABEL: test_v16f32:
+; AVX512VL:       # %bb.0:
+; AVX512VL-NEXT:    vextractf32x4 $3, %zmm0, %xmm3
+; AVX512VL-NEXT:    vpermilps {{.*#+}} xmm8 = xmm3[3,3,3,3]
+; AVX512VL-NEXT:    vpermilpd {{.*#+}} xmm9 = xmm3[1,0]
+; AVX512VL-NEXT:    vmovshdup {{.*#+}} xmm10 = xmm3[1,1,3,3]
+; AVX512VL-NEXT:    vextractf32x4 $2, %zmm0, %xmm6
+; AVX512VL-NEXT:    vpermilps {{.*#+}} xmm11 = xmm6[3,3,3,3]
+; AVX512VL-NEXT:    vpermilpd {{.*#+}} xmm12 = xmm6[1,0]
+; AVX512VL-NEXT:    vmovshdup {{.*#+}} xmm13 = xmm6[1,1,3,3]
+; AVX512VL-NEXT:    vextractf128 $1, %ymm0, %xmm2
+; AVX512VL-NEXT:    vpermilps {{.*#+}} xmm14 = xmm2[3,3,3,3]
+; AVX512VL-NEXT:    vpermilpd {{.*#+}} xmm15 = xmm2[1,0]
+; AVX512VL-NEXT:    vmovshdup {{.*#+}} xmm16 = xmm2[1,1,3,3]
+; AVX512VL-NEXT:    vpermilps {{.*#+}} xmm1 = xmm0[3,3,3,3]
+; AVX512VL-NEXT:    vpermilpd {{.*#+}} xmm4 = xmm0[1,0]
+; AVX512VL-NEXT:    vmovshdup {{.*#+}} xmm5 = xmm0[1,1,3,3]
+; AVX512VL-NEXT:    vmaxss %xmm0, %xmm5, %xmm7
+; AVX512VL-NEXT:    vcmpunordss %xmm0, %xmm0, %k1
+; AVX512VL-NEXT:    vmovss %xmm5, %xmm7, %xmm7 {%k1}
+; AVX512VL-NEXT:    vcmpunordss %xmm7, %xmm7, %k1
+; AVX512VL-NEXT:    vmaxss %xmm7, %xmm4, %xmm0
+; AVX512VL-NEXT:    vmovss %xmm4, %xmm0, %xmm0 {%k1}
+; AVX512VL-NEXT:    vcmpunordss %xmm0, %xmm0, %k1
+; AVX512VL-NEXT:    vmaxss %xmm0, %xmm1, %xmm0
+; AVX512VL-NEXT:    vmovss %xmm1, %xmm0, %xmm0 {%k1}
+; AVX512VL-NEXT:    vcmpunordss %xmm0, %xmm0, %k1
+; AVX512VL-NEXT:    vmaxss %xmm0, %xmm2, %xmm0
+; AVX512VL-NEXT:    vmovss %xmm2, %xmm0, %xmm0 {%k1}
+; AVX512VL-NEXT:    vcmpunordss %xmm0, %xmm0, %k1
+; AVX512VL-NEXT:    vmaxss %xmm0, %xmm16, %xmm0
+; AVX512VL-NEXT:    vmovss %xmm16, %xmm0, %xmm0 {%k1}
+; AVX512VL-NEXT:    vcmpunordss %xmm0, %xmm0, %k1
+; AVX512VL-NEXT:    vmaxss %xmm0, %xmm15, %xmm0
+; AVX512VL-NEXT:    vmovss %xmm15, %xmm0, %xmm0 {%k1}
+; AVX512VL-NEXT:    vcmpunordss %xmm0, %xmm0, %k1
+; AVX512VL-NEXT:    vmaxss %xmm0, %xmm14, %xmm0
+; AVX512VL-NEXT:    vmovss %xmm14, %xmm0, %xmm0 {%k1}
+; AVX512VL-NEXT:    vcmpunordss %xmm0, %xmm0, %k1
+; AVX512VL-NEXT:    vmaxss %xmm0, %xmm6, %xmm0
+; AVX512VL-NEXT:    vmovss %xmm6, %xmm0, %xmm0 {%k1}
+; AVX512VL-NEXT:    vcmpunordss %xmm0, %xmm0, %k1
+; AVX512VL-NEXT:    vmaxss %xmm0, %xmm13, %xmm0
+; AVX512VL-NEXT:    vmovss %xmm13, %xmm0, %xmm0 {%k1}
+; AVX512VL-NEXT:    vcmpunordss %xmm0, %xmm0, %k1
+; AVX512VL-NEXT:    vmaxss %xmm0, %xmm12, %xmm0
+; AVX512VL-NEXT:    vmovss %xmm12, %xmm0, %xmm0 {%k1}
+; AVX512VL-NEXT:    vcmpunordss %xmm0, %xmm0, %k1
+; AVX512VL-NEXT:    vmaxss %xmm0, %xmm11, %xmm0
+; AVX512VL-NEXT:    vmovss %xmm11, %xmm0, %xmm0 {%k1}
+; AVX512VL-NEXT:    vcmpunordss %xmm0, %xmm0, %k1
+; AVX512VL-NEXT:    vmaxss %xmm0, %xmm3, %xmm0
+; AVX512VL-NEXT:    vmovss %xmm3, %xmm0, %xmm0 {%k1}
+; AVX512VL-NEXT:    vcmpunordss %xmm0, %xmm0, %k1
+; AVX512VL-NEXT:    vmaxss %xmm0, %xmm10, %xmm0
+; AVX512VL-NEXT:    vmovss %xmm10, %xmm0, %xmm0 {%k1}
+; AVX512VL-NEXT:    vcmpunordss %xmm0, %xmm0, %k1
+; AVX512VL-NEXT:    vmaxss %xmm0, %xmm9, %xmm0
+; AVX512VL-NEXT:    vmovss %xmm9, %xmm0, %xmm0 {%k1}
+; AVX512VL-NEXT:    vcmpunordss %xmm0, %xmm0, %k1
+; AVX512VL-NEXT:    vmaxss %xmm0, %xmm8, %xmm0
+; AVX512VL-NEXT:    vmovss %xmm8, %xmm0, %xmm0 {%k1}
+; AVX512VL-NEXT:    vzeroupper
+; AVX512VL-NEXT:    retq
   %1 = call float @llvm.experimental.vector.reduce.fmax.v16f32(<16 x float> %a0)
   ret float %1
 }
@@ -186,50 +672,106 @@ define float @test_v16f32(<16 x float> %a0) {
 define double @test_v2f64(<2 x double> %a0) {
 ; SSE-LABEL: test_v2f64:
 ; SSE:       # %bb.0:
+; SSE-NEXT:    movapd %xmm0, %xmm2
+; SSE-NEXT:    unpckhpd {{.*#+}} xmm2 = xmm2[1],xmm0[1]
 ; SSE-NEXT:    movapd %xmm0, %xmm1
-; SSE-NEXT:    unpckhpd {{.*#+}} xmm1 = xmm1[1],xmm0[1]
-; SSE-NEXT:    maxsd %xmm1, %xmm0
+; SSE-NEXT:    cmpunordsd %xmm0, %xmm1
+; SSE-NEXT:    movapd %xmm1, %xmm3
+; SSE-NEXT:    andpd %xmm2, %xmm3
+; SSE-NEXT:    maxsd %xmm0, %xmm2
+; SSE-NEXT:    andnpd %xmm2, %xmm1
+; SSE-NEXT:    orpd %xmm3, %xmm1
+; SSE-NEXT:    movapd %xmm1, %xmm0
 ; SSE-NEXT:    retq
 ;
 ; AVX-LABEL: test_v2f64:
 ; AVX:       # %bb.0:
 ; AVX-NEXT:    vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
-; AVX-NEXT:    vmaxsd %xmm1, %xmm0, %xmm0
+; AVX-NEXT:    vmaxsd %xmm0, %xmm1, %xmm2
+; AVX-NEXT:    vcmpunordsd %xmm0, %xmm0, %xmm0
+; AVX-NEXT:    vblendvpd %xmm0, %xmm1, %xmm2, %xmm0
 ; AVX-NEXT:    retq
 ;
 ; AVX512-LABEL: test_v2f64:
 ; AVX512:       # %bb.0:
-; AVX512-NEXT:    vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
-; AVX512-NEXT:    vmaxsd %xmm1, %xmm0, %xmm0
+; AVX512-NEXT:    vpermilpd {{.*#+}} xmm2 = xmm0[1,0]
+; AVX512-NEXT:    vmaxsd %xmm0, %xmm2, %xmm1
+; AVX512-NEXT:    vcmpunordsd %xmm0, %xmm0, %k1
+; AVX512-NEXT:    vmovsd %xmm2, %xmm1, %xmm1 {%k1}
+; AVX512-NEXT:    vmovapd %xmm1, %xmm0
 ; AVX512-NEXT:    retq
   %1 = call double @llvm.experimental.vector.reduce.fmax.v2f64(<2 x double> %a0)
   ret double %1
 }
 
 define double @test_v4f64(<4 x double> %a0) {
-; SSE-LABEL: test_v4f64:
-; SSE:       # %bb.0:
-; SSE-NEXT:    maxpd %xmm1, %xmm0
-; SSE-NEXT:    movapd %xmm0, %xmm1
-; SSE-NEXT:    unpckhpd {{.*#+}} xmm1 = xmm1[1],xmm0[1]
-; SSE-NEXT:    maxsd %xmm1, %xmm0
-; SSE-NEXT:    retq
+; SSE2-LABEL: test_v4f64:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    movapd %xmm1, %xmm2
+; SSE2-NEXT:    maxpd %xmm0, %xmm2
+; SSE2-NEXT:    cmpunordpd %xmm0, %xmm0
+; SSE2-NEXT:    andpd %xmm0, %xmm1
+; SSE2-NEXT:    andnpd %xmm2, %xmm0
+; SSE2-NEXT:    orpd %xmm1, %xmm0
+; SSE2-NEXT:    pshufd {{.*#+}} xmm2 = xmm0[2,3,2,3]
+; SSE2-NEXT:    movapd %xmm0, %xmm1
+; SSE2-NEXT:    cmpunordsd %xmm0, %xmm1
+; SSE2-NEXT:    movapd %xmm1, %xmm3
+; SSE2-NEXT:    andpd %xmm2, %xmm3
+; SSE2-NEXT:    maxsd %xmm0, %xmm2
+; SSE2-NEXT:    andnpd %xmm2, %xmm1
+; SSE2-NEXT:    orpd %xmm3, %xmm1
+; SSE2-NEXT:    movapd %xmm1, %xmm0
+; SSE2-NEXT:    retq
+;
+; SSE41-LABEL: test_v4f64:
+; SSE41:       # %bb.0:
+; SSE41-NEXT:    movapd %xmm1, %xmm2
+; SSE41-NEXT:    maxpd %xmm0, %xmm2
+; SSE41-NEXT:    cmpunordpd %xmm0, %xmm0
+; SSE41-NEXT:    blendvpd %xmm0, %xmm1, %xmm2
+; SSE41-NEXT:    movapd %xmm2, %xmm1
+; SSE41-NEXT:    unpckhpd {{.*#+}} xmm1 = xmm1[1],xmm2[1]
+; SSE41-NEXT:    movapd %xmm2, %xmm0
+; SSE41-NEXT:    cmpunordsd %xmm2, %xmm0
+; SSE41-NEXT:    movapd %xmm0, %xmm3
+; SSE41-NEXT:    andpd %xmm1, %xmm3
+; SSE41-NEXT:    maxsd %xmm2, %xmm1
+; SSE41-NEXT:    andnpd %xmm1, %xmm0
+; SSE41-NEXT:    orpd %xmm3, %xmm0
+; SSE41-NEXT:    retq
 ;
 ; AVX-LABEL: test_v4f64:
 ; AVX:       # %bb.0:
 ; AVX-NEXT:    vextractf128 $1, %ymm0, %xmm1
-; AVX-NEXT:    vmaxpd %xmm1, %xmm0, %xmm0
-; AVX-NEXT:    vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
-; AVX-NEXT:    vmaxsd %xmm1, %xmm0, %xmm0
+; AVX-NEXT:    vpermilpd {{.*#+}} xmm2 = xmm1[1,0]
+; AVX-NEXT:    vpermilpd {{.*#+}} xmm3 = xmm0[1,0]
+; AVX-NEXT:    vmaxsd %xmm0, %xmm3, %xmm4
+; AVX-NEXT:    vcmpunordsd %xmm0, %xmm0, %xmm0
+; AVX-NEXT:    vblendvpd %xmm0, %xmm3, %xmm4, %xmm0
+; AVX-NEXT:    vcmpunordsd %xmm0, %xmm0, %xmm3
+; AVX-NEXT:    vmaxsd %xmm0, %xmm1, %xmm0
+; AVX-NEXT:    vblendvpd %xmm3, %xmm1, %xmm0, %xmm0
+; AVX-NEXT:    vcmpunordsd %xmm0, %xmm0, %xmm1
+; AVX-NEXT:    vmaxsd %xmm0, %xmm2, %xmm0
+; AVX-NEXT:    vblendvpd %xmm1, %xmm2, %xmm0, %xmm0
 ; AVX-NEXT:    vzeroupper
 ; AVX-NEXT:    retq
 ;
 ; AVX512-LABEL: test_v4f64:
 ; AVX512:       # %bb.0:
 ; AVX512-NEXT:    vextractf128 $1, %ymm0, %xmm1
-; AVX512-NEXT:    vmaxpd %xmm1, %xmm0, %xmm0
-; AVX512-NEXT:    vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
-; AVX512-NEXT:    vmaxsd %xmm1, %xmm0, %xmm0
+; AVX512-NEXT:    vpermilpd {{.*#+}} xmm2 = xmm1[1,0]
+; AVX512-NEXT:    vpermilpd {{.*#+}} xmm3 = xmm0[1,0]
+; AVX512-NEXT:    vmaxsd %xmm0, %xmm3, %xmm4
+; AVX512-NEXT:    vcmpunordsd %xmm0, %xmm0, %k1
+; AVX512-NEXT:    vmovsd %xmm3, %xmm4, %xmm4 {%k1}
+; AVX512-NEXT:    vcmpunordsd %xmm4, %xmm4, %k1
+; AVX512-NEXT:    vmaxsd %xmm4, %xmm1, %xmm0
+; AVX512-NEXT:    vmovsd %xmm1, %xmm0, %xmm0 {%k1}
+; AVX512-NEXT:    vcmpunordsd %xmm0, %xmm0, %k1
+; AVX512-NEXT:    vmaxsd %xmm0, %xmm2, %xmm0
+; AVX512-NEXT:    vmovsd %xmm2, %xmm0, %xmm0 {%k1}
 ; AVX512-NEXT:    vzeroupper
 ; AVX512-NEXT:    retq
   %1 = call double @llvm.experimental.vector.reduce.fmax.v4f64(<4 x double> %a0)
@@ -237,83 +779,325 @@ define double @test_v4f64(<4 x double> %a0) {
 }
 
 define double @test_v8f64(<8 x double> %a0) {
-; SSE-LABEL: test_v8f64:
-; SSE:       # %bb.0:
-; SSE-NEXT:    maxpd %xmm3, %xmm1
-; SSE-NEXT:    maxpd %xmm2, %xmm0
-; SSE-NEXT:    maxpd %xmm1, %xmm0
-; SSE-NEXT:    movapd %xmm0, %xmm1
-; SSE-NEXT:    unpckhpd {{.*#+}} xmm1 = xmm1[1],xmm0[1]
-; SSE-NEXT:    maxsd %xmm1, %xmm0
-; SSE-NEXT:    retq
+; SSE2-LABEL: test_v8f64:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    movapd %xmm2, %xmm4
+; SSE2-NEXT:    maxpd %xmm0, %xmm4
+; SSE2-NEXT:    cmpunordpd %xmm0, %xmm0
+; SSE2-NEXT:    andpd %xmm0, %xmm2
+; SSE2-NEXT:    andnpd %xmm4, %xmm0
+; SSE2-NEXT:    orpd %xmm2, %xmm0
+; SSE2-NEXT:    movapd %xmm3, %xmm2
+; SSE2-NEXT:    maxpd %xmm1, %xmm2
+; SSE2-NEXT:    cmpunordpd %xmm1, %xmm1
+; SSE2-NEXT:    andpd %xmm1, %xmm3
+; SSE2-NEXT:    andnpd %xmm2, %xmm1
+; SSE2-NEXT:    orpd %xmm3, %xmm1
+; SSE2-NEXT:    movapd %xmm1, %xmm2
+; SSE2-NEXT:    maxpd %xmm0, %xmm2
+; SSE2-NEXT:    cmpunordpd %xmm0, %xmm0
+; SSE2-NEXT:    andpd %xmm0, %xmm1
+; SSE2-NEXT:    andnpd %xmm2, %xmm0
+; SSE2-NEXT:    orpd %xmm1, %xmm0
+; SSE2-NEXT:    pshufd {{.*#+}} xmm2 = xmm0[2,3,2,3]
+; SSE2-NEXT:    movapd %xmm0, %xmm1
+; SSE2-NEXT:    cmpunordsd %xmm0, %xmm1
+; SSE2-NEXT:    movapd %xmm1, %xmm3
+; SSE2-NEXT:    andpd %xmm2, %xmm3
+; SSE2-NEXT:    maxsd %xmm0, %xmm2
+; SSE2-NEXT:    andnpd %xmm2, %xmm1
+; SSE2-NEXT:    orpd %xmm3, %xmm1
+; SSE2-NEXT:    movapd %xmm1, %xmm0
+; SSE2-NEXT:    retq
+;
+; SSE41-LABEL: test_v8f64:
+; SSE41:       # %bb.0:
+; SSE41-NEXT:    movapd %xmm2, %xmm4
+; SSE41-NEXT:    maxpd %xmm0, %xmm4
+; SSE41-NEXT:    cmpunordpd %xmm0, %xmm0
+; SSE41-NEXT:    blendvpd %xmm0, %xmm2, %xmm4
+; SSE41-NEXT:    movapd %xmm3, %xmm2
+; SSE41-NEXT:    maxpd %xmm1, %xmm2
+; SSE41-NEXT:    cmpunordpd %xmm1, %xmm1
+; SSE41-NEXT:    movapd %xmm1, %xmm0
+; SSE41-NEXT:    blendvpd %xmm0, %xmm3, %xmm2
+; SSE41-NEXT:    movapd %xmm2, %xmm1
+; SSE41-NEXT:    maxpd %xmm4, %xmm1
+; SSE41-NEXT:    cmpunordpd %xmm4, %xmm4
+; SSE41-NEXT:    movapd %xmm4, %xmm0
+; SSE41-NEXT:    blendvpd %xmm0, %xmm2, %xmm1
+; SSE41-NEXT:    movapd %xmm1, %xmm2
+; SSE41-NEXT:    unpckhpd {{.*#+}} xmm2 = xmm2[1],xmm1[1]
+; SSE41-NEXT:    movapd %xmm1, %xmm0
+; SSE41-NEXT:    cmpunordsd %xmm1, %xmm0
+; SSE41-NEXT:    movapd %xmm0, %xmm3
+; SSE41-NEXT:    andpd %xmm2, %xmm3
+; SSE41-NEXT:    maxsd %xmm1, %xmm2
+; SSE41-NEXT:    andnpd %xmm2, %xmm0
+; SSE41-NEXT:    orpd %xmm3, %xmm0
+; SSE41-NEXT:    retq
 ;
 ; AVX-LABEL: test_v8f64:
 ; AVX:       # %bb.0:
-; AVX-NEXT:    vmaxpd %ymm1, %ymm0, %ymm0
-; AVX-NEXT:    vextractf128 $1, %ymm0, %xmm1
-; AVX-NEXT:    vmaxpd %xmm1, %xmm0, %xmm0
+; AVX-NEXT:    vmaxpd %ymm0, %ymm1, %ymm2
+; AVX-NEXT:    vcmpunordpd %ymm0, %ymm0, %ymm0
+; AVX-NEXT:    vblendvpd %ymm0, %ymm1, %ymm2, %ymm0
 ; AVX-NEXT:    vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
-; AVX-NEXT:    vmaxsd %xmm1, %xmm0, %xmm0
+; AVX-NEXT:    vmaxsd %xmm0, %xmm1, %xmm2
+; AVX-NEXT:    vcmpunordsd %xmm0, %xmm0, %xmm3
+; AVX-NEXT:    vblendvpd %xmm3, %xmm1, %xmm2, %xmm1
+; AVX-NEXT:    vcmpunordsd %xmm1, %xmm1, %xmm2
+; AVX-NEXT:    vextractf128 $1, %ymm0, %xmm0
+; AVX-NEXT:    vmaxsd %xmm1, %xmm0, %xmm1
+; AVX-NEXT:    vblendvpd %xmm2, %xmm0, %xmm1, %xmm1
+; AVX-NEXT:    vcmpunordsd %xmm1, %xmm1, %xmm2
+; AVX-NEXT:    vpermilpd {{.*#+}} xmm0 = xmm0[1,0]
+; AVX-NEXT:    vmaxsd %xmm1, %xmm0, %xmm1
+; AVX-NEXT:    vblendvpd %xmm2, %xmm0, %xmm1, %xmm0
 ; AVX-NEXT:    vzeroupper
 ; AVX-NEXT:    retq
 ;
-; AVX512-LABEL: test_v8f64:
-; AVX512:       # %bb.0:
-; AVX512-NEXT:    vextractf64x4 $1, %zmm0, %ymm1
-; AVX512-NEXT:    vmaxpd %zmm1, %zmm0, %zmm0
-; AVX512-NEXT:    vextractf128 $1, %ymm0, %xmm1
-; AVX512-NEXT:    vmaxpd %xmm1, %xmm0, %xmm0
-; AVX512-NEXT:    vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
-; AVX512-NEXT:    vmaxsd %xmm1, %xmm0, %xmm0
-; AVX512-NEXT:    vzeroupper
-; AVX512-NEXT:    retq
+; AVX512BW-LABEL: test_v8f64:
+; AVX512BW:       # %bb.0:
+; AVX512BW-NEXT:    vextractf32x4 $3, %zmm0, %xmm2
+; AVX512BW-NEXT:    vpermilpd {{.*#+}} xmm8 = xmm2[1,0]
+; AVX512BW-NEXT:    vextractf32x4 $2, %zmm0, %xmm3
+; AVX512BW-NEXT:    vpermilpd {{.*#+}} xmm4 = xmm3[1,0]
+; AVX512BW-NEXT:    vextractf128 $1, %ymm0, %xmm5
+; AVX512BW-NEXT:    vpermilpd {{.*#+}} xmm6 = xmm5[1,0]
+; AVX512BW-NEXT:    vpermilpd {{.*#+}} xmm7 = xmm0[1,0]
+; AVX512BW-NEXT:    vmaxsd %xmm0, %xmm7, %xmm1
+; AVX512BW-NEXT:    vcmpunordsd %xmm0, %xmm0, %k1
+; AVX512BW-NEXT:    vmovsd %xmm7, %xmm1, %xmm1 {%k1}
+; AVX512BW-NEXT:    vcmpunordsd %xmm1, %xmm1, %k1
+; AVX512BW-NEXT:    vmaxsd %xmm1, %xmm5, %xmm0
+; AVX512BW-NEXT:    vmovsd %xmm5, %xmm0, %xmm0 {%k1}
+; AVX512BW-NEXT:    vcmpunordsd %xmm0, %xmm0, %k1
+; AVX512BW-NEXT:    vmaxsd %xmm0, %xmm6, %xmm0
+; AVX512BW-NEXT:    vmovsd %xmm6, %xmm0, %xmm0 {%k1}
+; AVX512BW-NEXT:    vcmpunordsd %xmm0, %xmm0, %k1
+; AVX512BW-NEXT:    vmaxsd %xmm0, %xmm3, %xmm0
+; AVX512BW-NEXT:    vmovsd %xmm3, %xmm0, %xmm0 {%k1}
+; AVX512BW-NEXT:    vcmpunordsd %xmm0, %xmm0, %k1
+; AVX512BW-NEXT:    vmaxsd %xmm0, %xmm4, %xmm0
+; AVX512BW-NEXT:    vmovsd %xmm4, %xmm0, %xmm0 {%k1}
+; AVX512BW-NEXT:    vcmpunordsd %xmm0, %xmm0, %k1
+; AVX512BW-NEXT:    vmaxsd %xmm0, %xmm2, %xmm0
+; AVX512BW-NEXT:    vmovsd %xmm2, %xmm0, %xmm0 {%k1}
+; AVX512BW-NEXT:    vcmpunordsd %xmm0, %xmm0, %k1
+; AVX512BW-NEXT:    vmaxsd %xmm0, %xmm8, %xmm0
+; AVX512BW-NEXT:    vmovsd %xmm8, %xmm0, %xmm0 {%k1}
+; AVX512BW-NEXT:    vzeroupper
+; AVX512BW-NEXT:    retq
+;
+; AVX512VL-LABEL: test_v8f64:
+; AVX512VL:       # %bb.0:
+; AVX512VL-NEXT:    vextractf32x4 $3, %zmm0, %xmm1
+; AVX512VL-NEXT:    vpermilpd {{.*#+}} xmm8 = xmm1[1,0]
+; AVX512VL-NEXT:    vextractf32x4 $2, %zmm0, %xmm3
+; AVX512VL-NEXT:    vpermilpd {{.*#+}} xmm4 = xmm3[1,0]
+; AVX512VL-NEXT:    vextractf128 $1, %ymm0, %xmm5
+; AVX512VL-NEXT:    vpermilpd {{.*#+}} xmm6 = xmm5[1,0]
+; AVX512VL-NEXT:    vpermilpd {{.*#+}} xmm7 = xmm0[1,0]
+; AVX512VL-NEXT:    vmaxsd %xmm0, %xmm7, %xmm2
+; AVX512VL-NEXT:    vcmpunordsd %xmm0, %xmm0, %k1
+; AVX512VL-NEXT:    vmovsd %xmm7, %xmm2, %xmm2 {%k1}
+; AVX512VL-NEXT:    vcmpunordsd %xmm2, %xmm2, %k1
+; AVX512VL-NEXT:    vmaxsd %xmm2, %xmm5, %xmm0
+; AVX512VL-NEXT:    vmovsd %xmm5, %xmm0, %xmm0 {%k1}
+; AVX512VL-NEXT:    vcmpunordsd %xmm0, %xmm0, %k1
+; AVX512VL-NEXT:    vmaxsd %xmm0, %xmm6, %xmm0
+; AVX512VL-NEXT:    vmovsd %xmm6, %xmm0, %xmm0 {%k1}
+; AVX512VL-NEXT:    vcmpunordsd %xmm0, %xmm0, %k1
+; AVX512VL-NEXT:    vmaxsd %xmm0, %xmm3, %xmm0
+; AVX512VL-NEXT:    vmovsd %xmm3, %xmm0, %xmm0 {%k1}
+; AVX512VL-NEXT:    vcmpunordsd %xmm0, %xmm0, %k1
+; AVX512VL-NEXT:    vmaxsd %xmm0, %xmm4, %xmm0
+; AVX512VL-NEXT:    vmovsd %xmm4, %xmm0, %xmm0 {%k1}
+; AVX512VL-NEXT:    vcmpunordsd %xmm0, %xmm0, %k1
+; AVX512VL-NEXT:    vmaxsd %xmm0, %xmm1, %xmm0
+; AVX512VL-NEXT:    vmovsd %xmm1, %xmm0, %xmm0 {%k1}
+; AVX512VL-NEXT:    vcmpunordsd %xmm0, %xmm0, %k1
+; AVX512VL-NEXT:    vmaxsd %xmm0, %xmm8, %xmm0
+; AVX512VL-NEXT:    vmovsd %xmm8, %xmm0, %xmm0 {%k1}
+; AVX512VL-NEXT:    vzeroupper
+; AVX512VL-NEXT:    retq
   %1 = call double @llvm.experimental.vector.reduce.fmax.v8f64(<8 x double> %a0)
   ret double %1
 }
 
 define double @test_v16f64(<16 x double> %a0) {
-; SSE-LABEL: test_v16f64:
-; SSE:       # %bb.0:
-; SSE-NEXT:    maxpd %xmm6, %xmm2
-; SSE-NEXT:    maxpd %xmm4, %xmm0
-; SSE-NEXT:    maxpd %xmm2, %xmm0
-; SSE-NEXT:    maxpd %xmm7, %xmm3
-; SSE-NEXT:    maxpd %xmm5, %xmm1
-; SSE-NEXT:    maxpd %xmm3, %xmm1
-; SSE-NEXT:    maxpd %xmm1, %xmm0
-; SSE-NEXT:    movapd %xmm0, %xmm1
-; SSE-NEXT:    unpckhpd {{.*#+}} xmm1 = xmm1[1],xmm0[1]
-; SSE-NEXT:    maxsd %xmm1, %xmm0
-; SSE-NEXT:    retq
+; SSE2-LABEL: test_v16f64:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    movapd %xmm4, %xmm8
+; SSE2-NEXT:    maxpd %xmm0, %xmm8
+; SSE2-NEXT:    cmpunordpd %xmm0, %xmm0
+; SSE2-NEXT:    andpd %xmm0, %xmm4
+; SSE2-NEXT:    andnpd %xmm8, %xmm0
+; SSE2-NEXT:    orpd %xmm4, %xmm0
+; SSE2-NEXT:    movapd %xmm6, %xmm4
+; SSE2-NEXT:    maxpd %xmm2, %xmm4
+; SSE2-NEXT:    cmpunordpd %xmm2, %xmm2
+; SSE2-NEXT:    andpd %xmm2, %xmm6
+; SSE2-NEXT:    andnpd %xmm4, %xmm2
+; SSE2-NEXT:    orpd %xmm6, %xmm2
+; SSE2-NEXT:    movapd %xmm2, %xmm4
+; SSE2-NEXT:    maxpd %xmm0, %xmm4
+; SSE2-NEXT:    cmpunordpd %xmm0, %xmm0
+; SSE2-NEXT:    andpd %xmm0, %xmm2
+; SSE2-NEXT:    andnpd %xmm4, %xmm0
+; SSE2-NEXT:    orpd %xmm2, %xmm0
+; SSE2-NEXT:    movapd %xmm5, %xmm2
+; SSE2-NEXT:    maxpd %xmm1, %xmm2
+; SSE2-NEXT:    cmpunordpd %xmm1, %xmm1
+; SSE2-NEXT:    andpd %xmm1, %xmm5
+; SSE2-NEXT:    andnpd %xmm2, %xmm1
+; SSE2-NEXT:    orpd %xmm5, %xmm1
+; SSE2-NEXT:    movapd %xmm7, %xmm2
+; SSE2-NEXT:    maxpd %xmm3, %xmm2
+; SSE2-NEXT:    cmpunordpd %xmm3, %xmm3
+; SSE2-NEXT:    andpd %xmm3, %xmm7
+; SSE2-NEXT:    andnpd %xmm2, %xmm3
+; SSE2-NEXT:    orpd %xmm7, %xmm3
+; SSE2-NEXT:    movapd %xmm3, %xmm2
+; SSE2-NEXT:    maxpd %xmm1, %xmm2
+; SSE2-NEXT:    cmpunordpd %xmm1, %xmm1
+; SSE2-NEXT:    andpd %xmm1, %xmm3
+; SSE2-NEXT:    andnpd %xmm2, %xmm1
+; SSE2-NEXT:    orpd %xmm3, %xmm1
+; SSE2-NEXT:    movapd %xmm1, %xmm2
+; SSE2-NEXT:    maxpd %xmm0, %xmm2
+; SSE2-NEXT:    cmpunordpd %xmm0, %xmm0
+; SSE2-NEXT:    andpd %xmm0, %xmm1
+; SSE2-NEXT:    andnpd %xmm2, %xmm0
+; SSE2-NEXT:    orpd %xmm1, %xmm0
+; SSE2-NEXT:    pshufd {{.*#+}} xmm2 = xmm0[2,3,2,3]
+; SSE2-NEXT:    movapd %xmm0, %xmm1
+; SSE2-NEXT:    cmpunordsd %xmm0, %xmm1
+; SSE2-NEXT:    movapd %xmm1, %xmm3
+; SSE2-NEXT:    andpd %xmm2, %xmm3
+; SSE2-NEXT:    maxsd %xmm0, %xmm2
+; SSE2-NEXT:    andnpd %xmm2, %xmm1
+; SSE2-NEXT:    orpd %xmm3, %xmm1
+; SSE2-NEXT:    movapd %xmm1, %xmm0
+; SSE2-NEXT:    retq
+;
+; SSE41-LABEL: test_v16f64:
+; SSE41:       # %bb.0:
+; SSE41-NEXT:    movapd %xmm3, %xmm8
+; SSE41-NEXT:    movapd %xmm4, %xmm3
+; SSE41-NEXT:    maxpd %xmm0, %xmm3
+; SSE41-NEXT:    cmpunordpd %xmm0, %xmm0
+; SSE41-NEXT:    blendvpd %xmm0, %xmm4, %xmm3
+; SSE41-NEXT:    movapd %xmm6, %xmm4
+; SSE41-NEXT:    maxpd %xmm2, %xmm4
+; SSE41-NEXT:    cmpunordpd %xmm2, %xmm2
+; SSE41-NEXT:    movapd %xmm2, %xmm0
+; SSE41-NEXT:    blendvpd %xmm0, %xmm6, %xmm4
+; SSE41-NEXT:    movapd %xmm4, %xmm2
+; SSE41-NEXT:    maxpd %xmm3, %xmm2
+; SSE41-NEXT:    cmpunordpd %xmm3, %xmm3
+; SSE41-NEXT:    movapd %xmm3, %xmm0
+; SSE41-NEXT:    blendvpd %xmm0, %xmm4, %xmm2
+; SSE41-NEXT:    movapd %xmm5, %xmm3
+; SSE41-NEXT:    maxpd %xmm1, %xmm3
+; SSE41-NEXT:    cmpunordpd %xmm1, %xmm1
+; SSE41-NEXT:    movapd %xmm1, %xmm0
+; SSE41-NEXT:    blendvpd %xmm0, %xmm5, %xmm3
+; SSE41-NEXT:    movapd %xmm7, %xmm1
+; SSE41-NEXT:    maxpd %xmm8, %xmm1
+; SSE41-NEXT:    cmpunordpd %xmm8, %xmm8
+; SSE41-NEXT:    movapd %xmm8, %xmm0
+; SSE41-NEXT:    blendvpd %xmm0, %xmm7, %xmm1
+; SSE41-NEXT:    movapd %xmm1, %xmm4
+; SSE41-NEXT:    maxpd %xmm3, %xmm4
+; SSE41-NEXT:    cmpunordpd %xmm3, %xmm3
+; SSE41-NEXT:    movapd %xmm3, %xmm0
+; SSE41-NEXT:    blendvpd %xmm0, %xmm1, %xmm4
+; SSE41-NEXT:    movapd %xmm4, %xmm1
+; SSE41-NEXT:    maxpd %xmm2, %xmm1
+; SSE41-NEXT:    cmpunordpd %xmm2, %xmm2
+; SSE41-NEXT:    movapd %xmm2, %xmm0
+; SSE41-NEXT:    blendvpd %xmm0, %xmm4, %xmm1
+; SSE41-NEXT:    movapd %xmm1, %xmm2
+; SSE41-NEXT:    unpckhpd {{.*#+}} xmm2 = xmm2[1],xmm1[1]
+; SSE41-NEXT:    movapd %xmm1, %xmm0
+; SSE41-NEXT:    cmpunordsd %xmm1, %xmm0
+; SSE41-NEXT:    movapd %xmm0, %xmm3
+; SSE41-NEXT:    andpd %xmm2, %xmm3
+; SSE41-NEXT:    maxsd %xmm1, %xmm2
+; SSE41-NEXT:    andnpd %xmm2, %xmm0
+; SSE41-NEXT:    orpd %xmm3, %xmm0
+; SSE41-NEXT:    retq
 ;
 ; AVX-LABEL: test_v16f64:
 ; AVX:       # %bb.0:
-; AVX-NEXT:    vmaxpd %ymm3, %ymm1, %ymm1
-; AVX-NEXT:    vmaxpd %ymm2, %ymm0, %ymm0
-; AVX-NEXT:    vmaxpd %ymm1, %ymm0, %ymm0
-; AVX-NEXT:    vextractf128 $1, %ymm0, %xmm1
-; AVX-NEXT:    vmaxpd %xmm1, %xmm0, %xmm0
+; AVX-NEXT:    vmaxpd %ymm0, %ymm2, %ymm4
+; AVX-NEXT:    vcmpunordpd %ymm0, %ymm0, %ymm0
+; AVX-NEXT:    vblendvpd %ymm0, %ymm2, %ymm4, %ymm0
+; AVX-NEXT:    vmaxpd %ymm1, %ymm3, %ymm2
+; AVX-NEXT:    vcmpunordpd %ymm1, %ymm1, %ymm1
+; AVX-NEXT:    vblendvpd %ymm1, %ymm3, %ymm2, %ymm1
+; AVX-NEXT:    vmaxpd %ymm0, %ymm1, %ymm2
+; AVX-NEXT:    vcmpunordpd %ymm0, %ymm0, %ymm0
+; AVX-NEXT:    vblendvpd %ymm0, %ymm1, %ymm2, %ymm0
 ; AVX-NEXT:    vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
-; AVX-NEXT:    vmaxsd %xmm1, %xmm0, %xmm0
+; AVX-NEXT:    vmaxsd %xmm0, %xmm1, %xmm2
+; AVX-NEXT:    vcmpunordsd %xmm0, %xmm0, %xmm3
+; AVX-NEXT:    vblendvpd %xmm3, %xmm1, %xmm2, %xmm1
+; AVX-NEXT:    vcmpunordsd %xmm1, %xmm1, %xmm2
+; AVX-NEXT:    vextractf128 $1, %ymm0, %xmm0
+; AVX-NEXT:    vmaxsd %xmm1, %xmm0, %xmm1
+; AVX-NEXT:    vblendvpd %xmm2, %xmm0, %xmm1, %xmm1
+; AVX-NEXT:    vcmpunordsd %xmm1, %xmm1, %xmm2
+; AVX-NEXT:    vpermilpd {{.*#+}} xmm0 = xmm0[1,0]
+; AVX-NEXT:    vmaxsd %xmm1, %xmm0, %xmm1
+; AVX-NEXT:    vblendvpd %xmm2, %xmm0, %xmm1, %xmm0
 ; AVX-NEXT:    vzeroupper
 ; AVX-NEXT:    retq
 ;
 ; AVX512-LABEL: test_v16f64:
 ; AVX512:       # %bb.0:
-; AVX512-NEXT:    vmaxpd %zmm1, %zmm0, %zmm0
-; AVX512-NEXT:    vextractf64x4 $1, %zmm0, %ymm1
-; AVX512-NEXT:    vmaxpd %zmm1, %zmm0, %zmm0
-; AVX512-NEXT:    vextractf128 $1, %ymm0, %xmm1
-; AVX512-NEXT:    vmaxpd %xmm1, %xmm0, %xmm0
-; AVX512-NEXT:    vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
-; AVX512-NEXT:    vmaxsd %xmm1, %xmm0, %xmm0
+; AVX512-NEXT:    vmaxpd %zmm0, %zmm1, %zmm2
+; AVX512-NEXT:    vcmpunordpd %zmm0, %zmm0, %k1
+; AVX512-NEXT:    vmovapd %zmm1, %zmm2 {%k1}
+; AVX512-NEXT:    vpermilpd {{.*#+}} xmm0 = xmm2[1,0]
+; AVX512-NEXT:    vmaxsd %xmm2, %xmm0, %xmm1
+; AVX512-NEXT:    vcmpunordsd %xmm2, %xmm2, %k1
+; AVX512-NEXT:    vmovsd %xmm0, %xmm1, %xmm1 {%k1}
+; AVX512-NEXT:    vcmpunordsd %xmm1, %xmm1, %k1
+; AVX512-NEXT:    vextractf128 $1, %ymm2, %xmm0
+; AVX512-NEXT:    vmaxsd %xmm1, %xmm0, %xmm1
+; AVX512-NEXT:    vmovsd %xmm0, %xmm1, %xmm1 {%k1}
+; AVX512-NEXT:    vcmpunordsd %xmm1, %xmm1, %k1
+; AVX512-NEXT:    vpermilpd {{.*#+}} xmm0 = xmm0[1,0]
+; AVX512-NEXT:    vmaxsd %xmm1, %xmm0, %xmm1
+; AVX512-NEXT:    vmovsd %xmm0, %xmm1, %xmm1 {%k1}
+; AVX512-NEXT:    vcmpunordsd %xmm1, %xmm1, %k1
+; AVX512-NEXT:    vextractf32x4 $2, %zmm2, %xmm0
+; AVX512-NEXT:    vmaxsd %xmm1, %xmm0, %xmm1
+; AVX512-NEXT:    vmovsd %xmm0, %xmm1, %xmm1 {%k1}
+; AVX512-NEXT:    vcmpunordsd %xmm1, %xmm1, %k1
+; AVX512-NEXT:    vpermilpd {{.*#+}} xmm0 = xmm0[1,0]
+; AVX512-NEXT:    vmaxsd %xmm1, %xmm0, %xmm1
+; AVX512-NEXT:    vmovsd %xmm0, %xmm1, %xmm1 {%k1}
+; AVX512-NEXT:    vcmpunordsd %xmm1, %xmm1, %k1
+; AVX512-NEXT:    vextractf32x4 $3, %zmm2, %xmm0
+; AVX512-NEXT:    vmaxsd %xmm1, %xmm0, %xmm1
+; AVX512-NEXT:    vmovsd %xmm0, %xmm1, %xmm1 {%k1}
+; AVX512-NEXT:    vcmpunordsd %xmm1, %xmm1, %k1
+; AVX512-NEXT:    vpermilpd {{.*#+}} xmm2 = xmm0[1,0]
+; AVX512-NEXT:    vmaxsd %xmm1, %xmm2, %xmm0
+; AVX512-NEXT:    vmovsd %xmm2, %xmm0, %xmm0 {%k1}
 ; AVX512-NEXT:    vzeroupper
 ; AVX512-NEXT:    retq
   %1 = call double @llvm.experimental.vector.reduce.fmax.v16f64(<16 x double> %a0)
   ret double %1
 }
 
+declare float @llvm.experimental.vector.reduce.fmax.v1f32(<1 x float>)
 declare float @llvm.experimental.vector.reduce.fmax.v2f32(<2 x float>)
+declare float @llvm.experimental.vector.reduce.fmax.v3f32(<3 x float>)
 declare float @llvm.experimental.vector.reduce.fmax.v4f32(<4 x float>)
 declare float @llvm.experimental.vector.reduce.fmax.v8f32(<8 x float>)
 declare float @llvm.experimental.vector.reduce.fmax.v16f32(<16 x float>)

diff  --git a/llvm/test/CodeGen/X86/vector-reduce-fmin-nnan.ll b/llvm/test/CodeGen/X86/vector-reduce-fmin-nnan.ll
index f25852f0c6a8..28e812748aba 100644
--- a/llvm/test/CodeGen/X86/vector-reduce-fmin-nnan.ll
+++ b/llvm/test/CodeGen/X86/vector-reduce-fmin-nnan.ll
@@ -10,68 +10,176 @@
 ; vXf32
 ;
 
+define float @test_v1f32(<1 x float> %a0) {
+; ALL-LABEL: test_v1f32:
+; ALL:       # %bb.0:
+; ALL-NEXT:    retq
+  %1 = call nnan float @llvm.experimental.vector.reduce.fmin.v1f32(<1 x float> %a0)
+  ret float %1
+}
+
 define float @test_v2f32(<2 x float> %a0) {
 ; SSE2-LABEL: test_v2f32:
 ; SSE2:       # %bb.0:
+; SSE2-NEXT:    movaps %xmm0, %xmm2
+; SSE2-NEXT:    shufps {{.*#+}} xmm2 = xmm2[1,1],xmm0[1,1]
 ; SSE2-NEXT:    movaps %xmm0, %xmm1
-; SSE2-NEXT:    shufps {{.*#+}} xmm1 = xmm1[1,1],xmm0[1,1]
-; SSE2-NEXT:    minss %xmm1, %xmm0
+; SSE2-NEXT:    cmpunordss %xmm0, %xmm1
+; SSE2-NEXT:    movaps %xmm1, %xmm3
+; SSE2-NEXT:    andps %xmm2, %xmm3
+; SSE2-NEXT:    minss %xmm0, %xmm2
+; SSE2-NEXT:    andnps %xmm2, %xmm1
+; SSE2-NEXT:    orps %xmm3, %xmm1
+; SSE2-NEXT:    movaps %xmm1, %xmm0
 ; SSE2-NEXT:    retq
 ;
 ; SSE41-LABEL: test_v2f32:
 ; SSE41:       # %bb.0:
-; SSE41-NEXT:    movshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
-; SSE41-NEXT:    minss %xmm1, %xmm0
+; SSE41-NEXT:    movshdup {{.*#+}} xmm2 = xmm0[1,1,3,3]
+; SSE41-NEXT:    movaps %xmm0, %xmm1
+; SSE41-NEXT:    cmpunordss %xmm0, %xmm1
+; SSE41-NEXT:    movaps %xmm1, %xmm3
+; SSE41-NEXT:    andps %xmm2, %xmm3
+; SSE41-NEXT:    minss %xmm0, %xmm2
+; SSE41-NEXT:    andnps %xmm2, %xmm1
+; SSE41-NEXT:    orps %xmm3, %xmm1
+; SSE41-NEXT:    movaps %xmm1, %xmm0
 ; SSE41-NEXT:    retq
 ;
 ; AVX-LABEL: test_v2f32:
 ; AVX:       # %bb.0:
 ; AVX-NEXT:    vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
-; AVX-NEXT:    vminss %xmm1, %xmm0, %xmm0
+; AVX-NEXT:    vminss %xmm0, %xmm1, %xmm2
+; AVX-NEXT:    vcmpunordss %xmm0, %xmm0, %xmm0
+; AVX-NEXT:    vblendvps %xmm0, %xmm1, %xmm2, %xmm0
 ; AVX-NEXT:    retq
 ;
 ; AVX512-LABEL: test_v2f32:
 ; AVX512:       # %bb.0:
-; AVX512-NEXT:    vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
-; AVX512-NEXT:    vminss %xmm1, %xmm0, %xmm0
+; AVX512-NEXT:    vmovshdup {{.*#+}} xmm2 = xmm0[1,1,3,3]
+; AVX512-NEXT:    vminss %xmm0, %xmm2, %xmm1
+; AVX512-NEXT:    vcmpunordss %xmm0, %xmm0, %k1
+; AVX512-NEXT:    vmovss %xmm2, %xmm1, %xmm1 {%k1}
+; AVX512-NEXT:    vmovaps %xmm1, %xmm0
 ; AVX512-NEXT:    retq
   %1 = call nnan float @llvm.experimental.vector.reduce.fmin.v2f32(<2 x float> %a0)
   ret float %1
 }
 
+define float @test_v3f32(<3 x float> %a0) {
+; SSE2-LABEL: test_v3f32:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    movaps %xmm0, %xmm2
+; SSE2-NEXT:    shufps {{.*#+}} xmm2 = xmm2[1,1],xmm0[1,1]
+; SSE2-NEXT:    movaps %xmm0, %xmm1
+; SSE2-NEXT:    cmpunordss %xmm0, %xmm1
+; SSE2-NEXT:    movaps %xmm1, %xmm3
+; SSE2-NEXT:    andps %xmm2, %xmm3
+; SSE2-NEXT:    minss %xmm0, %xmm2
+; SSE2-NEXT:    andnps %xmm2, %xmm1
+; SSE2-NEXT:    orps %xmm3, %xmm1
+; SSE2-NEXT:    movhlps {{.*#+}} xmm0 = xmm0[1,1]
+; SSE2-NEXT:    movaps %xmm0, %xmm2
+; SSE2-NEXT:    minss %xmm1, %xmm2
+; SSE2-NEXT:    cmpunordss %xmm1, %xmm1
+; SSE2-NEXT:    movaps %xmm1, %xmm3
+; SSE2-NEXT:    andnps %xmm2, %xmm3
+; SSE2-NEXT:    andps %xmm0, %xmm1
+; SSE2-NEXT:    orps %xmm3, %xmm1
+; SSE2-NEXT:    movaps %xmm1, %xmm0
+; SSE2-NEXT:    retq
+;
+; SSE41-LABEL: test_v3f32:
+; SSE41:       # %bb.0:
+; SSE41-NEXT:    movshdup {{.*#+}} xmm2 = xmm0[1,1,3,3]
+; SSE41-NEXT:    movaps %xmm0, %xmm1
+; SSE41-NEXT:    cmpunordss %xmm0, %xmm1
+; SSE41-NEXT:    movaps %xmm1, %xmm3
+; SSE41-NEXT:    andps %xmm2, %xmm3
+; SSE41-NEXT:    minss %xmm0, %xmm2
+; SSE41-NEXT:    andnps %xmm2, %xmm1
+; SSE41-NEXT:    orps %xmm3, %xmm1
+; SSE41-NEXT:    movhlps {{.*#+}} xmm0 = xmm0[1,1]
+; SSE41-NEXT:    movaps %xmm0, %xmm2
+; SSE41-NEXT:    minss %xmm1, %xmm2
+; SSE41-NEXT:    cmpunordss %xmm1, %xmm1
+; SSE41-NEXT:    movaps %xmm1, %xmm3
+; SSE41-NEXT:    andnps %xmm2, %xmm3
+; SSE41-NEXT:    andps %xmm0, %xmm1
+; SSE41-NEXT:    orps %xmm3, %xmm1
+; SSE41-NEXT:    movaps %xmm1, %xmm0
+; SSE41-NEXT:    retq
+;
+; AVX-LABEL: test_v3f32:
+; AVX:       # %bb.0:
+; AVX-NEXT:    vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
+; AVX-NEXT:    vminss %xmm0, %xmm1, %xmm2
+; AVX-NEXT:    vcmpunordss %xmm0, %xmm0, %xmm3
+; AVX-NEXT:    vblendvps %xmm3, %xmm1, %xmm2, %xmm1
+; AVX-NEXT:    vcmpunordss %xmm1, %xmm1, %xmm2
+; AVX-NEXT:    vpermilpd {{.*#+}} xmm0 = xmm0[1,0]
+; AVX-NEXT:    vminss %xmm1, %xmm0, %xmm1
+; AVX-NEXT:    vblendvps %xmm2, %xmm0, %xmm1, %xmm0
+; AVX-NEXT:    retq
+;
+; AVX512-LABEL: test_v3f32:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
+; AVX512-NEXT:    vminss %xmm0, %xmm1, %xmm2
+; AVX512-NEXT:    vcmpunordss %xmm0, %xmm0, %k1
+; AVX512-NEXT:    vmovss %xmm1, %xmm2, %xmm2 {%k1}
+; AVX512-NEXT:    vcmpunordss %xmm2, %xmm2, %k1
+; AVX512-NEXT:    vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
+; AVX512-NEXT:    vminss %xmm2, %xmm1, %xmm0
+; AVX512-NEXT:    vmovss %xmm1, %xmm0, %xmm0 {%k1}
+; AVX512-NEXT:    retq
+  %1 = call nnan float @llvm.experimental.vector.reduce.fmin.v3f32(<3 x float> %a0)
+  ret float %1
+}
+
 define float @test_v4f32(<4 x float> %a0) {
 ; SSE2-LABEL: test_v4f32:
 ; SSE2:       # %bb.0:
 ; SSE2-NEXT:    movaps %xmm0, %xmm1
-; SSE2-NEXT:    unpckhpd {{.*#+}} xmm1 = xmm1[1],xmm0[1]
-; SSE2-NEXT:    minps %xmm1, %xmm0
-; SSE2-NEXT:    movaps %xmm0, %xmm1
-; SSE2-NEXT:    shufps {{.*#+}} xmm1 = xmm1[1,1],xmm0[1,1]
+; SSE2-NEXT:    shufps {{.*#+}} xmm1 = xmm1[3,3],xmm0[3,3]
+; SSE2-NEXT:    movaps %xmm0, %xmm2
+; SSE2-NEXT:    unpckhpd {{.*#+}} xmm2 = xmm2[1],xmm0[1]
+; SSE2-NEXT:    movaps %xmm0, %xmm3
+; SSE2-NEXT:    shufps {{.*#+}} xmm3 = xmm3[1,1],xmm0[1,1]
+; SSE2-NEXT:    minss %xmm3, %xmm0
+; SSE2-NEXT:    minss %xmm2, %xmm0
 ; SSE2-NEXT:    minss %xmm1, %xmm0
 ; SSE2-NEXT:    retq
 ;
 ; SSE41-LABEL: test_v4f32:
 ; SSE41:       # %bb.0:
 ; SSE41-NEXT:    movaps %xmm0, %xmm1
-; SSE41-NEXT:    unpckhpd {{.*#+}} xmm1 = xmm1[1],xmm0[1]
-; SSE41-NEXT:    minps %xmm1, %xmm0
-; SSE41-NEXT:    movshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
+; SSE41-NEXT:    shufps {{.*#+}} xmm1 = xmm1[3,3],xmm0[3,3]
+; SSE41-NEXT:    movaps %xmm0, %xmm2
+; SSE41-NEXT:    unpckhpd {{.*#+}} xmm2 = xmm2[1],xmm0[1]
+; SSE41-NEXT:    movshdup {{.*#+}} xmm3 = xmm0[1,1,3,3]
+; SSE41-NEXT:    minss %xmm3, %xmm0
+; SSE41-NEXT:    minss %xmm2, %xmm0
 ; SSE41-NEXT:    minss %xmm1, %xmm0
 ; SSE41-NEXT:    retq
 ;
 ; AVX-LABEL: test_v4f32:
 ; AVX:       # %bb.0:
-; AVX-NEXT:    vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
-; AVX-NEXT:    vminps %xmm1, %xmm0, %xmm0
-; AVX-NEXT:    vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
+; AVX-NEXT:    vpermilps {{.*#+}} xmm1 = xmm0[3,3,3,3]
+; AVX-NEXT:    vpermilpd {{.*#+}} xmm2 = xmm0[1,0]
+; AVX-NEXT:    vmovshdup {{.*#+}} xmm3 = xmm0[1,1,3,3]
+; AVX-NEXT:    vminss %xmm3, %xmm0, %xmm0
+; AVX-NEXT:    vminss %xmm2, %xmm0, %xmm0
 ; AVX-NEXT:    vminss %xmm1, %xmm0, %xmm0
 ; AVX-NEXT:    retq
 ;
 ; AVX512-LABEL: test_v4f32:
 ; AVX512:       # %bb.0:
-; AVX512-NEXT:    vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
-; AVX512-NEXT:    vminps %xmm1, %xmm0, %xmm0
-; AVX512-NEXT:    vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
+; AVX512-NEXT:    vpermilps {{.*#+}} xmm1 = xmm0[3,3,3,3]
+; AVX512-NEXT:    vpermilpd {{.*#+}} xmm2 = xmm0[1,0]
+; AVX512-NEXT:    vmovshdup {{.*#+}} xmm3 = xmm0[1,1,3,3]
+; AVX512-NEXT:    vminss %xmm3, %xmm0, %xmm0
+; AVX512-NEXT:    vminss %xmm2, %xmm0, %xmm0
 ; AVX512-NEXT:    vminss %xmm1, %xmm0, %xmm0
 ; AVX512-NEXT:    retq
   %1 = call nnan float @llvm.experimental.vector.reduce.fmin.v4f32(<4 x float> %a0)
@@ -82,43 +190,67 @@ define float @test_v8f32(<8 x float> %a0) {
 ; SSE2-LABEL: test_v8f32:
 ; SSE2:       # %bb.0:
 ; SSE2-NEXT:    minps %xmm1, %xmm0
+; SSE2-NEXT:    movaps %xmm0, %xmm2
+; SSE2-NEXT:    shufps {{.*#+}} xmm2 = xmm2[1,1],xmm0[1,1]
 ; SSE2-NEXT:    movaps %xmm0, %xmm1
-; SSE2-NEXT:    unpckhpd {{.*#+}} xmm1 = xmm1[1],xmm0[1]
-; SSE2-NEXT:    minps %xmm1, %xmm0
-; SSE2-NEXT:    movaps %xmm0, %xmm1
-; SSE2-NEXT:    shufps {{.*#+}} xmm1 = xmm1[1,1],xmm0[1,1]
-; SSE2-NEXT:    minss %xmm1, %xmm0
+; SSE2-NEXT:    minss %xmm2, %xmm1
+; SSE2-NEXT:    movaps %xmm0, %xmm2
+; SSE2-NEXT:    unpckhpd {{.*#+}} xmm2 = xmm2[1],xmm0[1]
+; SSE2-NEXT:    minss %xmm2, %xmm1
+; SSE2-NEXT:    shufps {{.*#+}} xmm0 = xmm0[3,3,3,3]
+; SSE2-NEXT:    minss %xmm0, %xmm1
+; SSE2-NEXT:    movaps %xmm1, %xmm0
 ; SSE2-NEXT:    retq
 ;
 ; SSE41-LABEL: test_v8f32:
 ; SSE41:       # %bb.0:
 ; SSE41-NEXT:    minps %xmm1, %xmm0
+; SSE41-NEXT:    movshdup {{.*#+}} xmm2 = xmm0[1,1,3,3]
 ; SSE41-NEXT:    movaps %xmm0, %xmm1
-; SSE41-NEXT:    unpckhpd {{.*#+}} xmm1 = xmm1[1],xmm0[1]
-; SSE41-NEXT:    minps %xmm1, %xmm0
-; SSE41-NEXT:    movshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
-; SSE41-NEXT:    minss %xmm1, %xmm0
+; SSE41-NEXT:    minss %xmm2, %xmm1
+; SSE41-NEXT:    movaps %xmm0, %xmm2
+; SSE41-NEXT:    unpckhpd {{.*#+}} xmm2 = xmm2[1],xmm0[1]
+; SSE41-NEXT:    minss %xmm2, %xmm1
+; SSE41-NEXT:    shufps {{.*#+}} xmm0 = xmm0[3,3,3,3]
+; SSE41-NEXT:    minss %xmm0, %xmm1
+; SSE41-NEXT:    movaps %xmm1, %xmm0
 ; SSE41-NEXT:    retq
 ;
 ; AVX-LABEL: test_v8f32:
 ; AVX:       # %bb.0:
 ; AVX-NEXT:    vextractf128 $1, %ymm0, %xmm1
-; AVX-NEXT:    vminps %xmm1, %xmm0, %xmm0
-; AVX-NEXT:    vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
-; AVX-NEXT:    vminps %xmm1, %xmm0, %xmm0
-; AVX-NEXT:    vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
+; AVX-NEXT:    vpermilps {{.*#+}} xmm2 = xmm1[3,3,3,3]
+; AVX-NEXT:    vpermilpd {{.*#+}} xmm3 = xmm1[1,0]
+; AVX-NEXT:    vmovshdup {{.*#+}} xmm4 = xmm1[1,1,3,3]
+; AVX-NEXT:    vpermilps {{.*#+}} xmm5 = xmm0[3,3,3,3]
+; AVX-NEXT:    vpermilpd {{.*#+}} xmm6 = xmm0[1,0]
+; AVX-NEXT:    vmovshdup {{.*#+}} xmm7 = xmm0[1,1,3,3]
+; AVX-NEXT:    vminss %xmm7, %xmm0, %xmm0
+; AVX-NEXT:    vminss %xmm6, %xmm0, %xmm0
+; AVX-NEXT:    vminss %xmm5, %xmm0, %xmm0
 ; AVX-NEXT:    vminss %xmm1, %xmm0, %xmm0
+; AVX-NEXT:    vminss %xmm4, %xmm0, %xmm0
+; AVX-NEXT:    vminss %xmm3, %xmm0, %xmm0
+; AVX-NEXT:    vminss %xmm2, %xmm0, %xmm0
 ; AVX-NEXT:    vzeroupper
 ; AVX-NEXT:    retq
 ;
 ; AVX512-LABEL: test_v8f32:
 ; AVX512:       # %bb.0:
 ; AVX512-NEXT:    vextractf128 $1, %ymm0, %xmm1
-; AVX512-NEXT:    vminps %xmm1, %xmm0, %xmm0
-; AVX512-NEXT:    vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
-; AVX512-NEXT:    vminps %xmm1, %xmm0, %xmm0
-; AVX512-NEXT:    vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
+; AVX512-NEXT:    vpermilps {{.*#+}} xmm2 = xmm1[3,3,3,3]
+; AVX512-NEXT:    vpermilpd {{.*#+}} xmm3 = xmm1[1,0]
+; AVX512-NEXT:    vmovshdup {{.*#+}} xmm4 = xmm1[1,1,3,3]
+; AVX512-NEXT:    vpermilps {{.*#+}} xmm5 = xmm0[3,3,3,3]
+; AVX512-NEXT:    vpermilpd {{.*#+}} xmm6 = xmm0[1,0]
+; AVX512-NEXT:    vmovshdup {{.*#+}} xmm7 = xmm0[1,1,3,3]
+; AVX512-NEXT:    vminss %xmm7, %xmm0, %xmm0
+; AVX512-NEXT:    vminss %xmm6, %xmm0, %xmm0
+; AVX512-NEXT:    vminss %xmm5, %xmm0, %xmm0
 ; AVX512-NEXT:    vminss %xmm1, %xmm0, %xmm0
+; AVX512-NEXT:    vminss %xmm4, %xmm0, %xmm0
+; AVX512-NEXT:    vminss %xmm3, %xmm0, %xmm0
+; AVX512-NEXT:    vminss %xmm2, %xmm0, %xmm0
 ; AVX512-NEXT:    vzeroupper
 ; AVX512-NEXT:    retq
   %1 = call nnan float @llvm.experimental.vector.reduce.fmin.v8f32(<8 x float> %a0)
@@ -131,12 +263,16 @@ define float @test_v16f32(<16 x float> %a0) {
 ; SSE2-NEXT:    minps %xmm3, %xmm1
 ; SSE2-NEXT:    minps %xmm2, %xmm0
 ; SSE2-NEXT:    minps %xmm1, %xmm0
+; SSE2-NEXT:    movaps %xmm0, %xmm2
+; SSE2-NEXT:    shufps {{.*#+}} xmm2 = xmm2[1,1],xmm0[1,1]
 ; SSE2-NEXT:    movaps %xmm0, %xmm1
-; SSE2-NEXT:    unpckhpd {{.*#+}} xmm1 = xmm1[1],xmm0[1]
-; SSE2-NEXT:    minps %xmm1, %xmm0
-; SSE2-NEXT:    movaps %xmm0, %xmm1
-; SSE2-NEXT:    shufps {{.*#+}} xmm1 = xmm1[1,1],xmm0[1,1]
-; SSE2-NEXT:    minss %xmm1, %xmm0
+; SSE2-NEXT:    minss %xmm2, %xmm1
+; SSE2-NEXT:    movaps %xmm0, %xmm2
+; SSE2-NEXT:    unpckhpd {{.*#+}} xmm2 = xmm2[1],xmm0[1]
+; SSE2-NEXT:    minss %xmm2, %xmm1
+; SSE2-NEXT:    shufps {{.*#+}} xmm0 = xmm0[3,3,3,3]
+; SSE2-NEXT:    minss %xmm0, %xmm1
+; SSE2-NEXT:    movaps %xmm1, %xmm0
 ; SSE2-NEXT:    retq
 ;
 ; SSE41-LABEL: test_v16f32:
@@ -144,35 +280,69 @@ define float @test_v16f32(<16 x float> %a0) {
 ; SSE41-NEXT:    minps %xmm3, %xmm1
 ; SSE41-NEXT:    minps %xmm2, %xmm0
 ; SSE41-NEXT:    minps %xmm1, %xmm0
+; SSE41-NEXT:    movshdup {{.*#+}} xmm2 = xmm0[1,1,3,3]
 ; SSE41-NEXT:    movaps %xmm0, %xmm1
-; SSE41-NEXT:    unpckhpd {{.*#+}} xmm1 = xmm1[1],xmm0[1]
-; SSE41-NEXT:    minps %xmm1, %xmm0
-; SSE41-NEXT:    movshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
-; SSE41-NEXT:    minss %xmm1, %xmm0
+; SSE41-NEXT:    minss %xmm2, %xmm1
+; SSE41-NEXT:    movaps %xmm0, %xmm2
+; SSE41-NEXT:    unpckhpd {{.*#+}} xmm2 = xmm2[1],xmm0[1]
+; SSE41-NEXT:    minss %xmm2, %xmm1
+; SSE41-NEXT:    shufps {{.*#+}} xmm0 = xmm0[3,3,3,3]
+; SSE41-NEXT:    minss %xmm0, %xmm1
+; SSE41-NEXT:    movaps %xmm1, %xmm0
 ; SSE41-NEXT:    retq
 ;
 ; AVX-LABEL: test_v16f32:
 ; AVX:       # %bb.0:
 ; AVX-NEXT:    vminps %ymm1, %ymm0, %ymm0
-; AVX-NEXT:    vextractf128 $1, %ymm0, %xmm1
-; AVX-NEXT:    vminps %xmm1, %xmm0, %xmm0
-; AVX-NEXT:    vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
-; AVX-NEXT:    vminps %xmm1, %xmm0, %xmm0
 ; AVX-NEXT:    vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
-; AVX-NEXT:    vminss %xmm1, %xmm0, %xmm0
+; AVX-NEXT:    vminss %xmm1, %xmm0, %xmm1
+; AVX-NEXT:    vpermilpd {{.*#+}} xmm2 = xmm0[1,0]
+; AVX-NEXT:    vminss %xmm2, %xmm1, %xmm1
+; AVX-NEXT:    vpermilps {{.*#+}} xmm2 = xmm0[3,3,3,3]
+; AVX-NEXT:    vminss %xmm2, %xmm1, %xmm1
+; AVX-NEXT:    vextractf128 $1, %ymm0, %xmm0
+; AVX-NEXT:    vminss %xmm0, %xmm1, %xmm1
+; AVX-NEXT:    vmovshdup {{.*#+}} xmm2 = xmm0[1,1,3,3]
+; AVX-NEXT:    vminss %xmm2, %xmm1, %xmm1
+; AVX-NEXT:    vpermilpd {{.*#+}} xmm2 = xmm0[1,0]
+; AVX-NEXT:    vminss %xmm2, %xmm1, %xmm1
+; AVX-NEXT:    vpermilps {{.*#+}} xmm0 = xmm0[3,3,3,3]
+; AVX-NEXT:    vminss %xmm0, %xmm1, %xmm0
 ; AVX-NEXT:    vzeroupper
 ; AVX-NEXT:    retq
 ;
 ; AVX512-LABEL: test_v16f32:
 ; AVX512:       # %bb.0:
-; AVX512-NEXT:    vextractf64x4 $1, %zmm0, %ymm1
-; AVX512-NEXT:    vminps %zmm1, %zmm0, %zmm0
-; AVX512-NEXT:    vextractf128 $1, %ymm0, %xmm1
-; AVX512-NEXT:    vminps %xmm1, %xmm0, %xmm0
-; AVX512-NEXT:    vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
-; AVX512-NEXT:    vminps %xmm1, %xmm0, %xmm0
-; AVX512-NEXT:    vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
+; AVX512-NEXT:    vextractf32x4 $3, %zmm0, %xmm1
+; AVX512-NEXT:    vpermilps {{.*#+}} xmm8 = xmm1[3,3,3,3]
+; AVX512-NEXT:    vpermilpd {{.*#+}} xmm9 = xmm1[1,0]
+; AVX512-NEXT:    vmovshdup {{.*#+}} xmm10 = xmm1[1,1,3,3]
+; AVX512-NEXT:    vextractf32x4 $2, %zmm0, %xmm5
+; AVX512-NEXT:    vpermilps {{.*#+}} xmm11 = xmm5[3,3,3,3]
+; AVX512-NEXT:    vpermilpd {{.*#+}} xmm12 = xmm5[1,0]
+; AVX512-NEXT:    vmovshdup {{.*#+}} xmm13 = xmm5[1,1,3,3]
+; AVX512-NEXT:    vextractf128 $1, %ymm0, %xmm3
+; AVX512-NEXT:    vpermilps {{.*#+}} xmm14 = xmm3[3,3,3,3]
+; AVX512-NEXT:    vpermilpd {{.*#+}} xmm15 = xmm3[1,0]
+; AVX512-NEXT:    vmovshdup {{.*#+}} xmm7 = xmm3[1,1,3,3]
+; AVX512-NEXT:    vpermilps {{.*#+}} xmm2 = xmm0[3,3,3,3]
+; AVX512-NEXT:    vpermilpd {{.*#+}} xmm4 = xmm0[1,0]
+; AVX512-NEXT:    vmovshdup {{.*#+}} xmm6 = xmm0[1,1,3,3]
+; AVX512-NEXT:    vminss %xmm6, %xmm0, %xmm0
+; AVX512-NEXT:    vminss %xmm4, %xmm0, %xmm0
+; AVX512-NEXT:    vminss %xmm2, %xmm0, %xmm0
+; AVX512-NEXT:    vminss %xmm3, %xmm0, %xmm0
+; AVX512-NEXT:    vminss %xmm7, %xmm0, %xmm0
+; AVX512-NEXT:    vminss %xmm15, %xmm0, %xmm0
+; AVX512-NEXT:    vminss %xmm14, %xmm0, %xmm0
+; AVX512-NEXT:    vminss %xmm5, %xmm0, %xmm0
+; AVX512-NEXT:    vminss %xmm13, %xmm0, %xmm0
+; AVX512-NEXT:    vminss %xmm12, %xmm0, %xmm0
+; AVX512-NEXT:    vminss %xmm11, %xmm0, %xmm0
 ; AVX512-NEXT:    vminss %xmm1, %xmm0, %xmm0
+; AVX512-NEXT:    vminss %xmm10, %xmm0, %xmm0
+; AVX512-NEXT:    vminss %xmm9, %xmm0, %xmm0
+; AVX512-NEXT:    vminss %xmm8, %xmm0, %xmm0
 ; AVX512-NEXT:    vzeroupper
 ; AVX512-NEXT:    retq
   %1 = call nnan float @llvm.experimental.vector.reduce.fmin.v16f32(<16 x float> %a0)
@@ -218,18 +388,22 @@ define double @test_v4f64(<4 x double> %a0) {
 ; AVX-LABEL: test_v4f64:
 ; AVX:       # %bb.0:
 ; AVX-NEXT:    vextractf128 $1, %ymm0, %xmm1
-; AVX-NEXT:    vminpd %xmm1, %xmm0, %xmm0
-; AVX-NEXT:    vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
+; AVX-NEXT:    vpermilpd {{.*#+}} xmm2 = xmm1[1,0]
+; AVX-NEXT:    vpermilpd {{.*#+}} xmm3 = xmm0[1,0]
+; AVX-NEXT:    vminsd %xmm3, %xmm0, %xmm0
 ; AVX-NEXT:    vminsd %xmm1, %xmm0, %xmm0
+; AVX-NEXT:    vminsd %xmm2, %xmm0, %xmm0
 ; AVX-NEXT:    vzeroupper
 ; AVX-NEXT:    retq
 ;
 ; AVX512-LABEL: test_v4f64:
 ; AVX512:       # %bb.0:
 ; AVX512-NEXT:    vextractf128 $1, %ymm0, %xmm1
-; AVX512-NEXT:    vminpd %xmm1, %xmm0, %xmm0
-; AVX512-NEXT:    vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
+; AVX512-NEXT:    vpermilpd {{.*#+}} xmm2 = xmm1[1,0]
+; AVX512-NEXT:    vpermilpd {{.*#+}} xmm3 = xmm0[1,0]
+; AVX512-NEXT:    vminsd %xmm3, %xmm0, %xmm0
 ; AVX512-NEXT:    vminsd %xmm1, %xmm0, %xmm0
+; AVX512-NEXT:    vminsd %xmm2, %xmm0, %xmm0
 ; AVX512-NEXT:    vzeroupper
 ; AVX512-NEXT:    retq
   %1 = call nnan double @llvm.experimental.vector.reduce.fmin.v4f64(<4 x double> %a0)
@@ -250,21 +424,31 @@ define double @test_v8f64(<8 x double> %a0) {
 ; AVX-LABEL: test_v8f64:
 ; AVX:       # %bb.0:
 ; AVX-NEXT:    vminpd %ymm1, %ymm0, %ymm0
-; AVX-NEXT:    vextractf128 $1, %ymm0, %xmm1
-; AVX-NEXT:    vminpd %xmm1, %xmm0, %xmm0
 ; AVX-NEXT:    vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
-; AVX-NEXT:    vminsd %xmm1, %xmm0, %xmm0
+; AVX-NEXT:    vminsd %xmm1, %xmm0, %xmm1
+; AVX-NEXT:    vextractf128 $1, %ymm0, %xmm0
+; AVX-NEXT:    vminsd %xmm0, %xmm1, %xmm1
+; AVX-NEXT:    vpermilpd {{.*#+}} xmm0 = xmm0[1,0]
+; AVX-NEXT:    vminsd %xmm0, %xmm1, %xmm0
 ; AVX-NEXT:    vzeroupper
 ; AVX-NEXT:    retq
 ;
 ; AVX512-LABEL: test_v8f64:
 ; AVX512:       # %bb.0:
-; AVX512-NEXT:    vextractf64x4 $1, %zmm0, %ymm1
-; AVX512-NEXT:    vminpd %zmm1, %zmm0, %zmm0
-; AVX512-NEXT:    vextractf128 $1, %ymm0, %xmm1
-; AVX512-NEXT:    vminpd %xmm1, %xmm0, %xmm0
-; AVX512-NEXT:    vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
+; AVX512-NEXT:    vextractf32x4 $3, %zmm0, %xmm1
+; AVX512-NEXT:    vpermilpd {{.*#+}} xmm2 = xmm1[1,0]
+; AVX512-NEXT:    vextractf32x4 $2, %zmm0, %xmm3
+; AVX512-NEXT:    vpermilpd {{.*#+}} xmm4 = xmm3[1,0]
+; AVX512-NEXT:    vextractf128 $1, %ymm0, %xmm5
+; AVX512-NEXT:    vpermilpd {{.*#+}} xmm6 = xmm5[1,0]
+; AVX512-NEXT:    vpermilpd {{.*#+}} xmm7 = xmm0[1,0]
+; AVX512-NEXT:    vminsd %xmm7, %xmm0, %xmm0
+; AVX512-NEXT:    vminsd %xmm5, %xmm0, %xmm0
+; AVX512-NEXT:    vminsd %xmm6, %xmm0, %xmm0
+; AVX512-NEXT:    vminsd %xmm3, %xmm0, %xmm0
+; AVX512-NEXT:    vminsd %xmm4, %xmm0, %xmm0
 ; AVX512-NEXT:    vminsd %xmm1, %xmm0, %xmm0
+; AVX512-NEXT:    vminsd %xmm2, %xmm0, %xmm0
 ; AVX512-NEXT:    vzeroupper
 ; AVX512-NEXT:    retq
   %1 = call nnan double @llvm.experimental.vector.reduce.fmin.v8f64(<8 x double> %a0)
@@ -274,12 +458,12 @@ define double @test_v8f64(<8 x double> %a0) {
 define double @test_v16f64(<16 x double> %a0) {
 ; SSE-LABEL: test_v16f64:
 ; SSE:       # %bb.0:
-; SSE-NEXT:    minpd %xmm6, %xmm2
-; SSE-NEXT:    minpd %xmm4, %xmm0
-; SSE-NEXT:    minpd %xmm2, %xmm0
 ; SSE-NEXT:    minpd %xmm7, %xmm3
 ; SSE-NEXT:    minpd %xmm5, %xmm1
 ; SSE-NEXT:    minpd %xmm3, %xmm1
+; SSE-NEXT:    minpd %xmm6, %xmm2
+; SSE-NEXT:    minpd %xmm4, %xmm0
+; SSE-NEXT:    minpd %xmm2, %xmm0
 ; SSE-NEXT:    minpd %xmm1, %xmm0
 ; SSE-NEXT:    movapd %xmm0, %xmm1
 ; SSE-NEXT:    unpckhpd {{.*#+}} xmm1 = xmm1[1],xmm0[1]
@@ -291,29 +475,41 @@ define double @test_v16f64(<16 x double> %a0) {
 ; AVX-NEXT:    vminpd %ymm3, %ymm1, %ymm1
 ; AVX-NEXT:    vminpd %ymm2, %ymm0, %ymm0
 ; AVX-NEXT:    vminpd %ymm1, %ymm0, %ymm0
-; AVX-NEXT:    vextractf128 $1, %ymm0, %xmm1
-; AVX-NEXT:    vminpd %xmm1, %xmm0, %xmm0
 ; AVX-NEXT:    vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
-; AVX-NEXT:    vminsd %xmm1, %xmm0, %xmm0
+; AVX-NEXT:    vminsd %xmm1, %xmm0, %xmm1
+; AVX-NEXT:    vextractf128 $1, %ymm0, %xmm0
+; AVX-NEXT:    vminsd %xmm0, %xmm1, %xmm1
+; AVX-NEXT:    vpermilpd {{.*#+}} xmm0 = xmm0[1,0]
+; AVX-NEXT:    vminsd %xmm0, %xmm1, %xmm0
 ; AVX-NEXT:    vzeroupper
 ; AVX-NEXT:    retq
 ;
 ; AVX512-LABEL: test_v16f64:
 ; AVX512:       # %bb.0:
 ; AVX512-NEXT:    vminpd %zmm1, %zmm0, %zmm0
-; AVX512-NEXT:    vextractf64x4 $1, %zmm0, %ymm1
-; AVX512-NEXT:    vminpd %zmm1, %zmm0, %zmm0
-; AVX512-NEXT:    vextractf128 $1, %ymm0, %xmm1
-; AVX512-NEXT:    vminpd %xmm1, %xmm0, %xmm0
 ; AVX512-NEXT:    vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
-; AVX512-NEXT:    vminsd %xmm1, %xmm0, %xmm0
+; AVX512-NEXT:    vminsd %xmm1, %xmm0, %xmm1
+; AVX512-NEXT:    vextractf128 $1, %ymm0, %xmm2
+; AVX512-NEXT:    vminsd %xmm2, %xmm1, %xmm1
+; AVX512-NEXT:    vpermilpd {{.*#+}} xmm2 = xmm2[1,0]
+; AVX512-NEXT:    vminsd %xmm2, %xmm1, %xmm1
+; AVX512-NEXT:    vextractf32x4 $2, %zmm0, %xmm2
+; AVX512-NEXT:    vminsd %xmm2, %xmm1, %xmm1
+; AVX512-NEXT:    vpermilpd {{.*#+}} xmm2 = xmm2[1,0]
+; AVX512-NEXT:    vminsd %xmm2, %xmm1, %xmm1
+; AVX512-NEXT:    vextractf32x4 $3, %zmm0, %xmm0
+; AVX512-NEXT:    vminsd %xmm0, %xmm1, %xmm1
+; AVX512-NEXT:    vpermilpd {{.*#+}} xmm0 = xmm0[1,0]
+; AVX512-NEXT:    vminsd %xmm0, %xmm1, %xmm0
 ; AVX512-NEXT:    vzeroupper
 ; AVX512-NEXT:    retq
   %1 = call nnan double @llvm.experimental.vector.reduce.fmin.v16f64(<16 x double> %a0)
   ret double %1
 }
 
+declare float @llvm.experimental.vector.reduce.fmin.v1f32(<1 x float>)
 declare float @llvm.experimental.vector.reduce.fmin.v2f32(<2 x float>)
+declare float @llvm.experimental.vector.reduce.fmin.v3f32(<3 x float>)
 declare float @llvm.experimental.vector.reduce.fmin.v4f32(<4 x float>)
 declare float @llvm.experimental.vector.reduce.fmin.v8f32(<8 x float>)
 declare float @llvm.experimental.vector.reduce.fmin.v16f32(<16 x float>)

diff  --git a/llvm/test/CodeGen/X86/vector-reduce-fmin.ll b/llvm/test/CodeGen/X86/vector-reduce-fmin.ll
index d6c681f50752..1d7436eaa8a4 100644
--- a/llvm/test/CodeGen/X86/vector-reduce-fmin.ll
+++ b/llvm/test/CodeGen/X86/vector-reduce-fmin.ll
@@ -13,27 +13,46 @@
 define float @test_v2f32(<2 x float> %a0) {
 ; SSE2-LABEL: test_v2f32:
 ; SSE2:       # %bb.0:
+; SSE2-NEXT:    movaps %xmm0, %xmm2
+; SSE2-NEXT:    shufps {{.*#+}} xmm2 = xmm2[1,1],xmm0[1,1]
 ; SSE2-NEXT:    movaps %xmm0, %xmm1
-; SSE2-NEXT:    shufps {{.*#+}} xmm1 = xmm1[1,1],xmm0[1,1]
-; SSE2-NEXT:    minss %xmm1, %xmm0
+; SSE2-NEXT:    cmpunordss %xmm0, %xmm1
+; SSE2-NEXT:    movaps %xmm1, %xmm3
+; SSE2-NEXT:    andps %xmm2, %xmm3
+; SSE2-NEXT:    minss %xmm0, %xmm2
+; SSE2-NEXT:    andnps %xmm2, %xmm1
+; SSE2-NEXT:    orps %xmm3, %xmm1
+; SSE2-NEXT:    movaps %xmm1, %xmm0
 ; SSE2-NEXT:    retq
 ;
 ; SSE41-LABEL: test_v2f32:
 ; SSE41:       # %bb.0:
-; SSE41-NEXT:    movshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
-; SSE41-NEXT:    minss %xmm1, %xmm0
+; SSE41-NEXT:    movshdup {{.*#+}} xmm2 = xmm0[1,1,3,3]
+; SSE41-NEXT:    movaps %xmm0, %xmm1
+; SSE41-NEXT:    cmpunordss %xmm0, %xmm1
+; SSE41-NEXT:    movaps %xmm1, %xmm3
+; SSE41-NEXT:    andps %xmm2, %xmm3
+; SSE41-NEXT:    minss %xmm0, %xmm2
+; SSE41-NEXT:    andnps %xmm2, %xmm1
+; SSE41-NEXT:    orps %xmm3, %xmm1
+; SSE41-NEXT:    movaps %xmm1, %xmm0
 ; SSE41-NEXT:    retq
 ;
 ; AVX-LABEL: test_v2f32:
 ; AVX:       # %bb.0:
 ; AVX-NEXT:    vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
-; AVX-NEXT:    vminss %xmm1, %xmm0, %xmm0
+; AVX-NEXT:    vminss %xmm0, %xmm1, %xmm2
+; AVX-NEXT:    vcmpunordss %xmm0, %xmm0, %xmm0
+; AVX-NEXT:    vblendvps %xmm0, %xmm1, %xmm2, %xmm0
 ; AVX-NEXT:    retq
 ;
 ; AVX512-LABEL: test_v2f32:
 ; AVX512:       # %bb.0:
-; AVX512-NEXT:    vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
-; AVX512-NEXT:    vminss %xmm1, %xmm0, %xmm0
+; AVX512-NEXT:    vmovshdup {{.*#+}} xmm2 = xmm0[1,1,3,3]
+; AVX512-NEXT:    vminss %xmm0, %xmm2, %xmm1
+; AVX512-NEXT:    vcmpunordss %xmm0, %xmm0, %k1
+; AVX512-NEXT:    vmovss %xmm2, %xmm1, %xmm1 {%k1}
+; AVX512-NEXT:    vmovaps %xmm1, %xmm0
 ; AVX512-NEXT:    retq
   %1 = call float @llvm.experimental.vector.reduce.fmin.v2f32(<2 x float> %a0)
   ret float %1
@@ -42,37 +61,95 @@ define float @test_v2f32(<2 x float> %a0) {
 define float @test_v4f32(<4 x float> %a0) {
 ; SSE2-LABEL: test_v4f32:
 ; SSE2:       # %bb.0:
+; SSE2-NEXT:    movaps %xmm0, %xmm2
+; SSE2-NEXT:    movaps %xmm0, %xmm3
+; SSE2-NEXT:    shufps {{.*#+}} xmm3 = xmm3[1,1],xmm0[1,1]
 ; SSE2-NEXT:    movaps %xmm0, %xmm1
-; SSE2-NEXT:    unpckhpd {{.*#+}} xmm1 = xmm1[1],xmm0[1]
-; SSE2-NEXT:    minps %xmm1, %xmm0
-; SSE2-NEXT:    movaps %xmm0, %xmm1
-; SSE2-NEXT:    shufps {{.*#+}} xmm1 = xmm1[1,1],xmm0[1,1]
-; SSE2-NEXT:    minss %xmm1, %xmm0
+; SSE2-NEXT:    cmpunordss %xmm0, %xmm1
+; SSE2-NEXT:    movaps %xmm1, %xmm4
+; SSE2-NEXT:    andps %xmm3, %xmm4
+; SSE2-NEXT:    minss %xmm0, %xmm3
+; SSE2-NEXT:    shufps {{.*#+}} xmm0 = xmm0[3,3,3,3]
+; SSE2-NEXT:    movhlps {{.*#+}} xmm2 = xmm2[1,1]
+; SSE2-NEXT:    andnps %xmm3, %xmm1
+; SSE2-NEXT:    orps %xmm4, %xmm1
+; SSE2-NEXT:    movaps %xmm2, %xmm3
+; SSE2-NEXT:    minss %xmm1, %xmm3
+; SSE2-NEXT:    cmpunordss %xmm1, %xmm1
+; SSE2-NEXT:    movaps %xmm1, %xmm4
+; SSE2-NEXT:    andnps %xmm3, %xmm4
+; SSE2-NEXT:    andps %xmm2, %xmm1
+; SSE2-NEXT:    orps %xmm4, %xmm1
+; SSE2-NEXT:    movaps %xmm0, %xmm2
+; SSE2-NEXT:    minss %xmm1, %xmm2
+; SSE2-NEXT:    cmpunordss %xmm1, %xmm1
+; SSE2-NEXT:    movaps %xmm1, %xmm3
+; SSE2-NEXT:    andnps %xmm2, %xmm3
+; SSE2-NEXT:    andps %xmm0, %xmm1
+; SSE2-NEXT:    orps %xmm3, %xmm1
+; SSE2-NEXT:    movaps %xmm1, %xmm0
 ; SSE2-NEXT:    retq
 ;
 ; SSE41-LABEL: test_v4f32:
 ; SSE41:       # %bb.0:
+; SSE41-NEXT:    movaps %xmm0, %xmm2
+; SSE41-NEXT:    movshdup {{.*#+}} xmm3 = xmm0[1,1,3,3]
 ; SSE41-NEXT:    movaps %xmm0, %xmm1
-; SSE41-NEXT:    unpckhpd {{.*#+}} xmm1 = xmm1[1],xmm0[1]
-; SSE41-NEXT:    minps %xmm1, %xmm0
-; SSE41-NEXT:    movshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
-; SSE41-NEXT:    minss %xmm1, %xmm0
+; SSE41-NEXT:    cmpunordss %xmm0, %xmm1
+; SSE41-NEXT:    movaps %xmm1, %xmm4
+; SSE41-NEXT:    andps %xmm3, %xmm4
+; SSE41-NEXT:    minss %xmm0, %xmm3
+; SSE41-NEXT:    shufps {{.*#+}} xmm0 = xmm0[3,3,3,3]
+; SSE41-NEXT:    movhlps {{.*#+}} xmm2 = xmm2[1,1]
+; SSE41-NEXT:    andnps %xmm3, %xmm1
+; SSE41-NEXT:    orps %xmm4, %xmm1
+; SSE41-NEXT:    movaps %xmm2, %xmm3
+; SSE41-NEXT:    minss %xmm1, %xmm3
+; SSE41-NEXT:    cmpunordss %xmm1, %xmm1
+; SSE41-NEXT:    movaps %xmm1, %xmm4
+; SSE41-NEXT:    andnps %xmm3, %xmm4
+; SSE41-NEXT:    andps %xmm2, %xmm1
+; SSE41-NEXT:    orps %xmm4, %xmm1
+; SSE41-NEXT:    movaps %xmm0, %xmm2
+; SSE41-NEXT:    minss %xmm1, %xmm2
+; SSE41-NEXT:    cmpunordss %xmm1, %xmm1
+; SSE41-NEXT:    movaps %xmm1, %xmm3
+; SSE41-NEXT:    andnps %xmm2, %xmm3
+; SSE41-NEXT:    andps %xmm0, %xmm1
+; SSE41-NEXT:    orps %xmm3, %xmm1
+; SSE41-NEXT:    movaps %xmm1, %xmm0
 ; SSE41-NEXT:    retq
 ;
 ; AVX-LABEL: test_v4f32:
 ; AVX:       # %bb.0:
-; AVX-NEXT:    vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
-; AVX-NEXT:    vminps %xmm1, %xmm0, %xmm0
-; AVX-NEXT:    vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
-; AVX-NEXT:    vminss %xmm1, %xmm0, %xmm0
+; AVX-NEXT:    vpermilps {{.*#+}} xmm1 = xmm0[3,3,3,3]
+; AVX-NEXT:    vpermilpd {{.*#+}} xmm2 = xmm0[1,0]
+; AVX-NEXT:    vmovshdup {{.*#+}} xmm3 = xmm0[1,1,3,3]
+; AVX-NEXT:    vminss %xmm0, %xmm3, %xmm4
+; AVX-NEXT:    vcmpunordss %xmm0, %xmm0, %xmm0
+; AVX-NEXT:    vblendvps %xmm0, %xmm3, %xmm4, %xmm0
+; AVX-NEXT:    vcmpunordss %xmm0, %xmm0, %xmm3
+; AVX-NEXT:    vminss %xmm0, %xmm2, %xmm0
+; AVX-NEXT:    vblendvps %xmm3, %xmm2, %xmm0, %xmm0
+; AVX-NEXT:    vcmpunordss %xmm0, %xmm0, %xmm2
+; AVX-NEXT:    vminss %xmm0, %xmm1, %xmm0
+; AVX-NEXT:    vblendvps %xmm2, %xmm1, %xmm0, %xmm0
 ; AVX-NEXT:    retq
 ;
 ; AVX512-LABEL: test_v4f32:
 ; AVX512:       # %bb.0:
-; AVX512-NEXT:    vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
-; AVX512-NEXT:    vminps %xmm1, %xmm0, %xmm0
-; AVX512-NEXT:    vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
-; AVX512-NEXT:    vminss %xmm1, %xmm0, %xmm0
+; AVX512-NEXT:    vpermilps {{.*#+}} xmm1 = xmm0[3,3,3,3]
+; AVX512-NEXT:    vpermilpd {{.*#+}} xmm2 = xmm0[1,0]
+; AVX512-NEXT:    vmovshdup {{.*#+}} xmm3 = xmm0[1,1,3,3]
+; AVX512-NEXT:    vminss %xmm0, %xmm3, %xmm4
+; AVX512-NEXT:    vcmpunordss %xmm0, %xmm0, %k1
+; AVX512-NEXT:    vmovss %xmm3, %xmm4, %xmm4 {%k1}
+; AVX512-NEXT:    vcmpunordss %xmm4, %xmm4, %k1
+; AVX512-NEXT:    vminss %xmm4, %xmm2, %xmm0
+; AVX512-NEXT:    vmovss %xmm2, %xmm0, %xmm0 {%k1}
+; AVX512-NEXT:    vcmpunordss %xmm0, %xmm0, %k1
+; AVX512-NEXT:    vminss %xmm0, %xmm1, %xmm0
+; AVX512-NEXT:    vmovss %xmm1, %xmm0, %xmm0 {%k1}
 ; AVX512-NEXT:    retq
   %1 = call float @llvm.experimental.vector.reduce.fmin.v4f32(<4 x float> %a0)
   ret float %1
@@ -81,46 +158,170 @@ define float @test_v4f32(<4 x float> %a0) {
 define float @test_v8f32(<8 x float> %a0) {
 ; SSE2-LABEL: test_v8f32:
 ; SSE2:       # %bb.0:
-; SSE2-NEXT:    minps %xmm1, %xmm0
-; SSE2-NEXT:    movaps %xmm0, %xmm1
-; SSE2-NEXT:    unpckhpd {{.*#+}} xmm1 = xmm1[1],xmm0[1]
-; SSE2-NEXT:    minps %xmm1, %xmm0
+; SSE2-NEXT:    movaps %xmm1, %xmm2
+; SSE2-NEXT:    minps %xmm0, %xmm2
+; SSE2-NEXT:    cmpunordps %xmm0, %xmm0
+; SSE2-NEXT:    andps %xmm0, %xmm1
+; SSE2-NEXT:    andnps %xmm2, %xmm0
+; SSE2-NEXT:    orps %xmm1, %xmm0
+; SSE2-NEXT:    pshufd {{.*#+}} xmm2 = xmm0[1,1,1,1]
 ; SSE2-NEXT:    movaps %xmm0, %xmm1
-; SSE2-NEXT:    shufps {{.*#+}} xmm1 = xmm1[1,1],xmm0[1,1]
-; SSE2-NEXT:    minss %xmm1, %xmm0
+; SSE2-NEXT:    cmpunordss %xmm0, %xmm1
+; SSE2-NEXT:    movaps %xmm1, %xmm3
+; SSE2-NEXT:    andps %xmm2, %xmm3
+; SSE2-NEXT:    minss %xmm0, %xmm2
+; SSE2-NEXT:    andnps %xmm2, %xmm1
+; SSE2-NEXT:    orps %xmm3, %xmm1
+; SSE2-NEXT:    pshufd {{.*#+}} xmm2 = xmm0[2,3,2,3]
+; SSE2-NEXT:    movdqa %xmm2, %xmm3
+; SSE2-NEXT:    minss %xmm1, %xmm3
+; SSE2-NEXT:    cmpunordss %xmm1, %xmm1
+; SSE2-NEXT:    movaps %xmm1, %xmm4
+; SSE2-NEXT:    andnps %xmm3, %xmm4
+; SSE2-NEXT:    andps %xmm2, %xmm1
+; SSE2-NEXT:    orps %xmm4, %xmm1
+; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[3,3,3,3]
+; SSE2-NEXT:    movdqa %xmm0, %xmm2
+; SSE2-NEXT:    minss %xmm1, %xmm2
+; SSE2-NEXT:    cmpunordss %xmm1, %xmm1
+; SSE2-NEXT:    movaps %xmm1, %xmm3
+; SSE2-NEXT:    andnps %xmm2, %xmm3
+; SSE2-NEXT:    andps %xmm0, %xmm1
+; SSE2-NEXT:    orps %xmm3, %xmm1
+; SSE2-NEXT:    movaps %xmm1, %xmm0
 ; SSE2-NEXT:    retq
 ;
 ; SSE41-LABEL: test_v8f32:
 ; SSE41:       # %bb.0:
-; SSE41-NEXT:    minps %xmm1, %xmm0
-; SSE41-NEXT:    movaps %xmm0, %xmm1
-; SSE41-NEXT:    unpckhpd {{.*#+}} xmm1 = xmm1[1],xmm0[1]
-; SSE41-NEXT:    minps %xmm1, %xmm0
-; SSE41-NEXT:    movshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
-; SSE41-NEXT:    minss %xmm1, %xmm0
+; SSE41-NEXT:    movaps %xmm1, %xmm2
+; SSE41-NEXT:    minps %xmm0, %xmm2
+; SSE41-NEXT:    cmpunordps %xmm0, %xmm0
+; SSE41-NEXT:    blendvps %xmm0, %xmm1, %xmm2
+; SSE41-NEXT:    movshdup {{.*#+}} xmm1 = xmm2[1,1,3,3]
+; SSE41-NEXT:    movaps %xmm2, %xmm0
+; SSE41-NEXT:    cmpunordss %xmm2, %xmm0
+; SSE41-NEXT:    movaps %xmm0, %xmm3
+; SSE41-NEXT:    andps %xmm1, %xmm3
+; SSE41-NEXT:    minss %xmm2, %xmm1
+; SSE41-NEXT:    andnps %xmm1, %xmm0
+; SSE41-NEXT:    orps %xmm3, %xmm0
+; SSE41-NEXT:    movaps %xmm2, %xmm1
+; SSE41-NEXT:    unpckhpd {{.*#+}} xmm1 = xmm1[1],xmm2[1]
+; SSE41-NEXT:    movaps %xmm1, %xmm3
+; SSE41-NEXT:    minss %xmm0, %xmm3
+; SSE41-NEXT:    cmpunordss %xmm0, %xmm0
+; SSE41-NEXT:    movaps %xmm0, %xmm4
+; SSE41-NEXT:    andnps %xmm3, %xmm4
+; SSE41-NEXT:    andps %xmm1, %xmm0
+; SSE41-NEXT:    orps %xmm4, %xmm0
+; SSE41-NEXT:    shufps {{.*#+}} xmm2 = xmm2[3,3,3,3]
+; SSE41-NEXT:    movaps %xmm2, %xmm1
+; SSE41-NEXT:    minss %xmm0, %xmm1
+; SSE41-NEXT:    cmpunordss %xmm0, %xmm0
+; SSE41-NEXT:    movaps %xmm0, %xmm3
+; SSE41-NEXT:    andnps %xmm1, %xmm3
+; SSE41-NEXT:    andps %xmm2, %xmm0
+; SSE41-NEXT:    orps %xmm3, %xmm0
 ; SSE41-NEXT:    retq
 ;
 ; AVX-LABEL: test_v8f32:
 ; AVX:       # %bb.0:
 ; AVX-NEXT:    vextractf128 $1, %ymm0, %xmm1
-; AVX-NEXT:    vminps %xmm1, %xmm0, %xmm0
-; AVX-NEXT:    vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
-; AVX-NEXT:    vminps %xmm1, %xmm0, %xmm0
-; AVX-NEXT:    vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
-; AVX-NEXT:    vminss %xmm1, %xmm0, %xmm0
+; AVX-NEXT:    vpermilps {{.*#+}} xmm8 = xmm1[3,3,3,3]
+; AVX-NEXT:    vpermilpd {{.*#+}} xmm3 = xmm1[1,0]
+; AVX-NEXT:    vmovshdup {{.*#+}} xmm4 = xmm1[1,1,3,3]
+; AVX-NEXT:    vpermilps {{.*#+}} xmm5 = xmm0[3,3,3,3]
+; AVX-NEXT:    vpermilpd {{.*#+}} xmm6 = xmm0[1,0]
+; AVX-NEXT:    vmovshdup {{.*#+}} xmm7 = xmm0[1,1,3,3]
+; AVX-NEXT:    vminss %xmm0, %xmm7, %xmm2
+; AVX-NEXT:    vcmpunordss %xmm0, %xmm0, %xmm0
+; AVX-NEXT:    vblendvps %xmm0, %xmm7, %xmm2, %xmm0
+; AVX-NEXT:    vcmpunordss %xmm0, %xmm0, %xmm2
+; AVX-NEXT:    vminss %xmm0, %xmm6, %xmm0
+; AVX-NEXT:    vblendvps %xmm2, %xmm6, %xmm0, %xmm0
+; AVX-NEXT:    vcmpunordss %xmm0, %xmm0, %xmm2
+; AVX-NEXT:    vminss %xmm0, %xmm5, %xmm0
+; AVX-NEXT:    vblendvps %xmm2, %xmm5, %xmm0, %xmm0
+; AVX-NEXT:    vcmpunordss %xmm0, %xmm0, %xmm2
+; AVX-NEXT:    vminss %xmm0, %xmm1, %xmm0
+; AVX-NEXT:    vblendvps %xmm2, %xmm1, %xmm0, %xmm0
+; AVX-NEXT:    vcmpunordss %xmm0, %xmm0, %xmm1
+; AVX-NEXT:    vminss %xmm0, %xmm4, %xmm0
+; AVX-NEXT:    vblendvps %xmm1, %xmm4, %xmm0, %xmm0
+; AVX-NEXT:    vcmpunordss %xmm0, %xmm0, %xmm1
+; AVX-NEXT:    vminss %xmm0, %xmm3, %xmm0
+; AVX-NEXT:    vblendvps %xmm1, %xmm3, %xmm0, %xmm0
+; AVX-NEXT:    vcmpunordss %xmm0, %xmm0, %xmm1
+; AVX-NEXT:    vminss %xmm0, %xmm8, %xmm0
+; AVX-NEXT:    vblendvps %xmm1, %xmm8, %xmm0, %xmm0
 ; AVX-NEXT:    vzeroupper
 ; AVX-NEXT:    retq
 ;
-; AVX512-LABEL: test_v8f32:
-; AVX512:       # %bb.0:
-; AVX512-NEXT:    vextractf128 $1, %ymm0, %xmm1
-; AVX512-NEXT:    vminps %xmm1, %xmm0, %xmm0
-; AVX512-NEXT:    vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
-; AVX512-NEXT:    vminps %xmm1, %xmm0, %xmm0
-; AVX512-NEXT:    vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
-; AVX512-NEXT:    vminss %xmm1, %xmm0, %xmm0
-; AVX512-NEXT:    vzeroupper
-; AVX512-NEXT:    retq
+; AVX512BW-LABEL: test_v8f32:
+; AVX512BW:       # %bb.0:
+; AVX512BW-NEXT:    vextractf128 $1, %ymm0, %xmm3
+; AVX512BW-NEXT:    vpermilps {{.*#+}} xmm8 = xmm3[3,3,3,3]
+; AVX512BW-NEXT:    vpermilpd {{.*#+}} xmm2 = xmm3[1,0]
+; AVX512BW-NEXT:    vmovshdup {{.*#+}} xmm4 = xmm3[1,1,3,3]
+; AVX512BW-NEXT:    vpermilps {{.*#+}} xmm5 = xmm0[3,3,3,3]
+; AVX512BW-NEXT:    vpermilpd {{.*#+}} xmm6 = xmm0[1,0]
+; AVX512BW-NEXT:    vmovshdup {{.*#+}} xmm7 = xmm0[1,1,3,3]
+; AVX512BW-NEXT:    vminss %xmm0, %xmm7, %xmm1
+; AVX512BW-NEXT:    vcmpunordss %xmm0, %xmm0, %k1
+; AVX512BW-NEXT:    vmovss %xmm7, %xmm1, %xmm1 {%k1}
+; AVX512BW-NEXT:    vcmpunordss %xmm1, %xmm1, %k1
+; AVX512BW-NEXT:    vminss %xmm1, %xmm6, %xmm0
+; AVX512BW-NEXT:    vmovss %xmm6, %xmm0, %xmm0 {%k1}
+; AVX512BW-NEXT:    vcmpunordss %xmm0, %xmm0, %k1
+; AVX512BW-NEXT:    vminss %xmm0, %xmm5, %xmm0
+; AVX512BW-NEXT:    vmovss %xmm5, %xmm0, %xmm0 {%k1}
+; AVX512BW-NEXT:    vcmpunordss %xmm0, %xmm0, %k1
+; AVX512BW-NEXT:    vminss %xmm0, %xmm3, %xmm0
+; AVX512BW-NEXT:    vmovss %xmm3, %xmm0, %xmm0 {%k1}
+; AVX512BW-NEXT:    vcmpunordss %xmm0, %xmm0, %k1
+; AVX512BW-NEXT:    vminss %xmm0, %xmm4, %xmm0
+; AVX512BW-NEXT:    vmovss %xmm4, %xmm0, %xmm0 {%k1}
+; AVX512BW-NEXT:    vcmpunordss %xmm0, %xmm0, %k1
+; AVX512BW-NEXT:    vminss %xmm0, %xmm2, %xmm0
+; AVX512BW-NEXT:    vmovss %xmm2, %xmm0, %xmm0 {%k1}
+; AVX512BW-NEXT:    vcmpunordss %xmm0, %xmm0, %k1
+; AVX512BW-NEXT:    vminss %xmm0, %xmm8, %xmm0
+; AVX512BW-NEXT:    vmovss %xmm8, %xmm0, %xmm0 {%k1}
+; AVX512BW-NEXT:    vzeroupper
+; AVX512BW-NEXT:    retq
+;
+; AVX512VL-LABEL: test_v8f32:
+; AVX512VL:       # %bb.0:
+; AVX512VL-NEXT:    vextractf128 $1, %ymm0, %xmm1
+; AVX512VL-NEXT:    vpermilps {{.*#+}} xmm8 = xmm1[3,3,3,3]
+; AVX512VL-NEXT:    vpermilpd {{.*#+}} xmm3 = xmm1[1,0]
+; AVX512VL-NEXT:    vmovshdup {{.*#+}} xmm4 = xmm1[1,1,3,3]
+; AVX512VL-NEXT:    vpermilps {{.*#+}} xmm5 = xmm0[3,3,3,3]
+; AVX512VL-NEXT:    vpermilpd {{.*#+}} xmm6 = xmm0[1,0]
+; AVX512VL-NEXT:    vmovshdup {{.*#+}} xmm7 = xmm0[1,1,3,3]
+; AVX512VL-NEXT:    vminss %xmm0, %xmm7, %xmm2
+; AVX512VL-NEXT:    vcmpunordss %xmm0, %xmm0, %k1
+; AVX512VL-NEXT:    vmovss %xmm7, %xmm2, %xmm2 {%k1}
+; AVX512VL-NEXT:    vcmpunordss %xmm2, %xmm2, %k1
+; AVX512VL-NEXT:    vminss %xmm2, %xmm6, %xmm0
+; AVX512VL-NEXT:    vmovss %xmm6, %xmm0, %xmm0 {%k1}
+; AVX512VL-NEXT:    vcmpunordss %xmm0, %xmm0, %k1
+; AVX512VL-NEXT:    vminss %xmm0, %xmm5, %xmm0
+; AVX512VL-NEXT:    vmovss %xmm5, %xmm0, %xmm0 {%k1}
+; AVX512VL-NEXT:    vcmpunordss %xmm0, %xmm0, %k1
+; AVX512VL-NEXT:    vminss %xmm0, %xmm1, %xmm0
+; AVX512VL-NEXT:    vmovss %xmm1, %xmm0, %xmm0 {%k1}
+; AVX512VL-NEXT:    vcmpunordss %xmm0, %xmm0, %k1
+; AVX512VL-NEXT:    vminss %xmm0, %xmm4, %xmm0
+; AVX512VL-NEXT:    vmovss %xmm4, %xmm0, %xmm0 {%k1}
+; AVX512VL-NEXT:    vcmpunordss %xmm0, %xmm0, %k1
+; AVX512VL-NEXT:    vminss %xmm0, %xmm3, %xmm0
+; AVX512VL-NEXT:    vmovss %xmm3, %xmm0, %xmm0 {%k1}
+; AVX512VL-NEXT:    vcmpunordss %xmm0, %xmm0, %k1
+; AVX512VL-NEXT:    vminss %xmm0, %xmm8, %xmm0
+; AVX512VL-NEXT:    vmovss %xmm8, %xmm0, %xmm0 {%k1}
+; AVX512VL-NEXT:    vzeroupper
+; AVX512VL-NEXT:    retq
   %1 = call float @llvm.experimental.vector.reduce.fmin.v8f32(<8 x float> %a0)
   ret float %1
 }
@@ -128,53 +329,259 @@ define float @test_v8f32(<8 x float> %a0) {
 define float @test_v16f32(<16 x float> %a0) {
 ; SSE2-LABEL: test_v16f32:
 ; SSE2:       # %bb.0:
-; SSE2-NEXT:    minps %xmm3, %xmm1
-; SSE2-NEXT:    minps %xmm2, %xmm0
-; SSE2-NEXT:    minps %xmm1, %xmm0
-; SSE2-NEXT:    movaps %xmm0, %xmm1
-; SSE2-NEXT:    unpckhpd {{.*#+}} xmm1 = xmm1[1],xmm0[1]
-; SSE2-NEXT:    minps %xmm1, %xmm0
+; SSE2-NEXT:    movaps %xmm2, %xmm4
+; SSE2-NEXT:    minps %xmm0, %xmm4
+; SSE2-NEXT:    cmpunordps %xmm0, %xmm0
+; SSE2-NEXT:    andps %xmm0, %xmm2
+; SSE2-NEXT:    andnps %xmm4, %xmm0
+; SSE2-NEXT:    orps %xmm2, %xmm0
+; SSE2-NEXT:    movaps %xmm3, %xmm2
+; SSE2-NEXT:    minps %xmm1, %xmm2
+; SSE2-NEXT:    cmpunordps %xmm1, %xmm1
+; SSE2-NEXT:    andps %xmm1, %xmm3
+; SSE2-NEXT:    andnps %xmm2, %xmm1
+; SSE2-NEXT:    orps %xmm3, %xmm1
+; SSE2-NEXT:    movaps %xmm1, %xmm2
+; SSE2-NEXT:    minps %xmm0, %xmm2
+; SSE2-NEXT:    cmpunordps %xmm0, %xmm0
+; SSE2-NEXT:    andps %xmm0, %xmm1
+; SSE2-NEXT:    andnps %xmm2, %xmm0
+; SSE2-NEXT:    orps %xmm1, %xmm0
+; SSE2-NEXT:    pshufd {{.*#+}} xmm2 = xmm0[1,1,1,1]
 ; SSE2-NEXT:    movaps %xmm0, %xmm1
-; SSE2-NEXT:    shufps {{.*#+}} xmm1 = xmm1[1,1],xmm0[1,1]
-; SSE2-NEXT:    minss %xmm1, %xmm0
+; SSE2-NEXT:    cmpunordss %xmm0, %xmm1
+; SSE2-NEXT:    movaps %xmm1, %xmm3
+; SSE2-NEXT:    andps %xmm2, %xmm3
+; SSE2-NEXT:    minss %xmm0, %xmm2
+; SSE2-NEXT:    andnps %xmm2, %xmm1
+; SSE2-NEXT:    orps %xmm3, %xmm1
+; SSE2-NEXT:    pshufd {{.*#+}} xmm2 = xmm0[2,3,2,3]
+; SSE2-NEXT:    movdqa %xmm2, %xmm3
+; SSE2-NEXT:    minss %xmm1, %xmm3
+; SSE2-NEXT:    cmpunordss %xmm1, %xmm1
+; SSE2-NEXT:    movaps %xmm1, %xmm4
+; SSE2-NEXT:    andnps %xmm3, %xmm4
+; SSE2-NEXT:    andps %xmm2, %xmm1
+; SSE2-NEXT:    orps %xmm4, %xmm1
+; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[3,3,3,3]
+; SSE2-NEXT:    movdqa %xmm0, %xmm2
+; SSE2-NEXT:    minss %xmm1, %xmm2
+; SSE2-NEXT:    cmpunordss %xmm1, %xmm1
+; SSE2-NEXT:    movaps %xmm1, %xmm3
+; SSE2-NEXT:    andnps %xmm2, %xmm3
+; SSE2-NEXT:    andps %xmm0, %xmm1
+; SSE2-NEXT:    orps %xmm3, %xmm1
+; SSE2-NEXT:    movaps %xmm1, %xmm0
 ; SSE2-NEXT:    retq
 ;
 ; SSE41-LABEL: test_v16f32:
 ; SSE41:       # %bb.0:
-; SSE41-NEXT:    minps %xmm3, %xmm1
-; SSE41-NEXT:    minps %xmm2, %xmm0
-; SSE41-NEXT:    minps %xmm1, %xmm0
-; SSE41-NEXT:    movaps %xmm0, %xmm1
-; SSE41-NEXT:    unpckhpd {{.*#+}} xmm1 = xmm1[1],xmm0[1]
-; SSE41-NEXT:    minps %xmm1, %xmm0
-; SSE41-NEXT:    movshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
-; SSE41-NEXT:    minss %xmm1, %xmm0
+; SSE41-NEXT:    movaps %xmm2, %xmm4
+; SSE41-NEXT:    minps %xmm0, %xmm4
+; SSE41-NEXT:    cmpunordps %xmm0, %xmm0
+; SSE41-NEXT:    blendvps %xmm0, %xmm2, %xmm4
+; SSE41-NEXT:    movaps %xmm3, %xmm2
+; SSE41-NEXT:    minps %xmm1, %xmm2
+; SSE41-NEXT:    cmpunordps %xmm1, %xmm1
+; SSE41-NEXT:    movaps %xmm1, %xmm0
+; SSE41-NEXT:    blendvps %xmm0, %xmm3, %xmm2
+; SSE41-NEXT:    movaps %xmm2, %xmm1
+; SSE41-NEXT:    minps %xmm4, %xmm1
+; SSE41-NEXT:    cmpunordps %xmm4, %xmm4
+; SSE41-NEXT:    movaps %xmm4, %xmm0
+; SSE41-NEXT:    blendvps %xmm0, %xmm2, %xmm1
+; SSE41-NEXT:    movshdup {{.*#+}} xmm2 = xmm1[1,1,3,3]
+; SSE41-NEXT:    movaps %xmm1, %xmm0
+; SSE41-NEXT:    cmpunordss %xmm1, %xmm0
+; SSE41-NEXT:    movaps %xmm0, %xmm3
+; SSE41-NEXT:    andps %xmm2, %xmm3
+; SSE41-NEXT:    minss %xmm1, %xmm2
+; SSE41-NEXT:    andnps %xmm2, %xmm0
+; SSE41-NEXT:    orps %xmm3, %xmm0
+; SSE41-NEXT:    movaps %xmm1, %xmm2
+; SSE41-NEXT:    unpckhpd {{.*#+}} xmm2 = xmm2[1],xmm1[1]
+; SSE41-NEXT:    movaps %xmm2, %xmm3
+; SSE41-NEXT:    minss %xmm0, %xmm3
+; SSE41-NEXT:    cmpunordss %xmm0, %xmm0
+; SSE41-NEXT:    movaps %xmm0, %xmm4
+; SSE41-NEXT:    andnps %xmm3, %xmm4
+; SSE41-NEXT:    andps %xmm2, %xmm0
+; SSE41-NEXT:    orps %xmm4, %xmm0
+; SSE41-NEXT:    shufps {{.*#+}} xmm1 = xmm1[3,3,3,3]
+; SSE41-NEXT:    movaps %xmm1, %xmm2
+; SSE41-NEXT:    minss %xmm0, %xmm2
+; SSE41-NEXT:    cmpunordss %xmm0, %xmm0
+; SSE41-NEXT:    movaps %xmm0, %xmm3
+; SSE41-NEXT:    andnps %xmm2, %xmm3
+; SSE41-NEXT:    andps %xmm1, %xmm0
+; SSE41-NEXT:    orps %xmm3, %xmm0
 ; SSE41-NEXT:    retq
 ;
 ; AVX-LABEL: test_v16f32:
 ; AVX:       # %bb.0:
-; AVX-NEXT:    vminps %ymm1, %ymm0, %ymm0
-; AVX-NEXT:    vextractf128 $1, %ymm0, %xmm1
-; AVX-NEXT:    vminps %xmm1, %xmm0, %xmm0
-; AVX-NEXT:    vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
-; AVX-NEXT:    vminps %xmm1, %xmm0, %xmm0
+; AVX-NEXT:    vminps %ymm0, %ymm1, %ymm2
+; AVX-NEXT:    vcmpunordps %ymm0, %ymm0, %ymm0
+; AVX-NEXT:    vblendvps %ymm0, %ymm1, %ymm2, %ymm0
 ; AVX-NEXT:    vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
-; AVX-NEXT:    vminss %xmm1, %xmm0, %xmm0
+; AVX-NEXT:    vminss %xmm0, %xmm1, %xmm2
+; AVX-NEXT:    vcmpunordss %xmm0, %xmm0, %xmm3
+; AVX-NEXT:    vblendvps %xmm3, %xmm1, %xmm2, %xmm1
+; AVX-NEXT:    vcmpunordss %xmm1, %xmm1, %xmm2
+; AVX-NEXT:    vpermilpd {{.*#+}} xmm3 = xmm0[1,0]
+; AVX-NEXT:    vminss %xmm1, %xmm3, %xmm1
+; AVX-NEXT:    vblendvps %xmm2, %xmm3, %xmm1, %xmm1
+; AVX-NEXT:    vcmpunordss %xmm1, %xmm1, %xmm2
+; AVX-NEXT:    vpermilps {{.*#+}} xmm3 = xmm0[3,3,3,3]
+; AVX-NEXT:    vminss %xmm1, %xmm3, %xmm1
+; AVX-NEXT:    vblendvps %xmm2, %xmm3, %xmm1, %xmm1
+; AVX-NEXT:    vcmpunordss %xmm1, %xmm1, %xmm2
+; AVX-NEXT:    vextractf128 $1, %ymm0, %xmm0
+; AVX-NEXT:    vminss %xmm1, %xmm0, %xmm1
+; AVX-NEXT:    vblendvps %xmm2, %xmm0, %xmm1, %xmm1
+; AVX-NEXT:    vcmpunordss %xmm1, %xmm1, %xmm2
+; AVX-NEXT:    vmovshdup {{.*#+}} xmm3 = xmm0[1,1,3,3]
+; AVX-NEXT:    vminss %xmm1, %xmm3, %xmm1
+; AVX-NEXT:    vblendvps %xmm2, %xmm3, %xmm1, %xmm1
+; AVX-NEXT:    vcmpunordss %xmm1, %xmm1, %xmm2
+; AVX-NEXT:    vpermilpd {{.*#+}} xmm3 = xmm0[1,0]
+; AVX-NEXT:    vminss %xmm1, %xmm3, %xmm1
+; AVX-NEXT:    vblendvps %xmm2, %xmm3, %xmm1, %xmm1
+; AVX-NEXT:    vcmpunordss %xmm1, %xmm1, %xmm2
+; AVX-NEXT:    vpermilps {{.*#+}} xmm0 = xmm0[3,3,3,3]
+; AVX-NEXT:    vminss %xmm1, %xmm0, %xmm1
+; AVX-NEXT:    vblendvps %xmm2, %xmm0, %xmm1, %xmm0
 ; AVX-NEXT:    vzeroupper
 ; AVX-NEXT:    retq
 ;
-; AVX512-LABEL: test_v16f32:
-; AVX512:       # %bb.0:
-; AVX512-NEXT:    vextractf64x4 $1, %zmm0, %ymm1
-; AVX512-NEXT:    vminps %zmm1, %zmm0, %zmm0
-; AVX512-NEXT:    vextractf128 $1, %ymm0, %xmm1
-; AVX512-NEXT:    vminps %xmm1, %xmm0, %xmm0
-; AVX512-NEXT:    vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
-; AVX512-NEXT:    vminps %xmm1, %xmm0, %xmm0
-; AVX512-NEXT:    vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
-; AVX512-NEXT:    vminss %xmm1, %xmm0, %xmm0
-; AVX512-NEXT:    vzeroupper
-; AVX512-NEXT:    retq
+; AVX512BW-LABEL: test_v16f32:
+; AVX512BW:       # %bb.0:
+; AVX512BW-NEXT:    vextractf128 $1, %ymm0, %xmm1
+; AVX512BW-NEXT:    vmovshdup {{.*#+}} xmm2 = xmm0[1,1,3,3]
+; AVX512BW-NEXT:    vminss %xmm0, %xmm2, %xmm3
+; AVX512BW-NEXT:    vcmpunordss %xmm0, %xmm0, %k1
+; AVX512BW-NEXT:    vmovss %xmm2, %xmm3, %xmm3 {%k1}
+; AVX512BW-NEXT:    vpermilpd {{.*#+}} xmm2 = xmm0[1,0]
+; AVX512BW-NEXT:    vcmpunordss %xmm3, %xmm3, %k1
+; AVX512BW-NEXT:    vminss %xmm3, %xmm2, %xmm3
+; AVX512BW-NEXT:    vmovss %xmm2, %xmm3, %xmm3 {%k1}
+; AVX512BW-NEXT:    vpermilps {{.*#+}} xmm2 = xmm0[3,3,3,3]
+; AVX512BW-NEXT:    vcmpunordss %xmm3, %xmm3, %k1
+; AVX512BW-NEXT:    vminss %xmm3, %xmm2, %xmm3
+; AVX512BW-NEXT:    vmovss %xmm2, %xmm3, %xmm3 {%k1}
+; AVX512BW-NEXT:    vmovshdup {{.*#+}} xmm2 = xmm1[1,1,3,3]
+; AVX512BW-NEXT:    vcmpunordss %xmm3, %xmm3, %k1
+; AVX512BW-NEXT:    vminss %xmm3, %xmm1, %xmm3
+; AVX512BW-NEXT:    vmovss %xmm1, %xmm3, %xmm3 {%k1}
+; AVX512BW-NEXT:    vcmpunordss %xmm3, %xmm3, %k1
+; AVX512BW-NEXT:    vminss %xmm3, %xmm2, %xmm3
+; AVX512BW-NEXT:    vmovss %xmm2, %xmm3, %xmm3 {%k1}
+; AVX512BW-NEXT:    vpermilpd {{.*#+}} xmm2 = xmm1[1,0]
+; AVX512BW-NEXT:    vcmpunordss %xmm3, %xmm3, %k1
+; AVX512BW-NEXT:    vminss %xmm3, %xmm2, %xmm3
+; AVX512BW-NEXT:    vmovss %xmm2, %xmm3, %xmm3 {%k1}
+; AVX512BW-NEXT:    vextractf32x4 $2, %zmm0, %xmm2
+; AVX512BW-NEXT:    vpermilps {{.*#+}} xmm1 = xmm1[3,3,3,3]
+; AVX512BW-NEXT:    vcmpunordss %xmm3, %xmm3, %k1
+; AVX512BW-NEXT:    vminss %xmm3, %xmm1, %xmm3
+; AVX512BW-NEXT:    vmovss %xmm1, %xmm3, %xmm3 {%k1}
+; AVX512BW-NEXT:    vmovshdup {{.*#+}} xmm1 = xmm2[1,1,3,3]
+; AVX512BW-NEXT:    vcmpunordss %xmm3, %xmm3, %k1
+; AVX512BW-NEXT:    vminss %xmm3, %xmm2, %xmm3
+; AVX512BW-NEXT:    vmovss %xmm2, %xmm3, %xmm3 {%k1}
+; AVX512BW-NEXT:    vcmpunordss %xmm3, %xmm3, %k1
+; AVX512BW-NEXT:    vminss %xmm3, %xmm1, %xmm3
+; AVX512BW-NEXT:    vmovss %xmm1, %xmm3, %xmm3 {%k1}
+; AVX512BW-NEXT:    vpermilpd {{.*#+}} xmm1 = xmm2[1,0]
+; AVX512BW-NEXT:    vcmpunordss %xmm3, %xmm3, %k1
+; AVX512BW-NEXT:    vminss %xmm3, %xmm1, %xmm3
+; AVX512BW-NEXT:    vmovss %xmm1, %xmm3, %xmm3 {%k1}
+; AVX512BW-NEXT:    vextractf32x4 $3, %zmm0, %xmm0
+; AVX512BW-NEXT:    vpermilps {{.*#+}} xmm1 = xmm2[3,3,3,3]
+; AVX512BW-NEXT:    vcmpunordss %xmm3, %xmm3, %k1
+; AVX512BW-NEXT:    vminss %xmm3, %xmm1, %xmm2
+; AVX512BW-NEXT:    vmovss %xmm1, %xmm2, %xmm2 {%k1}
+; AVX512BW-NEXT:    vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
+; AVX512BW-NEXT:    vcmpunordss %xmm2, %xmm2, %k1
+; AVX512BW-NEXT:    vminss %xmm2, %xmm0, %xmm2
+; AVX512BW-NEXT:    vmovss %xmm0, %xmm2, %xmm2 {%k1}
+; AVX512BW-NEXT:    vcmpunordss %xmm2, %xmm2, %k1
+; AVX512BW-NEXT:    vminss %xmm2, %xmm1, %xmm2
+; AVX512BW-NEXT:    vmovss %xmm1, %xmm2, %xmm2 {%k1}
+; AVX512BW-NEXT:    vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
+; AVX512BW-NEXT:    vcmpunordss %xmm2, %xmm2, %k1
+; AVX512BW-NEXT:    vminss %xmm2, %xmm1, %xmm2
+; AVX512BW-NEXT:    vmovss %xmm1, %xmm2, %xmm2 {%k1}
+; AVX512BW-NEXT:    vpermilps {{.*#+}} xmm1 = xmm0[3,3,3,3]
+; AVX512BW-NEXT:    vcmpunordss %xmm2, %xmm2, %k1
+; AVX512BW-NEXT:    vminss %xmm2, %xmm1, %xmm0
+; AVX512BW-NEXT:    vmovss %xmm1, %xmm0, %xmm0 {%k1}
+; AVX512BW-NEXT:    vzeroupper
+; AVX512BW-NEXT:    retq
+;
+; AVX512VL-LABEL: test_v16f32:
+; AVX512VL:       # %bb.0:
+; AVX512VL-NEXT:    vextractf32x4 $3, %zmm0, %xmm3
+; AVX512VL-NEXT:    vpermilps {{.*#+}} xmm8 = xmm3[3,3,3,3]
+; AVX512VL-NEXT:    vpermilpd {{.*#+}} xmm9 = xmm3[1,0]
+; AVX512VL-NEXT:    vmovshdup {{.*#+}} xmm10 = xmm3[1,1,3,3]
+; AVX512VL-NEXT:    vextractf32x4 $2, %zmm0, %xmm6
+; AVX512VL-NEXT:    vpermilps {{.*#+}} xmm11 = xmm6[3,3,3,3]
+; AVX512VL-NEXT:    vpermilpd {{.*#+}} xmm12 = xmm6[1,0]
+; AVX512VL-NEXT:    vmovshdup {{.*#+}} xmm13 = xmm6[1,1,3,3]
+; AVX512VL-NEXT:    vextractf128 $1, %ymm0, %xmm2
+; AVX512VL-NEXT:    vpermilps {{.*#+}} xmm14 = xmm2[3,3,3,3]
+; AVX512VL-NEXT:    vpermilpd {{.*#+}} xmm15 = xmm2[1,0]
+; AVX512VL-NEXT:    vmovshdup {{.*#+}} xmm16 = xmm2[1,1,3,3]
+; AVX512VL-NEXT:    vpermilps {{.*#+}} xmm1 = xmm0[3,3,3,3]
+; AVX512VL-NEXT:    vpermilpd {{.*#+}} xmm4 = xmm0[1,0]
+; AVX512VL-NEXT:    vmovshdup {{.*#+}} xmm5 = xmm0[1,1,3,3]
+; AVX512VL-NEXT:    vminss %xmm0, %xmm5, %xmm7
+; AVX512VL-NEXT:    vcmpunordss %xmm0, %xmm0, %k1
+; AVX512VL-NEXT:    vmovss %xmm5, %xmm7, %xmm7 {%k1}
+; AVX512VL-NEXT:    vcmpunordss %xmm7, %xmm7, %k1
+; AVX512VL-NEXT:    vminss %xmm7, %xmm4, %xmm0
+; AVX512VL-NEXT:    vmovss %xmm4, %xmm0, %xmm0 {%k1}
+; AVX512VL-NEXT:    vcmpunordss %xmm0, %xmm0, %k1
+; AVX512VL-NEXT:    vminss %xmm0, %xmm1, %xmm0
+; AVX512VL-NEXT:    vmovss %xmm1, %xmm0, %xmm0 {%k1}
+; AVX512VL-NEXT:    vcmpunordss %xmm0, %xmm0, %k1
+; AVX512VL-NEXT:    vminss %xmm0, %xmm2, %xmm0
+; AVX512VL-NEXT:    vmovss %xmm2, %xmm0, %xmm0 {%k1}
+; AVX512VL-NEXT:    vcmpunordss %xmm0, %xmm0, %k1
+; AVX512VL-NEXT:    vminss %xmm0, %xmm16, %xmm0
+; AVX512VL-NEXT:    vmovss %xmm16, %xmm0, %xmm0 {%k1}
+; AVX512VL-NEXT:    vcmpunordss %xmm0, %xmm0, %k1
+; AVX512VL-NEXT:    vminss %xmm0, %xmm15, %xmm0
+; AVX512VL-NEXT:    vmovss %xmm15, %xmm0, %xmm0 {%k1}
+; AVX512VL-NEXT:    vcmpunordss %xmm0, %xmm0, %k1
+; AVX512VL-NEXT:    vminss %xmm0, %xmm14, %xmm0
+; AVX512VL-NEXT:    vmovss %xmm14, %xmm0, %xmm0 {%k1}
+; AVX512VL-NEXT:    vcmpunordss %xmm0, %xmm0, %k1
+; AVX512VL-NEXT:    vminss %xmm0, %xmm6, %xmm0
+; AVX512VL-NEXT:    vmovss %xmm6, %xmm0, %xmm0 {%k1}
+; AVX512VL-NEXT:    vcmpunordss %xmm0, %xmm0, %k1
+; AVX512VL-NEXT:    vminss %xmm0, %xmm13, %xmm0
+; AVX512VL-NEXT:    vmovss %xmm13, %xmm0, %xmm0 {%k1}
+; AVX512VL-NEXT:    vcmpunordss %xmm0, %xmm0, %k1
+; AVX512VL-NEXT:    vminss %xmm0, %xmm12, %xmm0
+; AVX512VL-NEXT:    vmovss %xmm12, %xmm0, %xmm0 {%k1}
+; AVX512VL-NEXT:    vcmpunordss %xmm0, %xmm0, %k1
+; AVX512VL-NEXT:    vminss %xmm0, %xmm11, %xmm0
+; AVX512VL-NEXT:    vmovss %xmm11, %xmm0, %xmm0 {%k1}
+; AVX512VL-NEXT:    vcmpunordss %xmm0, %xmm0, %k1
+; AVX512VL-NEXT:    vminss %xmm0, %xmm3, %xmm0
+; AVX512VL-NEXT:    vmovss %xmm3, %xmm0, %xmm0 {%k1}
+; AVX512VL-NEXT:    vcmpunordss %xmm0, %xmm0, %k1
+; AVX512VL-NEXT:    vminss %xmm0, %xmm10, %xmm0
+; AVX512VL-NEXT:    vmovss %xmm10, %xmm0, %xmm0 {%k1}
+; AVX512VL-NEXT:    vcmpunordss %xmm0, %xmm0, %k1
+; AVX512VL-NEXT:    vminss %xmm0, %xmm9, %xmm0
+; AVX512VL-NEXT:    vmovss %xmm9, %xmm0, %xmm0 {%k1}
+; AVX512VL-NEXT:    vcmpunordss %xmm0, %xmm0, %k1
+; AVX512VL-NEXT:    vminss %xmm0, %xmm8, %xmm0
+; AVX512VL-NEXT:    vmovss %xmm8, %xmm0, %xmm0 {%k1}
+; AVX512VL-NEXT:    vzeroupper
+; AVX512VL-NEXT:    retq
   %1 = call float @llvm.experimental.vector.reduce.fmin.v16f32(<16 x float> %a0)
   ret float %1
 }
@@ -186,50 +593,176 @@ define float @test_v16f32(<16 x float> %a0) {
 define double @test_v2f64(<2 x double> %a0) {
 ; SSE-LABEL: test_v2f64:
 ; SSE:       # %bb.0:
+; SSE-NEXT:    movapd %xmm0, %xmm2
+; SSE-NEXT:    unpckhpd {{.*#+}} xmm2 = xmm2[1],xmm0[1]
 ; SSE-NEXT:    movapd %xmm0, %xmm1
-; SSE-NEXT:    unpckhpd {{.*#+}} xmm1 = xmm1[1],xmm0[1]
-; SSE-NEXT:    minsd %xmm1, %xmm0
+; SSE-NEXT:    cmpunordsd %xmm0, %xmm1
+; SSE-NEXT:    movapd %xmm1, %xmm3
+; SSE-NEXT:    andpd %xmm2, %xmm3
+; SSE-NEXT:    minsd %xmm0, %xmm2
+; SSE-NEXT:    andnpd %xmm2, %xmm1
+; SSE-NEXT:    orpd %xmm3, %xmm1
+; SSE-NEXT:    movapd %xmm1, %xmm0
 ; SSE-NEXT:    retq
 ;
 ; AVX-LABEL: test_v2f64:
 ; AVX:       # %bb.0:
 ; AVX-NEXT:    vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
-; AVX-NEXT:    vminsd %xmm1, %xmm0, %xmm0
+; AVX-NEXT:    vminsd %xmm0, %xmm1, %xmm2
+; AVX-NEXT:    vcmpunordsd %xmm0, %xmm0, %xmm0
+; AVX-NEXT:    vblendvpd %xmm0, %xmm1, %xmm2, %xmm0
 ; AVX-NEXT:    retq
 ;
 ; AVX512-LABEL: test_v2f64:
 ; AVX512:       # %bb.0:
-; AVX512-NEXT:    vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
-; AVX512-NEXT:    vminsd %xmm1, %xmm0, %xmm0
+; AVX512-NEXT:    vpermilpd {{.*#+}} xmm2 = xmm0[1,0]
+; AVX512-NEXT:    vminsd %xmm0, %xmm2, %xmm1
+; AVX512-NEXT:    vcmpunordsd %xmm0, %xmm0, %k1
+; AVX512-NEXT:    vmovsd %xmm2, %xmm1, %xmm1 {%k1}
+; AVX512-NEXT:    vmovapd %xmm1, %xmm0
 ; AVX512-NEXT:    retq
   %1 = call double @llvm.experimental.vector.reduce.fmin.v2f64(<2 x double> %a0)
   ret double %1
 }
 
+define double @test_v3f64(<3 x double> %a0) {
+; SSE2-LABEL: test_v3f64:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    unpcklpd {{.*#+}} xmm0 = xmm0[0],xmm1[0]
+; SSE2-NEXT:    shufpd {{.*#+}} xmm2 = xmm2[0],mem[1]
+; SSE2-NEXT:    movapd %xmm2, %xmm1
+; SSE2-NEXT:    minpd %xmm0, %xmm1
+; SSE2-NEXT:    cmpunordpd %xmm0, %xmm0
+; SSE2-NEXT:    andpd %xmm0, %xmm2
+; SSE2-NEXT:    andnpd %xmm1, %xmm0
+; SSE2-NEXT:    orpd %xmm2, %xmm0
+; SSE2-NEXT:    pshufd {{.*#+}} xmm2 = xmm0[2,3,2,3]
+; SSE2-NEXT:    movapd %xmm0, %xmm1
+; SSE2-NEXT:    cmpunordsd %xmm0, %xmm1
+; SSE2-NEXT:    movapd %xmm1, %xmm3
+; SSE2-NEXT:    andpd %xmm2, %xmm3
+; SSE2-NEXT:    minsd %xmm0, %xmm2
+; SSE2-NEXT:    andnpd %xmm2, %xmm1
+; SSE2-NEXT:    orpd %xmm3, %xmm1
+; SSE2-NEXT:    movapd %xmm1, %xmm0
+; SSE2-NEXT:    retq
+;
+; SSE41-LABEL: test_v3f64:
+; SSE41:       # %bb.0:
+; SSE41-NEXT:    unpcklpd {{.*#+}} xmm0 = xmm0[0],xmm1[0]
+; SSE41-NEXT:    blendpd {{.*#+}} xmm2 = xmm2[0],mem[1]
+; SSE41-NEXT:    movapd %xmm2, %xmm1
+; SSE41-NEXT:    minpd %xmm0, %xmm1
+; SSE41-NEXT:    cmpunordpd %xmm0, %xmm0
+; SSE41-NEXT:    blendvpd %xmm0, %xmm2, %xmm1
+; SSE41-NEXT:    movapd %xmm1, %xmm2
+; SSE41-NEXT:    unpckhpd {{.*#+}} xmm2 = xmm2[1],xmm1[1]
+; SSE41-NEXT:    movapd %xmm1, %xmm0
+; SSE41-NEXT:    cmpunordsd %xmm1, %xmm0
+; SSE41-NEXT:    movapd %xmm0, %xmm3
+; SSE41-NEXT:    andpd %xmm2, %xmm3
+; SSE41-NEXT:    minsd %xmm1, %xmm2
+; SSE41-NEXT:    andnpd %xmm2, %xmm0
+; SSE41-NEXT:    orpd %xmm3, %xmm0
+; SSE41-NEXT:    retq
+;
+; AVX-LABEL: test_v3f64:
+; AVX:       # %bb.0:
+; AVX-NEXT:    vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
+; AVX-NEXT:    vminsd %xmm0, %xmm1, %xmm2
+; AVX-NEXT:    vcmpunordsd %xmm0, %xmm0, %xmm3
+; AVX-NEXT:    vblendvpd %xmm3, %xmm1, %xmm2, %xmm1
+; AVX-NEXT:    vcmpunordsd %xmm1, %xmm1, %xmm2
+; AVX-NEXT:    vextractf128 $1, %ymm0, %xmm0
+; AVX-NEXT:    vminsd %xmm1, %xmm0, %xmm1
+; AVX-NEXT:    vblendvpd %xmm2, %xmm0, %xmm1, %xmm0
+; AVX-NEXT:    vzeroupper
+; AVX-NEXT:    retq
+;
+; AVX512-LABEL: test_v3f64:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
+; AVX512-NEXT:    vminsd %xmm0, %xmm1, %xmm2
+; AVX512-NEXT:    vcmpunordsd %xmm0, %xmm0, %k1
+; AVX512-NEXT:    vmovsd %xmm1, %xmm2, %xmm2 {%k1}
+; AVX512-NEXT:    vcmpunordsd %xmm2, %xmm2, %k1
+; AVX512-NEXT:    vextractf128 $1, %ymm0, %xmm1
+; AVX512-NEXT:    vminsd %xmm2, %xmm1, %xmm0
+; AVX512-NEXT:    vmovsd %xmm1, %xmm0, %xmm0 {%k1}
+; AVX512-NEXT:    vzeroupper
+; AVX512-NEXT:    retq
+  %1 = call double @llvm.experimental.vector.reduce.fmin.v3f64(<3 x double> %a0)
+  ret double %1
+}
+
 define double @test_v4f64(<4 x double> %a0) {
-; SSE-LABEL: test_v4f64:
-; SSE:       # %bb.0:
-; SSE-NEXT:    minpd %xmm1, %xmm0
-; SSE-NEXT:    movapd %xmm0, %xmm1
-; SSE-NEXT:    unpckhpd {{.*#+}} xmm1 = xmm1[1],xmm0[1]
-; SSE-NEXT:    minsd %xmm1, %xmm0
-; SSE-NEXT:    retq
+; SSE2-LABEL: test_v4f64:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    movapd %xmm1, %xmm2
+; SSE2-NEXT:    minpd %xmm0, %xmm2
+; SSE2-NEXT:    cmpunordpd %xmm0, %xmm0
+; SSE2-NEXT:    andpd %xmm0, %xmm1
+; SSE2-NEXT:    andnpd %xmm2, %xmm0
+; SSE2-NEXT:    orpd %xmm1, %xmm0
+; SSE2-NEXT:    pshufd {{.*#+}} xmm2 = xmm0[2,3,2,3]
+; SSE2-NEXT:    movapd %xmm0, %xmm1
+; SSE2-NEXT:    cmpunordsd %xmm0, %xmm1
+; SSE2-NEXT:    movapd %xmm1, %xmm3
+; SSE2-NEXT:    andpd %xmm2, %xmm3
+; SSE2-NEXT:    minsd %xmm0, %xmm2
+; SSE2-NEXT:    andnpd %xmm2, %xmm1
+; SSE2-NEXT:    orpd %xmm3, %xmm1
+; SSE2-NEXT:    movapd %xmm1, %xmm0
+; SSE2-NEXT:    retq
+;
+; SSE41-LABEL: test_v4f64:
+; SSE41:       # %bb.0:
+; SSE41-NEXT:    movapd %xmm1, %xmm2
+; SSE41-NEXT:    minpd %xmm0, %xmm2
+; SSE41-NEXT:    cmpunordpd %xmm0, %xmm0
+; SSE41-NEXT:    blendvpd %xmm0, %xmm1, %xmm2
+; SSE41-NEXT:    movapd %xmm2, %xmm1
+; SSE41-NEXT:    unpckhpd {{.*#+}} xmm1 = xmm1[1],xmm2[1]
+; SSE41-NEXT:    movapd %xmm2, %xmm0
+; SSE41-NEXT:    cmpunordsd %xmm2, %xmm0
+; SSE41-NEXT:    movapd %xmm0, %xmm3
+; SSE41-NEXT:    andpd %xmm1, %xmm3
+; SSE41-NEXT:    minsd %xmm2, %xmm1
+; SSE41-NEXT:    andnpd %xmm1, %xmm0
+; SSE41-NEXT:    orpd %xmm3, %xmm0
+; SSE41-NEXT:    retq
 ;
 ; AVX-LABEL: test_v4f64:
 ; AVX:       # %bb.0:
 ; AVX-NEXT:    vextractf128 $1, %ymm0, %xmm1
-; AVX-NEXT:    vminpd %xmm1, %xmm0, %xmm0
-; AVX-NEXT:    vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
-; AVX-NEXT:    vminsd %xmm1, %xmm0, %xmm0
+; AVX-NEXT:    vpermilpd {{.*#+}} xmm2 = xmm1[1,0]
+; AVX-NEXT:    vpermilpd {{.*#+}} xmm3 = xmm0[1,0]
+; AVX-NEXT:    vminsd %xmm0, %xmm3, %xmm4
+; AVX-NEXT:    vcmpunordsd %xmm0, %xmm0, %xmm0
+; AVX-NEXT:    vblendvpd %xmm0, %xmm3, %xmm4, %xmm0
+; AVX-NEXT:    vcmpunordsd %xmm0, %xmm0, %xmm3
+; AVX-NEXT:    vminsd %xmm0, %xmm1, %xmm0
+; AVX-NEXT:    vblendvpd %xmm3, %xmm1, %xmm0, %xmm0
+; AVX-NEXT:    vcmpunordsd %xmm0, %xmm0, %xmm1
+; AVX-NEXT:    vminsd %xmm0, %xmm2, %xmm0
+; AVX-NEXT:    vblendvpd %xmm1, %xmm2, %xmm0, %xmm0
 ; AVX-NEXT:    vzeroupper
 ; AVX-NEXT:    retq
 ;
 ; AVX512-LABEL: test_v4f64:
 ; AVX512:       # %bb.0:
 ; AVX512-NEXT:    vextractf128 $1, %ymm0, %xmm1
-; AVX512-NEXT:    vminpd %xmm1, %xmm0, %xmm0
-; AVX512-NEXT:    vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
-; AVX512-NEXT:    vminsd %xmm1, %xmm0, %xmm0
+; AVX512-NEXT:    vpermilpd {{.*#+}} xmm2 = xmm1[1,0]
+; AVX512-NEXT:    vpermilpd {{.*#+}} xmm3 = xmm0[1,0]
+; AVX512-NEXT:    vminsd %xmm0, %xmm3, %xmm4
+; AVX512-NEXT:    vcmpunordsd %xmm0, %xmm0, %k1
+; AVX512-NEXT:    vmovsd %xmm3, %xmm4, %xmm4 {%k1}
+; AVX512-NEXT:    vcmpunordsd %xmm4, %xmm4, %k1
+; AVX512-NEXT:    vminsd %xmm4, %xmm1, %xmm0
+; AVX512-NEXT:    vmovsd %xmm1, %xmm0, %xmm0 {%k1}
+; AVX512-NEXT:    vcmpunordsd %xmm0, %xmm0, %k1
+; AVX512-NEXT:    vminsd %xmm0, %xmm2, %xmm0
+; AVX512-NEXT:    vmovsd %xmm2, %xmm0, %xmm0 {%k1}
 ; AVX512-NEXT:    vzeroupper
 ; AVX512-NEXT:    retq
   %1 = call double @llvm.experimental.vector.reduce.fmin.v4f64(<4 x double> %a0)
@@ -237,76 +770,316 @@ define double @test_v4f64(<4 x double> %a0) {
 }
 
 define double @test_v8f64(<8 x double> %a0) {
-; SSE-LABEL: test_v8f64:
-; SSE:       # %bb.0:
-; SSE-NEXT:    minpd %xmm3, %xmm1
-; SSE-NEXT:    minpd %xmm2, %xmm0
-; SSE-NEXT:    minpd %xmm1, %xmm0
-; SSE-NEXT:    movapd %xmm0, %xmm1
-; SSE-NEXT:    unpckhpd {{.*#+}} xmm1 = xmm1[1],xmm0[1]
-; SSE-NEXT:    minsd %xmm1, %xmm0
-; SSE-NEXT:    retq
+; SSE2-LABEL: test_v8f64:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    movapd %xmm2, %xmm4
+; SSE2-NEXT:    minpd %xmm0, %xmm4
+; SSE2-NEXT:    cmpunordpd %xmm0, %xmm0
+; SSE2-NEXT:    andpd %xmm0, %xmm2
+; SSE2-NEXT:    andnpd %xmm4, %xmm0
+; SSE2-NEXT:    orpd %xmm2, %xmm0
+; SSE2-NEXT:    movapd %xmm3, %xmm2
+; SSE2-NEXT:    minpd %xmm1, %xmm2
+; SSE2-NEXT:    cmpunordpd %xmm1, %xmm1
+; SSE2-NEXT:    andpd %xmm1, %xmm3
+; SSE2-NEXT:    andnpd %xmm2, %xmm1
+; SSE2-NEXT:    orpd %xmm3, %xmm1
+; SSE2-NEXT:    movapd %xmm1, %xmm2
+; SSE2-NEXT:    minpd %xmm0, %xmm2
+; SSE2-NEXT:    cmpunordpd %xmm0, %xmm0
+; SSE2-NEXT:    andpd %xmm0, %xmm1
+; SSE2-NEXT:    andnpd %xmm2, %xmm0
+; SSE2-NEXT:    orpd %xmm1, %xmm0
+; SSE2-NEXT:    pshufd {{.*#+}} xmm2 = xmm0[2,3,2,3]
+; SSE2-NEXT:    movapd %xmm0, %xmm1
+; SSE2-NEXT:    cmpunordsd %xmm0, %xmm1
+; SSE2-NEXT:    movapd %xmm1, %xmm3
+; SSE2-NEXT:    andpd %xmm2, %xmm3
+; SSE2-NEXT:    minsd %xmm0, %xmm2
+; SSE2-NEXT:    andnpd %xmm2, %xmm1
+; SSE2-NEXT:    orpd %xmm3, %xmm1
+; SSE2-NEXT:    movapd %xmm1, %xmm0
+; SSE2-NEXT:    retq
+;
+; SSE41-LABEL: test_v8f64:
+; SSE41:       # %bb.0:
+; SSE41-NEXT:    movapd %xmm2, %xmm4
+; SSE41-NEXT:    minpd %xmm0, %xmm4
+; SSE41-NEXT:    cmpunordpd %xmm0, %xmm0
+; SSE41-NEXT:    blendvpd %xmm0, %xmm2, %xmm4
+; SSE41-NEXT:    movapd %xmm3, %xmm2
+; SSE41-NEXT:    minpd %xmm1, %xmm2
+; SSE41-NEXT:    cmpunordpd %xmm1, %xmm1
+; SSE41-NEXT:    movapd %xmm1, %xmm0
+; SSE41-NEXT:    blendvpd %xmm0, %xmm3, %xmm2
+; SSE41-NEXT:    movapd %xmm2, %xmm1
+; SSE41-NEXT:    minpd %xmm4, %xmm1
+; SSE41-NEXT:    cmpunordpd %xmm4, %xmm4
+; SSE41-NEXT:    movapd %xmm4, %xmm0
+; SSE41-NEXT:    blendvpd %xmm0, %xmm2, %xmm1
+; SSE41-NEXT:    movapd %xmm1, %xmm2
+; SSE41-NEXT:    unpckhpd {{.*#+}} xmm2 = xmm2[1],xmm1[1]
+; SSE41-NEXT:    movapd %xmm1, %xmm0
+; SSE41-NEXT:    cmpunordsd %xmm1, %xmm0
+; SSE41-NEXT:    movapd %xmm0, %xmm3
+; SSE41-NEXT:    andpd %xmm2, %xmm3
+; SSE41-NEXT:    minsd %xmm1, %xmm2
+; SSE41-NEXT:    andnpd %xmm2, %xmm0
+; SSE41-NEXT:    orpd %xmm3, %xmm0
+; SSE41-NEXT:    retq
 ;
 ; AVX-LABEL: test_v8f64:
 ; AVX:       # %bb.0:
-; AVX-NEXT:    vminpd %ymm1, %ymm0, %ymm0
-; AVX-NEXT:    vextractf128 $1, %ymm0, %xmm1
-; AVX-NEXT:    vminpd %xmm1, %xmm0, %xmm0
+; AVX-NEXT:    vminpd %ymm0, %ymm1, %ymm2
+; AVX-NEXT:    vcmpunordpd %ymm0, %ymm0, %ymm0
+; AVX-NEXT:    vblendvpd %ymm0, %ymm1, %ymm2, %ymm0
 ; AVX-NEXT:    vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
-; AVX-NEXT:    vminsd %xmm1, %xmm0, %xmm0
+; AVX-NEXT:    vminsd %xmm0, %xmm1, %xmm2
+; AVX-NEXT:    vcmpunordsd %xmm0, %xmm0, %xmm3
+; AVX-NEXT:    vblendvpd %xmm3, %xmm1, %xmm2, %xmm1
+; AVX-NEXT:    vcmpunordsd %xmm1, %xmm1, %xmm2
+; AVX-NEXT:    vextractf128 $1, %ymm0, %xmm0
+; AVX-NEXT:    vminsd %xmm1, %xmm0, %xmm1
+; AVX-NEXT:    vblendvpd %xmm2, %xmm0, %xmm1, %xmm1
+; AVX-NEXT:    vcmpunordsd %xmm1, %xmm1, %xmm2
+; AVX-NEXT:    vpermilpd {{.*#+}} xmm0 = xmm0[1,0]
+; AVX-NEXT:    vminsd %xmm1, %xmm0, %xmm1
+; AVX-NEXT:    vblendvpd %xmm2, %xmm0, %xmm1, %xmm0
 ; AVX-NEXT:    vzeroupper
 ; AVX-NEXT:    retq
 ;
-; AVX512-LABEL: test_v8f64:
-; AVX512:       # %bb.0:
-; AVX512-NEXT:    vextractf64x4 $1, %zmm0, %ymm1
-; AVX512-NEXT:    vminpd %zmm1, %zmm0, %zmm0
-; AVX512-NEXT:    vextractf128 $1, %ymm0, %xmm1
-; AVX512-NEXT:    vminpd %xmm1, %xmm0, %xmm0
-; AVX512-NEXT:    vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
-; AVX512-NEXT:    vminsd %xmm1, %xmm0, %xmm0
-; AVX512-NEXT:    vzeroupper
-; AVX512-NEXT:    retq
+; AVX512BW-LABEL: test_v8f64:
+; AVX512BW:       # %bb.0:
+; AVX512BW-NEXT:    vextractf32x4 $3, %zmm0, %xmm2
+; AVX512BW-NEXT:    vpermilpd {{.*#+}} xmm8 = xmm2[1,0]
+; AVX512BW-NEXT:    vextractf32x4 $2, %zmm0, %xmm3
+; AVX512BW-NEXT:    vpermilpd {{.*#+}} xmm4 = xmm3[1,0]
+; AVX512BW-NEXT:    vextractf128 $1, %ymm0, %xmm5
+; AVX512BW-NEXT:    vpermilpd {{.*#+}} xmm6 = xmm5[1,0]
+; AVX512BW-NEXT:    vpermilpd {{.*#+}} xmm7 = xmm0[1,0]
+; AVX512BW-NEXT:    vminsd %xmm0, %xmm7, %xmm1
+; AVX512BW-NEXT:    vcmpunordsd %xmm0, %xmm0, %k1
+; AVX512BW-NEXT:    vmovsd %xmm7, %xmm1, %xmm1 {%k1}
+; AVX512BW-NEXT:    vcmpunordsd %xmm1, %xmm1, %k1
+; AVX512BW-NEXT:    vminsd %xmm1, %xmm5, %xmm0
+; AVX512BW-NEXT:    vmovsd %xmm5, %xmm0, %xmm0 {%k1}
+; AVX512BW-NEXT:    vcmpunordsd %xmm0, %xmm0, %k1
+; AVX512BW-NEXT:    vminsd %xmm0, %xmm6, %xmm0
+; AVX512BW-NEXT:    vmovsd %xmm6, %xmm0, %xmm0 {%k1}
+; AVX512BW-NEXT:    vcmpunordsd %xmm0, %xmm0, %k1
+; AVX512BW-NEXT:    vminsd %xmm0, %xmm3, %xmm0
+; AVX512BW-NEXT:    vmovsd %xmm3, %xmm0, %xmm0 {%k1}
+; AVX512BW-NEXT:    vcmpunordsd %xmm0, %xmm0, %k1
+; AVX512BW-NEXT:    vminsd %xmm0, %xmm4, %xmm0
+; AVX512BW-NEXT:    vmovsd %xmm4, %xmm0, %xmm0 {%k1}
+; AVX512BW-NEXT:    vcmpunordsd %xmm0, %xmm0, %k1
+; AVX512BW-NEXT:    vminsd %xmm0, %xmm2, %xmm0
+; AVX512BW-NEXT:    vmovsd %xmm2, %xmm0, %xmm0 {%k1}
+; AVX512BW-NEXT:    vcmpunordsd %xmm0, %xmm0, %k1
+; AVX512BW-NEXT:    vminsd %xmm0, %xmm8, %xmm0
+; AVX512BW-NEXT:    vmovsd %xmm8, %xmm0, %xmm0 {%k1}
+; AVX512BW-NEXT:    vzeroupper
+; AVX512BW-NEXT:    retq
+;
+; AVX512VL-LABEL: test_v8f64:
+; AVX512VL:       # %bb.0:
+; AVX512VL-NEXT:    vextractf32x4 $3, %zmm0, %xmm1
+; AVX512VL-NEXT:    vpermilpd {{.*#+}} xmm8 = xmm1[1,0]
+; AVX512VL-NEXT:    vextractf32x4 $2, %zmm0, %xmm3
+; AVX512VL-NEXT:    vpermilpd {{.*#+}} xmm4 = xmm3[1,0]
+; AVX512VL-NEXT:    vextractf128 $1, %ymm0, %xmm5
+; AVX512VL-NEXT:    vpermilpd {{.*#+}} xmm6 = xmm5[1,0]
+; AVX512VL-NEXT:    vpermilpd {{.*#+}} xmm7 = xmm0[1,0]
+; AVX512VL-NEXT:    vminsd %xmm0, %xmm7, %xmm2
+; AVX512VL-NEXT:    vcmpunordsd %xmm0, %xmm0, %k1
+; AVX512VL-NEXT:    vmovsd %xmm7, %xmm2, %xmm2 {%k1}
+; AVX512VL-NEXT:    vcmpunordsd %xmm2, %xmm2, %k1
+; AVX512VL-NEXT:    vminsd %xmm2, %xmm5, %xmm0
+; AVX512VL-NEXT:    vmovsd %xmm5, %xmm0, %xmm0 {%k1}
+; AVX512VL-NEXT:    vcmpunordsd %xmm0, %xmm0, %k1
+; AVX512VL-NEXT:    vminsd %xmm0, %xmm6, %xmm0
+; AVX512VL-NEXT:    vmovsd %xmm6, %xmm0, %xmm0 {%k1}
+; AVX512VL-NEXT:    vcmpunordsd %xmm0, %xmm0, %k1
+; AVX512VL-NEXT:    vminsd %xmm0, %xmm3, %xmm0
+; AVX512VL-NEXT:    vmovsd %xmm3, %xmm0, %xmm0 {%k1}
+; AVX512VL-NEXT:    vcmpunordsd %xmm0, %xmm0, %k1
+; AVX512VL-NEXT:    vminsd %xmm0, %xmm4, %xmm0
+; AVX512VL-NEXT:    vmovsd %xmm4, %xmm0, %xmm0 {%k1}
+; AVX512VL-NEXT:    vcmpunordsd %xmm0, %xmm0, %k1
+; AVX512VL-NEXT:    vminsd %xmm0, %xmm1, %xmm0
+; AVX512VL-NEXT:    vmovsd %xmm1, %xmm0, %xmm0 {%k1}
+; AVX512VL-NEXT:    vcmpunordsd %xmm0, %xmm0, %k1
+; AVX512VL-NEXT:    vminsd %xmm0, %xmm8, %xmm0
+; AVX512VL-NEXT:    vmovsd %xmm8, %xmm0, %xmm0 {%k1}
+; AVX512VL-NEXT:    vzeroupper
+; AVX512VL-NEXT:    retq
   %1 = call double @llvm.experimental.vector.reduce.fmin.v8f64(<8 x double> %a0)
   ret double %1
 }
 
 define double @test_v16f64(<16 x double> %a0) {
-; SSE-LABEL: test_v16f64:
-; SSE:       # %bb.0:
-; SSE-NEXT:    minpd %xmm6, %xmm2
-; SSE-NEXT:    minpd %xmm4, %xmm0
-; SSE-NEXT:    minpd %xmm2, %xmm0
-; SSE-NEXT:    minpd %xmm7, %xmm3
-; SSE-NEXT:    minpd %xmm5, %xmm1
-; SSE-NEXT:    minpd %xmm3, %xmm1
-; SSE-NEXT:    minpd %xmm1, %xmm0
-; SSE-NEXT:    movapd %xmm0, %xmm1
-; SSE-NEXT:    unpckhpd {{.*#+}} xmm1 = xmm1[1],xmm0[1]
-; SSE-NEXT:    minsd %xmm1, %xmm0
-; SSE-NEXT:    retq
+; SSE2-LABEL: test_v16f64:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    movapd %xmm4, %xmm8
+; SSE2-NEXT:    minpd %xmm0, %xmm8
+; SSE2-NEXT:    cmpunordpd %xmm0, %xmm0
+; SSE2-NEXT:    andpd %xmm0, %xmm4
+; SSE2-NEXT:    andnpd %xmm8, %xmm0
+; SSE2-NEXT:    orpd %xmm4, %xmm0
+; SSE2-NEXT:    movapd %xmm6, %xmm4
+; SSE2-NEXT:    minpd %xmm2, %xmm4
+; SSE2-NEXT:    cmpunordpd %xmm2, %xmm2
+; SSE2-NEXT:    andpd %xmm2, %xmm6
+; SSE2-NEXT:    andnpd %xmm4, %xmm2
+; SSE2-NEXT:    orpd %xmm6, %xmm2
+; SSE2-NEXT:    movapd %xmm2, %xmm4
+; SSE2-NEXT:    minpd %xmm0, %xmm4
+; SSE2-NEXT:    cmpunordpd %xmm0, %xmm0
+; SSE2-NEXT:    andpd %xmm0, %xmm2
+; SSE2-NEXT:    andnpd %xmm4, %xmm0
+; SSE2-NEXT:    orpd %xmm2, %xmm0
+; SSE2-NEXT:    movapd %xmm5, %xmm2
+; SSE2-NEXT:    minpd %xmm1, %xmm2
+; SSE2-NEXT:    cmpunordpd %xmm1, %xmm1
+; SSE2-NEXT:    andpd %xmm1, %xmm5
+; SSE2-NEXT:    andnpd %xmm2, %xmm1
+; SSE2-NEXT:    orpd %xmm5, %xmm1
+; SSE2-NEXT:    movapd %xmm7, %xmm2
+; SSE2-NEXT:    minpd %xmm3, %xmm2
+; SSE2-NEXT:    cmpunordpd %xmm3, %xmm3
+; SSE2-NEXT:    andpd %xmm3, %xmm7
+; SSE2-NEXT:    andnpd %xmm2, %xmm3
+; SSE2-NEXT:    orpd %xmm7, %xmm3
+; SSE2-NEXT:    movapd %xmm3, %xmm2
+; SSE2-NEXT:    minpd %xmm1, %xmm2
+; SSE2-NEXT:    cmpunordpd %xmm1, %xmm1
+; SSE2-NEXT:    andpd %xmm1, %xmm3
+; SSE2-NEXT:    andnpd %xmm2, %xmm1
+; SSE2-NEXT:    orpd %xmm3, %xmm1
+; SSE2-NEXT:    movapd %xmm1, %xmm2
+; SSE2-NEXT:    minpd %xmm0, %xmm2
+; SSE2-NEXT:    cmpunordpd %xmm0, %xmm0
+; SSE2-NEXT:    andpd %xmm0, %xmm1
+; SSE2-NEXT:    andnpd %xmm2, %xmm0
+; SSE2-NEXT:    orpd %xmm1, %xmm0
+; SSE2-NEXT:    pshufd {{.*#+}} xmm2 = xmm0[2,3,2,3]
+; SSE2-NEXT:    movapd %xmm0, %xmm1
+; SSE2-NEXT:    cmpunordsd %xmm0, %xmm1
+; SSE2-NEXT:    movapd %xmm1, %xmm3
+; SSE2-NEXT:    andpd %xmm2, %xmm3
+; SSE2-NEXT:    minsd %xmm0, %xmm2
+; SSE2-NEXT:    andnpd %xmm2, %xmm1
+; SSE2-NEXT:    orpd %xmm3, %xmm1
+; SSE2-NEXT:    movapd %xmm1, %xmm0
+; SSE2-NEXT:    retq
+;
+; SSE41-LABEL: test_v16f64:
+; SSE41:       # %bb.0:
+; SSE41-NEXT:    movapd %xmm3, %xmm8
+; SSE41-NEXT:    movapd %xmm4, %xmm3
+; SSE41-NEXT:    minpd %xmm0, %xmm3
+; SSE41-NEXT:    cmpunordpd %xmm0, %xmm0
+; SSE41-NEXT:    blendvpd %xmm0, %xmm4, %xmm3
+; SSE41-NEXT:    movapd %xmm6, %xmm4
+; SSE41-NEXT:    minpd %xmm2, %xmm4
+; SSE41-NEXT:    cmpunordpd %xmm2, %xmm2
+; SSE41-NEXT:    movapd %xmm2, %xmm0
+; SSE41-NEXT:    blendvpd %xmm0, %xmm6, %xmm4
+; SSE41-NEXT:    movapd %xmm4, %xmm2
+; SSE41-NEXT:    minpd %xmm3, %xmm2
+; SSE41-NEXT:    cmpunordpd %xmm3, %xmm3
+; SSE41-NEXT:    movapd %xmm3, %xmm0
+; SSE41-NEXT:    blendvpd %xmm0, %xmm4, %xmm2
+; SSE41-NEXT:    movapd %xmm5, %xmm3
+; SSE41-NEXT:    minpd %xmm1, %xmm3
+; SSE41-NEXT:    cmpunordpd %xmm1, %xmm1
+; SSE41-NEXT:    movapd %xmm1, %xmm0
+; SSE41-NEXT:    blendvpd %xmm0, %xmm5, %xmm3
+; SSE41-NEXT:    movapd %xmm7, %xmm1
+; SSE41-NEXT:    minpd %xmm8, %xmm1
+; SSE41-NEXT:    cmpunordpd %xmm8, %xmm8
+; SSE41-NEXT:    movapd %xmm8, %xmm0
+; SSE41-NEXT:    blendvpd %xmm0, %xmm7, %xmm1
+; SSE41-NEXT:    movapd %xmm1, %xmm4
+; SSE41-NEXT:    minpd %xmm3, %xmm4
+; SSE41-NEXT:    cmpunordpd %xmm3, %xmm3
+; SSE41-NEXT:    movapd %xmm3, %xmm0
+; SSE41-NEXT:    blendvpd %xmm0, %xmm1, %xmm4
+; SSE41-NEXT:    movapd %xmm4, %xmm1
+; SSE41-NEXT:    minpd %xmm2, %xmm1
+; SSE41-NEXT:    cmpunordpd %xmm2, %xmm2
+; SSE41-NEXT:    movapd %xmm2, %xmm0
+; SSE41-NEXT:    blendvpd %xmm0, %xmm4, %xmm1
+; SSE41-NEXT:    movapd %xmm1, %xmm2
+; SSE41-NEXT:    unpckhpd {{.*#+}} xmm2 = xmm2[1],xmm1[1]
+; SSE41-NEXT:    movapd %xmm1, %xmm0
+; SSE41-NEXT:    cmpunordsd %xmm1, %xmm0
+; SSE41-NEXT:    movapd %xmm0, %xmm3
+; SSE41-NEXT:    andpd %xmm2, %xmm3
+; SSE41-NEXT:    minsd %xmm1, %xmm2
+; SSE41-NEXT:    andnpd %xmm2, %xmm0
+; SSE41-NEXT:    orpd %xmm3, %xmm0
+; SSE41-NEXT:    retq
 ;
 ; AVX-LABEL: test_v16f64:
 ; AVX:       # %bb.0:
-; AVX-NEXT:    vminpd %ymm3, %ymm1, %ymm1
-; AVX-NEXT:    vminpd %ymm2, %ymm0, %ymm0
-; AVX-NEXT:    vminpd %ymm1, %ymm0, %ymm0
-; AVX-NEXT:    vextractf128 $1, %ymm0, %xmm1
-; AVX-NEXT:    vminpd %xmm1, %xmm0, %xmm0
+; AVX-NEXT:    vminpd %ymm0, %ymm2, %ymm4
+; AVX-NEXT:    vcmpunordpd %ymm0, %ymm0, %ymm0
+; AVX-NEXT:    vblendvpd %ymm0, %ymm2, %ymm4, %ymm0
+; AVX-NEXT:    vminpd %ymm1, %ymm3, %ymm2
+; AVX-NEXT:    vcmpunordpd %ymm1, %ymm1, %ymm1
+; AVX-NEXT:    vblendvpd %ymm1, %ymm3, %ymm2, %ymm1
+; AVX-NEXT:    vminpd %ymm0, %ymm1, %ymm2
+; AVX-NEXT:    vcmpunordpd %ymm0, %ymm0, %ymm0
+; AVX-NEXT:    vblendvpd %ymm0, %ymm1, %ymm2, %ymm0
 ; AVX-NEXT:    vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
-; AVX-NEXT:    vminsd %xmm1, %xmm0, %xmm0
+; AVX-NEXT:    vminsd %xmm0, %xmm1, %xmm2
+; AVX-NEXT:    vcmpunordsd %xmm0, %xmm0, %xmm3
+; AVX-NEXT:    vblendvpd %xmm3, %xmm1, %xmm2, %xmm1
+; AVX-NEXT:    vcmpunordsd %xmm1, %xmm1, %xmm2
+; AVX-NEXT:    vextractf128 $1, %ymm0, %xmm0
+; AVX-NEXT:    vminsd %xmm1, %xmm0, %xmm1
+; AVX-NEXT:    vblendvpd %xmm2, %xmm0, %xmm1, %xmm1
+; AVX-NEXT:    vcmpunordsd %xmm1, %xmm1, %xmm2
+; AVX-NEXT:    vpermilpd {{.*#+}} xmm0 = xmm0[1,0]
+; AVX-NEXT:    vminsd %xmm1, %xmm0, %xmm1
+; AVX-NEXT:    vblendvpd %xmm2, %xmm0, %xmm1, %xmm0
 ; AVX-NEXT:    vzeroupper
 ; AVX-NEXT:    retq
 ;
 ; AVX512-LABEL: test_v16f64:
 ; AVX512:       # %bb.0:
-; AVX512-NEXT:    vminpd %zmm1, %zmm0, %zmm0
-; AVX512-NEXT:    vextractf64x4 $1, %zmm0, %ymm1
-; AVX512-NEXT:    vminpd %zmm1, %zmm0, %zmm0
-; AVX512-NEXT:    vextractf128 $1, %ymm0, %xmm1
-; AVX512-NEXT:    vminpd %xmm1, %xmm0, %xmm0
-; AVX512-NEXT:    vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
-; AVX512-NEXT:    vminsd %xmm1, %xmm0, %xmm0
+; AVX512-NEXT:    vminpd %zmm0, %zmm1, %zmm2
+; AVX512-NEXT:    vcmpunordpd %zmm0, %zmm0, %k1
+; AVX512-NEXT:    vmovapd %zmm1, %zmm2 {%k1}
+; AVX512-NEXT:    vpermilpd {{.*#+}} xmm0 = xmm2[1,0]
+; AVX512-NEXT:    vminsd %xmm2, %xmm0, %xmm1
+; AVX512-NEXT:    vcmpunordsd %xmm2, %xmm2, %k1
+; AVX512-NEXT:    vmovsd %xmm0, %xmm1, %xmm1 {%k1}
+; AVX512-NEXT:    vcmpunordsd %xmm1, %xmm1, %k1
+; AVX512-NEXT:    vextractf128 $1, %ymm2, %xmm0
+; AVX512-NEXT:    vminsd %xmm1, %xmm0, %xmm1
+; AVX512-NEXT:    vmovsd %xmm0, %xmm1, %xmm1 {%k1}
+; AVX512-NEXT:    vcmpunordsd %xmm1, %xmm1, %k1
+; AVX512-NEXT:    vpermilpd {{.*#+}} xmm0 = xmm0[1,0]
+; AVX512-NEXT:    vminsd %xmm1, %xmm0, %xmm1
+; AVX512-NEXT:    vmovsd %xmm0, %xmm1, %xmm1 {%k1}
+; AVX512-NEXT:    vcmpunordsd %xmm1, %xmm1, %k1
+; AVX512-NEXT:    vextractf32x4 $2, %zmm2, %xmm0
+; AVX512-NEXT:    vminsd %xmm1, %xmm0, %xmm1
+; AVX512-NEXT:    vmovsd %xmm0, %xmm1, %xmm1 {%k1}
+; AVX512-NEXT:    vcmpunordsd %xmm1, %xmm1, %k1
+; AVX512-NEXT:    vpermilpd {{.*#+}} xmm0 = xmm0[1,0]
+; AVX512-NEXT:    vminsd %xmm1, %xmm0, %xmm1
+; AVX512-NEXT:    vmovsd %xmm0, %xmm1, %xmm1 {%k1}
+; AVX512-NEXT:    vcmpunordsd %xmm1, %xmm1, %k1
+; AVX512-NEXT:    vextractf32x4 $3, %zmm2, %xmm0
+; AVX512-NEXT:    vminsd %xmm1, %xmm0, %xmm1
+; AVX512-NEXT:    vmovsd %xmm0, %xmm1, %xmm1 {%k1}
+; AVX512-NEXT:    vcmpunordsd %xmm1, %xmm1, %k1
+; AVX512-NEXT:    vpermilpd {{.*#+}} xmm2 = xmm0[1,0]
+; AVX512-NEXT:    vminsd %xmm1, %xmm2, %xmm0
+; AVX512-NEXT:    vmovsd %xmm2, %xmm0, %xmm0 {%k1}
 ; AVX512-NEXT:    vzeroupper
 ; AVX512-NEXT:    retq
   %1 = call double @llvm.experimental.vector.reduce.fmin.v16f64(<16 x double> %a0)
@@ -319,6 +1092,7 @@ declare float @llvm.experimental.vector.reduce.fmin.v8f32(<8 x float>)
 declare float @llvm.experimental.vector.reduce.fmin.v16f32(<16 x float>)
 
 declare double @llvm.experimental.vector.reduce.fmin.v2f64(<2 x double>)
+declare double @llvm.experimental.vector.reduce.fmin.v3f64(<3 x double>)
 declare double @llvm.experimental.vector.reduce.fmin.v4f64(<4 x double>)
 declare double @llvm.experimental.vector.reduce.fmin.v8f64(<8 x double>)
 declare double @llvm.experimental.vector.reduce.fmin.v16f64(<16 x double>)


        


More information about the llvm-commits mailing list