[llvm] [IR] Add intrinsics to represent complex multiply and divide operations (PR #68742)

Tue Oct 10 13:39:06 PDT 2023

https://github.com/jcranmer-intel created https://github.com/llvm/llvm-project/pull/68742

This patch represents the first in a series of patches to bring a more standardized version of complex values into LLVM. Representation of the complex multiply and division instructions are added as intrinsics, and their precise behavior (with regards to potential range overflow) is controlled via attributes and fast-math flags.

With the three commits that are added here, the intrinsics are specified in LLVM IR, methods to construct them are added in IR builder, and CodeGen is implemented, both to expand them into libcalls (to __mulsc3/__divsc3 and friends) or branchy code, or to use existing complex multiply instructions. CodeGen is only verified correct for the x86 platform, though. Later commits are not included in the PR, but available for viewing at https://github.com/jcranmer-intel/llvm-project/tree/complex-patches, which adds support for pattern matching complex multiply intrinsics in InstCombine, and also adds uses of these intrinsics in the clang frontend.

These changes were previously present on Phabricator at https://reviews.llvm.org/D119284, https://reviews.llvm.org/D119286, and https://reviews.llvm.org/D119287.

>From 7068052818e2bd56517e8f75917fb3b0ac02e7bd Mon Sep 17 00:00:00 2001
From: Joshua Cranmer <joshua.cranmer at intel.com>
Date: Tue, 10 Oct 2023 12:53:54 -0700
Subject: [PATCH 1/3] [IR] Add intrinsics to represent complex multiply and
 divide instructions.

This patch represents the first in a series of patches to bring a more
standardized version of complex values into LLVM. Representation of the complex
multiply and division instructions are added as intrinsics, and their precise
behavior is controlled via attributes and fast-math flags.
---
 llvm/docs/LangRef.rst                    | 171 +++++++++++++++++++++++
 llvm/include/llvm/IR/Intrinsics.td       |  10 ++
 llvm/lib/IR/Verifier.cpp                 |  12 ++
 llvm/test/Verifier/complex-intrinsics.ll |  39 ++++++
 4 files changed, 232 insertions(+)
 create mode 100644 llvm/test/Verifier/complex-intrinsics.ll

diff --git a/llvm/docs/LangRef.rst b/llvm/docs/LangRef.rst
index 1883e9f6290b151..3d6323cee63b193 100644
--- a/llvm/docs/LangRef.rst
+++ b/llvm/docs/LangRef.rst
@@ -18448,6 +18448,177 @@ will be on any later loop iteration.
 This intrinsic will only return 0 if the input count is also 0. A non-zero input
 count will produce a non-zero result.
 
+Complex Intrinsics
+------------------
+
+Complex numbers are currently represented, for intrinsic purposes, as vectors of
+floating-point numbers. A scalar complex type is represented using the type
+``<2 x floatty>``, with index ``0`` corresponding to the real part of the number
+and index ``1`` corresponding the imaginary part of the number. A vector complex
+type can be represented by an even-length vector of floating-point numbers,
+with even indices (``0``, ``2``, etc.) corresponding to real parts of numbers
+and the indices one larger (``1``, ``3``, etc.) the corresponding imaginary
+parts.
+
+The precise semantics of these intrinsics depends on the value of the
+``complex-range`` attribute provided as a call-site attribute. This attribute
+takes on three possible values:
+
+``"full"``
+  The semantics has the full expansion as given in Annex G of the C
+  specification. In general, this means it needs to be expanded using the call
+  to the appropriate routine in compiler-rt (e.g., __mulsc3).
+
+``"no-nan"``
+  This code is permitted to allow complex infinities to be represented as NaNs
+  instead, as if the code for the appropriate routine were compiled in a manner
+  that allowed ``isnan(x)`` or ``isinf(x)`` to be optimized as false.
+
+``"limited"``
+  The semantics are equivalent to the naive arithmetic expansion operations
+  (specific expansion is detailed for each arithmetic expression).
+
+When this attribute is not present, it is presumed to be ``"full"`` if no
+fast-math flags are set, and ``"no-nan"`` if ``nnan`` or ``ninf`` flags are
+present.
+
+Fast-math flags are additionally relevant for these intrinsics, particularly in
+the case of ``complex-range=limited`` variants, as those will be likely to be
+expanded in code generation and fast-math flags will propagate to the expanded
+IR in such circumstances.
+
+Intrinsics for complex addition and subtraction are not provided, as these are
+equivalent to ``fadd`` and ``fsub`` instructions, respectively.
+
+'``llvm.experimental.complex.fmul.*``' Intrinsic
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+This is an overloaded intrinsic.
+
+::
+
+      declare <2 x float> @llvm.experimental.complex.fmul.v2f32(<2 x float> <op1>, <2 x float> <op2>)
+      declare <2 x double> @llvm.experimental.complex.fmul.v2f64(<2 x double> <op1>, <2 x double> <op2>)
+      declare <4 x float> @llvm.experimental.complex.fmul.v4f32(<4 x float> <op1>, <4 x float> <op2>)
+
+Overview:
+"""""""""
+
+The '``llvm.experimental.complex.fmul``' intrinsic returns the product of its
+two operands.
+
+Arguments:
+""""""""""
+
+The arguments to the '``llvm.experimental.complex.fmul``' intrinsic must be a
+:ref:`vector <t_vector>` of :ref:`floating-point <t_floating>` types of length
+divisible by 2.
+
+Semantics:
+""""""""""
+
+The value produced is the complex product of the two inputs.
+
+If the value of ``complex-range`` attribute is ``no-nan`` or ``limited``, or if
+the ``noinf`` or ``nonan`` fast math flags are provided, the output may be
+equivalent to the following code:
+
+.. code-block:: llvm
+
+      declare <2 x float> limited_complex_mul(<2 x float> %op1, <2 x float> %op2) {
+        %x = extractelement <2 x float> %op1, i32 0 ; real of %op1
+        %y = extractelement <2 x float> %op1, i32 1 ; imag of %op1
+        %u = extractelement <2 x float> %op2, i32 0 ; real of %op2
+        %v = extractelement <2 x float> %op2, i32 1 ; imag of %op2
+        %xu = fmul float %x, %u
+        %yv = fmul float %y, %v
+        %yu = fmul float %y, %u
+        %xv = fmul float %x, %v
+        %out_real = fsub float %xu, %yv
+        %out_imag = fadd float %yu, %xv
+        %ret.0 = insertelement <2 x float> undef, i32 0, %out_real
+        %ret.1 = insertelement <2 x float> %ret.0, i32 1, %out_imag
+        return <2 x float> %ret.1
+      }
+
+When the ``complex-range`` attribute is set to ``full`` or is missing, the above
+code is insufficient to handle the result. Instead, code must be added to
+check for infinities if either the real or imaginary component of the result is
+a NaN value.
+
+
+'``llvm.experimental.complex.fdiv.*``' Intrinsic
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+This is an overloaded intrinsic.
+
+::
+
+      declare <2 x float> @llvm.experimental.complex.fdiv.v2f32(<2 x float> <op1>, <2 x float> <op2>)
+      declare <2 x double> @llvm.experimental.complex.fdiv.v2f64(<2 x double> <op1>, <2 x double> <op2>)
+      declare <4 x float> @llvm.experimental.complex.fdiv.v4f32(<4 x float> <op1>, <4 x float> <op2>)
+
+Overview:
+"""""""""
+
+The '``llvm.experimental.complex.fdiv``' intrinsic returns the quotient of its
+two operands.
+
+Arguments:
+""""""""""
+
+The arguments to the '``llvm.experimental.complex.fdiv``' intrinsic must be a
+:ref:`vector <t_vector>` of :ref:`floating-point <t_floating>` types of length
+divisible by 2.
+
+Semantics:
+""""""""""
+
+The value produced is the complex quotient of the two inputs.
+
+If the ``complex-range`` attribute is set to ``limited``, the output will be
+equivalent to the following code:
+
+.. code-block:: llvm
+
+      declare <2 x float> limited_complex_div(<2 x float> %op1, <2 x float> %op2) {
+        %x = extractelement <2 x float> %op1, i32 0 ; real of %op1
+        %y = extractelement <2 x float> %op1, i32 1 ; imag of %op1
+        %u = extractelement <2 x float> %op2, i32 0 ; real of %op2
+        %v = extractelement <2 x float> %op2, i32 1 ; imag of %op2
+        %xu = fmul float %x, %u
+        %yv = fmul float %y, %v
+        %yu = fmul float %y, %u
+        %xv = fmul float %x, %v
+        %uu = fmul float %u, %u
+        %vv = fmul float %v, %v
+        %unscaled_real = fadd float %xu, %yv
+        %unscaled_imag = fsub float %yu, %xv
+        %scale = fadd float %uu, %vv
+        %out_real = fdiv float %unscaled_real, %scale
+        %out_imag = fdiv float %unscaled_imag, %scale
+        %ret.0 = insertelement <2 x float> undef, i32 0, %out_real
+        %ret.1 = insertelement <2 x float> %ret.0, i32 1, %out_imag
+        return <2 x float> %ret.1
+      }
+
+If the ``complex-range`` attribute is set to ``no-nan`` (or the ``nnan`` or
+``ninf`` flags are specified), an additional range reduction step is necessary.
+
+If the ``complex-range`` attribute is set to ``full``, or is missing entirely,
+then an additional check is necessary after the computation that is necessary
+to recover infinites that are instead represented as NaN values.
+
+Note that when ``complex-range`` is set to ``limited``, and the code is being
+expanded to the IR provided above, the fast-math flags are duplicated onto the
+expanded code. In particular, the ``arcp`` fast math flag may also be useful, as
+it will permit the divisions to be replaced with multiplications with a
+reciprocal instead.
+
 Matrix Intrinsics
 -----------------
 
diff --git a/llvm/include/llvm/IR/Intrinsics.td b/llvm/include/llvm/IR/Intrinsics.td
index ab15b1f1e0ee888..35e3c281861dfd8 100644
--- a/llvm/include/llvm/IR/Intrinsics.td
+++ b/llvm/include/llvm/IR/Intrinsics.td
@@ -2350,6 +2350,16 @@ let IntrProperties = [IntrNoMem, IntrSpeculatable] in {
                                          [llvm_anyvector_ty]>;
 }
 
+//===----- Complex math intrinsics ----------------------------------------===//
+
+def int_experimental_complex_fmul: DefaultAttrsIntrinsic<[llvm_anyvector_ty],
+                                            [LLVMMatchType<0>,LLVMMatchType<0>],
+                                            [IntrNoMem]>;
+
+def int_experimental_complex_fdiv: DefaultAttrsIntrinsic<[llvm_anyvector_ty],
+                                            [LLVMMatchType<0>,LLVMMatchType<0>],
+                                            [IntrNoMem]>;
+
 //===----- Matrix intrinsics ---------------------------------------------===//
 
 def int_matrix_transpose
diff --git a/llvm/lib/IR/Verifier.cpp b/llvm/lib/IR/Verifier.cpp
index 188e4a4a658f330..c453c944d37e660 100644
--- a/llvm/lib/IR/Verifier.cpp
+++ b/llvm/lib/IR/Verifier.cpp
@@ -5966,6 +5966,18 @@ void Verifier::visitIntrinsicCall(Intrinsic::ID ID, CallBase &Call) {
             &Call);
     break;
   }
+  case Intrinsic::experimental_complex_fdiv:
+  case Intrinsic::experimental_complex_fmul: {
+    // Check that the vector type is a pair of floating-point types.
+    Type *ArgTy = Call.getArgOperand(0)->getType();
+    FixedVectorType *VectorTy = dyn_cast<FixedVectorType>(ArgTy);
+    Check(VectorTy && VectorTy->getNumElements() % 2 == 0 &&
+            VectorTy->getElementType()->isFloatingPointTy(),
+          "complex intrinsic must use an even-length vector of floating-point "
+          "types",
+          &Call);
+    break;
+  }
   };
 
   // Verify that there aren't any unmediated control transfers between funclets.
diff --git a/llvm/test/Verifier/complex-intrinsics.ll b/llvm/test/Verifier/complex-intrinsics.ll
new file mode 100644
index 000000000000000..21d46b39ef80c17
--- /dev/null
+++ b/llvm/test/Verifier/complex-intrinsics.ll
@@ -0,0 +1,39 @@
+; RUN: opt -passes=verify -S < %s 2>&1 | FileCheck --check-prefix=CHECK1 %s
+; RUN: opt -passes=verify -S < %s 2>&1 | FileCheck --check-prefix=CHECK2 %s
+; RUN: sed -e s/.T3:// %s | not opt -passes=verify -disable-output 2>&1 | FileCheck --check-prefix=CHECK3 %s
+; RUN: sed -e s/.T4:// %s | not opt -passes=verify -disable-output 2>&1 | FileCheck --check-prefix=CHECK4 %s
+
+; Check that a double-valued complex fmul is accepted, and attributes are
+; correct.
+; CHECK1: declare <2 x double> @llvm.experimental.complex.fmul.v2f64(<2 x double>, <2 x double>) #[[ATTR:[0-9]+]]
+; CHECK1:  attributes #[[ATTR]] = { nocallback nofree nosync nounwind willreturn memory(none) }
+declare <2 x double> @llvm.experimental.complex.fmul.v2f64(<2 x double>, <2 x double>)
+define <2 x double> @t1(<2 x double> %a, <2 x double> %b) {
+  %res = call <2 x double> @llvm.experimental.complex.fmul.v2f64(<2 x double> %a, <2 x double> %b)
+  ret <2 x double> %res
+}
+
+; Test that vector complex values are supported.
+; CHECK2: declare <4 x double> @llvm.experimental.complex.fmul.v4f64(<4 x double>, <4 x double>) #[[ATTR:[0-9]+]]
+; CHECK2:  attributes #[[ATTR]] = { nocallback nofree nosync nounwind willreturn memory(none) }
+declare <4 x double> @llvm.experimental.complex.fmul.v4f64(<4 x double>, <4 x double>)
+define <4 x double> @t2(<4 x double> %a, <4 x double> %b) {
+  %res = call <4 x double> @llvm.experimental.complex.fmul.v4f64(<4 x double> %a, <4 x double> %b)
+  ret <4 x double> %res
+}
+
+; Test that odd-length vectors are not supported.
+; CHECK3: complex intrinsic must use an even-length vector of floating-point types
+;T3: declare <3 x double> @llvm.experimental.complex.fmul.v3f64(<3 x double>, <3 x double>)
+;T3: define <3 x double> @t3(<3 x double> %a, <3 x double> %b) {
+;T3:   %res = call <3 x double> @llvm.experimental.complex.fmul.v3f64(<3 x double> %a, <3 x double> %b)
+;T3:   ret <3 x double> %res
+;T3: }
+
+; Test that non-floating point complex types are not supported.
+; CHECK4: complex intrinsic must use an even-length vector of floating-point types
+;T4: declare <2 x i64> @llvm.experimental.complex.fmul.v2i64(<2 x i64>, <2 x i64>)
+;T4: define <2 x i64> @t4(<2 x i64> %a, <2 x i64> %b) {
+;T4:   %res = call <2 x i64> @llvm.experimental.complex.fmul.v2i64(<2 x i64> %a, <2 x i64> %b)
+;T4:   ret <2 x i64> %res
+;T4: }

>From beee502a18c01e5f6bb4845f41d4d6e00f845ef5 Mon Sep 17 00:00:00 2001
From: Joshua Cranmer <joshua.cranmer at intel.com>
Date: Tue, 10 Oct 2023 12:54:48 -0700
Subject: [PATCH 2/3] [IRBuilder] Add methods to construct complex intrinsics
 to IRBuilder.

---
 llvm/include/llvm/IR/IRBuilder.h | 37 ++++++++++++++++++++++++++++++++
 llvm/lib/IR/IRBuilder.cpp        | 32 +++++++++++++++++++++++++++
 2 files changed, 69 insertions(+)

diff --git a/llvm/include/llvm/IR/IRBuilder.h b/llvm/include/llvm/IR/IRBuilder.h
index c9f243fdb12e404..dacdfec0d5da756 100644
--- a/llvm/include/llvm/IR/IRBuilder.h
+++ b/llvm/include/llvm/IR/IRBuilder.h
@@ -1762,6 +1762,43 @@ class IRBuilderBase {
   Value *CreateNAryOp(unsigned Opc, ArrayRef<Value *> Ops,
                       const Twine &Name = "", MDNode *FPMathTag = nullptr);
 
+  /// Construct a complex value out of a pair of real and imaginary values.
+  /// The resulting value will be a vector, with lane 0 being the real value and
+  /// lane 1 being the complex value.
+  /// Either the \p Real or \p Imag parameter may be null, if the input is a
+  /// pure real or pure imaginary number.
+  Value *CreateComplexValue(Value *Real, Value *Imag, const Twine &Name = "") {
+    Type *ScalarTy = (Real ? Real : Imag)->getType();
+    assert(ScalarTy->isFloatingPointTy() &&
+           "Only floating-point types may be complex values.");
+    Type *ComplexTy = FixedVectorType::get(ScalarTy, 2);
+    Value *Base = PoisonValue::get(ComplexTy);
+    if (Real)
+      Base = CreateInsertElement(Base, Real, uint64_t(0), Name);
+    if (Imag)
+      Base = CreateInsertElement(Base, Imag, uint64_t(1), Name);
+    return Base;
+  }
+
+  /// Construct a complex multiply operation, setting fast-math flags and the
+  /// complex-range attribute as appropriate.
+  Value *CreateComplexMul(Value *L, Value *R, bool CxLimitedRange,
+                          const Twine &Name = "");
+
+  /// Construct a complex divide operation, setting fast-math flags and the
+  /// complex-range attribute as appropriate.
+  /// The complex-range attribute is set from the \p IgnoreNaNs and
+  /// \p DisableScaling as follows:
+  ///
+  /// \p IgnoreNans | \p DisableScaling | complex-range value
+  /// ------------- | ----------------- | -------------------
+  /// false         | false             | full
+  /// false         | true              | (illegal combination)
+  /// true          | false             | no-nan
+  /// true          | true              | limited
+  Value *CreateComplexDiv(Value *L, Value *R, bool IgnoreNaNs,
+                          bool DisableScaling = false, const Twine &Name = "");
+
   //===--------------------------------------------------------------------===//
   // Instruction creation methods: Memory Instructions
   //===--------------------------------------------------------------------===//
diff --git a/llvm/lib/IR/IRBuilder.cpp b/llvm/lib/IR/IRBuilder.cpp
index b321d8b325fe0be..9ede65d95cb9865 100644
--- a/llvm/lib/IR/IRBuilder.cpp
+++ b/llvm/lib/IR/IRBuilder.cpp
@@ -1116,6 +1116,38 @@ CallInst *IRBuilderBase::CreateConstrainedFPCall(
   return C;
 }
 
+Value *IRBuilderBase::CreateComplexMul(Value *L, Value *R, bool CxLimitedRange,
+                                       const Twine &Name) {
+  CallInst *Result = CreateBinaryIntrinsic(Intrinsic::experimental_complex_fmul,
+                                           L, R, nullptr, Name);
+  Result->setFastMathFlags(FMF);
+  AttributeList Attrs = Result->getAttributes();
+  StringRef Range =
+      (CxLimitedRange || FMF.noNaNs() || FMF.noInfs()) ? "limited" : "full";
+  Attrs = Attrs.addFnAttribute(getContext(), "complex-range", Range);
+  Result->setAttributes(Attrs);
+  return Result;
+}
+
+Value *IRBuilderBase::CreateComplexDiv(Value *L, Value *R, bool IgnoreNaNs,
+                                       bool DisableScaling, const Twine &Name) {
+  CallInst *Result = CreateBinaryIntrinsic(Intrinsic::experimental_complex_fdiv,
+                                           L, R, nullptr, Name);
+  Result->setFastMathFlags(FMF);
+  AttributeList Attrs = Result->getAttributes();
+  StringRef Range = "full";
+  if (DisableScaling) {
+    assert(IgnoreNaNs &&
+           "complex division DisableScaling should imply IgnoreNaNs");
+    Range = "limited";
+  } else if (IgnoreNaNs || FMF.noNaNs() || FMF.noInfs()) {
+    Range = "no-nan";
+  }
+  Attrs = Attrs.addFnAttribute(getContext(), "complex-range", Range);
+  Result->setAttributes(Attrs);
+  return Result;
+}
+
 Value *IRBuilderBase::CreateSelect(Value *C, Value *True, Value *False,
                                    const Twine &Name, Instruction *MDFrom) {
   if (auto *V = Folder.FoldSelect(C, True, False))

>From bcd640cbdccbe58c1c301502ab8b93fdbd6cb04b Mon Sep 17 00:00:00 2001
From: Joshua Cranmer <joshua.cranmer at intel.com>
Date: Tue, 10 Oct 2023 12:55:01 -0700
Subject: [PATCH 3/3] [CodeGen] Expand complex multiply and divide intrinsics
 for codegen.

For architectures without complex multiply or divide intrinsics (most of them),
a pass is needed to expand these intrinsics before codegen.

The tricky thing here is that where the intrinsics need to expand into a
compiler-rt helper function (e.g., __mulsc3), the ABI of complex floating point
types needs to be retrieved from the target. However, this target information
isn't fully validated for all targets, only x86.

This also adds support for lowering the complex multiply intrinsic directly to
instructions for the x86 backend.
---
 llvm/include/llvm/CodeGen/ExpandComplex.h     |  22 +
 llvm/include/llvm/CodeGen/ISDOpcodes.h        |   3 +
 llvm/include/llvm/CodeGen/Passes.h            |   6 +
 llvm/include/llvm/CodeGen/TargetLowering.h    |  19 +
 llvm/include/llvm/InitializePasses.h          |   1 +
 .../include/llvm/Target/TargetSelectionDAG.td |   1 +
 llvm/lib/CodeGen/CMakeLists.txt               |   1 +
 llvm/lib/CodeGen/ExpandComplex.cpp            | 294 ++++++++++
 .../SelectionDAG/LegalizeVectorTypes.cpp      |   2 +
 .../SelectionDAG/SelectionDAGBuilder.cpp      |   6 +
 .../SelectionDAG/SelectionDAGDumper.cpp       |   1 +
 llvm/lib/CodeGen/TargetPassConfig.cpp         |   4 +
 llvm/lib/Target/X86/X86ISelLowering.cpp       | 135 +++++
 llvm/lib/Target/X86/X86ISelLowering.h         |   4 +
 llvm/test/CodeGen/AArch64/O0-pipeline.ll      |   1 +
 llvm/test/CodeGen/AArch64/O3-pipeline.ll      |   1 +
 llvm/test/CodeGen/AMDGPU/llc-pipeline.ll      |   5 +
 llvm/test/CodeGen/ARM/O3-pipeline.ll          |   1 +
 llvm/test/CodeGen/LoongArch/O0-pipeline.ll    |   1 +
 llvm/test/CodeGen/LoongArch/opt-pipeline.ll   |   1 +
 llvm/test/CodeGen/PowerPC/O0-pipeline.ll      |   1 +
 llvm/test/CodeGen/PowerPC/O3-pipeline.ll      |   1 +
 llvm/test/CodeGen/RISCV/O0-pipeline.ll        |   1 +
 llvm/test/CodeGen/RISCV/O3-pipeline.ll        |   1 +
 llvm/test/CodeGen/X86/O0-pipeline.ll          |   1 +
 llvm/test/CodeGen/X86/complex-32bit.ll        | 173 ++++++
 llvm/test/CodeGen/X86/complex-64bit.ll        | 103 ++++
 llvm/test/CodeGen/X86/complex-divide.ll       |  92 +++
 llvm/test/CodeGen/X86/complex-multiply.ll     | 525 ++++++++++++++++++
 llvm/test/CodeGen/X86/complex-win32.ll        |  59 ++
 llvm/test/CodeGen/X86/complex-win64.ll        |  44 ++
 .../test/CodeGen/X86/fp16-complex-multiply.ll | 231 ++++++++
 llvm/test/CodeGen/X86/opt-pipeline.ll         |   1 +
 33 files changed, 1742 insertions(+)
 create mode 100644 llvm/include/llvm/CodeGen/ExpandComplex.h
 create mode 100644 llvm/lib/CodeGen/ExpandComplex.cpp
 create mode 100644 llvm/test/CodeGen/X86/complex-32bit.ll
 create mode 100644 llvm/test/CodeGen/X86/complex-64bit.ll
 create mode 100644 llvm/test/CodeGen/X86/complex-divide.ll
 create mode 100644 llvm/test/CodeGen/X86/complex-multiply.ll
 create mode 100644 llvm/test/CodeGen/X86/complex-win32.ll
 create mode 100644 llvm/test/CodeGen/X86/complex-win64.ll
 create mode 100644 llvm/test/CodeGen/X86/fp16-complex-multiply.ll

diff --git a/llvm/include/llvm/CodeGen/ExpandComplex.h b/llvm/include/llvm/CodeGen/ExpandComplex.h
new file mode 100644
index 000000000000000..0186fa75ee395ab
--- /dev/null
+++ b/llvm/include/llvm/CodeGen/ExpandComplex.h
@@ -0,0 +1,22 @@
+//===---- ExpandComplex.h - Expand experimental complex intrinsics --------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef LLVM_CODEGEN_EXPANDCOMPLEX_H
+#define LLVM_CODEGEN_EXPANDCOMPLEX_H
+
+#include "llvm/IR/PassManager.h"
+
+namespace llvm {
+
+class ExpandComplexPass : public PassInfoMixin<ExpandComplexPass> {
+public:
+  PreservedAnalyses run(Function &F, FunctionAnalysisManager &AM);
+};
+} // end namespace llvm
+
+#endif // LLVM_CODEGEN_EXPANDCOMPLEX_H
diff --git a/llvm/include/llvm/CodeGen/ISDOpcodes.h b/llvm/include/llvm/CodeGen/ISDOpcodes.h
index 67779a23a191313..4f72e6bd979d77e 100644
--- a/llvm/include/llvm/CodeGen/ISDOpcodes.h
+++ b/llvm/include/llvm/CodeGen/ISDOpcodes.h
@@ -1371,6 +1371,9 @@ enum NodeType {
   // Outputs: [rv], output chain, glue
   PATCHPOINT,
 
+  /// COMPLEX_MUL - Do a naive complex multiplication.
+  COMPLEX_MUL,
+
 // Vector Predication
 #define BEGIN_REGISTER_VP_SDNODE(VPSDID, ...) VPSDID,
 #include "llvm/IR/VPIntrinsics.def"
diff --git a/llvm/include/llvm/CodeGen/Passes.h b/llvm/include/llvm/CodeGen/Passes.h
index befa8a6eb9a27ce..353c053ee5d626b 100644
--- a/llvm/include/llvm/CodeGen/Passes.h
+++ b/llvm/include/llvm/CodeGen/Passes.h
@@ -506,6 +506,12 @@ namespace llvm {
   /// printing assembly.
   ModulePass *createMachineOutlinerPass(bool RunOnAllFunctions = true);
 
+  /// This pass expands the experimental complex intrinsics into regular
+  /// floating-point arithmetic or calls to __mulsc3 (or similar) functions.
+  FunctionPass *createExpandComplexPass();
+
+  /// This pass expands the experimental reduction intrinsics into sequences of
+  /// shuffles.
   /// This pass expands the reduction intrinsics into sequences of shuffles.
   FunctionPass *createExpandReductionsPass();
 
diff --git a/llvm/include/llvm/CodeGen/TargetLowering.h b/llvm/include/llvm/CodeGen/TargetLowering.h
index 187e000d0272d2e..19e28999d18ec00 100644
--- a/llvm/include/llvm/CodeGen/TargetLowering.h
+++ b/llvm/include/llvm/CodeGen/TargetLowering.h
@@ -676,6 +676,24 @@ class TargetLoweringBase {
     return false;
   }
 
+  /// Enum that specifies how a C complex type is lowered (in LLVM type terms).
+  enum class ComplexABI {
+    Memory,  ///< Indicates that a pointer to the struct is passed.
+    Vector,  ///< Indicates that T _Complex can be passed as <2 x T>.
+    Struct,  ///< Indicates that T _Complex can be passed as {T, T}.
+    Integer, ///< Indicates that an integer of the same size is passed.
+  };
+
+  /// Returns how a C complex type is lowered when used as the return value.
+  virtual ComplexABI getComplexReturnABI(Type *ScalarFloatTy) const {
+    return ComplexABI::Struct;
+  }
+
+  /// Returns true if the target can match the @llvm.experimental.complex.fmul
+  /// intrinsic with the given type. Such an intrinsic is assumed will only be
+  /// matched when "complex-range" is "limited" or "no-nan".
+  virtual bool CustomLowerComplexMultiply(Type *FloatTy) const { return false; }
+
   /// Return if the target supports combining a
   /// chain like:
   /// \code
@@ -2783,6 +2801,7 @@ class TargetLoweringBase {
     case ISD::AVGCEILU:
     case ISD::ABDS:
     case ISD::ABDU:
+    case ISD::COMPLEX_MUL:
       return true;
     default: return false;
     }
diff --git a/llvm/include/llvm/InitializePasses.h b/llvm/include/llvm/InitializePasses.h
index db653fff71ba95a..f1855763937037a 100644
--- a/llvm/include/llvm/InitializePasses.h
+++ b/llvm/include/llvm/InitializePasses.h
@@ -111,6 +111,7 @@ void initializeEdgeBundlesPass(PassRegistry&);
 void initializeEHContGuardCatchretPass(PassRegistry &);
 void initializeExpandLargeFpConvertLegacyPassPass(PassRegistry&);
 void initializeExpandLargeDivRemLegacyPassPass(PassRegistry&);
+void initializeExpandComplexPass(PassRegistry &);
 void initializeExpandMemCmpPassPass(PassRegistry&);
 void initializeExpandPostRAPass(PassRegistry&);
 void initializeExpandReductionsPass(PassRegistry&);
diff --git a/llvm/include/llvm/Target/TargetSelectionDAG.td b/llvm/include/llvm/Target/TargetSelectionDAG.td
index fa5761c3a199a56..11515063cdbc4e7 100644
--- a/llvm/include/llvm/Target/TargetSelectionDAG.td
+++ b/llvm/include/llvm/Target/TargetSelectionDAG.td
@@ -770,6 +770,7 @@ def assertsext : SDNode<"ISD::AssertSext", SDT_assert>;
 def assertzext : SDNode<"ISD::AssertZext", SDT_assert>;
 def assertalign : SDNode<"ISD::AssertAlign", SDT_assert>;
 
+def COMPLEX_MUL : SDNode<"ISD::COMPLEX_MUL", SDTFPBinOp, [SDNPCommutative]>;
 //===----------------------------------------------------------------------===//
 // Selection DAG Condition Codes
 
diff --git a/llvm/lib/CodeGen/CMakeLists.txt b/llvm/lib/CodeGen/CMakeLists.txt
index 389c70d04f17ba3..df214361abe9588 100644
--- a/llvm/lib/CodeGen/CMakeLists.txt
+++ b/llvm/lib/CodeGen/CMakeLists.txt
@@ -68,6 +68,7 @@ add_llvm_component_library(LLVMCodeGen
   EdgeBundles.cpp
   EHContGuardCatchret.cpp
   ExecutionDomainFix.cpp
+  ExpandComplex.cpp
   ExpandLargeDivRem.cpp
   ExpandLargeFpConvert.cpp
   ExpandMemCmp.cpp
diff --git a/llvm/lib/CodeGen/ExpandComplex.cpp b/llvm/lib/CodeGen/ExpandComplex.cpp
new file mode 100644
index 000000000000000..253de368cc7d4cf
--- /dev/null
+++ b/llvm/lib/CodeGen/ExpandComplex.cpp
@@ -0,0 +1,294 @@
+//===-- ExpandComplex.cpp - Expand experimental complex intrinsics --------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+// This pass implements IR expansion for complex intrinsics, allowing targets
+// to enable the intrinsics until just before codegen.
+//
+//===----------------------------------------------------------------------===//
+
+#include "llvm/CodeGen/ExpandComplex.h"
+#include "llvm/CodeGen/Passes.h"
+#include "llvm/CodeGen/TargetLowering.h"
+#include "llvm/CodeGen/TargetPassConfig.h"
+#include "llvm/CodeGen/TargetSubtargetInfo.h"
+#include "llvm/IR/Function.h"
+#include "llvm/IR/IRBuilder.h"
+#include "llvm/IR/InstIterator.h"
+#include "llvm/IR/IntrinsicInst.h"
+#include "llvm/IR/Intrinsics.h"
+#include "llvm/IR/Module.h"
+#include "llvm/IR/PatternMatch.h"
+#include "llvm/InitializePasses.h"
+#include "llvm/Pass.h"
+#include "llvm/Target/TargetMachine.h"
+
+using namespace llvm;
+using namespace llvm::PatternMatch;
+
+namespace {
+
+bool expandComplexInstruction(IntrinsicInst *CI, const TargetLowering *TLI,
+                              const DataLayout &DL) {
+  Intrinsic::ID Opcode = CI->getIntrinsicID();
+  assert((Opcode == Intrinsic::experimental_complex_fmul ||
+          Opcode == Intrinsic::experimental_complex_fdiv) &&
+         "Expected a complex instruction");
+
+  // Break the input values up into real and imaginary pieces.
+  Type *ComplexVectorTy = CI->getArgOperand(0)->getType();
+  Type *FloatTy = ComplexVectorTy->getScalarType();
+  IRBuilder<> Builder(CI);
+  Builder.setFastMathFlags(CI->getFastMathFlags());
+  Value *LhsR = Builder.CreateExtractElement(CI->getArgOperand(0), uint64_t(0));
+  Value *LhsI = Builder.CreateExtractElement(CI->getArgOperand(0), uint64_t(1));
+  Value *RhsR = nullptr, *RhsI = nullptr;
+  RhsR = Builder.CreateExtractElement(CI->getArgOperand(1), uint64_t(0));
+  RhsI = Builder.CreateExtractElement(CI->getArgOperand(1), uint64_t(1));
+
+  // The expansion has three pieces: the naive arithmetic, a possible prescaling
+  // (not relevant for multiplication), and a step to convert NaN output values
+  // to infinity values in certain situations (see Annex G of the C
+  // specification for more details). The "complex-range" attribute determines
+  // how many we need: "limited" has just the first one, "no-nan" the first two,
+  // and "full" for all three.
+
+  // Get the "complex-range" attribute, setting a default based on the presence
+  // of fast-math flags.
+  StringRef Range = CI->getFnAttr("complex-range").getValueAsString();
+  if (Range.empty()) {
+    Range = CI->getFastMathFlags().noNaNs() || CI->getFastMathFlags().noInfs()
+                ? "no-nan"
+                : "full";
+  }
+
+  // We can expand to naive arithmetic code if we only need the first piece. For
+  // multiplication, we can also accept "no-nan", since there is no semantic
+  // difference between "limited" and "no-nan" in that case.
+  bool CanExpand =
+      Range == "limited" ||
+      (Range == "no-nan" && Opcode == Intrinsic::experimental_complex_fmul);
+
+  Value *OutReal, *OutImag;
+  if (!CanExpand) {
+    // Do a call directly to the compiler-rt library here.
+    const char *Name = nullptr;
+    if (Opcode == Intrinsic::experimental_complex_fmul) {
+      if (FloatTy->isHalfTy())
+        Name = "__mulhc3";
+      else if (FloatTy->isFloatTy())
+        Name = "__mulsc3";
+      else if (FloatTy->isDoubleTy())
+        Name = "__muldc3";
+      else if (FloatTy->isX86_FP80Ty())
+        Name = "__mulxc3";
+      else if (FloatTy->isFP128Ty() || FloatTy->isPPC_FP128Ty())
+        Name = "__multc3";
+    } else if (Opcode == Intrinsic::experimental_complex_fdiv) {
+      if (FloatTy->isHalfTy())
+        Name = "__divhc3";
+      else if (FloatTy->isFloatTy())
+        Name = "__divsc3";
+      else if (FloatTy->isDoubleTy())
+        Name = "__divdc3";
+      else if (FloatTy->isX86_FP80Ty())
+        Name = "__divxc3";
+      else if (FloatTy->isFP128Ty() || FloatTy->isPPC_FP128Ty())
+        Name = "__divtc3";
+    }
+
+    if (!Name)
+      report_fatal_error("Cannot find libcall for intrinsic");
+
+    // The function we are to call is T complex __name(T, T, T, T) in C terms.
+    // Use TLI to figure out what the appropriate actual ABI for this function.
+    StructType *ComplexStructTy = StructType::get(FloatTy, FloatTy);
+    switch (TLI->getComplexReturnABI(FloatTy)) {
+    case TargetLowering::ComplexABI::Vector: {
+      // When the result is a vector type directly, we can replace the intrinsic
+      // with the call to the underlying function without any other munging.
+      FunctionCallee Func = CI->getModule()->getOrInsertFunction(
+          Name, ComplexVectorTy, FloatTy, FloatTy, FloatTy, FloatTy);
+      Value *NewResult = Builder.CreateCall(Func, {LhsR, LhsI, RhsR, RhsI});
+      CI->replaceAllUsesWith(NewResult);
+      CI->eraseFromParent();
+      return true;
+    }
+    case TargetLowering::ComplexABI::Integer: {
+      // This ABI form packs the type as a small struct in an integer register.
+      // All we need to do is move the integer to a vector register, without any
+      // other munging.
+      uint64_t Width =
+        ComplexVectorTy->getPrimitiveSizeInBits().getFixedValue();
+      Type *IntegerTy = Builder.getIntNTy(Width);
+      FunctionCallee Func = CI->getModule()->getOrInsertFunction(
+          Name, IntegerTy, FloatTy, FloatTy, FloatTy, FloatTy);
+      Value *NewResult = Builder.CreateBitCast(
+          Builder.CreateCall(Func, {LhsR, LhsI, RhsR, RhsI}), ComplexVectorTy);
+      CI->replaceAllUsesWith(NewResult);
+      CI->eraseFromParent();
+      return true;
+    }
+    case TargetLowering::ComplexABI::Memory: {
+      // Allocate a struct for the return type in the entry block. Stack slot
+      // coloring should remove duplicate allocations.
+      unsigned AllocaAS = DL.getAllocaAddrSpace();
+      Value *Alloca;
+      {
+        IRBuilderBase::InsertPointGuard Guard(Builder);
+        BasicBlock *EntryBB = &CI->getParent()->getParent()->getEntryBlock();
+        Builder.SetInsertPoint(EntryBB, EntryBB->begin());
+        Alloca = Builder.CreateAlloca(ComplexStructTy, AllocaAS);
+      }
+
+      AttributeList Attrs;
+      Attrs = Attrs.addParamAttribute(
+          CI->getContext(), 0,
+          Attribute::getWithStructRetType(CI->getContext(), ComplexStructTy));
+      FunctionCallee Func = CI->getModule()->getOrInsertFunction(
+          Name, std::move(Attrs), Type::getVoidTy(CI->getContext()),
+          PointerType::get(ComplexStructTy, AllocaAS), FloatTy, FloatTy,
+          FloatTy, FloatTy);
+
+      Builder.CreateCall(Func, {Alloca, LhsR, LhsI, RhsR, RhsI});
+      OutReal = Builder.CreateLoad(
+          FloatTy, Builder.CreateStructGEP(ComplexStructTy, Alloca, 0));
+      OutImag = Builder.CreateLoad(
+          FloatTy, Builder.CreateStructGEP(ComplexStructTy, Alloca, 1));
+      break;
+    }
+    case TargetLowering::ComplexABI::Struct: {
+      FunctionCallee Func = CI->getModule()->getOrInsertFunction(
+          Name, ComplexStructTy, FloatTy, FloatTy, FloatTy, FloatTy);
+      Value *ComplexStructRes =
+          Builder.CreateCall(Func, {LhsR, LhsI, RhsR, RhsI});
+      OutReal = Builder.CreateExtractValue(ComplexStructRes, 0);
+      OutImag = Builder.CreateExtractValue(ComplexStructRes, 1);
+      break;
+    }
+    }
+  } else {
+    switch (Opcode) {
+    case Intrinsic::experimental_complex_fmul: {
+      // If the target has a complex_fmul expansion and the fast-math flag
+      // set, use that instead of expanding.
+      if (TLI->CustomLowerComplexMultiply(ComplexVectorTy)) {
+        return false;
+      }
+
+      OutReal = Builder.CreateFSub(Builder.CreateFMul(LhsR, RhsR),
+                                   Builder.CreateFMul(LhsI, RhsI));
+      OutImag = Builder.CreateFAdd(Builder.CreateFMul(LhsI, RhsR),
+                                   Builder.CreateFMul(LhsR, RhsI));
+      break;
+    }
+    case Intrinsic::experimental_complex_fdiv: {
+      Value *Scale = Builder.CreateFAdd(Builder.CreateFMul(RhsR, RhsR),
+                                        Builder.CreateFMul(RhsI, RhsI));
+      OutReal =
+          Builder.CreateFDiv(Builder.CreateFAdd(Builder.CreateFMul(LhsR, RhsR),
+                                                Builder.CreateFMul(LhsI, RhsI)),
+                             Scale);
+      OutImag =
+          Builder.CreateFDiv(Builder.CreateFSub(Builder.CreateFMul(LhsI, RhsR),
+                                                Builder.CreateFMul(LhsR, RhsI)),
+                             Scale);
+      break;
+    }
+    }
+  }
+
+  // Replace all of the uses of the intrinsic with OutReal/OutImag. We avoid
+  // creating the vector unless we have to.
+  bool HasVectorUse = false;
+  for (User *U : CI->users()) {
+    uint64_t Index;
+    if (match(U, m_ExtractElt(m_Value(), m_ConstantInt(Index)))) {
+      assert((Index == 0 || Index == 1) && "Extract element too small");
+      U->replaceAllUsesWith(Index == 0 ? OutReal : OutImag);
+    } else {
+      HasVectorUse = true;
+    }
+  }
+
+  if (HasVectorUse) {
+    Value *OutComplex = Builder.CreateInsertElement(
+        Builder.CreateInsertElement(UndefValue::get(ComplexVectorTy), OutReal,
+                                    uint64_t(0)),
+        OutImag, uint64_t(1));
+    CI->replaceAllUsesWith(OutComplex);
+  } else {
+    CI->replaceAllUsesWith(UndefValue::get(CI->getType()));
+  }
+
+  CI->eraseFromParent();
+  return true;
+}
+
+bool expandComplexIntrinsics(Function &F, const TargetLowering *TLI) {
+  bool Changed = false;
+  SmallVector<IntrinsicInst *, 4> Worklist;
+  for (auto &I : instructions(F)) {
+    if (auto *II = dyn_cast<IntrinsicInst>(&I)) {
+      switch (II->getIntrinsicID()) {
+      default:
+        break;
+      case Intrinsic::experimental_complex_fmul:
+      case Intrinsic::experimental_complex_fdiv:
+        Worklist.push_back(II);
+        break;
+      }
+    }
+  }
+
+  const DataLayout &DL = F.getParent()->getDataLayout();
+  for (auto *II : Worklist) {
+    Changed |= expandComplexInstruction(II, TLI, DL);
+  }
+  return Changed;
+}
+
+class ExpandComplex : public FunctionPass {
+public:
+  static char ID;
+  ExpandComplex() : FunctionPass(ID) {
+    initializeExpandComplexPass(*PassRegistry::getPassRegistry());
+  }
+
+  bool runOnFunction(Function &F) override {
+    const TargetMachine *TM =
+        &getAnalysis<TargetPassConfig>().getTM<TargetMachine>();
+    const TargetSubtargetInfo *SubtargetInfo = TM->getSubtargetImpl(F);
+    const TargetLowering *TLI = SubtargetInfo->getTargetLowering();
+    return expandComplexIntrinsics(F, TLI);
+  }
+
+  void getAnalysisUsage(AnalysisUsage &AU) const override {
+    AU.addRequired<TargetPassConfig>();
+    AU.setPreservesCFG();
+  }
+};
+} // namespace
+
+char ExpandComplex::ID;
+INITIALIZE_PASS_BEGIN(ExpandComplex, "expand-complex",
+                      "Expand complex intrinsics", false, false)
+INITIALIZE_PASS_DEPENDENCY(TargetPassConfig)
+INITIALIZE_PASS_END(ExpandComplex, "expand-complex",
+                    "Expand complex intrinsics", false, false)
+
+FunctionPass *llvm::createExpandComplexPass() { return new ExpandComplex(); }
+
+PreservedAnalyses ExpandComplexPass::run(Function &F,
+                                         FunctionAnalysisManager &AM) {
+  /*const auto &TTI = AM.getResult<TargetIRAnalysis>(F);
+  if (!expandReductions(F, &TTI))
+    return PreservedAnalyses::all();*/
+  PreservedAnalyses PA;
+  PA.preserveSet<CFGAnalyses>();
+  return PA;
+}
diff --git a/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp b/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
index 1bb6fbbf064b931..d25b71c4b351eea 100644
--- a/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
@@ -1164,6 +1164,7 @@ void DAGTypeLegalizer::SplitVectorResult(SDNode *N, unsigned ResNo) {
   case ISD::ROTL:
   case ISD::ROTR:
   case ISD::VP_FCOPYSIGN:
+  case ISD::COMPLEX_MUL:
     SplitVecRes_BinOp(N, Lo, Hi);
     break;
   case ISD::FMA: case ISD::VP_FMA:
@@ -4088,6 +4089,7 @@ void DAGTypeLegalizer::WidenVectorResult(SDNode *N, unsigned ResNo) {
   case ISD::AVGFLOORU:
   case ISD::AVGCEILS:
   case ISD::AVGCEILU:
+  case ISD::COMPLEX_MUL:
   // Vector-predicated binary op widening. Note that -- unlike the
   // unpredicated versions -- we don't have to worry about trapping on
   // operations like UDIV, FADD, etc., as we pass on the original vector
diff --git a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
index c5fd56795a5201a..10d80bf2be2cbad 100644
--- a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
@@ -7537,6 +7537,12 @@ void SelectionDAGBuilder::visitIntrinsicCall(const CallInst &I,
   case Intrinsic::experimental_vector_deinterleave2:
     visitVectorDeinterleave(I);
     return;
+  case Intrinsic::experimental_complex_fmul:
+    EVT ResultVT = TLI.getValueType(DAG.getDataLayout(), I.getType());
+    setValue(&I, DAG.getNode(ISD::COMPLEX_MUL, sdl, ResultVT,
+                             getValue(I.getOperand(0)),
+                             getValue(I.getOperand(1)), Flags));
+    return;
   }
 }
 
diff --git a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGDumper.cpp b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGDumper.cpp
index a92111ca23656eb..75007a03573e603 100644
--- a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGDumper.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGDumper.cpp
@@ -513,6 +513,7 @@ std::string SDNode::getOperationName(const SelectionDAG *G) const {
     return "stackmap";
   case ISD::PATCHPOINT:
     return "patchpoint";
+  case ISD::COMPLEX_MUL:                return "complex_mul";
 
     // Vector Predication
 #define BEGIN_REGISTER_VP_SDNODE(SDID, LEGALARG, NAME, ...)                    \
diff --git a/llvm/lib/CodeGen/TargetPassConfig.cpp b/llvm/lib/CodeGen/TargetPassConfig.cpp
index e6ecbc9b03f7149..d376cad93dcb4fb 100644
--- a/llvm/lib/CodeGen/TargetPassConfig.cpp
+++ b/llvm/lib/CodeGen/TargetPassConfig.cpp
@@ -919,6 +919,10 @@ void TargetPassConfig::addIRPasses() {
   // Convert conditional moves to conditional jumps when profitable.
   if (getOptLevel() != CodeGenOptLevel::None && !DisableSelectOptimize)
     addPass(createSelectOptimizePass());
+
+  // If the target doesn't support complex intrinsics, or if they need to be
+  // expanded into more complex calls, generate the expansion to complex calls.
+  addPass(createExpandComplexPass());
 }
 
 /// Turn exception handling constructs into something the code generators can
diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index f2716b08c4d0312..1ebbc9fca34537b 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -2324,6 +2324,22 @@ X86TargetLowering::X86TargetLowering(const X86TargetMachine &TM,
     }
   }
 
+  if (Subtarget.hasFP16()) {
+    for (auto VT : {MVT::v2f16, MVT::v4f16, MVT::v8f16, MVT::v16f16}) {
+      if (Subtarget.hasVLX())
+        setOperationAction(ISD::COMPLEX_MUL, VT, Custom);
+      setOperationAction(ISD::COMPLEX_MUL, MVT::v32f16, Custom);
+    }
+  }
+  if (Subtarget.hasAnyFMA() || (Subtarget.hasAVX512() && Subtarget.hasVLX())) {
+    for (auto VT : {MVT::v2f32, MVT::v4f32, MVT::v8f32, MVT::v2f64, MVT::v4f64})
+      setOperationAction(ISD::COMPLEX_MUL, VT, Custom);
+  }
+  if (Subtarget.hasAVX512()) {
+    setOperationAction(ISD::COMPLEX_MUL, MVT::v8f64, Custom);
+    setOperationAction(ISD::COMPLEX_MUL, MVT::v16f32, Custom);
+  }
+
   if (Subtarget.hasAMXTILE()) {
     addRegisterClass(MVT::x86amx, &X86::TILERegClass);
   }
@@ -2531,6 +2547,46 @@ X86TargetLowering::getPreferredVectorAction(MVT VT) const {
   return TargetLoweringBase::getPreferredVectorAction(VT);
 }
 
+TargetLoweringBase::ComplexABI
+X86TargetLowering::getComplexReturnABI(Type *ScalarFloatTy) const {
+  // Windows ABIs don't have dedicated _Complex rules, so they work as regular
+  // structs. These return as integers if the size is 8 bytes or fewer, or
+  // structs via memory if larger. (The size threshold is the same for both
+  // 32 and 64-bit ABIs).
+  if (Subtarget.isOSWindows()) {
+    unsigned FloatSize =
+      ScalarFloatTy->getPrimitiveSizeInBits().getFixedValue();
+    if (FloatSize <= 32) {
+      return ComplexABI::Integer;
+    } else {
+      return ComplexABI::Memory;
+    }
+  }
+  if (Subtarget.is32Bit()) {
+    if (ScalarFloatTy->isFloatTy()) {
+      return ComplexABI::Integer;
+    } else if (ScalarFloatTy->isHalfTy()) {
+      return ComplexABI::Vector;
+    } else {
+      return ComplexABI::Memory;
+    }
+  } else {
+    // The x86-64 ABI specifies that (save for x86-fp80), this is handled as a
+    // regular C struct. This means that float and smaller get packed into a
+    // single vector in xmm0; double and x86-fp80 (by special case) return two
+    // values; and larger types than x86-fp80 (i.e., fp128) returns via memory.
+    unsigned FloatSize =
+      ScalarFloatTy->getPrimitiveSizeInBits().getFixedValue();
+    if (FloatSize <= 32) {
+      return ComplexABI::Vector;
+    } else if (FloatSize <= 80) {
+      return ComplexABI::Struct;
+    } else {
+      return ComplexABI::Memory;
+    }
+  }
+}
+
 FastISel *
 X86TargetLowering::createFastISel(FunctionLoweringInfo &funcInfo,
                                   const TargetLibraryInfo *libInfo) const {
@@ -31750,6 +31806,68 @@ bool X86TargetLowering::isInlineAsmTargetBranch(
   return Inst.equals_insensitive("call") || Inst.equals_insensitive("jmp");
 }
 
+bool X86TargetLowering::CustomLowerComplexMultiply(Type *FloatTy) const {
+  auto VecTy = cast<FixedVectorType>(FloatTy);
+  unsigned VecSize = VecTy->getNumElements() * VecTy->getScalarSizeInBits();
+  Type *ElementTy = VecTy->getElementType();
+  if (ElementTy->isHalfTy()) {
+    // All the half type need avx512fp16 enabled.
+    if (VecSize == 512)
+      // For 512-bt vector type, just avx512fp16 needed.
+      return Subtarget.hasFP16();
+    else
+      // 128-bit, 256-bit vector type are legal and other vector type can
+      // be widened or split. AVX512VL should be enabled.
+      return Subtarget.hasFP16() && Subtarget.hasVLX();
+  }
+  if (ElementTy->isFloatTy() || ElementTy->isDoubleTy()) {
+    if (VecSize == 512)
+      // For 512-bt vector type, they are legal or can be split.
+      return Subtarget.hasAVX512() || Subtarget.hasAnyFMA();
+    // 128-bit, 256-bit vector type are legal or and other type can
+    // be widened or split.
+    return Subtarget.hasAnyFMA() ||
+           (Subtarget.hasAVX512() && Subtarget.hasVLX());
+  }
+  return false;
+}
+
+static SDValue LowerComplexMUL(SDValue Op, SelectionDAG &DAG,
+                               const X86Subtarget &Subtarget) {
+  MVT VT = Op.getSimpleValueType();
+  MVT ElementTy = VT.getScalarType();
+  SDLoc DL(Op);
+  // Custom handling for half type since we have corresponding complex half
+  // multiply instructions.
+  // FIXME: We use vfmulcph for sclar complex multiply here, use vfmulcsh
+  // instead.
+  if (ElementTy == MVT::f16) {
+    // Transform llvm.experimental.complex.fmul.vxf16 to vfmulcph instruction.
+    MVT BitCastTy = MVT::getVectorVT(MVT::f32, VT.getVectorNumElements() / 2);
+    SDValue LHS = DAG.getNode(ISD::BITCAST, DL, BitCastTy, Op.getOperand(0));
+    SDValue RHS = DAG.getNode(ISD::BITCAST, DL, BitCastTy, Op.getOperand(1));
+    return DAG.getNode(ISD::BITCAST, DL, VT,
+                       DAG.getNode(X86ISD::VFMULC, DL, BitCastTy, LHS, RHS));
+  }
+  assert((ElementTy == MVT::f32 || ElementTy == MVT::f64) &&
+         "Unexpected element type");
+  // llvm.experimental.complex.fmul.vxf{32,64} are transformed to SHUFFLE and
+  // FMA instructions.
+  SDValue LHS = Op.getOperand(0);
+  SDValue RHS = Op.getOperand(1);
+  unsigned Imm = ElementTy == MVT::SimpleValueType::f32 ? 0xb1 : 0x55;
+  SDValue V1, V2, V3, V4;
+  // Swap vcetor elements in pairs. E.g: [1,2,3,4] ---> [2,1,4,3]
+  V1 = DAG.getNode(X86ISD::VPERMILPI, DL, VT, LHS,
+                   DAG.getTargetConstant(Imm, DL, MVT::i8));
+  // Duplicate the odd index elements, which is real part.
+  V2 = DAG.getNode(X86ISD::MOVSHDUP, DL, VT, RHS);
+  V3 = DAG.getNode(ISD::FMUL, DL, VT, V1, V2);
+  // Duplicate the evem index elements, which is imaginary part.
+  V4 = DAG.getNode(X86ISD::MOVSLDUP, DL, VT, RHS);
+  return DAG.getNode(X86ISD::FMADDSUB, DL, VT, LHS, V4, V3);
+}
+
 /// Provide custom lowering hooks for some operations.
 SDValue X86TargetLowering::LowerOperation(SDValue Op, SelectionDAG &DAG) const {
   switch (Op.getOpcode()) {
@@ -31904,6 +32022,7 @@ SDValue X86TargetLowering::LowerOperation(SDValue Op, SelectionDAG &DAG) const {
   case ISD::ADDRSPACECAST:      return LowerADDRSPACECAST(Op, DAG);
   case X86ISD::CVTPS2PH:        return LowerCVTPS2PH(Op, DAG);
   case ISD::PREFETCH:           return LowerPREFETCH(Op, Subtarget, DAG);
+  case ISD::COMPLEX_MUL:        return LowerComplexMUL(Op, DAG, Subtarget);
   }
 }
 
@@ -33005,6 +33124,22 @@ void X86TargetLowering::ReplaceNodeResults(SDNode *N,
     // to move the scalar in two i32 pieces.
     Results.push_back(LowerBITREVERSE(SDValue(N, 0), Subtarget, DAG));
     return;
+  case ISD::COMPLEX_MUL:
+    // Widen the vector size smaller than 128 to 128
+    MVT VT = N->getSimpleValueType(0);
+    // FIXME: (COMPLEX_MUL v2f16, v2f16) should be lowered to VFMULCSH but we
+    // mix the v2f16 and v4f16 here.
+    assert((VT == MVT::v2f32 || VT == MVT::v2f16 ||
+           VT == MVT::v4f16) && "Unexpected Value type of COMPLEX_MUL!");
+    MVT WideVT =
+        VT.getVectorElementType() == MVT::f16 ? MVT::v8f16 : MVT::v4f32;
+    SmallVector<SDValue, 4> Ops(VT == MVT::v2f16 ? 4 : 2, DAG.getUNDEF(VT));
+    Ops[0] = N->getOperand(0);
+    SDValue LHS = DAG.getNode(ISD::CONCAT_VECTORS, dl, WideVT, Ops);
+    Ops[0] = N->getOperand(1);
+    SDValue RHS = DAG.getNode(ISD::CONCAT_VECTORS, dl, WideVT, Ops);
+    Results.push_back(DAG.getNode(N->getOpcode(), dl, WideVT, LHS, RHS));
+    return;
   }
   case ISD::EXTRACT_VECTOR_ELT: {
     // f16 = extract vXf16 %vec, i64 %idx
diff --git a/llvm/lib/Target/X86/X86ISelLowering.h b/llvm/lib/Target/X86/X86ISelLowering.h
index 8046f42736951cd..83545ffcb5a06a6 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.h
+++ b/llvm/lib/Target/X86/X86ISelLowering.h
@@ -1017,6 +1017,8 @@ namespace llvm {
 
     bool isMemoryAccessFast(EVT VT, Align Alignment) const;
 
+    ComplexABI getComplexReturnABI(Type *ScalarFloatTy) const override;
+
     /// Returns true if the target allows unaligned memory accesses of the
     /// specified type. Returns whether it is "fast" in the last argument.
     bool allowsMisalignedMemoryAccesses(EVT VT, unsigned AS, Align Alignment,
@@ -1549,6 +1551,8 @@ namespace llvm {
     bool isInlineAsmTargetBranch(const SmallVectorImpl<StringRef> &AsmStrs,
                                  unsigned OpNo) const override;
 
+    bool CustomLowerComplexMultiply(Type *FloatTy) const override;
+
     /// Lower interleaved load(s) into target specific
     /// instructions/intrinsics.
     bool lowerInterleavedLoad(LoadInst *LI,
diff --git a/llvm/test/CodeGen/AArch64/O0-pipeline.ll b/llvm/test/CodeGen/AArch64/O0-pipeline.ll
index 4f87bb2a3ee811e..29c56ced202b940 100644
--- a/llvm/test/CodeGen/AArch64/O0-pipeline.ll
+++ b/llvm/test/CodeGen/AArch64/O0-pipeline.ll
@@ -26,6 +26,7 @@
 ; CHECK-NEXT:       Expand vector predication intrinsics
 ; CHECK-NEXT:       Scalarize Masked Memory Intrinsics
 ; CHECK-NEXT:       Expand reduction intrinsics
+; CHECK-NEXT:       Expand complex intrinsics
 ; CHECK-NEXT:     AArch64 Globals Tagging
 ; CHECK-NEXT:     FunctionPass Manager
 ; CHECK-NEXT:       AArch64 Stack Tagging
diff --git a/llvm/test/CodeGen/AArch64/O3-pipeline.ll b/llvm/test/CodeGen/AArch64/O3-pipeline.ll
index f5c1c3c291cb585..7583ee5093facae 100644
--- a/llvm/test/CodeGen/AArch64/O3-pipeline.ll
+++ b/llvm/test/CodeGen/AArch64/O3-pipeline.ll
@@ -66,6 +66,7 @@
 ; CHECK-NEXT:       Expand reduction intrinsics
 ; CHECK-NEXT:       Natural Loop Information
 ; CHECK-NEXT:       TLS Variable Hoist
+; CHECK-NEXT:       Expand complex intrinsics
 ; CHECK-NEXT:       Lazy Branch Probability Analysis
 ; CHECK-NEXT:       Lazy Block Frequency Analysis
 ; CHECK-NEXT:       Optimization Remark Emitter
diff --git a/llvm/test/CodeGen/AMDGPU/llc-pipeline.ll b/llvm/test/CodeGen/AMDGPU/llc-pipeline.ll
index b939c8d2e339de4..c30285a7503dcce 100644
--- a/llvm/test/CodeGen/AMDGPU/llc-pipeline.ll
+++ b/llvm/test/CodeGen/AMDGPU/llc-pipeline.ll
@@ -46,6 +46,7 @@
 ; GCN-O0-NEXT:      Expand vector predication intrinsics
 ; GCN-O0-NEXT:      Scalarize Masked Memory Intrinsics
 ; GCN-O0-NEXT:      Expand reduction intrinsics
+; GCN-O0-NEXT:      Expand complex intrinsics
 ; GCN-O0-NEXT:    CallGraph Construction
 ; GCN-O0-NEXT:    Call Graph SCC Pass Manager
 ; GCN-O0-NEXT:      AMDGPU Annotate Kernel Features
@@ -225,6 +226,7 @@
 ; GCN-O1-NEXT:      Expand reduction intrinsics
 ; GCN-O1-NEXT:      Natural Loop Information
 ; GCN-O1-NEXT:      TLS Variable Hoist
+; GCN-O1-NEXT:      Expand complex intrinsics
 ; GCN-O1-NEXT:    CallGraph Construction
 ; GCN-O1-NEXT:    Call Graph SCC Pass Manager
 ; GCN-O1-NEXT:      AMDGPU Annotate Kernel Features
@@ -504,6 +506,7 @@
 ; GCN-O1-OPTS-NEXT:      Expand reduction intrinsics
 ; GCN-O1-OPTS-NEXT:      Natural Loop Information
 ; GCN-O1-OPTS-NEXT:      TLS Variable Hoist
+; GCN-O1-OPTS-NEXT:      Expand complex intrinsics
 ; GCN-O1-OPTS-NEXT:      Early CSE
 ; GCN-O1-OPTS-NEXT:    CallGraph Construction
 ; GCN-O1-OPTS-NEXT:    Call Graph SCC Pass Manager
@@ -808,6 +811,7 @@
 ; GCN-O2-NEXT:      Expand reduction intrinsics
 ; GCN-O2-NEXT:      Natural Loop Information
 ; GCN-O2-NEXT:      TLS Variable Hoist
+; GCN-O2-NEXT:      Expand complex intrinsics
 ; GCN-O2-NEXT:      Early CSE
 ; GCN-O2-NEXT:    CallGraph Construction
 ; GCN-O2-NEXT:    Call Graph SCC Pass Manager
@@ -1120,6 +1124,7 @@
 ; GCN-O3-NEXT:      Expand reduction intrinsics
 ; GCN-O3-NEXT:      Natural Loop Information
 ; GCN-O3-NEXT:      TLS Variable Hoist
+; GCN-O3-NEXT:      Expand complex intrinsics
 ; GCN-O3-NEXT:      Basic Alias Analysis (stateless AA impl)
 ; GCN-O3-NEXT:      Function Alias Analysis Results
 ; GCN-O3-NEXT:      Memory Dependence Analysis
diff --git a/llvm/test/CodeGen/ARM/O3-pipeline.ll b/llvm/test/CodeGen/ARM/O3-pipeline.ll
index 5e565970fc3a868..202b201c09a716a 100644
--- a/llvm/test/CodeGen/ARM/O3-pipeline.ll
+++ b/llvm/test/CodeGen/ARM/O3-pipeline.ll
@@ -44,6 +44,7 @@
 ; CHECK-NEXT:      Expand reduction intrinsics
 ; CHECK-NEXT:      Natural Loop Information
 ; CHECK-NEXT:      TLS Variable Hoist
+; CHECK-NEXT:      Expand complex intrinsics
 ; CHECK-NEXT:      Scalar Evolution Analysis
 ; CHECK-NEXT:      Basic Alias Analysis (stateless AA impl)
 ; CHECK-NEXT:      Function Alias Analysis Results
diff --git a/llvm/test/CodeGen/LoongArch/O0-pipeline.ll b/llvm/test/CodeGen/LoongArch/O0-pipeline.ll
index 327e461eb69a98c..ac1a5f28f78c6f0 100644
--- a/llvm/test/CodeGen/LoongArch/O0-pipeline.ll
+++ b/llvm/test/CodeGen/LoongArch/O0-pipeline.ll
@@ -30,6 +30,7 @@
 ; CHECK-NEXT:       Expand vector predication intrinsics
 ; CHECK-NEXT:       Scalarize Masked Memory Intrinsics
 ; CHECK-NEXT:       Expand reduction intrinsics
+; CHECK-NEXT:       Expand complex intrinsics
 ; CHECK-NEXT:       Exception handling preparation
 ; CHECK-NEXT:       Prepare callbr
 ; CHECK-NEXT:       Safe Stack instrumentation pass
diff --git a/llvm/test/CodeGen/LoongArch/opt-pipeline.ll b/llvm/test/CodeGen/LoongArch/opt-pipeline.ll
index 8b1d635b605b32a..8202fcb312832d4 100644
--- a/llvm/test/CodeGen/LoongArch/opt-pipeline.ll
+++ b/llvm/test/CodeGen/LoongArch/opt-pipeline.ll
@@ -67,6 +67,7 @@
 ; CHECK-NEXT:       Expand reduction intrinsics
 ; CHECK-NEXT:       Natural Loop Information
 ; CHECK-NEXT:       TLS Variable Hoist
+; CHECK-NEXT:       Expand complex intrinsics
 ; CHECK-NEXT:       CodeGen Prepare
 ; CHECK-NEXT:       Dominator Tree Construction
 ; CHECK-NEXT:       Exception handling preparation
diff --git a/llvm/test/CodeGen/PowerPC/O0-pipeline.ll b/llvm/test/CodeGen/PowerPC/O0-pipeline.ll
index 56ed3ffe9864281..ea5531a084ae2ad 100644
--- a/llvm/test/CodeGen/PowerPC/O0-pipeline.ll
+++ b/llvm/test/CodeGen/PowerPC/O0-pipeline.ll
@@ -29,6 +29,7 @@
 ; CHECK-NEXT:       Expand vector predication intrinsics
 ; CHECK-NEXT:       Scalarize Masked Memory Intrinsics
 ; CHECK-NEXT:       Expand reduction intrinsics
+; CHECK-NEXT:       Expand complex intrinsics
 ; CHECK-NEXT:       Exception handling preparation
 ; CHECK-NEXT:       Prepare callbr
 ; CHECK-NEXT:       Safe Stack instrumentation pass
diff --git a/llvm/test/CodeGen/PowerPC/O3-pipeline.ll b/llvm/test/CodeGen/PowerPC/O3-pipeline.ll
index 6ce4416211cc4d1..58665ffa3a60d3f 100644
--- a/llvm/test/CodeGen/PowerPC/O3-pipeline.ll
+++ b/llvm/test/CodeGen/PowerPC/O3-pipeline.ll
@@ -68,6 +68,7 @@
 ; CHECK-NEXT:       Expand reduction intrinsics
 ; CHECK-NEXT:       Natural Loop Information
 ; CHECK-NEXT:       TLS Variable Hoist
+; CHECK-NEXT:       Expand complex intrinsics
 ; CHECK-NEXT:       CodeGen Prepare
 ; CHECK-NEXT:       Dominator Tree Construction
 ; CHECK-NEXT:       Exception handling preparation
diff --git a/llvm/test/CodeGen/RISCV/O0-pipeline.ll b/llvm/test/CodeGen/RISCV/O0-pipeline.ll
index 01c7613201854a6..a86ce909a2eec5c 100644
--- a/llvm/test/CodeGen/RISCV/O0-pipeline.ll
+++ b/llvm/test/CodeGen/RISCV/O0-pipeline.ll
@@ -30,6 +30,7 @@
 ; CHECK-NEXT:       Expand vector predication intrinsics
 ; CHECK-NEXT:       Scalarize Masked Memory Intrinsics
 ; CHECK-NEXT:       Expand reduction intrinsics
+; CHECK-NEXT:       Expand complex intrinsics
 ; CHECK-NEXT:       Exception handling preparation
 ; CHECK-NEXT:       Prepare callbr
 ; CHECK-NEXT:       Safe Stack instrumentation pass
diff --git a/llvm/test/CodeGen/RISCV/O3-pipeline.ll b/llvm/test/CodeGen/RISCV/O3-pipeline.ll
index 277951782ce5ccb..776b95510e9ae41 100644
--- a/llvm/test/CodeGen/RISCV/O3-pipeline.ll
+++ b/llvm/test/CodeGen/RISCV/O3-pipeline.ll
@@ -62,6 +62,7 @@
 ; CHECK-NEXT:       Expand reduction intrinsics
 ; CHECK-NEXT:       Natural Loop Information
 ; CHECK-NEXT:       TLS Variable Hoist
+; CHECK-NEXT:       Expand complex intrinsics
 ; CHECK-NEXT:       CodeGen Prepare
 ; CHECK-NEXT:       Dominator Tree Construction
 ; CHECK-NEXT:       Exception handling preparation
diff --git a/llvm/test/CodeGen/X86/O0-pipeline.ll b/llvm/test/CodeGen/X86/O0-pipeline.ll
index 402645ed1e2e5d6..b7881f818a0d824 100644
--- a/llvm/test/CodeGen/X86/O0-pipeline.ll
+++ b/llvm/test/CodeGen/X86/O0-pipeline.ll
@@ -30,6 +30,7 @@
 ; CHECK-NEXT:       Expand vector predication intrinsics
 ; CHECK-NEXT:       Scalarize Masked Memory Intrinsics
 ; CHECK-NEXT:       Expand reduction intrinsics
+; CHECK-NEXT:       Expand complex intrinsics
 ; CHECK-NEXT:       Expand indirectbr instructions
 ; CHECK-NEXT:       Exception handling preparation
 ; CHECK-NEXT:       Prepare callbr
diff --git a/llvm/test/CodeGen/X86/complex-32bit.ll b/llvm/test/CodeGen/X86/complex-32bit.ll
new file mode 100644
index 000000000000000..75b9a91f201cabb
--- /dev/null
+++ b/llvm/test/CodeGen/X86/complex-32bit.ll
@@ -0,0 +1,173 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
+; RUN: llc < %s -mtriple=i386-linux-gnu -mattr=+sse2 | FileCheck %s
+
+; Check that we handle the ABI of the complex functions correctly for 32-bit.
+
+declare <2 x half> @llvm.experimental.complex.fmul.v2f16(<2 x half>, <2 x half>)
+declare <2 x float> @llvm.experimental.complex.fmul.v2f32(<2 x float>, <2 x float>)
+declare <2 x double> @llvm.experimental.complex.fmul.v2f64(<2 x double>, <2 x double>)
+declare <2 x x86_fp80> @llvm.experimental.complex.fmul.v2f80(<2 x x86_fp80>, <2 x x86_fp80>)
+declare <2 x fp128> @llvm.experimental.complex.fmul.v2f128(<2 x fp128>, <2 x fp128>)
+
+define <2 x half> @intrinsic_f16(<2 x half> %z, <2 x half> %w) {
+; CHECK-LABEL: intrinsic_f16:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    subl $28, %esp
+; CHECK-NEXT:    .cfi_def_cfa_offset 32
+; CHECK-NEXT:    movdqa %xmm0, %xmm2
+; CHECK-NEXT:    psrld $16, %xmm2
+; CHECK-NEXT:    movdqa %xmm1, %xmm3
+; CHECK-NEXT:    psrld $16, %xmm3
+; CHECK-NEXT:    pextrw $0, %xmm1, %eax
+; CHECK-NEXT:    movw %ax, {{[0-9]+}}(%esp)
+; CHECK-NEXT:    pextrw $0, %xmm0, %eax
+; CHECK-NEXT:    movw %ax, (%esp)
+; CHECK-NEXT:    pextrw $0, %xmm3, %eax
+; CHECK-NEXT:    movw %ax, {{[0-9]+}}(%esp)
+; CHECK-NEXT:    pextrw $0, %xmm2, %eax
+; CHECK-NEXT:    movw %ax, {{[0-9]+}}(%esp)
+; CHECK-NEXT:    calll __mulhc3 at PLT
+; CHECK-NEXT:    addl $28, %esp
+; CHECK-NEXT:    .cfi_def_cfa_offset 4
+; CHECK-NEXT:    retl
+  %mul = call <2 x half> @llvm.experimental.complex.fmul.v2f16(<2 x half> %z, <2 x half> %w)
+  ret <2 x half> %mul
+}
+
+define <2 x float> @intrinsic_f32(<2 x float> %z, <2 x float> %w) {
+; CHECK-LABEL: intrinsic_f32:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    subl $28, %esp
+; CHECK-NEXT:    .cfi_def_cfa_offset 32
+; CHECK-NEXT:    movaps %xmm0, %xmm2
+; CHECK-NEXT:    shufps {{.*#+}} xmm2 = xmm2[1,1],xmm0[1,1]
+; CHECK-NEXT:    movaps %xmm1, %xmm3
+; CHECK-NEXT:    shufps {{.*#+}} xmm3 = xmm3[1,1],xmm1[1,1]
+; CHECK-NEXT:    movss %xmm1, {{[0-9]+}}(%esp)
+; CHECK-NEXT:    movss %xmm0, (%esp)
+; CHECK-NEXT:    movss %xmm3, {{[0-9]+}}(%esp)
+; CHECK-NEXT:    movss %xmm2, {{[0-9]+}}(%esp)
+; CHECK-NEXT:    calll __mulsc3 at PLT
+; CHECK-NEXT:    movd %edx, %xmm1
+; CHECK-NEXT:    movd %eax, %xmm0
+; CHECK-NEXT:    punpckldq {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1]
+; CHECK-NEXT:    addl $28, %esp
+; CHECK-NEXT:    .cfi_def_cfa_offset 4
+; CHECK-NEXT:    retl
+  %mul = call <2 x float> @llvm.experimental.complex.fmul.v2f32(<2 x float> %z, <2 x float> %w)
+  ret <2 x float> %mul
+}
+
+
+define <2 x double> @intrinsic_f64(<2 x double> %z, <2 x double> %w) {
+; CHECK-LABEL: intrinsic_f64:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    subl $60, %esp
+; CHECK-NEXT:    .cfi_def_cfa_offset 64
+; CHECK-NEXT:    movhps %xmm1, {{[0-9]+}}(%esp)
+; CHECK-NEXT:    movlps %xmm1, {{[0-9]+}}(%esp)
+; CHECK-NEXT:    movhps %xmm0, {{[0-9]+}}(%esp)
+; CHECK-NEXT:    movlps %xmm0, {{[0-9]+}}(%esp)
+; CHECK-NEXT:    leal {{[0-9]+}}(%esp), %eax
+; CHECK-NEXT:    movl %eax, (%esp)
+; CHECK-NEXT:    calll __muldc3 at PLT
+; CHECK-NEXT:    subl $4, %esp
+; CHECK-NEXT:    movups {{[0-9]+}}(%esp), %xmm0
+; CHECK-NEXT:    addl $60, %esp
+; CHECK-NEXT:    .cfi_def_cfa_offset 4
+; CHECK-NEXT:    retl
+  %mul = call <2 x double> @llvm.experimental.complex.fmul.v2f64(<2 x double> %z, <2 x double> %w)
+  ret <2 x double> %mul
+}
+
+define <2 x x86_fp80> @intrinsic_f80(<2 x x86_fp80> %z, <2 x x86_fp80> %w) {
+; CHECK-LABEL: intrinsic_f80:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    subl $92, %esp
+; CHECK-NEXT:    .cfi_def_cfa_offset 96
+; CHECK-NEXT:    fldt {{[0-9]+}}(%esp)
+; CHECK-NEXT:    fldt {{[0-9]+}}(%esp)
+; CHECK-NEXT:    fldt {{[0-9]+}}(%esp)
+; CHECK-NEXT:    fldt {{[0-9]+}}(%esp)
+; CHECK-NEXT:    fstpt {{[0-9]+}}(%esp)
+; CHECK-NEXT:    fstpt {{[0-9]+}}(%esp)
+; CHECK-NEXT:    fstpt {{[0-9]+}}(%esp)
+; CHECK-NEXT:    fstpt {{[0-9]+}}(%esp)
+; CHECK-NEXT:    leal {{[0-9]+}}(%esp), %eax
+; CHECK-NEXT:    movl %eax, (%esp)
+; CHECK-NEXT:    calll __mulxc3 at PLT
+; CHECK-NEXT:    subl $4, %esp
+; CHECK-NEXT:    fldt {{[0-9]+}}(%esp)
+; CHECK-NEXT:    fldt {{[0-9]+}}(%esp)
+; CHECK-NEXT:    fxch %st(1)
+; CHECK-NEXT:    addl $92, %esp
+; CHECK-NEXT:    .cfi_def_cfa_offset 4
+; CHECK-NEXT:    retl
+  %mul = call <2 x x86_fp80> @llvm.experimental.complex.fmul.v2f80(<2 x x86_fp80> %z, <2 x x86_fp80> %w)
+  ret <2 x x86_fp80> %mul
+}
+
+define <2 x fp128> @intrinsic_f128(<2 x fp128> %z, <2 x fp128> %w) {
+; CHECK-LABEL: intrinsic_f128:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    pushl %esi
+; CHECK-NEXT:    .cfi_def_cfa_offset 8
+; CHECK-NEXT:    subl $40, %esp
+; CHECK-NEXT:    .cfi_def_cfa_offset 48
+; CHECK-NEXT:    .cfi_offset %esi, -8
+; CHECK-NEXT:    movl {{[0-9]+}}(%esp), %esi
+; CHECK-NEXT:    subl $12, %esp
+; CHECK-NEXT:    .cfi_adjust_cfa_offset 12
+; CHECK-NEXT:    leal {{[0-9]+}}(%esp), %eax
+; CHECK-NEXT:    pushl {{[0-9]+}}(%esp)
+; CHECK-NEXT:    .cfi_adjust_cfa_offset 4
+; CHECK-NEXT:    pushl {{[0-9]+}}(%esp)
+; CHECK-NEXT:    .cfi_adjust_cfa_offset 4
+; CHECK-NEXT:    pushl {{[0-9]+}}(%esp)
+; CHECK-NEXT:    .cfi_adjust_cfa_offset 4
+; CHECK-NEXT:    pushl {{[0-9]+}}(%esp)
+; CHECK-NEXT:    .cfi_adjust_cfa_offset 4
+; CHECK-NEXT:    pushl {{[0-9]+}}(%esp)
+; CHECK-NEXT:    .cfi_adjust_cfa_offset 4
+; CHECK-NEXT:    pushl {{[0-9]+}}(%esp)
+; CHECK-NEXT:    .cfi_adjust_cfa_offset 4
+; CHECK-NEXT:    pushl {{[0-9]+}}(%esp)
+; CHECK-NEXT:    .cfi_adjust_cfa_offset 4
+; CHECK-NEXT:    pushl {{[0-9]+}}(%esp)
+; CHECK-NEXT:    .cfi_adjust_cfa_offset 4
+; CHECK-NEXT:    pushl {{[0-9]+}}(%esp)
+; CHECK-NEXT:    .cfi_adjust_cfa_offset 4
+; CHECK-NEXT:    pushl {{[0-9]+}}(%esp)
+; CHECK-NEXT:    .cfi_adjust_cfa_offset 4
+; CHECK-NEXT:    pushl {{[0-9]+}}(%esp)
+; CHECK-NEXT:    .cfi_adjust_cfa_offset 4
+; CHECK-NEXT:    pushl {{[0-9]+}}(%esp)
+; CHECK-NEXT:    .cfi_adjust_cfa_offset 4
+; CHECK-NEXT:    pushl {{[0-9]+}}(%esp)
+; CHECK-NEXT:    .cfi_adjust_cfa_offset 4
+; CHECK-NEXT:    pushl {{[0-9]+}}(%esp)
+; CHECK-NEXT:    .cfi_adjust_cfa_offset 4
+; CHECK-NEXT:    pushl {{[0-9]+}}(%esp)
+; CHECK-NEXT:    .cfi_adjust_cfa_offset 4
+; CHECK-NEXT:    pushl {{[0-9]+}}(%esp)
+; CHECK-NEXT:    .cfi_adjust_cfa_offset 4
+; CHECK-NEXT:    pushl %eax
+; CHECK-NEXT:    .cfi_adjust_cfa_offset 4
+; CHECK-NEXT:    calll __multc3 at PLT
+; CHECK-NEXT:    .cfi_adjust_cfa_offset -4
+; CHECK-NEXT:    addl $76, %esp
+; CHECK-NEXT:    .cfi_adjust_cfa_offset -76
+; CHECK-NEXT:    movaps (%esp), %xmm0
+; CHECK-NEXT:    movaps {{[0-9]+}}(%esp), %xmm1
+; CHECK-NEXT:    movaps %xmm1, 16(%esi)
+; CHECK-NEXT:    movaps %xmm0, (%esi)
+; CHECK-NEXT:    movl %esi, %eax
+; CHECK-NEXT:    addl $40, %esp
+; CHECK-NEXT:    .cfi_def_cfa_offset 8
+; CHECK-NEXT:    popl %esi
+; CHECK-NEXT:    .cfi_def_cfa_offset 4
+; CHECK-NEXT:    retl $4
+  %mul = call <2 x fp128> @llvm.experimental.complex.fmul.v2f128(<2 x fp128> %z, <2 x fp128> %w)
+  ret <2 x fp128> %mul
+}
+
diff --git a/llvm/test/CodeGen/X86/complex-64bit.ll b/llvm/test/CodeGen/X86/complex-64bit.ll
new file mode 100644
index 000000000000000..dc855817d8aaaa8
--- /dev/null
+++ b/llvm/test/CodeGen/X86/complex-64bit.ll
@@ -0,0 +1,103 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown | FileCheck %s
+
+; Check that we handle the ABI of the complex functions correctly for 32-bit.
+
+declare <2 x half> @llvm.experimental.complex.fmul.v2f16(<2 x half>, <2 x half>)
+declare <2 x float> @llvm.experimental.complex.fmul.v2f32(<2 x float>, <2 x float>)
+declare <2 x double> @llvm.experimental.complex.fmul.v2f64(<2 x double>, <2 x double>)
+declare <2 x x86_fp80> @llvm.experimental.complex.fmul.v2f80(<2 x x86_fp80>, <2 x x86_fp80>)
+declare <2 x fp128> @llvm.experimental.complex.fmul.v2f128(<2 x fp128>, <2 x fp128>)
+
+define <2 x half> @intrinsic_f16(<2 x half> %z, <2 x half> %w) {
+; CHECK-LABEL: intrinsic_f16:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    pushq %rax
+; CHECK-NEXT:    .cfi_def_cfa_offset 16
+; CHECK-NEXT:    movdqa %xmm1, %xmm2
+; CHECK-NEXT:    movdqa %xmm0, %xmm1
+; CHECK-NEXT:    psrld $16, %xmm1
+; CHECK-NEXT:    movdqa %xmm2, %xmm3
+; CHECK-NEXT:    psrld $16, %xmm3
+; CHECK-NEXT:    callq __mulhc3 at PLT
+; CHECK-NEXT:    popq %rax
+; CHECK-NEXT:    .cfi_def_cfa_offset 8
+; CHECK-NEXT:    retq
+  %mul = call <2 x half> @llvm.experimental.complex.fmul.v2f16(<2 x half> %z, <2 x half> %w)
+  ret <2 x half> %mul
+}
+
+define <2 x float> @intrinsic_f32(<2 x float> %z, <2 x float> %w) {
+; CHECK-LABEL: intrinsic_f32:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    pushq %rax
+; CHECK-NEXT:    .cfi_def_cfa_offset 16
+; CHECK-NEXT:    movaps %xmm1, %xmm2
+; CHECK-NEXT:    movaps %xmm0, %xmm1
+; CHECK-NEXT:    shufps {{.*#+}} xmm1 = xmm1[1,1],xmm0[1,1]
+; CHECK-NEXT:    movaps %xmm2, %xmm3
+; CHECK-NEXT:    shufps {{.*#+}} xmm3 = xmm3[1,1],xmm2[1,1]
+; CHECK-NEXT:    callq __mulsc3 at PLT
+; CHECK-NEXT:    popq %rax
+; CHECK-NEXT:    .cfi_def_cfa_offset 8
+; CHECK-NEXT:    retq
+  %mul = call <2 x float> @llvm.experimental.complex.fmul.v2f32(<2 x float> %z, <2 x float> %w)
+  ret <2 x float> %mul
+}
+
+define <2 x double> @intrinsic_f64(<2 x double> %z, <2 x double> %w) {
+; CHECK-LABEL: intrinsic_f64:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    pushq %rax
+; CHECK-NEXT:    .cfi_def_cfa_offset 16
+; CHECK-NEXT:    movaps %xmm1, %xmm2
+; CHECK-NEXT:    movaps %xmm0, %xmm1
+; CHECK-NEXT:    unpckhpd {{.*#+}} xmm1 = xmm1[1],xmm0[1]
+; CHECK-NEXT:    movaps %xmm2, %xmm3
+; CHECK-NEXT:    unpckhpd {{.*#+}} xmm3 = xmm3[1],xmm2[1]
+; CHECK-NEXT:    callq __muldc3 at PLT
+; CHECK-NEXT:    movlhps {{.*#+}} xmm0 = xmm0[0],xmm1[0]
+; CHECK-NEXT:    popq %rax
+; CHECK-NEXT:    .cfi_def_cfa_offset 8
+; CHECK-NEXT:    retq
+  %mul = call <2 x double> @llvm.experimental.complex.fmul.v2f64(<2 x double> %z, <2 x double> %w)
+  ret <2 x double> %mul
+}
+
+define <2 x x86_fp80> @intrinsic_f80(<2 x x86_fp80> %z, <2 x x86_fp80> %w) {
+; CHECK-LABEL: intrinsic_f80:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    subq $72, %rsp
+; CHECK-NEXT:    .cfi_def_cfa_offset 80
+; CHECK-NEXT:    fldt {{[0-9]+}}(%rsp)
+; CHECK-NEXT:    fldt {{[0-9]+}}(%rsp)
+; CHECK-NEXT:    fldt {{[0-9]+}}(%rsp)
+; CHECK-NEXT:    fldt {{[0-9]+}}(%rsp)
+; CHECK-NEXT:    fstpt {{[0-9]+}}(%rsp)
+; CHECK-NEXT:    fstpt {{[0-9]+}}(%rsp)
+; CHECK-NEXT:    fstpt {{[0-9]+}}(%rsp)
+; CHECK-NEXT:    fstpt (%rsp)
+; CHECK-NEXT:    callq __mulxc3 at PLT
+; CHECK-NEXT:    addq $72, %rsp
+; CHECK-NEXT:    .cfi_def_cfa_offset 8
+; CHECK-NEXT:    retq
+  %mul = call <2 x x86_fp80> @llvm.experimental.complex.fmul.v2f80(<2 x x86_fp80> %z, <2 x x86_fp80> %w)
+  ret <2 x x86_fp80> %mul
+}
+
+define <2 x fp128> @intrinsic_f128(<2 x fp128> %z, <2 x fp128> %w) {
+; CHECK-LABEL: intrinsic_f128:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    subq $40, %rsp
+; CHECK-NEXT:    .cfi_def_cfa_offset 48
+; CHECK-NEXT:    movq %rsp, %rdi
+; CHECK-NEXT:    callq __multc3 at PLT
+; CHECK-NEXT:    movaps (%rsp), %xmm0
+; CHECK-NEXT:    movaps {{[0-9]+}}(%rsp), %xmm1
+; CHECK-NEXT:    addq $40, %rsp
+; CHECK-NEXT:    .cfi_def_cfa_offset 8
+; CHECK-NEXT:    retq
+  %mul = call <2 x fp128> @llvm.experimental.complex.fmul.v2f128(<2 x fp128> %z, <2 x fp128> %w)
+  ret <2 x fp128> %mul
+}
+
diff --git a/llvm/test/CodeGen/X86/complex-divide.ll b/llvm/test/CodeGen/X86/complex-divide.ll
new file mode 100644
index 000000000000000..b5030bac82d2505
--- /dev/null
+++ b/llvm/test/CodeGen/X86/complex-divide.ll
@@ -0,0 +1,92 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown | FileCheck %s
+
+; Check the expansion of the complex divide intrinsic. This only tests
+; expansion for 32-bit floats, as the expansion should produce identical IR
+; expansions save for the ABI of calling __divsc3, which is tested (indirectly)
+; for each type individually in complex-{32,64}bit.ll.
+
+declare <2 x float> @llvm.experimental.complex.fdiv.v2f32(<2 x float>, <2 x float>)
+
+; Generate a call to __divsc3
+define <2 x float> @intrinsic_slow_f32(<2 x float> %z, <2 x float> %w) {
+; CHECK-LABEL: intrinsic_slow_f32:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    pushq %rax
+; CHECK-NEXT:    .cfi_def_cfa_offset 16
+; CHECK-NEXT:    movaps %xmm1, %xmm2
+; CHECK-NEXT:    movaps %xmm0, %xmm1
+; CHECK-NEXT:    shufps {{.*#+}} xmm1 = xmm1[1,1],xmm0[1,1]
+; CHECK-NEXT:    movaps %xmm2, %xmm3
+; CHECK-NEXT:    shufps {{.*#+}} xmm3 = xmm3[1,1],xmm2[1,1]
+; CHECK-NEXT:    callq __divsc3 at PLT
+; CHECK-NEXT:    popq %rax
+; CHECK-NEXT:    .cfi_def_cfa_offset 8
+; CHECK-NEXT:    retq
+  %div = call <2 x float> @llvm.experimental.complex.fdiv.v2f32(<2 x float> %z, <2 x float> %w)
+  ret <2 x float> %div
+}
+
+; Do not do an expansion (because fast is not sufficient to imply full
+; complex-range=limited.
+define <2 x float> @intrinsic_implied_not_limited_f32(<2 x float> %z, <2 x float> %w) #1 {
+; CHECK-LABEL: intrinsic_implied_not_limited_f32:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    pushq %rax
+; CHECK-NEXT:    .cfi_def_cfa_offset 16
+; CHECK-NEXT:    vmovaps %xmm1, %xmm2
+; CHECK-NEXT:    vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
+; CHECK-NEXT:    vmovshdup {{.*#+}} xmm3 = xmm2[1,1,3,3]
+; CHECK-NEXT:    callq __divsc3 at PLT
+; CHECK-NEXT:    popq %rax
+; CHECK-NEXT:    .cfi_def_cfa_offset 8
+; CHECK-NEXT:    retq
+  %div = call fast <2 x float> @llvm.experimental.complex.fdiv.v2f32(<2 x float> %z, <2 x float> %w)
+  ret <2 x float> %div
+}
+
+; Do an expansion (because of complex-range=limited)
+define <2 x float> @intrinsic_limited_f32(<2 x float> %z, <2 x float> %w) #1 {
+; CHECK-LABEL: intrinsic_limited_f32:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    vmovshdup {{.*#+}} xmm2 = xmm0[1,1,3,3]
+; CHECK-NEXT:    vmovshdup {{.*#+}} xmm3 = xmm1[1,1,3,3]
+; CHECK-COUNT-2: vmulss
+; CHECK-NEXT:    vaddss {{.*}} %xmm4
+; CHECK-COUNT-2: vmulss
+; CHECK-NEXT:    vaddss {{.*}} %xmm5
+; CHECK-NEXT:    vdivss %xmm4, %xmm5, %xmm5
+; CHECK-COUNT-2: vmulss
+; CHECK-NEXT:    vsubss %xmm0, %xmm1, %xmm0
+; CHECK-NEXT:    vdivss %xmm4, %xmm0, %xmm0
+; CHECK-NEXT:    vinsertps {{.*#+}} xmm0 = xmm5[0],xmm0[0],xmm5[2,3]
+; CHECK-NEXT:    retq
+  %div = call <2 x float> @llvm.experimental.complex.fdiv.v2f32(<2 x float> %z, <2 x float> %w) #0
+  ret <2 x float> %div
+}
+
+; Do an expansion, and use the FMA (because of fast-math flags).
+define <2 x float> @intrinsic_fast_f32(<2 x float> %z, <2 x float> %w) #1 {
+; CHECK-LABEL: intrinsic_fast_f32:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    vmovshdup {{.*#+}} xmm2 = xmm0[1,1,3,3]
+; CHECK-NEXT:    vmovshdup {{.*#+}} xmm3 = xmm1[1,1,3,3]
+; CHECK-NEXT:    vmulss %xmm3, %xmm3, %xmm4
+; CHECK-NEXT:    vfmadd231ss {{.*#+}} xmm4 = (xmm1 * xmm1) + xmm4
+; CHECK-NEXT:    vmulss %xmm3, %xmm2, %xmm5
+; CHECK-NEXT:    vfmadd231ss {{.*#+}} xmm5 = (xmm0 * xmm1) + xmm5
+; CHECK-NEXT:    vmovss {{.*#+}} xmm6 = mem[0],zero,zero,zero
+; CHECK-NEXT:    vdivss %xmm4, %xmm6, %xmm4
+; CHECK-NEXT:    vmulss %xmm4, %xmm5, %xmm5
+; CHECK-NEXT:    vmulss %xmm3, %xmm0, %xmm0
+; CHECK-NEXT:    vfmsub231ss {{.*#+}} xmm0 = (xmm2 * xmm1) - xmm0
+; CHECK-NEXT:    vmulss %xmm4, %xmm0, %xmm0
+; CHECK-NEXT:    vinsertps {{.*#+}} xmm0 = xmm5[0],xmm0[0],xmm5[2,3]
+; CHECK-NEXT:    retq
+  %div = call fast <2 x float> @llvm.experimental.complex.fdiv.v2f32(<2 x float> %z, <2 x float> %w) #0
+  ret <2 x float> %div
+}
+
+attributes #0 = { "complex-range"="limited" }
+attributes #1 = { "target-features"="+fma" }
+attributes #2 = { "complex-range"="no-nan" }
diff --git a/llvm/test/CodeGen/X86/complex-multiply.ll b/llvm/test/CodeGen/X86/complex-multiply.ll
new file mode 100644
index 000000000000000..5af2d1ed91d65ad
--- /dev/null
+++ b/llvm/test/CodeGen/X86/complex-multiply.ll
@@ -0,0 +1,525 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+fma | FileCheck %s --check-prefixes=ALL,FMA
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx512f,+avx512vl | FileCheck %s --check-prefixes=ALL,AVX512VL
+
+
+; Check the expansion of the complex multiply intrinsic. This only tests
+; expansion for 32-bit floats, as the expansion should produce identical IR
+; expansions save for ABI of calling __mulsc3, which is tested for each type
+; individually in complex-{32,64}bit.ll.
+
+declare <2 x float> @llvm.experimental.complex.fmul.v2f32(<2 x float>, <2 x float>)
+declare <4 x float> @llvm.experimental.complex.fmul.v4f32(<4 x float>, <4 x float>)
+declare <8 x float> @llvm.experimental.complex.fmul.v8f32(<8 x float>, <8 x float>)
+declare <16 x float> @llvm.experimental.complex.fmul.v16f32(<16 x float>, <16 x float>)
+declare <2 x double> @llvm.experimental.complex.fmul.v2f64(<2 x double>, <2 x double>)
+declare <4 x double> @llvm.experimental.complex.fmul.v4f64(<4 x double>, <4 x double>)
+declare <8 x double> @llvm.experimental.complex.fmul.v8f64(<8 x double>, <8 x double>)
+declare <6 x float> @llvm.experimental.complex.fmul.v6f32(<6 x float>, <6 x float>)
+declare <6 x double> @llvm.experimental.complex.fmul.v6f64(<6 x double>, <6 x double>)
+declare <32 x float> @llvm.experimental.complex.fmul.v32f32(<32 x float>, <32 x float>)
+
+; Generate a call to __mulsc3
+define <2 x float> @intrinsic_slow_v2f32(<2 x float> %z, <2 x float> %w) {
+; ALL-LABEL: intrinsic_slow_v2f32:
+; ALL:       # %bb.0:
+; ALL-NEXT:    pushq %rax
+; ALL-NEXT:    .cfi_def_cfa_offset 16
+; ALL-NEXT:    vmovaps %xmm1, %xmm2
+; ALL-NEXT:    vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
+; ALL-NEXT:    vmovshdup {{.*#+}} xmm3 = xmm2[1,1,3,3]
+; ALL-NEXT:    callq __mulsc3 at PLT
+; ALL-NEXT:    popq %rax
+; ALL-NEXT:    .cfi_def_cfa_offset 8
+; ALL-NEXT:    retq
+  %mul = call <2 x float> @llvm.experimental.complex.fmul.v2f32(<2 x float> %z, <2 x float> %w)
+  ret <2 x float> %mul
+}
+
+; Do an expansion (because of fast-math flags).
+define <2 x float> @intrinsic_implied_limited_v2f32(<2 x float> %z, <2 x float> %w)  {
+; ALL-LABEL: intrinsic_implied_limited_v2f32:
+; ALL:       # %bb.0:
+; ALL-NEXT:    vmovshdup {{.*#+}} xmm2 = xmm1[1,1,3,3]
+; ALL-NEXT:    vshufps {{.*#+}} xmm3 = xmm0[1,0,3,2]
+; ALL-NEXT:    vmulps %xmm2, %xmm3, %xmm2
+; ALL-NEXT:    vmovsldup {{.*#+}} xmm1 = xmm1[0,0,2,2]
+; ALL-NEXT:    vfmaddsub213ps {{.*#+}} xmm0 = (xmm1 * xmm0) +/- xmm2
+; ALL-NEXT:    retq
+  %mul = call nnan ninf <2 x float> @llvm.experimental.complex.fmul.v2f32(<2 x float> %z, <2 x float> %w)
+  ret <2 x float> %mul
+}
+
+; Do an expansion (because of complex-range=limited).
+define <2 x float> @intrinsic_limited_v2f32(<2 x float> %z, <2 x float> %w) {
+; ALL-LABEL: intrinsic_limited_v2f32:
+; ALL:       # %bb.0:
+; ALL-NEXT:    vmovshdup {{.*#+}} xmm2 = xmm1[1,1,3,3]
+; ALL-NEXT:    vshufps {{.*#+}} xmm3 = xmm0[1,0,3,2]
+; ALL-NEXT:    vmulps %xmm2, %xmm3, %xmm2
+; ALL-NEXT:    vmovsldup {{.*#+}} xmm1 = xmm1[0,0,2,2]
+; ALL-NEXT:    vfmaddsub213ps {{.*#+}} xmm0 = (xmm1 * xmm0) +/- xmm2
+; ALL-NEXT:    retq
+  %mul = call <2 x float> @llvm.experimental.complex.fmul.v2f32(<2 x float> %z, <2 x float> %w) #0
+  ret <2 x float> %mul
+}
+
+; Do an expansion, and use the FMA (because of fast-math flags).
+define <2 x float> @intrinsic_fast_v2f32(<2 x float> %z, <2 x float> %w) {
+; ALL-LABEL: intrinsic_fast_v2f32:
+; ALL:       # %bb.0:
+; ALL-NEXT:    vmovshdup {{.*#+}} xmm2 = xmm1[1,1,3,3]
+; ALL-NEXT:    vshufps {{.*#+}} xmm3 = xmm0[1,0,3,2]
+; ALL-NEXT:    vmulps %xmm2, %xmm3, %xmm2
+; ALL-NEXT:    vmovsldup {{.*#+}} xmm1 = xmm1[0,0,2,2]
+; ALL-NEXT:    vfmaddsub213ps {{.*#+}} xmm0 = (xmm1 * xmm0) +/- xmm2
+; ALL-NEXT:    retq
+  %mul = call fast <2 x float> @llvm.experimental.complex.fmul.v2f32(<2 x float> %z, <2 x float> %w)
+  ret <2 x float> %mul
+}
+
+define <4 x float> @intrinsic_slow_v4f32(<4 x float> %z, <4 x float> %w) {
+; ALL-LABEL: intrinsic_slow_v4f32:
+; ALL:       # %bb.0:
+; ALL-NEXT:    pushq %rax
+; ALL-NEXT:    .cfi_def_cfa_offset 16
+; ALL-NEXT:    vmovaps %xmm1, %xmm2
+; ALL-NEXT:    vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
+; ALL-NEXT:    vmovshdup {{.*#+}} xmm3 = xmm2[1,1,3,3]
+; ALL-NEXT:    callq __mulsc3 at PLT
+; ALL-NEXT:    popq %rax
+; ALL-NEXT:    .cfi_def_cfa_offset 8
+; ALL-NEXT:    retq
+  %mul = call <4 x float> @llvm.experimental.complex.fmul.v4f32(<4 x float> %z, <4 x float> %w)
+  ret <4 x float> %mul
+}
+
+define <4 x float> @intrinsic_fast_v4f32(<4 x float> %z, <4 x float> %w) {
+; ALL-LABEL: intrinsic_fast_v4f32:
+; ALL:       # %bb.0:
+; ALL-NEXT:    vmovshdup {{.*#+}} xmm2 = xmm1[1,1,3,3]
+; ALL-NEXT:    vshufps {{.*#+}} xmm3 = xmm0[1,0,3,2]
+; ALL-NEXT:    vmulps %xmm2, %xmm3, %xmm2
+; ALL-NEXT:    vmovsldup {{.*#+}} xmm1 = xmm1[0,0,2,2]
+; ALL-NEXT:    vfmaddsub213ps {{.*#+}} xmm0 = (xmm1 * xmm0) +/- xmm2
+; ALL-NEXT:    retq
+  %mul = call fast <4 x float> @llvm.experimental.complex.fmul.v4f32(<4 x float> %z, <4 x float> %w)
+  ret <4 x float> %mul
+}
+
+define <4 x float> @intrinsic_limited_v4f32(<4 x float> %z, <4 x float> %w) {
+; ALL-LABEL: intrinsic_limited_v4f32:
+; ALL:       # %bb.0:
+; ALL-NEXT:    vmovshdup {{.*#+}} xmm2 = xmm1[1,1,3,3]
+; ALL-NEXT:    vshufps {{.*#+}} xmm3 = xmm0[1,0,3,2]
+; ALL-NEXT:    vmulps %xmm2, %xmm3, %xmm2
+; ALL-NEXT:    vmovsldup {{.*#+}} xmm1 = xmm1[0,0,2,2]
+; ALL-NEXT:    vfmaddsub213ps {{.*#+}} xmm0 = (xmm1 * xmm0) +/- xmm2
+; ALL-NEXT:    retq
+  %mul = call <4 x float> @llvm.experimental.complex.fmul.v4f32(<4 x float> %z, <4 x float> %w) #0
+  ret <4 x float> %mul
+}
+
+define <8 x float> @intrinsic_slow_v8f32(<8 x float> %z, <8 x float> %w) {
+; ALL-LABEL: intrinsic_slow_v8f32:
+; ALL:       # %bb.0:
+; ALL-NEXT:    pushq %rax
+; ALL-NEXT:    .cfi_def_cfa_offset 16
+; ALL-NEXT:    vmovaps %ymm1, %ymm2
+; ALL-NEXT:    vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
+; ALL-NEXT:    vmovshdup {{.*#+}} xmm3 = xmm2[1,1,3,3]
+; ALL-NEXT:    # kill: def $xmm0 killed $xmm0 killed $ymm0
+; ALL-NEXT:    # kill: def $xmm2 killed $xmm2 killed $ymm2
+; ALL-NEXT:    callq __mulsc3 at PLT
+; ALL-NEXT:    popq %rax
+; ALL-NEXT:    .cfi_def_cfa_offset 8
+; ALL-NEXT:    retq
+  %mul = call <8 x float> @llvm.experimental.complex.fmul.v8f32(<8 x float> %z, <8 x float> %w)
+  ret <8 x float> %mul
+}
+
+define <8 x float> @intrinsic_fast_v8f32(<8 x float> %z, <8 x float> %w) {
+; ALL-LABEL: intrinsic_fast_v8f32:
+; ALL:       # %bb.0:
+; ALL-NEXT:    vmovshdup {{.*#+}} ymm2 = ymm1[1,1,3,3,5,5,7,7]
+; ALL-NEXT:    vshufps {{.*#+}} ymm3 = ymm0[1,0,3,2,5,4,7,6]
+; ALL-NEXT:    vmulps %ymm2, %ymm3, %ymm2
+; ALL-NEXT:    vmovsldup {{.*#+}} ymm1 = ymm1[0,0,2,2,4,4,6,6]
+; ALL-NEXT:    vfmaddsub213ps {{.*#+}} ymm0 = (ymm1 * ymm0) +/- ymm2
+; ALL-NEXT:    retq
+  %mul = call fast <8 x float> @llvm.experimental.complex.fmul.v8f32(<8 x float> %z, <8 x float> %w)
+  ret <8 x float> %mul
+}
+
+define <8 x float> @intrinsic_limited_v8f32(<8 x float> %z, <8 x float> %w) {
+; ALL-LABEL: intrinsic_limited_v8f32:
+; ALL:       # %bb.0:
+; ALL-NEXT:    vmovshdup {{.*#+}} ymm2 = ymm1[1,1,3,3,5,5,7,7]
+; ALL-NEXT:    vshufps {{.*#+}} ymm3 = ymm0[1,0,3,2,5,4,7,6]
+; ALL-NEXT:    vmulps %ymm2, %ymm3, %ymm2
+; ALL-NEXT:    vmovsldup {{.*#+}} ymm1 = ymm1[0,0,2,2,4,4,6,6]
+; ALL-NEXT:    vfmaddsub213ps {{.*#+}} ymm0 = (ymm1 * ymm0) +/- ymm2
+; ALL-NEXT:    retq
+  %mul = call <8 x float> @llvm.experimental.complex.fmul.v8f32(<8 x float> %z, <8 x float> %w) #0
+  ret <8 x float> %mul
+}
+
+define <16 x float> @intrinsic_slow_v16f32(<16 x float> %z, <16 x float> %w) {
+; FMA-LABEL: intrinsic_slow_v16f32:
+; FMA:       # %bb.0:
+; FMA-NEXT:    pushq %rax
+; FMA-NEXT:    .cfi_def_cfa_offset 16
+; FMA-NEXT:    vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
+; FMA-NEXT:    vmovshdup {{.*#+}} xmm3 = xmm2[1,1,3,3]
+; FMA-NEXT:    # kill: def $xmm0 killed $xmm0 killed $ymm0
+; FMA-NEXT:    # kill: def $xmm2 killed $xmm2 killed $ymm2
+; FMA-NEXT:    callq __mulsc3 at PLT
+; FMA-NEXT:    popq %rax
+; FMA-NEXT:    .cfi_def_cfa_offset 8
+; FMA-NEXT:    retq
+;
+; AVX512VL-LABEL: intrinsic_slow_v16f32:
+; AVX512VL:       # %bb.0:
+; AVX512VL-NEXT:    pushq %rax
+; AVX512VL-NEXT:    .cfi_def_cfa_offset 16
+; AVX512VL-NEXT:    vmovaps %zmm1, %zmm2
+; AVX512VL-NEXT:    vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
+; AVX512VL-NEXT:    vmovshdup {{.*#+}} xmm3 = xmm2[1,1,3,3]
+; AVX512VL-NEXT:    # kill: def $xmm0 killed $xmm0 killed $zmm0
+; AVX512VL-NEXT:    # kill: def $xmm2 killed $xmm2 killed $zmm2
+; AVX512VL-NEXT:    callq __mulsc3 at PLT
+; AVX512VL-NEXT:    popq %rax
+; AVX512VL-NEXT:    .cfi_def_cfa_offset 8
+; AVX512VL-NEXT:    retq
+  %mul = call <16 x float> @llvm.experimental.complex.fmul.v16f32(<16 x float> %z, <16 x float> %w)
+  ret <16 x float> %mul
+}
+
+
+define <16 x float> @intrinsic_fast_v16f32(<16 x float> %z, <16 x float> %w) {
+; FMA-LABEL: intrinsic_fast_v16f32:
+; FMA:       # %bb.0:
+; FMA-NEXT:    vmovshdup {{.*#+}} ymm4 = ymm2[1,1,3,3,5,5,7,7]
+; FMA-NEXT:    vshufps {{.*#+}} ymm5 = ymm0[1,0,3,2,5,4,7,6]
+; FMA-NEXT:    vmulps %ymm4, %ymm5, %ymm4
+; FMA-NEXT:    vmovsldup {{.*#+}} ymm2 = ymm2[0,0,2,2,4,4,6,6]
+; FMA-NEXT:    vfmaddsub213ps {{.*#+}} ymm0 = (ymm2 * ymm0) +/- ymm4
+; FMA-NEXT:    vmovshdup {{.*#+}} ymm2 = ymm3[1,1,3,3,5,5,7,7]
+; FMA-NEXT:    vshufps {{.*#+}} ymm4 = ymm1[1,0,3,2,5,4,7,6]
+; FMA-NEXT:    vmulps %ymm2, %ymm4, %ymm2
+; FMA-NEXT:    vmovsldup {{.*#+}} ymm3 = ymm3[0,0,2,2,4,4,6,6]
+; FMA-NEXT:    vfmaddsub213ps {{.*#+}} ymm1 = (ymm3 * ymm1) +/- ymm2
+; FMA-NEXT:    retq
+;
+; AVX512VL-LABEL: intrinsic_fast_v16f32:
+; AVX512VL:       # %bb.0:
+; AVX512VL-NEXT:    vmovshdup {{.*#+}} zmm2 = zmm1[1,1,3,3,5,5,7,7,9,9,11,11,13,13,15,15]
+; AVX512VL-NEXT:    vshufps {{.*#+}} zmm3 = zmm0[1,0,3,2,5,4,7,6,9,8,11,10,13,12,15,14]
+; AVX512VL-NEXT:    vmulps %zmm2, %zmm3, %zmm2
+; AVX512VL-NEXT:    vmovsldup {{.*#+}} zmm1 = zmm1[0,0,2,2,4,4,6,6,8,8,10,10,12,12,14,14]
+; AVX512VL-NEXT:    vfmaddsub213ps {{.*#+}} zmm0 = (zmm1 * zmm0) +/- zmm2
+; AVX512VL-NEXT:    retq
+  %mul = call fast <16 x float> @llvm.experimental.complex.fmul.v16f32(<16 x float> %z, <16 x float> %w)
+  ret <16 x float> %mul
+}
+
+define <16 x float> @intrinsic_limited_v16f32(<16 x float> %z, <16 x float> %w) {
+; FMA-LABEL: intrinsic_limited_v16f32:
+; FMA:       # %bb.0:
+; FMA-NEXT:    vmovshdup {{.*#+}} ymm4 = ymm2[1,1,3,3,5,5,7,7]
+; FMA-NEXT:    vshufps {{.*#+}} ymm5 = ymm0[1,0,3,2,5,4,7,6]
+; FMA-NEXT:    vmulps %ymm4, %ymm5, %ymm4
+; FMA-NEXT:    vmovsldup {{.*#+}} ymm2 = ymm2[0,0,2,2,4,4,6,6]
+; FMA-NEXT:    vfmaddsub213ps {{.*#+}} ymm0 = (ymm2 * ymm0) +/- ymm4
+; FMA-NEXT:    vmovshdup {{.*#+}} ymm2 = ymm3[1,1,3,3,5,5,7,7]
+; FMA-NEXT:    vshufps {{.*#+}} ymm4 = ymm1[1,0,3,2,5,4,7,6]
+; FMA-NEXT:    vmulps %ymm2, %ymm4, %ymm2
+; FMA-NEXT:    vmovsldup {{.*#+}} ymm3 = ymm3[0,0,2,2,4,4,6,6]
+; FMA-NEXT:    vfmaddsub213ps {{.*#+}} ymm1 = (ymm3 * ymm1) +/- ymm2
+; FMA-NEXT:    retq
+;
+; AVX512VL-LABEL: intrinsic_limited_v16f32:
+; AVX512VL:       # %bb.0:
+; AVX512VL-NEXT:    vmovshdup {{.*#+}} zmm2 = zmm1[1,1,3,3,5,5,7,7,9,9,11,11,13,13,15,15]
+; AVX512VL-NEXT:    vshufps {{.*#+}} zmm3 = zmm0[1,0,3,2,5,4,7,6,9,8,11,10,13,12,15,14]
+; AVX512VL-NEXT:    vmulps %zmm2, %zmm3, %zmm2
+; AVX512VL-NEXT:    vmovsldup {{.*#+}} zmm1 = zmm1[0,0,2,2,4,4,6,6,8,8,10,10,12,12,14,14]
+; AVX512VL-NEXT:    vfmaddsub213ps {{.*#+}} zmm0 = (zmm1 * zmm0) +/- zmm2
+; AVX512VL-NEXT:    retq
+  %mul = call <16 x float> @llvm.experimental.complex.fmul.v16f32(<16 x float> %z, <16 x float> %w) #0
+  ret <16 x float> %mul
+}
+
+define <2 x double> @intrinsic_slow_v2f64(<2 x double> %z, <2 x double> %w) {
+; ALL-LABEL: intrinsic_slow_v2f64:
+; ALL:       # %bb.0:
+; ALL-NEXT:    pushq %rax
+; ALL-NEXT:    .cfi_def_cfa_offset 16
+; ALL-NEXT:    vmovapd %xmm1, %xmm2
+; ALL-NEXT:    vshufpd {{.*#+}} xmm1 = xmm0[1,0]
+; ALL-NEXT:    vshufpd {{.*#+}} xmm3 = xmm2[1,0]
+; ALL-NEXT:    callq __muldc3 at PLT
+; ALL-NEXT:    vunpcklpd {{.*#+}} xmm0 = xmm0[0],xmm1[0]
+; ALL-NEXT:    popq %rax
+; ALL-NEXT:    .cfi_def_cfa_offset 8
+; ALL-NEXT:    retq
+  %mul = call <2 x double> @llvm.experimental.complex.fmul.v2f64(<2 x double> %z, <2 x double> %w)
+  ret <2 x double> %mul
+}
+
+define <2 x double> @intrinsic_fast_v2f64(<2 x double> %z, <2 x double> %w) {
+; ALL-LABEL: intrinsic_fast_v2f64:
+; ALL:       # %bb.0:
+; ALL-NEXT:    vshufpd {{.*#+}} xmm2 = xmm1[1,1]
+; ALL-NEXT:    vshufpd {{.*#+}} xmm3 = xmm0[1,0]
+; ALL-NEXT:    vmulpd %xmm2, %xmm3, %xmm2
+; ALL-NEXT:    vmovddup {{.*#+}} xmm1 = xmm1[0,0]
+; ALL-NEXT:    vfmaddsub213pd {{.*#+}} xmm0 = (xmm1 * xmm0) +/- xmm2
+; ALL-NEXT:    retq
+  %mul = call fast <2 x double> @llvm.experimental.complex.fmul.v2f64(<2 x double> %z, <2 x double> %w)
+  ret <2 x double> %mul
+}
+
+define <2 x double> @intrinsic_limited_v2f64(<2 x double> %z, <2 x double> %w) {
+; ALL-LABEL: intrinsic_limited_v2f64:
+; ALL:       # %bb.0:
+; ALL-NEXT:    vshufpd {{.*#+}} xmm2 = xmm1[1,1]
+; ALL-NEXT:    vshufpd {{.*#+}} xmm3 = xmm0[1,0]
+; ALL-NEXT:    vmulpd %xmm2, %xmm3, %xmm2
+; ALL-NEXT:    vmovddup {{.*#+}} xmm1 = xmm1[0,0]
+; ALL-NEXT:    vfmaddsub213pd {{.*#+}} xmm0 = (xmm1 * xmm0) +/- xmm2
+; ALL-NEXT:    retq
+  %mul = call <2 x double> @llvm.experimental.complex.fmul.v2f64(<2 x double> %z, <2 x double> %w) #0
+  ret <2 x double> %mul
+}
+
+define <4 x double> @intrinsic_slow_v4f64(<4 x double> %z, <4 x double> %w) {
+; ALL-LABEL: intrinsic_slow_v4f64:
+; ALL:       # %bb.0:
+; ALL-NEXT:    pushq %rax
+; ALL-NEXT:    .cfi_def_cfa_offset 16
+; ALL-NEXT:    vmovapd %ymm1, %ymm2
+; ALL-NEXT:    vshufpd {{.*#+}} xmm1 = xmm0[1,0]
+; ALL-NEXT:    vshufpd {{.*#+}} xmm3 = xmm2[1,0]
+; ALL-NEXT:    # kill: def $xmm0 killed $xmm0 killed $ymm0
+; ALL-NEXT:    # kill: def $xmm2 killed $xmm2 killed $ymm2
+; ALL-NEXT:    vzeroupper
+; ALL-NEXT:    callq __muldc3 at PLT
+; ALL-NEXT:    vunpcklpd {{.*#+}} xmm0 = xmm0[0],xmm1[0]
+; ALL-NEXT:    popq %rax
+; ALL-NEXT:    .cfi_def_cfa_offset 8
+; ALL-NEXT:    retq
+  %mul = call <4 x double> @llvm.experimental.complex.fmul.v4f64(<4 x double> %z, <4 x double> %w)
+  ret <4 x double> %mul
+}
+
+define <4 x double> @intrinsic_fast_v4f64(<4 x double> %z, <4 x double> %w) {
+; ALL-LABEL: intrinsic_fast_v4f64:
+; ALL:       # %bb.0:
+; ALL-NEXT:    vshufpd {{.*#+}} ymm2 = ymm1[1,1,3,3]
+; ALL-NEXT:    vshufpd {{.*#+}} ymm3 = ymm0[1,0,3,2]
+; ALL-NEXT:    vmulpd %ymm2, %ymm3, %ymm2
+; ALL-NEXT:    vmovddup {{.*#+}} ymm1 = ymm1[0,0,2,2]
+; ALL-NEXT:    vfmaddsub213pd {{.*#+}} ymm0 = (ymm1 * ymm0) +/- ymm2
+; ALL-NEXT:    retq
+  %mul = call fast <4 x double> @llvm.experimental.complex.fmul.v4f64(<4 x double> %z, <4 x double> %w)
+  ret <4 x double> %mul
+}
+
+define <4 x double> @intrinsic_limited_v4f64(<4 x double> %z, <4 x double> %w) {
+; ALL-LABEL: intrinsic_limited_v4f64:
+; ALL:       # %bb.0:
+; ALL-NEXT:    vshufpd {{.*#+}} ymm2 = ymm1[1,1,3,3]
+; ALL-NEXT:    vshufpd {{.*#+}} ymm3 = ymm0[1,0,3,2]
+; ALL-NEXT:    vmulpd %ymm2, %ymm3, %ymm2
+; ALL-NEXT:    vmovddup {{.*#+}} ymm1 = ymm1[0,0,2,2]
+; ALL-NEXT:    vfmaddsub213pd {{.*#+}} ymm0 = (ymm1 * ymm0) +/- ymm2
+; ALL-NEXT:    retq
+  %mul = call <4 x double> @llvm.experimental.complex.fmul.v4f64(<4 x double> %z, <4 x double> %w) #0
+  ret <4 x double> %mul
+}
+
+define <8 x double> @intrinsic_slow_v8f64(<8 x double> %z, <8 x double> %w) {
+; FMA-LABEL: intrinsic_slow_v8f64:
+; FMA:       # %bb.0:
+; FMA-NEXT:    pushq %rax
+; FMA-NEXT:    .cfi_def_cfa_offset 16
+; FMA-NEXT:    vshufpd {{.*#+}} xmm1 = xmm0[1,0]
+; FMA-NEXT:    vshufpd {{.*#+}} xmm3 = xmm2[1,0]
+; FMA-NEXT:    # kill: def $xmm0 killed $xmm0 killed $ymm0
+; FMA-NEXT:    # kill: def $xmm2 killed $xmm2 killed $ymm2
+; FMA-NEXT:    vzeroupper
+; FMA-NEXT:    callq __muldc3 at PLT
+; FMA-NEXT:    vunpcklpd {{.*#+}} xmm0 = xmm0[0],xmm1[0]
+; FMA-NEXT:    popq %rax
+; FMA-NEXT:    .cfi_def_cfa_offset 8
+; FMA-NEXT:    retq
+;
+; AVX512VL-LABEL: intrinsic_slow_v8f64:
+; AVX512VL:       # %bb.0:
+; AVX512VL-NEXT:    pushq %rax
+; AVX512VL-NEXT:    .cfi_def_cfa_offset 16
+; AVX512VL-NEXT:    vmovapd %zmm1, %zmm2
+; AVX512VL-NEXT:    vshufpd {{.*#+}} xmm1 = xmm0[1,0]
+; AVX512VL-NEXT:    vshufpd {{.*#+}} xmm3 = xmm2[1,0]
+; AVX512VL-NEXT:    # kill: def $xmm0 killed $xmm0 killed $zmm0
+; AVX512VL-NEXT:    # kill: def $xmm2 killed $xmm2 killed $zmm2
+; AVX512VL-NEXT:    vzeroupper
+; AVX512VL-NEXT:    callq __muldc3 at PLT
+; AVX512VL-NEXT:    vunpcklpd {{.*#+}} xmm0 = xmm0[0],xmm1[0]
+; AVX512VL-NEXT:    popq %rax
+; AVX512VL-NEXT:    .cfi_def_cfa_offset 8
+; AVX512VL-NEXT:    retq
+  %mul = call <8 x double> @llvm.experimental.complex.fmul.v8f64(<8 x double> %z, <8 x double> %w)
+  ret <8 x double> %mul
+}
+
+define <8 x double> @intrinsic_fast_v8f64(<8 x double> %z, <8 x double> %w) {
+; FMA-LABEL: intrinsic_fast_v8f64:
+; FMA:       # %bb.0:
+; FMA-NEXT:    vshufpd {{.*#+}} ymm4 = ymm2[1,1,3,3]
+; FMA-NEXT:    vshufpd {{.*#+}} ymm5 = ymm0[1,0,3,2]
+; FMA-NEXT:    vmulpd %ymm4, %ymm5, %ymm4
+; FMA-NEXT:    vmovddup {{.*#+}} ymm2 = ymm2[0,0,2,2]
+; FMA-NEXT:    vfmaddsub213pd {{.*#+}} ymm0 = (ymm2 * ymm0) +/- ymm4
+; FMA-NEXT:    vshufpd {{.*#+}} ymm2 = ymm3[1,1,3,3]
+; FMA-NEXT:    vshufpd {{.*#+}} ymm4 = ymm1[1,0,3,2]
+; FMA-NEXT:    vmulpd %ymm2, %ymm4, %ymm2
+; FMA-NEXT:    vmovddup {{.*#+}} ymm3 = ymm3[0,0,2,2]
+; FMA-NEXT:    vfmaddsub213pd {{.*#+}} ymm1 = (ymm3 * ymm1) +/- ymm2
+; FMA-NEXT:    retq
+;
+; AVX512VL-LABEL: intrinsic_fast_v8f64:
+; AVX512VL:       # %bb.0:
+; AVX512VL-NEXT:    vshufpd {{.*#+}} zmm2 = zmm1[1,1,3,3,5,5,7,7]
+; AVX512VL-NEXT:    vshufpd {{.*#+}} zmm3 = zmm0[1,0,3,2,5,4,7,6]
+; AVX512VL-NEXT:    vmulpd %zmm2, %zmm3, %zmm2
+; AVX512VL-NEXT:    vmovddup {{.*#+}} zmm1 = zmm1[0,0,2,2,4,4,6,6]
+; AVX512VL-NEXT:    vfmaddsub213pd {{.*#+}} zmm0 = (zmm1 * zmm0) +/- zmm2
+; AVX512VL-NEXT:    retq
+  %mul = call fast <8 x double> @llvm.experimental.complex.fmul.v8f64(<8 x double> %z, <8 x double> %w)
+  ret <8 x double> %mul
+}
+
+define <8 x double> @intrinsic_limited_v8f64(<8 x double> %z, <8 x double> %w) {
+; FMA-LABEL: intrinsic_limited_v8f64:
+; FMA:       # %bb.0:
+; FMA-NEXT:    vshufpd {{.*#+}} ymm4 = ymm2[1,1,3,3]
+; FMA-NEXT:    vshufpd {{.*#+}} ymm5 = ymm0[1,0,3,2]
+; FMA-NEXT:    vmulpd %ymm4, %ymm5, %ymm4
+; FMA-NEXT:    vmovddup {{.*#+}} ymm2 = ymm2[0,0,2,2]
+; FMA-NEXT:    vfmaddsub213pd {{.*#+}} ymm0 = (ymm2 * ymm0) +/- ymm4
+; FMA-NEXT:    vshufpd {{.*#+}} ymm2 = ymm3[1,1,3,3]
+; FMA-NEXT:    vshufpd {{.*#+}} ymm4 = ymm1[1,0,3,2]
+; FMA-NEXT:    vmulpd %ymm2, %ymm4, %ymm2
+; FMA-NEXT:    vmovddup {{.*#+}} ymm3 = ymm3[0,0,2,2]
+; FMA-NEXT:    vfmaddsub213pd {{.*#+}} ymm1 = (ymm3 * ymm1) +/- ymm2
+; FMA-NEXT:    retq
+;
+; AVX512VL-LABEL: intrinsic_limited_v8f64:
+; AVX512VL:       # %bb.0:
+; AVX512VL-NEXT:    vshufpd {{.*#+}} zmm2 = zmm1[1,1,3,3,5,5,7,7]
+; AVX512VL-NEXT:    vshufpd {{.*#+}} zmm3 = zmm0[1,0,3,2,5,4,7,6]
+; AVX512VL-NEXT:    vmulpd %zmm2, %zmm3, %zmm2
+; AVX512VL-NEXT:    vmovddup {{.*#+}} zmm1 = zmm1[0,0,2,2,4,4,6,6]
+; AVX512VL-NEXT:    vfmaddsub213pd {{.*#+}} zmm0 = (zmm1 * zmm0) +/- zmm2
+; AVX512VL-NEXT:    retq
+  %mul = call <8 x double> @llvm.experimental.complex.fmul.v8f64(<8 x double> %z, <8 x double> %w) #0
+  ret <8 x double> %mul
+}
+
+define <6 x float> @intrinsic_fast_v6f32(<6 x float> %z, <6 x float> %w) {
+; ALL-LABEL: intrinsic_fast_v6f32:
+; ALL:       # %bb.0:
+; ALL-NEXT:    vmovshdup {{.*#+}} ymm2 = ymm1[1,1,3,3,5,5,7,7]
+; ALL-NEXT:    vshufps {{.*#+}} ymm3 = ymm0[1,0,3,2,5,4,7,6]
+; ALL-NEXT:    vmulps %ymm2, %ymm3, %ymm2
+; ALL-NEXT:    vmovsldup {{.*#+}} ymm1 = ymm1[0,0,2,2,4,4,6,6]
+; ALL-NEXT:    vfmaddsub213ps {{.*#+}} ymm0 = (ymm1 * ymm0) +/- ymm2
+; ALL-NEXT:    retq
+  %mul = call fast <6 x float> @llvm.experimental.complex.fmul.v6f32(<6 x float> %z, <6 x float> %w)
+  ret <6 x float> %mul
+}
+
+define <6 x double> @intrinsic_fast_v6f64(<6 x double> %z, <6 x double> %w) {
+; FMA-LABEL: intrinsic_fast_v6f64:
+; FMA:       # %bb.0:
+; FMA-NEXT:    movq %rdi, %rax
+; FMA-NEXT:    vmovlhps {{.*#+}} xmm2 = xmm2[0],xmm3[0]
+; FMA-NEXT:    vmovlhps {{.*#+}} xmm0 = xmm0[0],xmm1[0]
+; FMA-NEXT:    vinsertf128 $1, %xmm2, %ymm0, %ymm0
+; FMA-NEXT:    vmovlhps {{.*#+}} xmm1 = xmm6[0],xmm7[0]
+; FMA-NEXT:    vinsertf128 $1, {{[0-9]+}}(%rsp), %ymm1, %ymm1
+; FMA-NEXT:    vshufpd {{.*#+}} ymm2 = ymm0[1,0,3,2]
+; FMA-NEXT:    vshufpd {{.*#+}} ymm3 = ymm1[1,1,3,3]
+; FMA-NEXT:    vmulpd %ymm3, %ymm2, %ymm2
+; FMA-NEXT:    vmovddup {{.*#+}} ymm1 = ymm1[0,0,2,2]
+; FMA-NEXT:    vfmaddsub213pd {{.*#+}} ymm1 = (ymm0 * ymm1) +/- ymm2
+; FMA-NEXT:    vunpcklpd {{.*#+}} xmm0 = xmm4[0],xmm5[0]
+; FMA-NEXT:    vunpcklpd {{.*#+}} xmm2 = xmm5[0],xmm4[0]
+; FMA-NEXT:    vmovapd {{[0-9]+}}(%rsp), %xmm3
+; FMA-NEXT:    vshufpd {{.*#+}} xmm4 = xmm3[1,1]
+; FMA-NEXT:    vmulpd %xmm4, %xmm2, %xmm2
+; FMA-NEXT:    vmovddup {{.*#+}} xmm3 = xmm3[0,0]
+; FMA-NEXT:    vfmaddsub213pd {{.*#+}} xmm3 = (xmm0 * xmm3) +/- xmm2
+; FMA-NEXT:    vmovapd %xmm3, 32(%rdi)
+; FMA-NEXT:    vmovapd %ymm1, (%rdi)
+; FMA-NEXT:    vzeroupper
+; FMA-NEXT:    retq
+;
+; AVX512VL-LABEL: intrinsic_fast_v6f64:
+; AVX512VL:       # %bb.0:
+; AVX512VL-NEXT:    vshufpd {{.*#+}} zmm2 = zmm1[1,1,3,3,5,5,7,7]
+; AVX512VL-NEXT:    vshufpd {{.*#+}} zmm3 = zmm0[1,0,3,2,5,4,7,6]
+; AVX512VL-NEXT:    vmulpd %zmm2, %zmm3, %zmm2
+; AVX512VL-NEXT:    vmovddup {{.*#+}} zmm1 = zmm1[0,0,2,2,4,4,6,6]
+; AVX512VL-NEXT:    vfmaddsub213pd {{.*#+}} zmm0 = (zmm1 * zmm0) +/- zmm2
+; AVX512VL-NEXT:    retq
+  %mul = call fast <6 x double> @llvm.experimental.complex.fmul.v6f64(<6 x double> %z, <6 x double> %w)
+  ret <6 x double> %mul
+}
+
+; Test the vector bigger than 512 bits.
+define <32 x float> @intrinsic_fast_v32f32(<32 x float> %z, <32 x float> %w) {
+; FMA-LABEL: intrinsic_fast_v32f32:
+; FMA:       # %bb.0:
+; FMA-NEXT:    vmovshdup {{.*#+}} ymm8 = ymm4[1,1,3,3,5,5,7,7]
+; FMA-NEXT:    vshufps {{.*#+}} ymm9 = ymm0[1,0,3,2,5,4,7,6]
+; FMA-NEXT:    vmulps %ymm8, %ymm9, %ymm8
+; FMA-NEXT:    vmovsldup {{.*#+}} ymm4 = ymm4[0,0,2,2,4,4,6,6]
+; FMA-NEXT:    vfmaddsub213ps {{.*#+}} ymm0 = (ymm4 * ymm0) +/- ymm8
+; FMA-NEXT:    vmovshdup {{.*#+}} ymm4 = ymm5[1,1,3,3,5,5,7,7]
+; FMA-NEXT:    vshufps {{.*#+}} ymm8 = ymm1[1,0,3,2,5,4,7,6]
+; FMA-NEXT:    vmulps %ymm4, %ymm8, %ymm4
+; FMA-NEXT:    vmovsldup {{.*#+}} ymm5 = ymm5[0,0,2,2,4,4,6,6]
+; FMA-NEXT:    vfmaddsub213ps {{.*#+}} ymm1 = (ymm5 * ymm1) +/- ymm4
+; FMA-NEXT:    vmovshdup {{.*#+}} ymm4 = ymm6[1,1,3,3,5,5,7,7]
+; FMA-NEXT:    vshufps {{.*#+}} ymm5 = ymm2[1,0,3,2,5,4,7,6]
+; FMA-NEXT:    vmulps %ymm4, %ymm5, %ymm4
+; FMA-NEXT:    vmovsldup {{.*#+}} ymm5 = ymm6[0,0,2,2,4,4,6,6]
+; FMA-NEXT:    vfmaddsub213ps {{.*#+}} ymm2 = (ymm5 * ymm2) +/- ymm4
+; FMA-NEXT:    vmovshdup {{.*#+}} ymm4 = ymm7[1,1,3,3,5,5,7,7]
+; FMA-NEXT:    vshufps {{.*#+}} ymm5 = ymm3[1,0,3,2,5,4,7,6]
+; FMA-NEXT:    vmulps %ymm4, %ymm5, %ymm4
+; FMA-NEXT:    vmovsldup {{.*#+}} ymm5 = ymm7[0,0,2,2,4,4,6,6]
+; FMA-NEXT:    vfmaddsub213ps {{.*#+}} ymm3 = (ymm5 * ymm3) +/- ymm4
+; FMA-NEXT:    retq
+;
+; AVX512VL-LABEL: intrinsic_fast_v32f32:
+; AVX512VL:       # %bb.0:
+; AVX512VL-NEXT:    vmovshdup {{.*#+}} zmm4 = zmm2[1,1,3,3,5,5,7,7,9,9,11,11,13,13,15,15]
+; AVX512VL-NEXT:    vshufps {{.*#+}} zmm5 = zmm0[1,0,3,2,5,4,7,6,9,8,11,10,13,12,15,14]
+; AVX512VL-NEXT:    vmulps %zmm4, %zmm5, %zmm4
+; AVX512VL-NEXT:    vmovsldup {{.*#+}} zmm2 = zmm2[0,0,2,2,4,4,6,6,8,8,10,10,12,12,14,14]
+; AVX512VL-NEXT:    vfmaddsub213ps {{.*#+}} zmm0 = (zmm2 * zmm0) +/- zmm4
+; AVX512VL-NEXT:    vmovshdup {{.*#+}} zmm2 = zmm3[1,1,3,3,5,5,7,7,9,9,11,11,13,13,15,15]
+; AVX512VL-NEXT:    vshufps {{.*#+}} zmm4 = zmm1[1,0,3,2,5,4,7,6,9,8,11,10,13,12,15,14]
+; AVX512VL-NEXT:    vmulps %zmm2, %zmm4, %zmm2
+; AVX512VL-NEXT:    vmovsldup {{.*#+}} zmm3 = zmm3[0,0,2,2,4,4,6,6,8,8,10,10,12,12,14,14]
+; AVX512VL-NEXT:    vfmaddsub213ps {{.*#+}} zmm1 = (zmm3 * zmm1) +/- zmm2
+; AVX512VL-NEXT:    retq
+  %mul = call fast <32 x float> @llvm.experimental.complex.fmul.v32f32(<32 x float> %z, <32 x float> %w)
+  ret <32 x float> %mul
+}
+
+attributes #0 = { "complex-range"="limited" }
diff --git a/llvm/test/CodeGen/X86/complex-win32.ll b/llvm/test/CodeGen/X86/complex-win32.ll
new file mode 100644
index 000000000000000..c8552d79ec84417
--- /dev/null
+++ b/llvm/test/CodeGen/X86/complex-win32.ll
@@ -0,0 +1,59 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
+; RUN: llc < %s -mtriple=i386-windows-msvc | FileCheck %s
+
+; Check that we handle the ABI of the complex functions correctly for 32-bit
+; windows API. Compiler-rt only includes mulsc3/muldc3, so we only test those.
+
+declare <2 x float> @llvm.experimental.complex.fmul.v2f32(<2 x float>, <2 x float>)
+declare <2 x double> @llvm.experimental.complex.fmul.v2f64(<2 x double>, <2 x double>)
+
+define <2 x float> @intrinsic_f32(<2 x float> %z, <2 x float> %w) {
+; CHECK-LABEL: intrinsic_f32:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    subl $24, %esp
+; CHECK-NEXT:    flds {{[0-9]+}}(%esp)
+; CHECK-NEXT:    flds {{[0-9]+}}(%esp)
+; CHECK-NEXT:    flds {{[0-9]+}}(%esp)
+; CHECK-NEXT:    flds {{[0-9]+}}(%esp)
+; CHECK-NEXT:    fstps {{[0-9]+}}(%esp)
+; CHECK-NEXT:    fstps {{[0-9]+}}(%esp)
+; CHECK-NEXT:    fstps {{[0-9]+}}(%esp)
+; CHECK-NEXT:    fstps (%esp)
+; CHECK-NEXT:    calll ___mulsc3
+; CHECK-NEXT:    movl %edx, {{[0-9]+}}(%esp)
+; CHECK-NEXT:    movl %eax, {{[0-9]+}}(%esp)
+; CHECK-NEXT:    flds {{[0-9]+}}(%esp)
+; CHECK-NEXT:    flds {{[0-9]+}}(%esp)
+; CHECK-NEXT:    addl $24, %esp
+; CHECK-NEXT:    retl
+  %mul = call <2 x float> @llvm.experimental.complex.fmul.v2f32(<2 x float> %z, <2 x float> %w)
+  ret <2 x float> %mul
+}
+
+define <2 x double> @intrinsic_f64(<2 x double> %z, <2 x double> %w) {
+; CHECK-LABEL: intrinsic_f64:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    pushl %ebp
+; CHECK-NEXT:    movl %esp, %ebp
+; CHECK-NEXT:    andl $-8, %esp
+; CHECK-NEXT:    subl $56, %esp
+; CHECK-NEXT:    fldl 8(%ebp)
+; CHECK-NEXT:    fldl 16(%ebp)
+; CHECK-NEXT:    fldl 24(%ebp)
+; CHECK-NEXT:    fldl 32(%ebp)
+; CHECK-NEXT:    fstpl {{[0-9]+}}(%esp)
+; CHECK-NEXT:    fstpl {{[0-9]+}}(%esp)
+; CHECK-NEXT:    fstpl {{[0-9]+}}(%esp)
+; CHECK-NEXT:    fstpl {{[0-9]+}}(%esp)
+; CHECK-NEXT:    leal {{[0-9]+}}(%esp), %eax
+; CHECK-NEXT:    movl %eax, (%esp)
+; CHECK-NEXT:    calll ___muldc3
+; CHECK-NEXT:    fldl {{[0-9]+}}(%esp)
+; CHECK-NEXT:    fldl {{[0-9]+}}(%esp)
+; CHECK-NEXT:    fxch %st(1)
+; CHECK-NEXT:    movl %ebp, %esp
+; CHECK-NEXT:    popl %ebp
+; CHECK-NEXT:    retl
+  %mul = call <2 x double> @llvm.experimental.complex.fmul.v2f64(<2 x double> %z, <2 x double> %w)
+  ret <2 x double> %mul
+}
diff --git a/llvm/test/CodeGen/X86/complex-win64.ll b/llvm/test/CodeGen/X86/complex-win64.ll
new file mode 100644
index 000000000000000..dca1de220113d3e
--- /dev/null
+++ b/llvm/test/CodeGen/X86/complex-win64.ll
@@ -0,0 +1,44 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
+; RUN: llc < %s -mtriple=x86_64-windows-msvc | FileCheck %s
+
+; Check that we handle the ABI of the complex functions correctly for 64-bit
+; windows API. Compiler-rt only includes mulsc3/muldc3, so we only test those.
+
+declare <2 x float> @llvm.experimental.complex.fmul.v2f32(<2 x float>, <2 x float>)
+declare <2 x double> @llvm.experimental.complex.fmul.v2f64(<2 x double>, <2 x double>)
+
+define <2 x float> @intrinsic_f32(<2 x float> %z, <2 x float> %w) nounwind {
+; CHECK-LABEL: intrinsic_f32:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    subq $40, %rsp
+; CHECK-NEXT:    movaps (%rdx), %xmm2
+; CHECK-NEXT:    movaps (%rcx), %xmm0
+; CHECK-NEXT:    movaps %xmm0, %xmm1
+; CHECK-NEXT:    shufps {{.*#+}} xmm1 = xmm1[1,1],xmm0[1,1]
+; CHECK-NEXT:    movaps %xmm2, %xmm3
+; CHECK-NEXT:    shufps {{.*#+}} xmm3 = xmm3[1,1],xmm2[1,1]
+; CHECK-NEXT:    callq __mulsc3
+; CHECK-NEXT:    movq %rax, %xmm0
+; CHECK-NEXT:    addq $40, %rsp
+; CHECK-NEXT:    retq
+  %mul = call <2 x float> @llvm.experimental.complex.fmul.v2f32(<2 x float> %z, <2 x float> %w)
+  ret <2 x float> %mul
+}
+
+define <2 x double> @intrinsic_f64(<2 x double> %z, <2 x double> %w) nounwind {
+; CHECK-LABEL: intrinsic_f64:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    subq $56, %rsp
+; CHECK-NEXT:    movaps (%rdx), %xmm3
+; CHECK-NEXT:    movaps (%rcx), %xmm1
+; CHECK-NEXT:    movaps %xmm1, %xmm2
+; CHECK-NEXT:    unpckhpd {{.*#+}} xmm2 = xmm2[1],xmm1[1]
+; CHECK-NEXT:    movhps %xmm3, {{[0-9]+}}(%rsp)
+; CHECK-NEXT:    leaq {{[0-9]+}}(%rsp), %rcx
+; CHECK-NEXT:    callq __muldc3
+; CHECK-NEXT:    movups {{[0-9]+}}(%rsp), %xmm0
+; CHECK-NEXT:    addq $56, %rsp
+; CHECK-NEXT:    retq
+  %mul = call <2 x double> @llvm.experimental.complex.fmul.v2f64(<2 x double> %z, <2 x double> %w)
+  ret <2 x double> %mul
+}
diff --git a/llvm/test/CodeGen/X86/fp16-complex-multiply.ll b/llvm/test/CodeGen/X86/fp16-complex-multiply.ll
new file mode 100644
index 000000000000000..0a9edb8431786fe
--- /dev/null
+++ b/llvm/test/CodeGen/X86/fp16-complex-multiply.ll
@@ -0,0 +1,231 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx512fp16 | FileCheck %s
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx512fp16,+avx512vl | FileCheck %s
+
+declare <2 x half> @llvm.experimental.complex.fmul.v2f16(<2 x half>, <2 x half>)
+declare <4 x half> @llvm.experimental.complex.fmul.v4f16(<4 x half>, <4 x half>)
+declare <8 x half> @llvm.experimental.complex.fmul.v8f16(<8 x half>, <8 x half>)
+declare <16 x half> @llvm.experimental.complex.fmul.v16f16(<16 x half>, <16 x half>)
+declare <32 x half> @llvm.experimental.complex.fmul.v32f16(<32 x half>, <32 x half>)
+declare <20 x half> @llvm.experimental.complex.fmul.v20f16(<20 x half>, <20 x half>)
+declare <64 x half> @llvm.experimental.complex.fmul.v64f16(<64 x half>, <64 x half>)
+
+; FIXME: llvm.experimental.complex.fmul.v2f16 should be lowered to vfmulcsh
+define <2 x half> @intrinsic_fast_v2f16(<2 x half> %z, <2 x half> %w) {
+; CHECK-LABEL: intrinsic_fast_v2f16:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    vfmulcph %xmm1, %xmm0, %xmm2
+; CHECK-NEXT:    vmovaps %xmm2, %xmm0
+; CHECK-NEXT:    retq
+  %mul = call fast <2 x half> @llvm.experimental.complex.fmul.v2f16(<2 x half> %z, <2 x half> %w)
+  ret <2 x half> %mul
+}
+
+define <4 x half> @intrinsic_fast_v4f16(<4 x half> %z, <4 x half> %w) {
+; CHECK-LABEL: intrinsic_fast_v4f16:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    vfmulcph %xmm1, %xmm0, %xmm2
+; CHECK-NEXT:    vmovaps %xmm2, %xmm0
+; CHECK-NEXT:    retq
+  %mul = call fast <4 x half> @llvm.experimental.complex.fmul.v4f16(<4 x half> %z, <4 x half> %w)
+  ret <4 x half> %mul
+}
+
+define <8 x half> @intrinsic_fast_v8f16(<8 x half> %z, <8 x half> %w) {
+; CHECK-LABEL: intrinsic_fast_v8f16:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    vfmulcph %xmm1, %xmm0, %xmm2
+; CHECK-NEXT:    vmovaps %xmm2, %xmm0
+; CHECK-NEXT:    retq
+  %mul = call fast <8 x half> @llvm.experimental.complex.fmul.v8f16(<8 x half> %z, <8 x half> %w)
+  ret <8 x half> %mul
+}
+
+define <16 x half> @intrinsic_fast_v16f16(<16 x half> %z, <16 x half> %w) {
+; CHECK-LABEL: intrinsic_fast_v16f16:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    vfmulcph %ymm1, %ymm0, %ymm2
+; CHECK-NEXT:    vmovaps %ymm2, %ymm0
+; CHECK-NEXT:    retq
+  %mul = call fast <16 x half> @llvm.experimental.complex.fmul.v16f16(<16 x half> %z, <16 x half> %w)
+  ret <16 x half> %mul
+}
+
+define <32 x half> @intrinsic_fast_v32f16(<32 x half> %z, <32 x half> %w) {
+; CHECK-LABEL: intrinsic_fast_v32f16:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    vfmulcph %zmm1, %zmm0, %zmm2
+; CHECK-NEXT:    vmovaps %zmm2, %zmm0
+; CHECK-NEXT:    retq
+  %mul = call fast <32 x half> @llvm.experimental.complex.fmul.v32f16(<32 x half> %z, <32 x half> %w)
+  ret <32 x half> %mul
+}
+
+define <20 x half> @intrinsic_fast_v20f16(<20 x half> %z, <20 x half> %w) {
+; CHECK-LABEL: intrinsic_fast_v20f16:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    vfmulcph %zmm1, %zmm0, %zmm2
+; CHECK-NEXT:    vmovaps %zmm2, %zmm0
+; CHECK-NEXT:    retq
+  %mul = call fast <20 x half> @llvm.experimental.complex.fmul.v20f16(<20 x half> %z, <20 x half> %w)
+  ret <20 x half> %mul
+}
+
+define <2 x half> @intrinsic_limited_v2f16(<2 x half> %z, <2 x half> %w) {
+; CHECK-LABEL: intrinsic_limited_v2f16:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    vfmulcph %xmm1, %xmm0, %xmm2
+; CHECK-NEXT:    vmovaps %xmm2, %xmm0
+; CHECK-NEXT:    retq
+  %mul = call <2 x half> @llvm.experimental.complex.fmul.v2f16(<2 x half> %z, <2 x half> %w) #0
+  ret <2 x half> %mul
+}
+
+define <4 x half> @intrinsic_limited_v4f16(<4 x half> %z, <4 x half> %w) {
+; CHECK-LABEL: intrinsic_limited_v4f16:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    vfmulcph %xmm1, %xmm0, %xmm2
+; CHECK-NEXT:    vmovaps %xmm2, %xmm0
+; CHECK-NEXT:    retq
+  %mul = call <4 x half> @llvm.experimental.complex.fmul.v4f16(<4 x half> %z, <4 x half> %w) #0
+  ret <4 x half> %mul
+}
+
+define <8 x half> @intrinsic_limited_v8f16(<8 x half> %z, <8 x half> %w) {
+; CHECK-LABEL: intrinsic_limited_v8f16:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    vfmulcph %xmm1, %xmm0, %xmm2
+; CHECK-NEXT:    vmovaps %xmm2, %xmm0
+; CHECK-NEXT:    retq
+  %mul = call <8 x half> @llvm.experimental.complex.fmul.v8f16(<8 x half> %z, <8 x half> %w) #0
+  ret <8 x half> %mul
+}
+
+define <16 x half> @intrinsic_limited_v16f16(<16 x half> %z, <16 x half> %w) {
+; CHECK-LABEL: intrinsic_limited_v16f16:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    vfmulcph %ymm1, %ymm0, %ymm2
+; CHECK-NEXT:    vmovaps %ymm2, %ymm0
+; CHECK-NEXT:    retq
+  %mul = call <16 x half> @llvm.experimental.complex.fmul.v16f16(<16 x half> %z, <16 x half> %w) #0
+  ret <16 x half> %mul
+}
+
+define <32 x half> @intrinsic_limited_v32f16(<32 x half> %z, <32 x half> %w) {
+; CHECK-LABEL: intrinsic_limited_v32f16:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    vfmulcph %zmm1, %zmm0, %zmm2
+; CHECK-NEXT:    vmovaps %zmm2, %zmm0
+; CHECK-NEXT:    retq
+  %mul = call <32 x half> @llvm.experimental.complex.fmul.v32f16(<32 x half> %z, <32 x half> %w) #0
+  ret <32 x half> %mul
+}
+
+define <20 x half> @intrinsic_limited_v20f16(<20 x half> %z, <20 x half> %w) {
+; CHECK-LABEL: intrinsic_limited_v20f16:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    vfmulcph %zmm1, %zmm0, %zmm2
+; CHECK-NEXT:    vmovaps %zmm2, %zmm0
+; CHECK-NEXT:    retq
+  %mul = call <20 x half> @llvm.experimental.complex.fmul.v20f16(<20 x half> %z, <20 x half> %w) #0
+  ret <20 x half> %mul
+}
+
+; Test the vector size bigger than 512 bits
+define <64 x half> @intrinsic_limited_v64f16(<64 x half> %z, <64 x half> %w) {
+; CHECK-LABEL: intrinsic_limited_v64f16:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    vfmulcph %zmm2, %zmm0, %zmm4
+; CHECK-NEXT:    vfmulcph %zmm3, %zmm1, %zmm2
+; CHECK-NEXT:    vmovaps %zmm4, %zmm0
+; CHECK-NEXT:    vmovaps %zmm2, %zmm1
+; CHECK-NEXT:    retq
+  %mul = call <64 x half> @llvm.experimental.complex.fmul.v64f16(<64 x half> %z, <64 x half> %w) #0
+  ret <64 x half> %mul
+}
+
+define <2 x half> @intrinsic_slow_v2f16(<2 x half> %z, <2 x half> %w) {
+; CHECK-LABEL: intrinsic_slow_v2f16:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    pushq %rax
+; CHECK-NEXT:    .cfi_def_cfa_offset 16
+; CHECK-NEXT:    vmovdqa %xmm1, %xmm2
+; CHECK-NEXT:    vpsrld $16, %xmm0, %xmm1
+; CHECK-NEXT:    vpsrld $16, %xmm2, %xmm3
+; CHECK-NEXT:    callq __mulhc3 at PLT
+; CHECK-NEXT:    popq %rax
+; CHECK-NEXT:    .cfi_def_cfa_offset 8
+; CHECK-NEXT:    retq
+  %mul = call <2 x half> @llvm.experimental.complex.fmul.v2f16(<2 x half> %z, <2 x half> %w)
+  ret <2 x half> %mul
+}
+
+define <4 x half> @intrinsic_slow_v4f16(<4 x half> %z, <4 x half> %w) {
+; CHECK-LABEL: intrinsic_slow_v4f16:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    pushq %rax
+; CHECK-NEXT:    .cfi_def_cfa_offset 16
+; CHECK-NEXT:    vmovdqa %xmm1, %xmm2
+; CHECK-NEXT:    vpsrld $16, %xmm0, %xmm1
+; CHECK-NEXT:    vpsrld $16, %xmm2, %xmm3
+; CHECK-NEXT:    callq __mulhc3 at PLT
+; CHECK-NEXT:    popq %rax
+; CHECK-NEXT:    .cfi_def_cfa_offset 8
+; CHECK-NEXT:    retq
+  %mul = call <4 x half> @llvm.experimental.complex.fmul.v4f16(<4 x half> %z, <4 x half> %w)
+  ret <4 x half> %mul
+}
+
+define <8 x half> @intrinsic_slow_v8f16(<8 x half> %z, <8 x half> %w) {
+; CHECK-LABEL: intrinsic_slow_v8f16:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    pushq %rax
+; CHECK-NEXT:    .cfi_def_cfa_offset 16
+; CHECK-NEXT:    vmovdqa %xmm1, %xmm2
+; CHECK-NEXT:    vpsrld $16, %xmm0, %xmm1
+; CHECK-NEXT:    vpsrld $16, %xmm2, %xmm3
+; CHECK-NEXT:    callq __mulhc3 at PLT
+; CHECK-NEXT:    popq %rax
+; CHECK-NEXT:    .cfi_def_cfa_offset 8
+; CHECK-NEXT:    retq
+  %mul = call <8 x half> @llvm.experimental.complex.fmul.v8f16(<8 x half> %z, <8 x half> %w)
+  ret <8 x half> %mul
+}
+
+define <16 x half> @intrinsic_slow_v16f16(<16 x half> %z, <16 x half> %w) {
+; CHECK-LABEL: intrinsic_slow_v16f16:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    pushq %rax
+; CHECK-NEXT:    .cfi_def_cfa_offset 16
+; CHECK-NEXT:    vmovdqa %ymm1, %ymm2
+; CHECK-NEXT:    vpsrld $16, %xmm0, %xmm1
+; CHECK-NEXT:    vpsrld $16, %xmm2, %xmm3
+; CHECK-NEXT:    # kill: def $xmm0 killed $xmm0 killed $ymm0
+; CHECK-NEXT:    # kill: def $xmm2 killed $xmm2 killed $ymm2
+; CHECK-NEXT:    callq __mulhc3 at PLT
+; CHECK-NEXT:    popq %rax
+; CHECK-NEXT:    .cfi_def_cfa_offset 8
+; CHECK-NEXT:    retq
+  %mul = call <16 x half> @llvm.experimental.complex.fmul.v16f16(<16 x half> %z, <16 x half> %w)
+  ret <16 x half> %mul
+}
+
+define <32 x half> @intrinsic_slow_v32f16(<32 x half> %z, <32 x half> %w) {
+; CHECK-LABEL: intrinsic_slow_v32f16:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    pushq %rax
+; CHECK-NEXT:    .cfi_def_cfa_offset 16
+; CHECK-NEXT:    vmovdqa64 %zmm1, %zmm2
+; CHECK-NEXT:    vpsrld $16, %xmm0, %xmm1
+; CHECK-NEXT:    vpsrld $16, %xmm2, %xmm3
+; CHECK-NEXT:    # kill: def $xmm0 killed $xmm0 killed $zmm0
+; CHECK-NEXT:    # kill: def $xmm2 killed $xmm2 killed $zmm2
+; CHECK-NEXT:    callq __mulhc3 at PLT
+; CHECK-NEXT:    popq %rax
+; CHECK-NEXT:    .cfi_def_cfa_offset 8
+; CHECK-NEXT:    retq
+  %mul = call <32 x half> @llvm.experimental.complex.fmul.v32f16(<32 x half> %z, <32 x half> %w)
+  ret <32 x half> %mul
+}
+
+attributes #0 = { "complex-range"="limited" }
diff --git a/llvm/test/CodeGen/X86/opt-pipeline.ll b/llvm/test/CodeGen/X86/opt-pipeline.ll
index fb8d2335b341066..1284fa70ddf88f2 100644
--- a/llvm/test/CodeGen/X86/opt-pipeline.ll
+++ b/llvm/test/CodeGen/X86/opt-pipeline.ll
@@ -65,6 +65,7 @@
 ; CHECK-NEXT:       Expand reduction intrinsics
 ; CHECK-NEXT:       Natural Loop Information
 ; CHECK-NEXT:       TLS Variable Hoist
+; CHECK-NEXT:       Expand complex intrinsics
 ; CHECK-NEXT:       Interleaved Access Pass
 ; CHECK-NEXT:       X86 Partial Reduction
 ; CHECK-NEXT:       Expand indirectbr instructions