[llvm] 109e4e3 - [Matrix] Add forward shape propagation and first shape aware lowerings.
Florian Hahn via llvm-commits
llvm-commits at lists.llvm.org
Mon Dec 23 04:52:28 PST 2019
Author: Florian Hahn
Date: 2019-12-23T13:51:56+01:00
New Revision: 109e4e3851e29e0d3357cd1a6a38155928c07d8a
URL: https://github.com/llvm/llvm-project/commit/109e4e3851e29e0d3357cd1a6a38155928c07d8a
DIFF: https://github.com/llvm/llvm-project/commit/109e4e3851e29e0d3357cd1a6a38155928c07d8a.diff
LOG: [Matrix] Add forward shape propagation and first shape aware lowerings.
This patch adds infrastructure for forward shape propagation to
LowerMatrixIntrinsics. It also updates the pass to make use of
the shape information to break up larger vector operations and to
eliminate unnecessary conversion operations between columnwise matrixes
and flattened vectors: if shape information is available for an
instruction, lower the operation to a set of instructions operating on
columns. For example, a store of a matrix is broken down into separate
stores for each column. For users that do not have shape
information (e.g. because they do not yet support shape information
aware lowering), we pack the result columns into a flat vector and
update those users.
It also adds shape aware lowering for the first non-intrinsic
instruction: vector stores.
Example:
For
%c = call <4 x double> @llvm.matrix.transpose(<4 x double> %a, i32 2, i32 2)
store <4 x double> %c, <4 x double>* %Ptr
We generate the code below without shape propagation. Note %9 which
combines the columns of the transposed matrix into a flat vector.
%split = shufflevector <4 x double> %a, <4 x double> undef, <2 x i32> <i32 0, i32 1>
%split1 = shufflevector <4 x double> %a, <4 x double> undef, <2 x i32> <i32 2, i32 3>
%1 = extractelement <2 x double> %split, i64 0
%2 = insertelement <2 x double> undef, double %1, i64 0
%3 = extractelement <2 x double> %split1, i64 0
%4 = insertelement <2 x double> %2, double %3, i64 1
%5 = extractelement <2 x double> %split, i64 1
%6 = insertelement <2 x double> undef, double %5, i64 0
%7 = extractelement <2 x double> %split1, i64 1
%8 = insertelement <2 x double> %6, double %7, i64 1
%9 = shufflevector <2 x double> %4, <2 x double> %8, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
store <4 x double> %9, <4 x double>* %Ptr
With this patch, we propagate the 2x2 shape information from the
transpose to the store and we generate the code below. Note that we
store the columns directly and do not need an extra shuffle.
%9 = bitcast <4 x double>* %Ptr to double*
%10 = bitcast double* %9 to <2 x double>*
store <2 x double> %4, <2 x double>* %10, align 8
%11 = getelementptr double, double* %9, i32 2
%12 = bitcast double* %11 to <2 x double>*
store <2 x double> %8, <2 x double>* %12, align 8
Reviewers: anemet, Gerolf, reames, hfinkel, andrew.w.kaylor
Reviewed By: anemet
Differential Revision: https://reviews.llvm.org/D70897
Added:
llvm/test/Transforms/LowerMatrixIntrinsics/propagate-forward.ll
llvm/test/Transforms/LowerMatrixIntrinsics/propagate-mixed-users.ll
Modified:
llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp
llvm/test/Transforms/LowerMatrixIntrinsics/bigger-expressions-double.ll
Removed:
################################################################################
diff --git a/llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp b/llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp
index 2bfc9643044f..b3188001e118 100644
--- a/llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp
+++ b/llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp
@@ -29,15 +29,20 @@
#include "llvm/IR/IRBuilder.h"
#include "llvm/IR/Instructions.h"
#include "llvm/IR/IntrinsicInst.h"
+#include "llvm/IR/PatternMatch.h"
#include "llvm/InitializePasses.h"
#include "llvm/Pass.h"
#include "llvm/Support/Debug.h"
#include "llvm/Transforms/Scalar.h"
using namespace llvm;
+using namespace PatternMatch;
#define DEBUG_TYPE "lower-matrix-intrinsics"
+static cl::opt<bool> EnableShapePropagation("matrix-propagate-shape",
+ cl::init(true));
+
namespace {
// Given an element poitner \p BasePtr to the start of a (sub) matrix, compute
@@ -104,12 +109,25 @@ Value *computeColumnAddr(Value *BasePtr, Value *Col, Value *Stride,
/// LowerMatrixIntrinsics contains the methods used to lower matrix intrinsics.
///
/// Currently, the lowering for each matrix intrinsic is done as follows:
-/// 1. Split the operand vectors containing an embedded matrix into a set of
-/// column vectors, based on the shape information from the intrinsic.
-/// 2. Apply the transformation described by the intrinsic on the column
-/// vectors, which yields a set of column vectors containing result matrix.
-/// 3. Embed the columns of the result matrix in a flat vector and replace all
-/// uses of the intrinsic result with it.
+/// 1. Propagate the shape information from intrinsics to connected
+/// instructions.
+/// 2. Lower instructions with shape information.
+/// 2.1. Get column vectors for each argument. If we already lowered the
+/// definition of an argument, use the produced column vectors directly.
+/// If not, split the operand vector containing an embedded matrix into
+/// a set of column vectors,
+/// 2.2. Lower the instruction in terms of columnwise operations, which yields
+/// a set of column vectors containing result matrix. Note that we lower
+/// all instructions that have shape information. Besides the intrinsics,
+/// this includes stores for example.
+/// 2.3. Update uses of the lowered instruction. If we have shape information
+/// for a user, there is nothing to do, as we will look up the result
+/// column matrix when lowering the user. For other uses, we embed the
+/// result matrix in a flat vector and update the use.
+/// 2.4. Cache the result column matrix for the instruction we lowered
+/// 3. After we lowered all instructions in a function, remove the now
+/// obsolete instructions.
+///
class LowerMatrixIntrinsics {
Function &Func;
const DataLayout &DL;
@@ -130,6 +148,10 @@ class LowerMatrixIntrinsics {
void setColumn(unsigned i, Value *V) { Columns[i] = V; }
size_t getNumColumns() const { return Columns.size(); }
+ size_t getNumRows() const {
+ assert(Columns.size() > 0 && "Cannot call getNumRows without columns");
+ return cast<VectorType>(Columns[0]->getType())->getNumElements();
+ }
const SmallVectorImpl<Value *> &getColumnVectors() const { return Columns; }
@@ -156,28 +178,73 @@ class LowerMatrixIntrinsics {
ShapeInfo(unsigned NumRows = 0, unsigned NumColumns = 0)
: NumRows(NumRows), NumColumns(NumColumns) {}
- ShapeInfo(ConstantInt *NumRows, ConstantInt *NumColumns)
- : NumRows(NumRows->getZExtValue()),
- NumColumns(NumColumns->getZExtValue()) {}
+ ShapeInfo(Value *NumRows, Value *NumColumns)
+ : NumRows(cast<ConstantInt>(NumRows)->getZExtValue()),
+ NumColumns(cast<ConstantInt>(NumColumns)->getZExtValue()) {}
+
+ bool operator==(const ShapeInfo &other) {
+ return NumRows == other.NumRows && NumColumns == other.NumColumns;
+ }
+ bool operator!=(const ShapeInfo &other) { return !(*this == other); }
+
+ /// Returns true if shape-information is defined, meaning both dimensions
+ /// are != 0.
+ operator bool() const {
+ assert(NumRows == 0 || NumColumns != 0);
+ return NumRows != 0;
+ }
};
+ /// Maps instructions to their shape information. The shape information
+ /// describes the shape to be used while lowering. This matches the shape of
+ /// the result value of the instruction, with the only exceptions being store
+ /// instructions and the matrix_columnwise_store intrinsics. For those, the
+ /// shape information indicates that those instructions should be lowered
+ /// using shape information as well.
+ DenseMap<Value *, ShapeInfo> ShapeMap;
+
+ /// List of instructions to remove. While lowering, we are not replacing all
+ /// users of a lowered instruction, if shape information is available and
+ /// those need to be removed after we finished lowering.
+ SmallVector<Instruction *, 16> ToRemove;
+
+ /// Map from instructions to their produced column matrix.
+ DenseMap<Value *, ColumnMatrixTy> Inst2ColumnMatrix;
+
public:
LowerMatrixIntrinsics(Function &F, TargetTransformInfo &TTI)
: Func(F), DL(F.getParent()->getDataLayout()), TTI(TTI) {}
/// Return the set of column vectors that a matrix value is lowered to.
///
- /// We split the flat vector \p MatrixVal containing a matrix with shape \p SI
- /// into column vectors.
+ /// If we lowered \p MatrixVal, just return the cache result column matrix.
+ /// Otherwie split the flat vector \p MatrixVal containing a matrix with
+ /// shape \p SI into column vectors.
ColumnMatrixTy getMatrix(Value *MatrixVal, const ShapeInfo &SI,
IRBuilder<> Builder) {
VectorType *VType = dyn_cast<VectorType>(MatrixVal->getType());
assert(VType && "MatrixVal must be a vector type");
assert(VType->getNumElements() == SI.NumRows * SI.NumColumns &&
"The vector size must match the number of matrix elements");
+
+ // Check if we lowered MatrixVal using shape information. In that case,
+ // return the existing column matrix, if it matches the requested shape
+ // information. If there is a mis-match, embed the result in a flat
+ // vector and split it later.
+ auto Found = Inst2ColumnMatrix.find(MatrixVal);
+ if (Found != Inst2ColumnMatrix.end()) {
+ ColumnMatrixTy &M = Found->second;
+ // Return the found matrix, if its shape matches the requested shape
+ // information
+ if (SI.NumRows == M.getNumRows() && SI.NumColumns == M.getNumColumns())
+ return M;
+
+ MatrixVal = M.embedInVector(Builder);
+ }
+
+ // Otherwise split MatrixVal.
SmallVector<Value *, 16> SplitVecs;
Value *Undef = UndefValue::get(VType);
-
for (unsigned MaskStart = 0; MaskStart < VType->getNumElements();
MaskStart += SI.NumRows) {
Constant *Mask = createSequentialMask(Builder, MaskStart, SI.NumRows, 0);
@@ -188,41 +255,144 @@ class LowerMatrixIntrinsics {
return {SplitVecs};
}
- // Replace intrinsic calls
- bool VisitCallInst(CallInst *Inst) {
- if (!Inst->getCalledFunction() || !Inst->getCalledFunction()->isIntrinsic())
+ /// If \p V already has a known shape return false. Otherwise set the shape
+ /// for instructions that support it.
+ bool setShapeInfo(Value *V, ShapeInfo Shape) {
+ assert(Shape && "Shape not set");
+ if (isa<UndefValue>(V) || !supportsShapeInfo(V))
return false;
- switch (Inst->getCalledFunction()->getIntrinsicID()) {
- case Intrinsic::matrix_multiply:
- LowerMultiply(Inst);
- break;
- case Intrinsic::matrix_transpose:
- LowerTranspose(Inst);
- break;
- case Intrinsic::matrix_columnwise_load:
- LowerColumnwiseLoad(Inst);
- break;
- case Intrinsic::matrix_columnwise_store:
- LowerColumnwiseStore(Inst);
- break;
- default:
+ auto SIter = ShapeMap.find(V);
+ if (SIter != ShapeMap.end()) {
+ LLVM_DEBUG(dbgs() << " not overriding existing shape: "
+ << SIter->second.NumRows << " "
+ << SIter->second.NumColumns << " for " << *V << "\n");
return false;
}
- Inst->eraseFromParent();
+
+ ShapeMap.insert({V, Shape});
+ LLVM_DEBUG(dbgs() << " " << Shape.NumRows << " x " << Shape.NumColumns
+ << " for " << *V << "\n");
return true;
}
+ /// Returns true if shape information can be used for \p V. The supported
+ /// instructions must match the instructions that can be lowered by this pass.
+ bool supportsShapeInfo(Value *V) {
+ Instruction *Inst = dyn_cast<Instruction>(V);
+ if (!Inst)
+ return false;
+
+ IntrinsicInst *II = dyn_cast<IntrinsicInst>(Inst);
+ if (II)
+ switch (II->getIntrinsicID()) {
+ case Intrinsic::matrix_multiply:
+ case Intrinsic::matrix_transpose:
+ case Intrinsic::matrix_columnwise_load:
+ case Intrinsic::matrix_columnwise_store:
+ return true;
+ default:
+ return false;
+ }
+ return isa<StoreInst>(Inst);
+ }
+
+ /// Propagate the shape information of instructions to their users.
+ void propagateShapeForward() {
+ // The work list contains instructions for which we can compute the shape,
+ // either based on the information provided by matrix intrinsics or known
+ // shapes of operands.
+ SmallVector<Instruction *, 8> WorkList;
+
+ // Initialize the work list with ops carrying shape information. Initially
+ // only the shape of matrix intrinsics is known.
+ for (BasicBlock &BB : Func)
+ for (Instruction &Inst : BB) {
+ IntrinsicInst *II = dyn_cast<IntrinsicInst>(&Inst);
+ if (!II)
+ continue;
+
+ switch (II->getIntrinsicID()) {
+ case Intrinsic::matrix_multiply:
+ case Intrinsic::matrix_transpose:
+ case Intrinsic::matrix_columnwise_load:
+ case Intrinsic::matrix_columnwise_store:
+ WorkList.push_back(&Inst);
+ break;
+ default:
+ break;
+ }
+ }
+
+ // Pop an element for which we guaranteed to have at least one of the
+ // operand shapes. Add the shape for this and then add users to the work
+ // list.
+ LLVM_DEBUG(dbgs() << "Forward-propagate shapes:\n");
+ while (!WorkList.empty()) {
+ Instruction *Inst = WorkList.back();
+ WorkList.pop_back();
+
+ // New entry, set the value and insert operands
+ bool Propagate = false;
+
+ Value *MatrixA;
+ Value *MatrixB;
+ Value *M;
+ Value *N;
+ Value *K;
+ if (match(Inst, m_Intrinsic<Intrinsic::matrix_multiply>(
+ m_Value(MatrixA), m_Value(MatrixB), m_Value(M),
+ m_Value(N), m_Value(K)))) {
+ Propagate = setShapeInfo(Inst, {M, K});
+ } else if (match(Inst, m_Intrinsic<Intrinsic::matrix_transpose>(
+ m_Value(MatrixA), m_Value(M), m_Value(N)))) {
+ // Flip dimensions.
+ Propagate = setShapeInfo(Inst, {N, M});
+ } else if (match(Inst, m_Intrinsic<Intrinsic::matrix_columnwise_store>(
+ m_Value(MatrixA), m_Value(), m_Value(),
+ m_Value(M), m_Value(N)))) {
+ Propagate = setShapeInfo(Inst, {N, M});
+ } else if (match(Inst,
+ m_Intrinsic<Intrinsic::matrix_columnwise_load>(
+ m_Value(), m_Value(), m_Value(M), m_Value(N)))) {
+ Propagate = setShapeInfo(Inst, {M, N});
+ } else if (match(Inst, m_Store(m_Value(MatrixA), m_Value()))) {
+ auto OpShape = ShapeMap.find(MatrixA);
+ if (OpShape != ShapeMap.end())
+ setShapeInfo(Inst, OpShape->second);
+ continue;
+ }
+
+ if (Propagate)
+ for (auto *User : Inst->users())
+ if (ShapeMap.count(User) == 0)
+ WorkList.push_back(cast<Instruction>(User));
+ }
+ }
+
bool Visit() {
+ if (EnableShapePropagation)
+ propagateShapeForward();
+
ReversePostOrderTraversal<Function *> RPOT(&Func);
bool Changed = false;
for (auto *BB : RPOT) {
for (Instruction &Inst : make_early_inc_range(*BB)) {
+ IRBuilder<> Builder(&Inst);
+
if (CallInst *CInst = dyn_cast<CallInst>(&Inst))
Changed |= VisitCallInst(CInst);
+
+ Value *Op1;
+ Value *Op2;
+ if (match(&Inst, m_Store(m_Value(Op1), m_Value(Op2))))
+ Changed |= VisitStore(&Inst, Op1, Op2, Builder);
}
}
+ for (Instruction *Inst : reverse(ToRemove))
+ Inst->eraseFromParent();
+
return Changed;
}
@@ -238,6 +408,7 @@ class LowerMatrixIntrinsics {
return Builder.CreateAlignedStore(ColumnValue, ColumnPtr, Align);
}
+
/// Turns \p BasePtr into an elementwise pointer to \p EltType.
Value *createElementPtr(Value *BasePtr, Type *EltType, IRBuilder<> &Builder) {
unsigned AS = cast<PointerType>(BasePtr->getType())->getAddressSpace();
@@ -245,6 +416,30 @@ class LowerMatrixIntrinsics {
return Builder.CreatePointerCast(BasePtr, EltPtrType);
}
+ /// Replace intrinsic calls
+ bool VisitCallInst(CallInst *Inst) {
+ if (!Inst->getCalledFunction() || !Inst->getCalledFunction()->isIntrinsic())
+ return false;
+
+ switch (Inst->getCalledFunction()->getIntrinsicID()) {
+ case Intrinsic::matrix_multiply:
+ LowerMultiply(Inst);
+ break;
+ case Intrinsic::matrix_transpose:
+ LowerTranspose(Inst);
+ break;
+ case Intrinsic::matrix_columnwise_load:
+ LowerColumnwiseLoad(Inst);
+ break;
+ case Intrinsic::matrix_columnwise_store:
+ LowerColumnwiseStore(Inst);
+ break;
+ default:
+ return false;
+ }
+ return true;
+ }
+
/// Lowers llvm.matrix.columnwise.load.
///
/// The intrinsic loads a matrix from memory using a stride between columns.
@@ -253,9 +448,8 @@ class LowerMatrixIntrinsics {
Value *Ptr = Inst->getArgOperand(0);
Value *Stride = Inst->getArgOperand(1);
auto VType = cast<VectorType>(Inst->getType());
- ShapeInfo Shape(cast<ConstantInt>(Inst->getArgOperand(2)),
- cast<ConstantInt>(Inst->getArgOperand(3)));
Value *EltPtr = createElementPtr(Ptr, VType->getElementType(), Builder);
+ ShapeInfo Shape(Inst->getArgOperand(2), Inst->getArgOperand(3));
ColumnMatrixTy Result;
// Distance between start of one column and the start of the next
@@ -267,22 +461,14 @@ class LowerMatrixIntrinsics {
Result.addColumn(Column);
}
- Inst->replaceAllUsesWith(Result.embedInVector(Builder));
+ finalizeLowering(Inst, Result, Builder);
}
- /// Lowers llvm.matrix.columnwise.store.
- ///
- /// The intrinsic store a matrix back memory using a stride between columns.
- void LowerColumnwiseStore(CallInst *Inst) {
+ void LowerStore(Instruction *Inst, Value *Matrix, Value *Ptr, Value *Stride,
+ ShapeInfo Shape) {
IRBuilder<> Builder(Inst);
- Value *Matrix = Inst->getArgOperand(0);
- Value *Ptr = Inst->getArgOperand(1);
- Value *Stride = Inst->getArgOperand(2);
- ShapeInfo Shape(cast<ConstantInt>(Inst->getArgOperand(3)),
- cast<ConstantInt>(Inst->getArgOperand(4)));
auto VType = cast<VectorType>(Matrix->getType());
Value *EltPtr = createElementPtr(Ptr, VType->getElementType(), Builder);
-
auto LM = getMatrix(Matrix, Shape, Builder);
for (auto C : enumerate(LM.columns())) {
Value *GEP =
@@ -290,6 +476,19 @@ class LowerMatrixIntrinsics {
Shape.NumRows, VType->getElementType(), Builder);
createColumnStore(C.value(), GEP, VType->getElementType(), Builder);
}
+
+ ToRemove.push_back(Inst);
+ }
+
+ /// Lowers llvm.matrix.columnwise.store.
+ ///
+ /// The intrinsic store a matrix back memory using a stride between columns.
+ void LowerColumnwiseStore(CallInst *Inst) {
+ Value *Matrix = Inst->getArgOperand(0);
+ Value *Ptr = Inst->getArgOperand(1);
+ Value *Stride = Inst->getArgOperand(2);
+ LowerStore(Inst, Matrix, Ptr, Stride,
+ {Inst->getArgOperand(3), Inst->getArgOperand(4)});
}
/// Extract a column vector of \p NumElts starting at index (\p I, \p J) from
@@ -345,14 +544,33 @@ class LowerMatrixIntrinsics {
return UseFPOp ? Builder.CreateFAdd(Sum, Mul) : Builder.CreateAdd(Sum, Mul);
}
+ /// Cache \p Matrix as result of \p Inst and update the uses of \p Inst. For
+ /// users with shape information, there's nothing to do: the will use the
+ /// cached value when they are lowered. For other users, \p Matrix is
+ /// flattened and the uses are updated to use it. Also marks \p Inst for
+ /// deletion.
+ void finalizeLowering(Instruction *Inst, ColumnMatrixTy Matrix,
+ IRBuilder<> &Builder) {
+ Inst2ColumnMatrix.insert(std::make_pair(Inst, Matrix));
+
+ ToRemove.push_back(Inst);
+ Value *Flattened = nullptr;
+ for (auto I = Inst->use_begin(), E = Inst->use_end(); I != E;) {
+ Use &U = *I++;
+ if (ShapeMap.find(U.getUser()) == ShapeMap.end()) {
+ if (!Flattened)
+ Flattened = Matrix.embedInVector(Builder);
+ U.set(Flattened);
+ }
+ }
+ }
+
/// Lowers llvm.matrix.multiply.
void LowerMultiply(CallInst *MatMul) {
IRBuilder<> Builder(MatMul);
auto *EltType = cast<VectorType>(MatMul->getType())->getElementType();
- ShapeInfo LShape(cast<ConstantInt>(MatMul->getArgOperand(2)),
- cast<ConstantInt>(MatMul->getArgOperand(3)));
- ShapeInfo RShape(cast<ConstantInt>(MatMul->getArgOperand(3)),
- cast<ConstantInt>(MatMul->getArgOperand(4)));
+ ShapeInfo LShape(MatMul->getArgOperand(2), MatMul->getArgOperand(3));
+ ShapeInfo RShape(MatMul->getArgOperand(3), MatMul->getArgOperand(4));
const ColumnMatrixTy &Lhs =
getMatrix(MatMul->getArgOperand(0), LShape, Builder);
@@ -394,8 +612,7 @@ class LowerMatrixIntrinsics {
Result.setColumn(J, insertVector(Result.getColumn(J), I, Sum, Builder));
}
}
-
- MatMul->replaceAllUsesWith(Result.embedInVector(Builder));
+ finalizeLowering(MatMul, Result, Builder);
}
/// Lowers llvm.matrix.transpose.
@@ -404,8 +621,7 @@ class LowerMatrixIntrinsics {
IRBuilder<> Builder(Inst);
Value *InputVal = Inst->getArgOperand(0);
VectorType *VectorTy = cast<VectorType>(InputVal->getType());
- ShapeInfo ArgShape(cast<ConstantInt>(Inst->getArgOperand(1)),
- cast<ConstantInt>(Inst->getArgOperand(2)));
+ ShapeInfo ArgShape(Inst->getArgOperand(1), Inst->getArgOperand(2));
ColumnMatrixTy InputMatrix = getMatrix(InputVal, ArgShape, Builder);
for (unsigned Row = 0; Row < ArgShape.NumRows; ++Row) {
@@ -425,7 +641,17 @@ class LowerMatrixIntrinsics {
Result.addColumn(ResultColumn);
}
- Inst->replaceAllUsesWith(Result.embedInVector(Builder));
+ finalizeLowering(Inst, Result, Builder);
+ }
+
+ bool VisitStore(Instruction *Inst, Value *StoredVal, Value *Ptr,
+ IRBuilder<> &Builder) {
+ auto I = ShapeMap.find(StoredVal);
+ if (I == ShapeMap.end())
+ return false;
+
+ LowerStore(Inst, StoredVal, Ptr, Builder.getInt32(I->second.NumRows), I->second);
+ return true;
}
};
} // namespace
diff --git a/llvm/test/Transforms/LowerMatrixIntrinsics/bigger-expressions-double.ll b/llvm/test/Transforms/LowerMatrixIntrinsics/bigger-expressions-double.ll
index 8fb73ddb5f63..2edc3700dd3d 100644
--- a/llvm/test/Transforms/LowerMatrixIntrinsics/bigger-expressions-double.ll
+++ b/llvm/test/Transforms/LowerMatrixIntrinsics/bigger-expressions-double.ll
@@ -2,15 +2,23 @@
; RUN: opt -lower-matrix-intrinsics -S < %s | FileCheck %s
; RUN: opt -passes='lower-matrix-intrinsics' -S < %s | FileCheck %s
-
define void @transpose_multiply(<9 x double>* %A.Ptr, <9 x double>* %B.Ptr, <9 x double>* %C.Ptr) {
; CHECK-LABEL: @transpose_multiply(
; CHECK-NEXT: entry:
+
+; Load input matrixes %A and %B.
+
; CHECK-NEXT: [[A:%.*]] = load <9 x double>, <9 x double>* [[A_PTR:%.*]]
; CHECK-NEXT: [[B:%.*]] = load <9 x double>, <9 x double>* [[B_PTR:%.*]]
+
+; Extract columns from loaded value %A.
+
; CHECK-NEXT: [[SPLIT:%.*]] = shufflevector <9 x double> [[A]], <9 x double> undef, <3 x i32> <i32 0, i32 1, i32 2>
; CHECK-NEXT: [[SPLIT1:%.*]] = shufflevector <9 x double> [[A]], <9 x double> undef, <3 x i32> <i32 3, i32 4, i32 5>
; CHECK-NEXT: [[SPLIT2:%.*]] = shufflevector <9 x double> [[A]], <9 x double> undef, <3 x i32> <i32 6, i32 7, i32 8>
+
+; Transpose %A.
+
; CHECK-NEXT: [[TMP0:%.*]] = extractelement <3 x double> [[SPLIT]], i64 0
; CHECK-NEXT: [[TMP1:%.*]] = insertelement <3 x double> undef, double [[TMP0]], i64 0
; CHECK-NEXT: [[TMP2:%.*]] = extractelement <3 x double> [[SPLIT1]], i64 0
@@ -29,192 +37,201 @@ define void @transpose_multiply(<9 x double>* %A.Ptr, <9 x double>* %B.Ptr, <9 x
; CHECK-NEXT: [[TMP15:%.*]] = insertelement <3 x double> [[TMP13]], double [[TMP14]], i64 1
; CHECK-NEXT: [[TMP16:%.*]] = extractelement <3 x double> [[SPLIT2]], i64 2
; CHECK-NEXT: [[TMP17:%.*]] = insertelement <3 x double> [[TMP15]], double [[TMP16]], i64 2
-; CHECK-NEXT: [[TMP18:%.*]] = shufflevector <3 x double> [[TMP5]], <3 x double> [[TMP11]], <6 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5>
-; CHECK-NEXT: [[TMP19:%.*]] = shufflevector <3 x double> [[TMP17]], <3 x double> undef, <6 x i32> <i32 0, i32 1, i32 2, i32 undef, i32 undef, i32 undef>
-; CHECK-NEXT: [[TMP20:%.*]] = shufflevector <6 x double> [[TMP18]], <6 x double> [[TMP19]], <9 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8>
-; CHECK-NEXT: [[SPLIT3:%.*]] = shufflevector <9 x double> [[TMP20]], <9 x double> undef, <3 x i32> <i32 0, i32 1, i32 2>
-; CHECK-NEXT: [[SPLIT4:%.*]] = shufflevector <9 x double> [[TMP20]], <9 x double> undef, <3 x i32> <i32 3, i32 4, i32 5>
-; CHECK-NEXT: [[SPLIT5:%.*]] = shufflevector <9 x double> [[TMP20]], <9 x double> undef, <3 x i32> <i32 6, i32 7, i32 8>
-; CHECK-NEXT: [[SPLIT6:%.*]] = shufflevector <9 x double> [[B]], <9 x double> undef, <3 x i32> <i32 0, i32 1, i32 2>
-; CHECK-NEXT: [[SPLIT7:%.*]] = shufflevector <9 x double> [[B]], <9 x double> undef, <3 x i32> <i32 3, i32 4, i32 5>
-; CHECK-NEXT: [[SPLIT8:%.*]] = shufflevector <9 x double> [[B]], <9 x double> undef, <3 x i32> <i32 6, i32 7, i32 8>
-; CHECK-NEXT: [[BLOCK:%.*]] = shufflevector <3 x double> [[SPLIT3]], <3 x double> undef, <1 x i32> zeroinitializer
-; CHECK-NEXT: [[TMP21:%.*]] = extractelement <3 x double> [[SPLIT6]], i64 0
-; CHECK-NEXT: [[SPLAT_SPLATINSERT:%.*]] = insertelement <1 x double> undef, double [[TMP21]], i32 0
+
+; Extract columns from %B.
+
+; CHECK-NEXT: [[SPLIT3:%.*]] = shufflevector <9 x double> [[B]], <9 x double> undef, <3 x i32> <i32 0, i32 1, i32 2>
+; CHECK-NEXT: [[SPLIT4:%.*]] = shufflevector <9 x double> [[B]], <9 x double> undef, <3 x i32> <i32 3, i32 4, i32 5>
+; CHECK-NEXT: [[SPLIT5:%.*]] = shufflevector <9 x double> [[B]], <9 x double> undef, <3 x i32> <i32 6, i32 7, i32 8>
+
+; Lower multiply(transpose(%A), %B)
+
+; CHECK-NEXT: [[BLOCK:%.*]] = shufflevector <3 x double> [[TMP5]], <3 x double> undef, <1 x i32> zeroinitializer
+; CHECK-NEXT: [[TMP18:%.*]] = extractelement <3 x double> [[SPLIT3]], i64 0
+; CHECK-NEXT: [[SPLAT_SPLATINSERT:%.*]] = insertelement <1 x double> undef, double [[TMP18]], i32 0
; CHECK-NEXT: [[SPLAT_SPLAT:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT]], <1 x double> undef, <1 x i32> zeroinitializer
-; CHECK-NEXT: [[TMP22:%.*]] = fmul <1 x double> [[BLOCK]], [[SPLAT_SPLAT]]
-; CHECK-NEXT: [[BLOCK9:%.*]] = shufflevector <3 x double> [[SPLIT4]], <3 x double> undef, <1 x i32> zeroinitializer
-; CHECK-NEXT: [[TMP23:%.*]] = extractelement <3 x double> [[SPLIT6]], i64 1
+; CHECK-NEXT: [[TMP19:%.*]] = fmul <1 x double> [[BLOCK]], [[SPLAT_SPLAT]]
+; CHECK-NEXT: [[BLOCK6:%.*]] = shufflevector <3 x double> [[TMP11]], <3 x double> undef, <1 x i32> zeroinitializer
+; CHECK-NEXT: [[TMP20:%.*]] = extractelement <3 x double> [[SPLIT3]], i64 1
+; CHECK-NEXT: [[SPLAT_SPLATINSERT7:%.*]] = insertelement <1 x double> undef, double [[TMP20]], i32 0
+; CHECK-NEXT: [[SPLAT_SPLAT8:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT7]], <1 x double> undef, <1 x i32> zeroinitializer
+; CHECK-NEXT: [[TMP21:%.*]] = fmul <1 x double> [[BLOCK6]], [[SPLAT_SPLAT8]]
+; CHECK-NEXT: [[TMP22:%.*]] = fadd <1 x double> [[TMP19]], [[TMP21]]
+; CHECK-NEXT: [[BLOCK9:%.*]] = shufflevector <3 x double> [[TMP17]], <3 x double> undef, <1 x i32> zeroinitializer
+; CHECK-NEXT: [[TMP23:%.*]] = extractelement <3 x double> [[SPLIT3]], i64 2
; CHECK-NEXT: [[SPLAT_SPLATINSERT10:%.*]] = insertelement <1 x double> undef, double [[TMP23]], i32 0
; CHECK-NEXT: [[SPLAT_SPLAT11:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT10]], <1 x double> undef, <1 x i32> zeroinitializer
; CHECK-NEXT: [[TMP24:%.*]] = fmul <1 x double> [[BLOCK9]], [[SPLAT_SPLAT11]]
; CHECK-NEXT: [[TMP25:%.*]] = fadd <1 x double> [[TMP22]], [[TMP24]]
-; CHECK-NEXT: [[BLOCK12:%.*]] = shufflevector <3 x double> [[SPLIT5]], <3 x double> undef, <1 x i32> zeroinitializer
-; CHECK-NEXT: [[TMP26:%.*]] = extractelement <3 x double> [[SPLIT6]], i64 2
-; CHECK-NEXT: [[SPLAT_SPLATINSERT13:%.*]] = insertelement <1 x double> undef, double [[TMP26]], i32 0
+; CHECK-NEXT: [[TMP26:%.*]] = shufflevector <1 x double> [[TMP25]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
+; CHECK-NEXT: [[TMP27:%.*]] = shufflevector <3 x double> undef, <3 x double> [[TMP26]], <3 x i32> <i32 3, i32 1, i32 2>
+; CHECK-NEXT: [[BLOCK12:%.*]] = shufflevector <3 x double> [[TMP5]], <3 x double> undef, <1 x i32> <i32 1>
+; CHECK-NEXT: [[TMP28:%.*]] = extractelement <3 x double> [[SPLIT3]], i64 0
+; CHECK-NEXT: [[SPLAT_SPLATINSERT13:%.*]] = insertelement <1 x double> undef, double [[TMP28]], i32 0
; CHECK-NEXT: [[SPLAT_SPLAT14:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT13]], <1 x double> undef, <1 x i32> zeroinitializer
-; CHECK-NEXT: [[TMP27:%.*]] = fmul <1 x double> [[BLOCK12]], [[SPLAT_SPLAT14]]
-; CHECK-NEXT: [[TMP28:%.*]] = fadd <1 x double> [[TMP25]], [[TMP27]]
-; CHECK-NEXT: [[TMP29:%.*]] = shufflevector <1 x double> [[TMP28]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
-; CHECK-NEXT: [[TMP30:%.*]] = shufflevector <3 x double> undef, <3 x double> [[TMP29]], <3 x i32> <i32 3, i32 1, i32 2>
-; CHECK-NEXT: [[BLOCK15:%.*]] = shufflevector <3 x double> [[SPLIT3]], <3 x double> undef, <1 x i32> <i32 1>
-; CHECK-NEXT: [[TMP31:%.*]] = extractelement <3 x double> [[SPLIT6]], i64 0
-; CHECK-NEXT: [[SPLAT_SPLATINSERT16:%.*]] = insertelement <1 x double> undef, double [[TMP31]], i32 0
+; CHECK-NEXT: [[TMP29:%.*]] = fmul <1 x double> [[BLOCK12]], [[SPLAT_SPLAT14]]
+; CHECK-NEXT: [[BLOCK15:%.*]] = shufflevector <3 x double> [[TMP11]], <3 x double> undef, <1 x i32> <i32 1>
+; CHECK-NEXT: [[TMP30:%.*]] = extractelement <3 x double> [[SPLIT3]], i64 1
+; CHECK-NEXT: [[SPLAT_SPLATINSERT16:%.*]] = insertelement <1 x double> undef, double [[TMP30]], i32 0
; CHECK-NEXT: [[SPLAT_SPLAT17:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT16]], <1 x double> undef, <1 x i32> zeroinitializer
-; CHECK-NEXT: [[TMP32:%.*]] = fmul <1 x double> [[BLOCK15]], [[SPLAT_SPLAT17]]
-; CHECK-NEXT: [[BLOCK18:%.*]] = shufflevector <3 x double> [[SPLIT4]], <3 x double> undef, <1 x i32> <i32 1>
-; CHECK-NEXT: [[TMP33:%.*]] = extractelement <3 x double> [[SPLIT6]], i64 1
+; CHECK-NEXT: [[TMP31:%.*]] = fmul <1 x double> [[BLOCK15]], [[SPLAT_SPLAT17]]
+; CHECK-NEXT: [[TMP32:%.*]] = fadd <1 x double> [[TMP29]], [[TMP31]]
+; CHECK-NEXT: [[BLOCK18:%.*]] = shufflevector <3 x double> [[TMP17]], <3 x double> undef, <1 x i32> <i32 1>
+; CHECK-NEXT: [[TMP33:%.*]] = extractelement <3 x double> [[SPLIT3]], i64 2
; CHECK-NEXT: [[SPLAT_SPLATINSERT19:%.*]] = insertelement <1 x double> undef, double [[TMP33]], i32 0
; CHECK-NEXT: [[SPLAT_SPLAT20:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT19]], <1 x double> undef, <1 x i32> zeroinitializer
; CHECK-NEXT: [[TMP34:%.*]] = fmul <1 x double> [[BLOCK18]], [[SPLAT_SPLAT20]]
; CHECK-NEXT: [[TMP35:%.*]] = fadd <1 x double> [[TMP32]], [[TMP34]]
-; CHECK-NEXT: [[BLOCK21:%.*]] = shufflevector <3 x double> [[SPLIT5]], <3 x double> undef, <1 x i32> <i32 1>
-; CHECK-NEXT: [[TMP36:%.*]] = extractelement <3 x double> [[SPLIT6]], i64 2
-; CHECK-NEXT: [[SPLAT_SPLATINSERT22:%.*]] = insertelement <1 x double> undef, double [[TMP36]], i32 0
+; CHECK-NEXT: [[TMP36:%.*]] = shufflevector <1 x double> [[TMP35]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
+; CHECK-NEXT: [[TMP37:%.*]] = shufflevector <3 x double> [[TMP27]], <3 x double> [[TMP36]], <3 x i32> <i32 0, i32 3, i32 2>
+; CHECK-NEXT: [[BLOCK21:%.*]] = shufflevector <3 x double> [[TMP5]], <3 x double> undef, <1 x i32> <i32 2>
+; CHECK-NEXT: [[TMP38:%.*]] = extractelement <3 x double> [[SPLIT3]], i64 0
+; CHECK-NEXT: [[SPLAT_SPLATINSERT22:%.*]] = insertelement <1 x double> undef, double [[TMP38]], i32 0
; CHECK-NEXT: [[SPLAT_SPLAT23:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT22]], <1 x double> undef, <1 x i32> zeroinitializer
-; CHECK-NEXT: [[TMP37:%.*]] = fmul <1 x double> [[BLOCK21]], [[SPLAT_SPLAT23]]
-; CHECK-NEXT: [[TMP38:%.*]] = fadd <1 x double> [[TMP35]], [[TMP37]]
-; CHECK-NEXT: [[TMP39:%.*]] = shufflevector <1 x double> [[TMP38]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
-; CHECK-NEXT: [[TMP40:%.*]] = shufflevector <3 x double> [[TMP30]], <3 x double> [[TMP39]], <3 x i32> <i32 0, i32 3, i32 2>
-; CHECK-NEXT: [[BLOCK24:%.*]] = shufflevector <3 x double> [[SPLIT3]], <3 x double> undef, <1 x i32> <i32 2>
-; CHECK-NEXT: [[TMP41:%.*]] = extractelement <3 x double> [[SPLIT6]], i64 0
-; CHECK-NEXT: [[SPLAT_SPLATINSERT25:%.*]] = insertelement <1 x double> undef, double [[TMP41]], i32 0
+; CHECK-NEXT: [[TMP39:%.*]] = fmul <1 x double> [[BLOCK21]], [[SPLAT_SPLAT23]]
+; CHECK-NEXT: [[BLOCK24:%.*]] = shufflevector <3 x double> [[TMP11]], <3 x double> undef, <1 x i32> <i32 2>
+; CHECK-NEXT: [[TMP40:%.*]] = extractelement <3 x double> [[SPLIT3]], i64 1
+; CHECK-NEXT: [[SPLAT_SPLATINSERT25:%.*]] = insertelement <1 x double> undef, double [[TMP40]], i32 0
; CHECK-NEXT: [[SPLAT_SPLAT26:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT25]], <1 x double> undef, <1 x i32> zeroinitializer
-; CHECK-NEXT: [[TMP42:%.*]] = fmul <1 x double> [[BLOCK24]], [[SPLAT_SPLAT26]]
-; CHECK-NEXT: [[BLOCK27:%.*]] = shufflevector <3 x double> [[SPLIT4]], <3 x double> undef, <1 x i32> <i32 2>
-; CHECK-NEXT: [[TMP43:%.*]] = extractelement <3 x double> [[SPLIT6]], i64 1
+; CHECK-NEXT: [[TMP41:%.*]] = fmul <1 x double> [[BLOCK24]], [[SPLAT_SPLAT26]]
+; CHECK-NEXT: [[TMP42:%.*]] = fadd <1 x double> [[TMP39]], [[TMP41]]
+; CHECK-NEXT: [[BLOCK27:%.*]] = shufflevector <3 x double> [[TMP17]], <3 x double> undef, <1 x i32> <i32 2>
+; CHECK-NEXT: [[TMP43:%.*]] = extractelement <3 x double> [[SPLIT3]], i64 2
; CHECK-NEXT: [[SPLAT_SPLATINSERT28:%.*]] = insertelement <1 x double> undef, double [[TMP43]], i32 0
; CHECK-NEXT: [[SPLAT_SPLAT29:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT28]], <1 x double> undef, <1 x i32> zeroinitializer
; CHECK-NEXT: [[TMP44:%.*]] = fmul <1 x double> [[BLOCK27]], [[SPLAT_SPLAT29]]
; CHECK-NEXT: [[TMP45:%.*]] = fadd <1 x double> [[TMP42]], [[TMP44]]
-; CHECK-NEXT: [[BLOCK30:%.*]] = shufflevector <3 x double> [[SPLIT5]], <3 x double> undef, <1 x i32> <i32 2>
-; CHECK-NEXT: [[TMP46:%.*]] = extractelement <3 x double> [[SPLIT6]], i64 2
-; CHECK-NEXT: [[SPLAT_SPLATINSERT31:%.*]] = insertelement <1 x double> undef, double [[TMP46]], i32 0
+; CHECK-NEXT: [[TMP46:%.*]] = shufflevector <1 x double> [[TMP45]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
+; CHECK-NEXT: [[TMP47:%.*]] = shufflevector <3 x double> [[TMP37]], <3 x double> [[TMP46]], <3 x i32> <i32 0, i32 1, i32 3>
+; CHECK-NEXT: [[BLOCK30:%.*]] = shufflevector <3 x double> [[TMP5]], <3 x double> undef, <1 x i32> zeroinitializer
+; CHECK-NEXT: [[TMP48:%.*]] = extractelement <3 x double> [[SPLIT4]], i64 0
+; CHECK-NEXT: [[SPLAT_SPLATINSERT31:%.*]] = insertelement <1 x double> undef, double [[TMP48]], i32 0
; CHECK-NEXT: [[SPLAT_SPLAT32:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT31]], <1 x double> undef, <1 x i32> zeroinitializer
-; CHECK-NEXT: [[TMP47:%.*]] = fmul <1 x double> [[BLOCK30]], [[SPLAT_SPLAT32]]
-; CHECK-NEXT: [[TMP48:%.*]] = fadd <1 x double> [[TMP45]], [[TMP47]]
-; CHECK-NEXT: [[TMP49:%.*]] = shufflevector <1 x double> [[TMP48]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
-; CHECK-NEXT: [[TMP50:%.*]] = shufflevector <3 x double> [[TMP40]], <3 x double> [[TMP49]], <3 x i32> <i32 0, i32 1, i32 3>
-; CHECK-NEXT: [[BLOCK33:%.*]] = shufflevector <3 x double> [[SPLIT3]], <3 x double> undef, <1 x i32> zeroinitializer
-; CHECK-NEXT: [[TMP51:%.*]] = extractelement <3 x double> [[SPLIT7]], i64 0
-; CHECK-NEXT: [[SPLAT_SPLATINSERT34:%.*]] = insertelement <1 x double> undef, double [[TMP51]], i32 0
+; CHECK-NEXT: [[TMP49:%.*]] = fmul <1 x double> [[BLOCK30]], [[SPLAT_SPLAT32]]
+; CHECK-NEXT: [[BLOCK33:%.*]] = shufflevector <3 x double> [[TMP11]], <3 x double> undef, <1 x i32> zeroinitializer
+; CHECK-NEXT: [[TMP50:%.*]] = extractelement <3 x double> [[SPLIT4]], i64 1
+; CHECK-NEXT: [[SPLAT_SPLATINSERT34:%.*]] = insertelement <1 x double> undef, double [[TMP50]], i32 0
; CHECK-NEXT: [[SPLAT_SPLAT35:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT34]], <1 x double> undef, <1 x i32> zeroinitializer
-; CHECK-NEXT: [[TMP52:%.*]] = fmul <1 x double> [[BLOCK33]], [[SPLAT_SPLAT35]]
-; CHECK-NEXT: [[BLOCK36:%.*]] = shufflevector <3 x double> [[SPLIT4]], <3 x double> undef, <1 x i32> zeroinitializer
-; CHECK-NEXT: [[TMP53:%.*]] = extractelement <3 x double> [[SPLIT7]], i64 1
+; CHECK-NEXT: [[TMP51:%.*]] = fmul <1 x double> [[BLOCK33]], [[SPLAT_SPLAT35]]
+; CHECK-NEXT: [[TMP52:%.*]] = fadd <1 x double> [[TMP49]], [[TMP51]]
+; CHECK-NEXT: [[BLOCK36:%.*]] = shufflevector <3 x double> [[TMP17]], <3 x double> undef, <1 x i32> zeroinitializer
+; CHECK-NEXT: [[TMP53:%.*]] = extractelement <3 x double> [[SPLIT4]], i64 2
; CHECK-NEXT: [[SPLAT_SPLATINSERT37:%.*]] = insertelement <1 x double> undef, double [[TMP53]], i32 0
; CHECK-NEXT: [[SPLAT_SPLAT38:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT37]], <1 x double> undef, <1 x i32> zeroinitializer
; CHECK-NEXT: [[TMP54:%.*]] = fmul <1 x double> [[BLOCK36]], [[SPLAT_SPLAT38]]
; CHECK-NEXT: [[TMP55:%.*]] = fadd <1 x double> [[TMP52]], [[TMP54]]
-; CHECK-NEXT: [[BLOCK39:%.*]] = shufflevector <3 x double> [[SPLIT5]], <3 x double> undef, <1 x i32> zeroinitializer
-; CHECK-NEXT: [[TMP56:%.*]] = extractelement <3 x double> [[SPLIT7]], i64 2
-; CHECK-NEXT: [[SPLAT_SPLATINSERT40:%.*]] = insertelement <1 x double> undef, double [[TMP56]], i32 0
+; CHECK-NEXT: [[TMP56:%.*]] = shufflevector <1 x double> [[TMP55]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
+; CHECK-NEXT: [[TMP57:%.*]] = shufflevector <3 x double> undef, <3 x double> [[TMP56]], <3 x i32> <i32 3, i32 1, i32 2>
+; CHECK-NEXT: [[BLOCK39:%.*]] = shufflevector <3 x double> [[TMP5]], <3 x double> undef, <1 x i32> <i32 1>
+; CHECK-NEXT: [[TMP58:%.*]] = extractelement <3 x double> [[SPLIT4]], i64 0
+; CHECK-NEXT: [[SPLAT_SPLATINSERT40:%.*]] = insertelement <1 x double> undef, double [[TMP58]], i32 0
; CHECK-NEXT: [[SPLAT_SPLAT41:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT40]], <1 x double> undef, <1 x i32> zeroinitializer
-; CHECK-NEXT: [[TMP57:%.*]] = fmul <1 x double> [[BLOCK39]], [[SPLAT_SPLAT41]]
-; CHECK-NEXT: [[TMP58:%.*]] = fadd <1 x double> [[TMP55]], [[TMP57]]
-; CHECK-NEXT: [[TMP59:%.*]] = shufflevector <1 x double> [[TMP58]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
-; CHECK-NEXT: [[TMP60:%.*]] = shufflevector <3 x double> undef, <3 x double> [[TMP59]], <3 x i32> <i32 3, i32 1, i32 2>
-; CHECK-NEXT: [[BLOCK42:%.*]] = shufflevector <3 x double> [[SPLIT3]], <3 x double> undef, <1 x i32> <i32 1>
-; CHECK-NEXT: [[TMP61:%.*]] = extractelement <3 x double> [[SPLIT7]], i64 0
-; CHECK-NEXT: [[SPLAT_SPLATINSERT43:%.*]] = insertelement <1 x double> undef, double [[TMP61]], i32 0
+; CHECK-NEXT: [[TMP59:%.*]] = fmul <1 x double> [[BLOCK39]], [[SPLAT_SPLAT41]]
+; CHECK-NEXT: [[BLOCK42:%.*]] = shufflevector <3 x double> [[TMP11]], <3 x double> undef, <1 x i32> <i32 1>
+; CHECK-NEXT: [[TMP60:%.*]] = extractelement <3 x double> [[SPLIT4]], i64 1
+; CHECK-NEXT: [[SPLAT_SPLATINSERT43:%.*]] = insertelement <1 x double> undef, double [[TMP60]], i32 0
; CHECK-NEXT: [[SPLAT_SPLAT44:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT43]], <1 x double> undef, <1 x i32> zeroinitializer
-; CHECK-NEXT: [[TMP62:%.*]] = fmul <1 x double> [[BLOCK42]], [[SPLAT_SPLAT44]]
-; CHECK-NEXT: [[BLOCK45:%.*]] = shufflevector <3 x double> [[SPLIT4]], <3 x double> undef, <1 x i32> <i32 1>
-; CHECK-NEXT: [[TMP63:%.*]] = extractelement <3 x double> [[SPLIT7]], i64 1
+; CHECK-NEXT: [[TMP61:%.*]] = fmul <1 x double> [[BLOCK42]], [[SPLAT_SPLAT44]]
+; CHECK-NEXT: [[TMP62:%.*]] = fadd <1 x double> [[TMP59]], [[TMP61]]
+; CHECK-NEXT: [[BLOCK45:%.*]] = shufflevector <3 x double> [[TMP17]], <3 x double> undef, <1 x i32> <i32 1>
+; CHECK-NEXT: [[TMP63:%.*]] = extractelement <3 x double> [[SPLIT4]], i64 2
; CHECK-NEXT: [[SPLAT_SPLATINSERT46:%.*]] = insertelement <1 x double> undef, double [[TMP63]], i32 0
; CHECK-NEXT: [[SPLAT_SPLAT47:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT46]], <1 x double> undef, <1 x i32> zeroinitializer
; CHECK-NEXT: [[TMP64:%.*]] = fmul <1 x double> [[BLOCK45]], [[SPLAT_SPLAT47]]
; CHECK-NEXT: [[TMP65:%.*]] = fadd <1 x double> [[TMP62]], [[TMP64]]
-; CHECK-NEXT: [[BLOCK48:%.*]] = shufflevector <3 x double> [[SPLIT5]], <3 x double> undef, <1 x i32> <i32 1>
-; CHECK-NEXT: [[TMP66:%.*]] = extractelement <3 x double> [[SPLIT7]], i64 2
-; CHECK-NEXT: [[SPLAT_SPLATINSERT49:%.*]] = insertelement <1 x double> undef, double [[TMP66]], i32 0
+; CHECK-NEXT: [[TMP66:%.*]] = shufflevector <1 x double> [[TMP65]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
+; CHECK-NEXT: [[TMP67:%.*]] = shufflevector <3 x double> [[TMP57]], <3 x double> [[TMP66]], <3 x i32> <i32 0, i32 3, i32 2>
+; CHECK-NEXT: [[BLOCK48:%.*]] = shufflevector <3 x double> [[TMP5]], <3 x double> undef, <1 x i32> <i32 2>
+; CHECK-NEXT: [[TMP68:%.*]] = extractelement <3 x double> [[SPLIT4]], i64 0
+; CHECK-NEXT: [[SPLAT_SPLATINSERT49:%.*]] = insertelement <1 x double> undef, double [[TMP68]], i32 0
; CHECK-NEXT: [[SPLAT_SPLAT50:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT49]], <1 x double> undef, <1 x i32> zeroinitializer
-; CHECK-NEXT: [[TMP67:%.*]] = fmul <1 x double> [[BLOCK48]], [[SPLAT_SPLAT50]]
-; CHECK-NEXT: [[TMP68:%.*]] = fadd <1 x double> [[TMP65]], [[TMP67]]
-; CHECK-NEXT: [[TMP69:%.*]] = shufflevector <1 x double> [[TMP68]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
-; CHECK-NEXT: [[TMP70:%.*]] = shufflevector <3 x double> [[TMP60]], <3 x double> [[TMP69]], <3 x i32> <i32 0, i32 3, i32 2>
-; CHECK-NEXT: [[BLOCK51:%.*]] = shufflevector <3 x double> [[SPLIT3]], <3 x double> undef, <1 x i32> <i32 2>
-; CHECK-NEXT: [[TMP71:%.*]] = extractelement <3 x double> [[SPLIT7]], i64 0
-; CHECK-NEXT: [[SPLAT_SPLATINSERT52:%.*]] = insertelement <1 x double> undef, double [[TMP71]], i32 0
+; CHECK-NEXT: [[TMP69:%.*]] = fmul <1 x double> [[BLOCK48]], [[SPLAT_SPLAT50]]
+; CHECK-NEXT: [[BLOCK51:%.*]] = shufflevector <3 x double> [[TMP11]], <3 x double> undef, <1 x i32> <i32 2>
+; CHECK-NEXT: [[TMP70:%.*]] = extractelement <3 x double> [[SPLIT4]], i64 1
+; CHECK-NEXT: [[SPLAT_SPLATINSERT52:%.*]] = insertelement <1 x double> undef, double [[TMP70]], i32 0
; CHECK-NEXT: [[SPLAT_SPLAT53:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT52]], <1 x double> undef, <1 x i32> zeroinitializer
-; CHECK-NEXT: [[TMP72:%.*]] = fmul <1 x double> [[BLOCK51]], [[SPLAT_SPLAT53]]
-; CHECK-NEXT: [[BLOCK54:%.*]] = shufflevector <3 x double> [[SPLIT4]], <3 x double> undef, <1 x i32> <i32 2>
-; CHECK-NEXT: [[TMP73:%.*]] = extractelement <3 x double> [[SPLIT7]], i64 1
+; CHECK-NEXT: [[TMP71:%.*]] = fmul <1 x double> [[BLOCK51]], [[SPLAT_SPLAT53]]
+; CHECK-NEXT: [[TMP72:%.*]] = fadd <1 x double> [[TMP69]], [[TMP71]]
+; CHECK-NEXT: [[BLOCK54:%.*]] = shufflevector <3 x double> [[TMP17]], <3 x double> undef, <1 x i32> <i32 2>
+; CHECK-NEXT: [[TMP73:%.*]] = extractelement <3 x double> [[SPLIT4]], i64 2
; CHECK-NEXT: [[SPLAT_SPLATINSERT55:%.*]] = insertelement <1 x double> undef, double [[TMP73]], i32 0
; CHECK-NEXT: [[SPLAT_SPLAT56:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT55]], <1 x double> undef, <1 x i32> zeroinitializer
; CHECK-NEXT: [[TMP74:%.*]] = fmul <1 x double> [[BLOCK54]], [[SPLAT_SPLAT56]]
; CHECK-NEXT: [[TMP75:%.*]] = fadd <1 x double> [[TMP72]], [[TMP74]]
-; CHECK-NEXT: [[BLOCK57:%.*]] = shufflevector <3 x double> [[SPLIT5]], <3 x double> undef, <1 x i32> <i32 2>
-; CHECK-NEXT: [[TMP76:%.*]] = extractelement <3 x double> [[SPLIT7]], i64 2
-; CHECK-NEXT: [[SPLAT_SPLATINSERT58:%.*]] = insertelement <1 x double> undef, double [[TMP76]], i32 0
+; CHECK-NEXT: [[TMP76:%.*]] = shufflevector <1 x double> [[TMP75]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
+; CHECK-NEXT: [[TMP77:%.*]] = shufflevector <3 x double> [[TMP67]], <3 x double> [[TMP76]], <3 x i32> <i32 0, i32 1, i32 3>
+; CHECK-NEXT: [[BLOCK57:%.*]] = shufflevector <3 x double> [[TMP5]], <3 x double> undef, <1 x i32> zeroinitializer
+; CHECK-NEXT: [[TMP78:%.*]] = extractelement <3 x double> [[SPLIT5]], i64 0
+; CHECK-NEXT: [[SPLAT_SPLATINSERT58:%.*]] = insertelement <1 x double> undef, double [[TMP78]], i32 0
; CHECK-NEXT: [[SPLAT_SPLAT59:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT58]], <1 x double> undef, <1 x i32> zeroinitializer
-; CHECK-NEXT: [[TMP77:%.*]] = fmul <1 x double> [[BLOCK57]], [[SPLAT_SPLAT59]]
-; CHECK-NEXT: [[TMP78:%.*]] = fadd <1 x double> [[TMP75]], [[TMP77]]
-; CHECK-NEXT: [[TMP79:%.*]] = shufflevector <1 x double> [[TMP78]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
-; CHECK-NEXT: [[TMP80:%.*]] = shufflevector <3 x double> [[TMP70]], <3 x double> [[TMP79]], <3 x i32> <i32 0, i32 1, i32 3>
-; CHECK-NEXT: [[BLOCK60:%.*]] = shufflevector <3 x double> [[SPLIT3]], <3 x double> undef, <1 x i32> zeroinitializer
-; CHECK-NEXT: [[TMP81:%.*]] = extractelement <3 x double> [[SPLIT8]], i64 0
-; CHECK-NEXT: [[SPLAT_SPLATINSERT61:%.*]] = insertelement <1 x double> undef, double [[TMP81]], i32 0
+; CHECK-NEXT: [[TMP79:%.*]] = fmul <1 x double> [[BLOCK57]], [[SPLAT_SPLAT59]]
+; CHECK-NEXT: [[BLOCK60:%.*]] = shufflevector <3 x double> [[TMP11]], <3 x double> undef, <1 x i32> zeroinitializer
+; CHECK-NEXT: [[TMP80:%.*]] = extractelement <3 x double> [[SPLIT5]], i64 1
+; CHECK-NEXT: [[SPLAT_SPLATINSERT61:%.*]] = insertelement <1 x double> undef, double [[TMP80]], i32 0
; CHECK-NEXT: [[SPLAT_SPLAT62:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT61]], <1 x double> undef, <1 x i32> zeroinitializer
-; CHECK-NEXT: [[TMP82:%.*]] = fmul <1 x double> [[BLOCK60]], [[SPLAT_SPLAT62]]
-; CHECK-NEXT: [[BLOCK63:%.*]] = shufflevector <3 x double> [[SPLIT4]], <3 x double> undef, <1 x i32> zeroinitializer
-; CHECK-NEXT: [[TMP83:%.*]] = extractelement <3 x double> [[SPLIT8]], i64 1
+; CHECK-NEXT: [[TMP81:%.*]] = fmul <1 x double> [[BLOCK60]], [[SPLAT_SPLAT62]]
+; CHECK-NEXT: [[TMP82:%.*]] = fadd <1 x double> [[TMP79]], [[TMP81]]
+; CHECK-NEXT: [[BLOCK63:%.*]] = shufflevector <3 x double> [[TMP17]], <3 x double> undef, <1 x i32> zeroinitializer
+; CHECK-NEXT: [[TMP83:%.*]] = extractelement <3 x double> [[SPLIT5]], i64 2
; CHECK-NEXT: [[SPLAT_SPLATINSERT64:%.*]] = insertelement <1 x double> undef, double [[TMP83]], i32 0
; CHECK-NEXT: [[SPLAT_SPLAT65:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT64]], <1 x double> undef, <1 x i32> zeroinitializer
; CHECK-NEXT: [[TMP84:%.*]] = fmul <1 x double> [[BLOCK63]], [[SPLAT_SPLAT65]]
; CHECK-NEXT: [[TMP85:%.*]] = fadd <1 x double> [[TMP82]], [[TMP84]]
-; CHECK-NEXT: [[BLOCK66:%.*]] = shufflevector <3 x double> [[SPLIT5]], <3 x double> undef, <1 x i32> zeroinitializer
-; CHECK-NEXT: [[TMP86:%.*]] = extractelement <3 x double> [[SPLIT8]], i64 2
-; CHECK-NEXT: [[SPLAT_SPLATINSERT67:%.*]] = insertelement <1 x double> undef, double [[TMP86]], i32 0
+; CHECK-NEXT: [[TMP86:%.*]] = shufflevector <1 x double> [[TMP85]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
+; CHECK-NEXT: [[TMP87:%.*]] = shufflevector <3 x double> undef, <3 x double> [[TMP86]], <3 x i32> <i32 3, i32 1, i32 2>
+; CHECK-NEXT: [[BLOCK66:%.*]] = shufflevector <3 x double> [[TMP5]], <3 x double> undef, <1 x i32> <i32 1>
+; CHECK-NEXT: [[TMP88:%.*]] = extractelement <3 x double> [[SPLIT5]], i64 0
+; CHECK-NEXT: [[SPLAT_SPLATINSERT67:%.*]] = insertelement <1 x double> undef, double [[TMP88]], i32 0
; CHECK-NEXT: [[SPLAT_SPLAT68:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT67]], <1 x double> undef, <1 x i32> zeroinitializer
-; CHECK-NEXT: [[TMP87:%.*]] = fmul <1 x double> [[BLOCK66]], [[SPLAT_SPLAT68]]
-; CHECK-NEXT: [[TMP88:%.*]] = fadd <1 x double> [[TMP85]], [[TMP87]]
-; CHECK-NEXT: [[TMP89:%.*]] = shufflevector <1 x double> [[TMP88]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
-; CHECK-NEXT: [[TMP90:%.*]] = shufflevector <3 x double> undef, <3 x double> [[TMP89]], <3 x i32> <i32 3, i32 1, i32 2>
-; CHECK-NEXT: [[BLOCK69:%.*]] = shufflevector <3 x double> [[SPLIT3]], <3 x double> undef, <1 x i32> <i32 1>
-; CHECK-NEXT: [[TMP91:%.*]] = extractelement <3 x double> [[SPLIT8]], i64 0
-; CHECK-NEXT: [[SPLAT_SPLATINSERT70:%.*]] = insertelement <1 x double> undef, double [[TMP91]], i32 0
+; CHECK-NEXT: [[TMP89:%.*]] = fmul <1 x double> [[BLOCK66]], [[SPLAT_SPLAT68]]
+; CHECK-NEXT: [[BLOCK69:%.*]] = shufflevector <3 x double> [[TMP11]], <3 x double> undef, <1 x i32> <i32 1>
+; CHECK-NEXT: [[TMP90:%.*]] = extractelement <3 x double> [[SPLIT5]], i64 1
+; CHECK-NEXT: [[SPLAT_SPLATINSERT70:%.*]] = insertelement <1 x double> undef, double [[TMP90]], i32 0
; CHECK-NEXT: [[SPLAT_SPLAT71:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT70]], <1 x double> undef, <1 x i32> zeroinitializer
-; CHECK-NEXT: [[TMP92:%.*]] = fmul <1 x double> [[BLOCK69]], [[SPLAT_SPLAT71]]
-; CHECK-NEXT: [[BLOCK72:%.*]] = shufflevector <3 x double> [[SPLIT4]], <3 x double> undef, <1 x i32> <i32 1>
-; CHECK-NEXT: [[TMP93:%.*]] = extractelement <3 x double> [[SPLIT8]], i64 1
+; CHECK-NEXT: [[TMP91:%.*]] = fmul <1 x double> [[BLOCK69]], [[SPLAT_SPLAT71]]
+; CHECK-NEXT: [[TMP92:%.*]] = fadd <1 x double> [[TMP89]], [[TMP91]]
+; CHECK-NEXT: [[BLOCK72:%.*]] = shufflevector <3 x double> [[TMP17]], <3 x double> undef, <1 x i32> <i32 1>
+; CHECK-NEXT: [[TMP93:%.*]] = extractelement <3 x double> [[SPLIT5]], i64 2
; CHECK-NEXT: [[SPLAT_SPLATINSERT73:%.*]] = insertelement <1 x double> undef, double [[TMP93]], i32 0
; CHECK-NEXT: [[SPLAT_SPLAT74:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT73]], <1 x double> undef, <1 x i32> zeroinitializer
; CHECK-NEXT: [[TMP94:%.*]] = fmul <1 x double> [[BLOCK72]], [[SPLAT_SPLAT74]]
; CHECK-NEXT: [[TMP95:%.*]] = fadd <1 x double> [[TMP92]], [[TMP94]]
-; CHECK-NEXT: [[BLOCK75:%.*]] = shufflevector <3 x double> [[SPLIT5]], <3 x double> undef, <1 x i32> <i32 1>
-; CHECK-NEXT: [[TMP96:%.*]] = extractelement <3 x double> [[SPLIT8]], i64 2
-; CHECK-NEXT: [[SPLAT_SPLATINSERT76:%.*]] = insertelement <1 x double> undef, double [[TMP96]], i32 0
+; CHECK-NEXT: [[TMP96:%.*]] = shufflevector <1 x double> [[TMP95]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
+; CHECK-NEXT: [[TMP97:%.*]] = shufflevector <3 x double> [[TMP87]], <3 x double> [[TMP96]], <3 x i32> <i32 0, i32 3, i32 2>
+; CHECK-NEXT: [[BLOCK75:%.*]] = shufflevector <3 x double> [[TMP5]], <3 x double> undef, <1 x i32> <i32 2>
+; CHECK-NEXT: [[TMP98:%.*]] = extractelement <3 x double> [[SPLIT5]], i64 0
+; CHECK-NEXT: [[SPLAT_SPLATINSERT76:%.*]] = insertelement <1 x double> undef, double [[TMP98]], i32 0
; CHECK-NEXT: [[SPLAT_SPLAT77:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT76]], <1 x double> undef, <1 x i32> zeroinitializer
-; CHECK-NEXT: [[TMP97:%.*]] = fmul <1 x double> [[BLOCK75]], [[SPLAT_SPLAT77]]
-; CHECK-NEXT: [[TMP98:%.*]] = fadd <1 x double> [[TMP95]], [[TMP97]]
-; CHECK-NEXT: [[TMP99:%.*]] = shufflevector <1 x double> [[TMP98]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
-; CHECK-NEXT: [[TMP100:%.*]] = shufflevector <3 x double> [[TMP90]], <3 x double> [[TMP99]], <3 x i32> <i32 0, i32 3, i32 2>
-; CHECK-NEXT: [[BLOCK78:%.*]] = shufflevector <3 x double> [[SPLIT3]], <3 x double> undef, <1 x i32> <i32 2>
-; CHECK-NEXT: [[TMP101:%.*]] = extractelement <3 x double> [[SPLIT8]], i64 0
-; CHECK-NEXT: [[SPLAT_SPLATINSERT79:%.*]] = insertelement <1 x double> undef, double [[TMP101]], i32 0
+; CHECK-NEXT: [[TMP99:%.*]] = fmul <1 x double> [[BLOCK75]], [[SPLAT_SPLAT77]]
+; CHECK-NEXT: [[BLOCK78:%.*]] = shufflevector <3 x double> [[TMP11]], <3 x double> undef, <1 x i32> <i32 2>
+; CHECK-NEXT: [[TMP100:%.*]] = extractelement <3 x double> [[SPLIT5]], i64 1
+; CHECK-NEXT: [[SPLAT_SPLATINSERT79:%.*]] = insertelement <1 x double> undef, double [[TMP100]], i32 0
; CHECK-NEXT: [[SPLAT_SPLAT80:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT79]], <1 x double> undef, <1 x i32> zeroinitializer
-; CHECK-NEXT: [[TMP102:%.*]] = fmul <1 x double> [[BLOCK78]], [[SPLAT_SPLAT80]]
-; CHECK-NEXT: [[BLOCK81:%.*]] = shufflevector <3 x double> [[SPLIT4]], <3 x double> undef, <1 x i32> <i32 2>
-; CHECK-NEXT: [[TMP103:%.*]] = extractelement <3 x double> [[SPLIT8]], i64 1
+; CHECK-NEXT: [[TMP101:%.*]] = fmul <1 x double> [[BLOCK78]], [[SPLAT_SPLAT80]]
+; CHECK-NEXT: [[TMP102:%.*]] = fadd <1 x double> [[TMP99]], [[TMP101]]
+; CHECK-NEXT: [[BLOCK81:%.*]] = shufflevector <3 x double> [[TMP17]], <3 x double> undef, <1 x i32> <i32 2>
+; CHECK-NEXT: [[TMP103:%.*]] = extractelement <3 x double> [[SPLIT5]], i64 2
; CHECK-NEXT: [[SPLAT_SPLATINSERT82:%.*]] = insertelement <1 x double> undef, double [[TMP103]], i32 0
; CHECK-NEXT: [[SPLAT_SPLAT83:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT82]], <1 x double> undef, <1 x i32> zeroinitializer
; CHECK-NEXT: [[TMP104:%.*]] = fmul <1 x double> [[BLOCK81]], [[SPLAT_SPLAT83]]
; CHECK-NEXT: [[TMP105:%.*]] = fadd <1 x double> [[TMP102]], [[TMP104]]
-; CHECK-NEXT: [[BLOCK84:%.*]] = shufflevector <3 x double> [[SPLIT5]], <3 x double> undef, <1 x i32> <i32 2>
-; CHECK-NEXT: [[TMP106:%.*]] = extractelement <3 x double> [[SPLIT8]], i64 2
-; CHECK-NEXT: [[SPLAT_SPLATINSERT85:%.*]] = insertelement <1 x double> undef, double [[TMP106]], i32 0
-; CHECK-NEXT: [[SPLAT_SPLAT86:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT85]], <1 x double> undef, <1 x i32> zeroinitializer
-; CHECK-NEXT: [[TMP107:%.*]] = fmul <1 x double> [[BLOCK84]], [[SPLAT_SPLAT86]]
-; CHECK-NEXT: [[TMP108:%.*]] = fadd <1 x double> [[TMP105]], [[TMP107]]
-; CHECK-NEXT: [[TMP109:%.*]] = shufflevector <1 x double> [[TMP108]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
-; CHECK-NEXT: [[TMP110:%.*]] = shufflevector <3 x double> [[TMP100]], <3 x double> [[TMP109]], <3 x i32> <i32 0, i32 1, i32 3>
-; CHECK-NEXT: [[TMP111:%.*]] = shufflevector <3 x double> [[TMP50]], <3 x double> [[TMP80]], <6 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5>
-; CHECK-NEXT: [[TMP112:%.*]] = shufflevector <3 x double> [[TMP110]], <3 x double> undef, <6 x i32> <i32 0, i32 1, i32 2, i32 undef, i32 undef, i32 undef>
-; CHECK-NEXT: [[TMP113:%.*]] = shufflevector <6 x double> [[TMP111]], <6 x double> [[TMP112]], <9 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8>
-; CHECK-NEXT: store <9 x double> [[TMP113]], <9 x double>* [[C_PTR:%.*]]
+; CHECK-NEXT: [[TMP106:%.*]] = shufflevector <1 x double> [[TMP105]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
+; CHECK-NEXT: [[TMP107:%.*]] = shufflevector <3 x double> [[TMP97]], <3 x double> [[TMP106]], <3 x i32> <i32 0, i32 1, i32 3>
+
+; Store result columns.
+
+; CHECK-NEXT: [[TMP108:%.*]] = bitcast <9 x double>* [[C_PTR:%.*]] to double*
+; CHECK-NEXT: [[TMP109:%.*]] = bitcast double* [[TMP108]] to <3 x double>*
+; CHECK-NEXT: store <3 x double> [[TMP47]], <3 x double>* [[TMP109]], align 8
+; CHECK-NEXT: [[TMP110:%.*]] = getelementptr double, double* [[TMP108]], i32 3
+; CHECK-NEXT: [[TMP111:%.*]] = bitcast double* [[TMP110]] to <3 x double>*
+; CHECK-NEXT: store <3 x double> [[TMP77]], <3 x double>* [[TMP111]], align 8
+; CHECK-NEXT: [[TMP112:%.*]] = getelementptr double, double* [[TMP108]], i32 6
+; CHECK-NEXT: [[TMP113:%.*]] = bitcast double* [[TMP112]] to <3 x double>*
+; CHECK-NEXT: store <3 x double> [[TMP107]], <3 x double>* [[TMP113]], align 8
; CHECK-NEXT: ret void
;
+
entry:
%a = load <9 x double>, <9 x double>* %A.Ptr
%b = load <9 x double>, <9 x double>* %B.Ptr
@@ -230,11 +247,20 @@ declare <9 x double> @llvm.matrix.multiply.v9f64.v9f64.v9f64(<9 x double>, <9 x
define void @transpose_multiply_add(<9 x double>* %A.Ptr, <9 x double>* %B.Ptr, <9 x double>* %C.Ptr) {
; CHECK-LABEL: @transpose_multiply_add(
; CHECK-NEXT: entry:
+
+; Load input matrixes %A and %B.
+
; CHECK-NEXT: [[A:%.*]] = load <9 x double>, <9 x double>* [[A_PTR:%.*]]
; CHECK-NEXT: [[B:%.*]] = load <9 x double>, <9 x double>* [[B_PTR:%.*]]
+
+; Extract columns from loaded value %A.
+
; CHECK-NEXT: [[SPLIT:%.*]] = shufflevector <9 x double> [[A]], <9 x double> undef, <3 x i32> <i32 0, i32 1, i32 2>
; CHECK-NEXT: [[SPLIT1:%.*]] = shufflevector <9 x double> [[A]], <9 x double> undef, <3 x i32> <i32 3, i32 4, i32 5>
; CHECK-NEXT: [[SPLIT2:%.*]] = shufflevector <9 x double> [[A]], <9 x double> undef, <3 x i32> <i32 6, i32 7, i32 8>
+
+; Transpose %A.
+
; CHECK-NEXT: [[TMP0:%.*]] = extractelement <3 x double> [[SPLIT]], i64 0
; CHECK-NEXT: [[TMP1:%.*]] = insertelement <3 x double> undef, double [[TMP0]], i64 0
; CHECK-NEXT: [[TMP2:%.*]] = extractelement <3 x double> [[SPLIT1]], i64 0
@@ -253,191 +279,197 @@ define void @transpose_multiply_add(<9 x double>* %A.Ptr, <9 x double>* %B.Ptr,
; CHECK-NEXT: [[TMP15:%.*]] = insertelement <3 x double> [[TMP13]], double [[TMP14]], i64 1
; CHECK-NEXT: [[TMP16:%.*]] = extractelement <3 x double> [[SPLIT2]], i64 2
; CHECK-NEXT: [[TMP17:%.*]] = insertelement <3 x double> [[TMP15]], double [[TMP16]], i64 2
-; CHECK-NEXT: [[TMP18:%.*]] = shufflevector <3 x double> [[TMP5]], <3 x double> [[TMP11]], <6 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5>
-; CHECK-NEXT: [[TMP19:%.*]] = shufflevector <3 x double> [[TMP17]], <3 x double> undef, <6 x i32> <i32 0, i32 1, i32 2, i32 undef, i32 undef, i32 undef>
-; CHECK-NEXT: [[TMP20:%.*]] = shufflevector <6 x double> [[TMP18]], <6 x double> [[TMP19]], <9 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8>
-; CHECK-NEXT: [[SPLIT3:%.*]] = shufflevector <9 x double> [[TMP20]], <9 x double> undef, <3 x i32> <i32 0, i32 1, i32 2>
-; CHECK-NEXT: [[SPLIT4:%.*]] = shufflevector <9 x double> [[TMP20]], <9 x double> undef, <3 x i32> <i32 3, i32 4, i32 5>
-; CHECK-NEXT: [[SPLIT5:%.*]] = shufflevector <9 x double> [[TMP20]], <9 x double> undef, <3 x i32> <i32 6, i32 7, i32 8>
-; CHECK-NEXT: [[SPLIT6:%.*]] = shufflevector <9 x double> [[B]], <9 x double> undef, <3 x i32> <i32 0, i32 1, i32 2>
-; CHECK-NEXT: [[SPLIT7:%.*]] = shufflevector <9 x double> [[B]], <9 x double> undef, <3 x i32> <i32 3, i32 4, i32 5>
-; CHECK-NEXT: [[SPLIT8:%.*]] = shufflevector <9 x double> [[B]], <9 x double> undef, <3 x i32> <i32 6, i32 7, i32 8>
-; CHECK-NEXT: [[BLOCK:%.*]] = shufflevector <3 x double> [[SPLIT3]], <3 x double> undef, <1 x i32> zeroinitializer
-; CHECK-NEXT: [[TMP21:%.*]] = extractelement <3 x double> [[SPLIT6]], i64 0
-; CHECK-NEXT: [[SPLAT_SPLATINSERT:%.*]] = insertelement <1 x double> undef, double [[TMP21]], i32 0
+
+; Extract columns from %B.
+
+; CHECK-NEXT: [[SPLIT3:%.*]] = shufflevector <9 x double> [[B]], <9 x double> undef, <3 x i32> <i32 0, i32 1, i32 2>
+; CHECK-NEXT: [[SPLIT4:%.*]] = shufflevector <9 x double> [[B]], <9 x double> undef, <3 x i32> <i32 3, i32 4, i32 5>
+; CHECK-NEXT: [[SPLIT5:%.*]] = shufflevector <9 x double> [[B]], <9 x double> undef, <3 x i32> <i32 6, i32 7, i32 8>
+
+; Lower multiply(transpose(%A), %B)
+
+; CHECK-NEXT: [[BLOCK:%.*]] = shufflevector <3 x double> [[TMP5]], <3 x double> undef, <1 x i32> zeroinitializer
+; CHECK-NEXT: [[TMP18:%.*]] = extractelement <3 x double> [[SPLIT3]], i64 0
+; CHECK-NEXT: [[SPLAT_SPLATINSERT:%.*]] = insertelement <1 x double> undef, double [[TMP18]], i32 0
; CHECK-NEXT: [[SPLAT_SPLAT:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT]], <1 x double> undef, <1 x i32> zeroinitializer
-; CHECK-NEXT: [[TMP22:%.*]] = fmul <1 x double> [[BLOCK]], [[SPLAT_SPLAT]]
-; CHECK-NEXT: [[BLOCK9:%.*]] = shufflevector <3 x double> [[SPLIT4]], <3 x double> undef, <1 x i32> zeroinitializer
-; CHECK-NEXT: [[TMP23:%.*]] = extractelement <3 x double> [[SPLIT6]], i64 1
+; CHECK-NEXT: [[TMP19:%.*]] = fmul <1 x double> [[BLOCK]], [[SPLAT_SPLAT]]
+; CHECK-NEXT: [[BLOCK6:%.*]] = shufflevector <3 x double> [[TMP11]], <3 x double> undef, <1 x i32> zeroinitializer
+; CHECK-NEXT: [[TMP20:%.*]] = extractelement <3 x double> [[SPLIT3]], i64 1
+; CHECK-NEXT: [[SPLAT_SPLATINSERT7:%.*]] = insertelement <1 x double> undef, double [[TMP20]], i32 0
+; CHECK-NEXT: [[SPLAT_SPLAT8:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT7]], <1 x double> undef, <1 x i32> zeroinitializer
+; CHECK-NEXT: [[TMP21:%.*]] = fmul <1 x double> [[BLOCK6]], [[SPLAT_SPLAT8]]
+; CHECK-NEXT: [[TMP22:%.*]] = fadd <1 x double> [[TMP19]], [[TMP21]]
+; CHECK-NEXT: [[BLOCK9:%.*]] = shufflevector <3 x double> [[TMP17]], <3 x double> undef, <1 x i32> zeroinitializer
+; CHECK-NEXT: [[TMP23:%.*]] = extractelement <3 x double> [[SPLIT3]], i64 2
; CHECK-NEXT: [[SPLAT_SPLATINSERT10:%.*]] = insertelement <1 x double> undef, double [[TMP23]], i32 0
; CHECK-NEXT: [[SPLAT_SPLAT11:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT10]], <1 x double> undef, <1 x i32> zeroinitializer
; CHECK-NEXT: [[TMP24:%.*]] = fmul <1 x double> [[BLOCK9]], [[SPLAT_SPLAT11]]
; CHECK-NEXT: [[TMP25:%.*]] = fadd <1 x double> [[TMP22]], [[TMP24]]
-; CHECK-NEXT: [[BLOCK12:%.*]] = shufflevector <3 x double> [[SPLIT5]], <3 x double> undef, <1 x i32> zeroinitializer
-; CHECK-NEXT: [[TMP26:%.*]] = extractelement <3 x double> [[SPLIT6]], i64 2
-; CHECK-NEXT: [[SPLAT_SPLATINSERT13:%.*]] = insertelement <1 x double> undef, double [[TMP26]], i32 0
+; CHECK-NEXT: [[TMP26:%.*]] = shufflevector <1 x double> [[TMP25]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
+; CHECK-NEXT: [[TMP27:%.*]] = shufflevector <3 x double> undef, <3 x double> [[TMP26]], <3 x i32> <i32 3, i32 1, i32 2>
+; CHECK-NEXT: [[BLOCK12:%.*]] = shufflevector <3 x double> [[TMP5]], <3 x double> undef, <1 x i32> <i32 1>
+; CHECK-NEXT: [[TMP28:%.*]] = extractelement <3 x double> [[SPLIT3]], i64 0
+; CHECK-NEXT: [[SPLAT_SPLATINSERT13:%.*]] = insertelement <1 x double> undef, double [[TMP28]], i32 0
; CHECK-NEXT: [[SPLAT_SPLAT14:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT13]], <1 x double> undef, <1 x i32> zeroinitializer
-; CHECK-NEXT: [[TMP27:%.*]] = fmul <1 x double> [[BLOCK12]], [[SPLAT_SPLAT14]]
-; CHECK-NEXT: [[TMP28:%.*]] = fadd <1 x double> [[TMP25]], [[TMP27]]
-; CHECK-NEXT: [[TMP29:%.*]] = shufflevector <1 x double> [[TMP28]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
-; CHECK-NEXT: [[TMP30:%.*]] = shufflevector <3 x double> undef, <3 x double> [[TMP29]], <3 x i32> <i32 3, i32 1, i32 2>
-; CHECK-NEXT: [[BLOCK15:%.*]] = shufflevector <3 x double> [[SPLIT3]], <3 x double> undef, <1 x i32> <i32 1>
-; CHECK-NEXT: [[TMP31:%.*]] = extractelement <3 x double> [[SPLIT6]], i64 0
-; CHECK-NEXT: [[SPLAT_SPLATINSERT16:%.*]] = insertelement <1 x double> undef, double [[TMP31]], i32 0
+; CHECK-NEXT: [[TMP29:%.*]] = fmul <1 x double> [[BLOCK12]], [[SPLAT_SPLAT14]]
+; CHECK-NEXT: [[BLOCK15:%.*]] = shufflevector <3 x double> [[TMP11]], <3 x double> undef, <1 x i32> <i32 1>
+; CHECK-NEXT: [[TMP30:%.*]] = extractelement <3 x double> [[SPLIT3]], i64 1
+; CHECK-NEXT: [[SPLAT_SPLATINSERT16:%.*]] = insertelement <1 x double> undef, double [[TMP30]], i32 0
; CHECK-NEXT: [[SPLAT_SPLAT17:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT16]], <1 x double> undef, <1 x i32> zeroinitializer
-; CHECK-NEXT: [[TMP32:%.*]] = fmul <1 x double> [[BLOCK15]], [[SPLAT_SPLAT17]]
-; CHECK-NEXT: [[BLOCK18:%.*]] = shufflevector <3 x double> [[SPLIT4]], <3 x double> undef, <1 x i32> <i32 1>
-; CHECK-NEXT: [[TMP33:%.*]] = extractelement <3 x double> [[SPLIT6]], i64 1
+; CHECK-NEXT: [[TMP31:%.*]] = fmul <1 x double> [[BLOCK15]], [[SPLAT_SPLAT17]]
+; CHECK-NEXT: [[TMP32:%.*]] = fadd <1 x double> [[TMP29]], [[TMP31]]
+; CHECK-NEXT: [[BLOCK18:%.*]] = shufflevector <3 x double> [[TMP17]], <3 x double> undef, <1 x i32> <i32 1>
+; CHECK-NEXT: [[TMP33:%.*]] = extractelement <3 x double> [[SPLIT3]], i64 2
; CHECK-NEXT: [[SPLAT_SPLATINSERT19:%.*]] = insertelement <1 x double> undef, double [[TMP33]], i32 0
; CHECK-NEXT: [[SPLAT_SPLAT20:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT19]], <1 x double> undef, <1 x i32> zeroinitializer
; CHECK-NEXT: [[TMP34:%.*]] = fmul <1 x double> [[BLOCK18]], [[SPLAT_SPLAT20]]
; CHECK-NEXT: [[TMP35:%.*]] = fadd <1 x double> [[TMP32]], [[TMP34]]
-; CHECK-NEXT: [[BLOCK21:%.*]] = shufflevector <3 x double> [[SPLIT5]], <3 x double> undef, <1 x i32> <i32 1>
-; CHECK-NEXT: [[TMP36:%.*]] = extractelement <3 x double> [[SPLIT6]], i64 2
-; CHECK-NEXT: [[SPLAT_SPLATINSERT22:%.*]] = insertelement <1 x double> undef, double [[TMP36]], i32 0
+; CHECK-NEXT: [[TMP36:%.*]] = shufflevector <1 x double> [[TMP35]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
+; CHECK-NEXT: [[TMP37:%.*]] = shufflevector <3 x double> [[TMP27]], <3 x double> [[TMP36]], <3 x i32> <i32 0, i32 3, i32 2>
+; CHECK-NEXT: [[BLOCK21:%.*]] = shufflevector <3 x double> [[TMP5]], <3 x double> undef, <1 x i32> <i32 2>
+; CHECK-NEXT: [[TMP38:%.*]] = extractelement <3 x double> [[SPLIT3]], i64 0
+; CHECK-NEXT: [[SPLAT_SPLATINSERT22:%.*]] = insertelement <1 x double> undef, double [[TMP38]], i32 0
; CHECK-NEXT: [[SPLAT_SPLAT23:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT22]], <1 x double> undef, <1 x i32> zeroinitializer
-; CHECK-NEXT: [[TMP37:%.*]] = fmul <1 x double> [[BLOCK21]], [[SPLAT_SPLAT23]]
-; CHECK-NEXT: [[TMP38:%.*]] = fadd <1 x double> [[TMP35]], [[TMP37]]
-; CHECK-NEXT: [[TMP39:%.*]] = shufflevector <1 x double> [[TMP38]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
-; CHECK-NEXT: [[TMP40:%.*]] = shufflevector <3 x double> [[TMP30]], <3 x double> [[TMP39]], <3 x i32> <i32 0, i32 3, i32 2>
-; CHECK-NEXT: [[BLOCK24:%.*]] = shufflevector <3 x double> [[SPLIT3]], <3 x double> undef, <1 x i32> <i32 2>
-; CHECK-NEXT: [[TMP41:%.*]] = extractelement <3 x double> [[SPLIT6]], i64 0
-; CHECK-NEXT: [[SPLAT_SPLATINSERT25:%.*]] = insertelement <1 x double> undef, double [[TMP41]], i32 0
+; CHECK-NEXT: [[TMP39:%.*]] = fmul <1 x double> [[BLOCK21]], [[SPLAT_SPLAT23]]
+; CHECK-NEXT: [[BLOCK24:%.*]] = shufflevector <3 x double> [[TMP11]], <3 x double> undef, <1 x i32> <i32 2>
+; CHECK-NEXT: [[TMP40:%.*]] = extractelement <3 x double> [[SPLIT3]], i64 1
+; CHECK-NEXT: [[SPLAT_SPLATINSERT25:%.*]] = insertelement <1 x double> undef, double [[TMP40]], i32 0
; CHECK-NEXT: [[SPLAT_SPLAT26:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT25]], <1 x double> undef, <1 x i32> zeroinitializer
-; CHECK-NEXT: [[TMP42:%.*]] = fmul <1 x double> [[BLOCK24]], [[SPLAT_SPLAT26]]
-; CHECK-NEXT: [[BLOCK27:%.*]] = shufflevector <3 x double> [[SPLIT4]], <3 x double> undef, <1 x i32> <i32 2>
-; CHECK-NEXT: [[TMP43:%.*]] = extractelement <3 x double> [[SPLIT6]], i64 1
+; CHECK-NEXT: [[TMP41:%.*]] = fmul <1 x double> [[BLOCK24]], [[SPLAT_SPLAT26]]
+; CHECK-NEXT: [[TMP42:%.*]] = fadd <1 x double> [[TMP39]], [[TMP41]]
+; CHECK-NEXT: [[BLOCK27:%.*]] = shufflevector <3 x double> [[TMP17]], <3 x double> undef, <1 x i32> <i32 2>
+; CHECK-NEXT: [[TMP43:%.*]] = extractelement <3 x double> [[SPLIT3]], i64 2
; CHECK-NEXT: [[SPLAT_SPLATINSERT28:%.*]] = insertelement <1 x double> undef, double [[TMP43]], i32 0
; CHECK-NEXT: [[SPLAT_SPLAT29:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT28]], <1 x double> undef, <1 x i32> zeroinitializer
; CHECK-NEXT: [[TMP44:%.*]] = fmul <1 x double> [[BLOCK27]], [[SPLAT_SPLAT29]]
; CHECK-NEXT: [[TMP45:%.*]] = fadd <1 x double> [[TMP42]], [[TMP44]]
-; CHECK-NEXT: [[BLOCK30:%.*]] = shufflevector <3 x double> [[SPLIT5]], <3 x double> undef, <1 x i32> <i32 2>
-; CHECK-NEXT: [[TMP46:%.*]] = extractelement <3 x double> [[SPLIT6]], i64 2
-; CHECK-NEXT: [[SPLAT_SPLATINSERT31:%.*]] = insertelement <1 x double> undef, double [[TMP46]], i32 0
+; CHECK-NEXT: [[TMP46:%.*]] = shufflevector <1 x double> [[TMP45]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
+; CHECK-NEXT: [[TMP47:%.*]] = shufflevector <3 x double> [[TMP37]], <3 x double> [[TMP46]], <3 x i32> <i32 0, i32 1, i32 3>
+; CHECK-NEXT: [[BLOCK30:%.*]] = shufflevector <3 x double> [[TMP5]], <3 x double> undef, <1 x i32> zeroinitializer
+; CHECK-NEXT: [[TMP48:%.*]] = extractelement <3 x double> [[SPLIT4]], i64 0
+; CHECK-NEXT: [[SPLAT_SPLATINSERT31:%.*]] = insertelement <1 x double> undef, double [[TMP48]], i32 0
; CHECK-NEXT: [[SPLAT_SPLAT32:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT31]], <1 x double> undef, <1 x i32> zeroinitializer
-; CHECK-NEXT: [[TMP47:%.*]] = fmul <1 x double> [[BLOCK30]], [[SPLAT_SPLAT32]]
-; CHECK-NEXT: [[TMP48:%.*]] = fadd <1 x double> [[TMP45]], [[TMP47]]
-; CHECK-NEXT: [[TMP49:%.*]] = shufflevector <1 x double> [[TMP48]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
-; CHECK-NEXT: [[TMP50:%.*]] = shufflevector <3 x double> [[TMP40]], <3 x double> [[TMP49]], <3 x i32> <i32 0, i32 1, i32 3>
-; CHECK-NEXT: [[BLOCK33:%.*]] = shufflevector <3 x double> [[SPLIT3]], <3 x double> undef, <1 x i32> zeroinitializer
-; CHECK-NEXT: [[TMP51:%.*]] = extractelement <3 x double> [[SPLIT7]], i64 0
-; CHECK-NEXT: [[SPLAT_SPLATINSERT34:%.*]] = insertelement <1 x double> undef, double [[TMP51]], i32 0
+; CHECK-NEXT: [[TMP49:%.*]] = fmul <1 x double> [[BLOCK30]], [[SPLAT_SPLAT32]]
+; CHECK-NEXT: [[BLOCK33:%.*]] = shufflevector <3 x double> [[TMP11]], <3 x double> undef, <1 x i32> zeroinitializer
+; CHECK-NEXT: [[TMP50:%.*]] = extractelement <3 x double> [[SPLIT4]], i64 1
+; CHECK-NEXT: [[SPLAT_SPLATINSERT34:%.*]] = insertelement <1 x double> undef, double [[TMP50]], i32 0
; CHECK-NEXT: [[SPLAT_SPLAT35:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT34]], <1 x double> undef, <1 x i32> zeroinitializer
-; CHECK-NEXT: [[TMP52:%.*]] = fmul <1 x double> [[BLOCK33]], [[SPLAT_SPLAT35]]
-; CHECK-NEXT: [[BLOCK36:%.*]] = shufflevector <3 x double> [[SPLIT4]], <3 x double> undef, <1 x i32> zeroinitializer
-; CHECK-NEXT: [[TMP53:%.*]] = extractelement <3 x double> [[SPLIT7]], i64 1
+; CHECK-NEXT: [[TMP51:%.*]] = fmul <1 x double> [[BLOCK33]], [[SPLAT_SPLAT35]]
+; CHECK-NEXT: [[TMP52:%.*]] = fadd <1 x double> [[TMP49]], [[TMP51]]
+; CHECK-NEXT: [[BLOCK36:%.*]] = shufflevector <3 x double> [[TMP17]], <3 x double> undef, <1 x i32> zeroinitializer
+; CHECK-NEXT: [[TMP53:%.*]] = extractelement <3 x double> [[SPLIT4]], i64 2
; CHECK-NEXT: [[SPLAT_SPLATINSERT37:%.*]] = insertelement <1 x double> undef, double [[TMP53]], i32 0
; CHECK-NEXT: [[SPLAT_SPLAT38:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT37]], <1 x double> undef, <1 x i32> zeroinitializer
; CHECK-NEXT: [[TMP54:%.*]] = fmul <1 x double> [[BLOCK36]], [[SPLAT_SPLAT38]]
; CHECK-NEXT: [[TMP55:%.*]] = fadd <1 x double> [[TMP52]], [[TMP54]]
-; CHECK-NEXT: [[BLOCK39:%.*]] = shufflevector <3 x double> [[SPLIT5]], <3 x double> undef, <1 x i32> zeroinitializer
-; CHECK-NEXT: [[TMP56:%.*]] = extractelement <3 x double> [[SPLIT7]], i64 2
-; CHECK-NEXT: [[SPLAT_SPLATINSERT40:%.*]] = insertelement <1 x double> undef, double [[TMP56]], i32 0
+; CHECK-NEXT: [[TMP56:%.*]] = shufflevector <1 x double> [[TMP55]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
+; CHECK-NEXT: [[TMP57:%.*]] = shufflevector <3 x double> undef, <3 x double> [[TMP56]], <3 x i32> <i32 3, i32 1, i32 2>
+; CHECK-NEXT: [[BLOCK39:%.*]] = shufflevector <3 x double> [[TMP5]], <3 x double> undef, <1 x i32> <i32 1>
+; CHECK-NEXT: [[TMP58:%.*]] = extractelement <3 x double> [[SPLIT4]], i64 0
+; CHECK-NEXT: [[SPLAT_SPLATINSERT40:%.*]] = insertelement <1 x double> undef, double [[TMP58]], i32 0
; CHECK-NEXT: [[SPLAT_SPLAT41:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT40]], <1 x double> undef, <1 x i32> zeroinitializer
-; CHECK-NEXT: [[TMP57:%.*]] = fmul <1 x double> [[BLOCK39]], [[SPLAT_SPLAT41]]
-; CHECK-NEXT: [[TMP58:%.*]] = fadd <1 x double> [[TMP55]], [[TMP57]]
-; CHECK-NEXT: [[TMP59:%.*]] = shufflevector <1 x double> [[TMP58]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
-; CHECK-NEXT: [[TMP60:%.*]] = shufflevector <3 x double> undef, <3 x double> [[TMP59]], <3 x i32> <i32 3, i32 1, i32 2>
-; CHECK-NEXT: [[BLOCK42:%.*]] = shufflevector <3 x double> [[SPLIT3]], <3 x double> undef, <1 x i32> <i32 1>
-; CHECK-NEXT: [[TMP61:%.*]] = extractelement <3 x double> [[SPLIT7]], i64 0
-; CHECK-NEXT: [[SPLAT_SPLATINSERT43:%.*]] = insertelement <1 x double> undef, double [[TMP61]], i32 0
+; CHECK-NEXT: [[TMP59:%.*]] = fmul <1 x double> [[BLOCK39]], [[SPLAT_SPLAT41]]
+; CHECK-NEXT: [[BLOCK42:%.*]] = shufflevector <3 x double> [[TMP11]], <3 x double> undef, <1 x i32> <i32 1>
+; CHECK-NEXT: [[TMP60:%.*]] = extractelement <3 x double> [[SPLIT4]], i64 1
+; CHECK-NEXT: [[SPLAT_SPLATINSERT43:%.*]] = insertelement <1 x double> undef, double [[TMP60]], i32 0
; CHECK-NEXT: [[SPLAT_SPLAT44:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT43]], <1 x double> undef, <1 x i32> zeroinitializer
-; CHECK-NEXT: [[TMP62:%.*]] = fmul <1 x double> [[BLOCK42]], [[SPLAT_SPLAT44]]
-; CHECK-NEXT: [[BLOCK45:%.*]] = shufflevector <3 x double> [[SPLIT4]], <3 x double> undef, <1 x i32> <i32 1>
-; CHECK-NEXT: [[TMP63:%.*]] = extractelement <3 x double> [[SPLIT7]], i64 1
+; CHECK-NEXT: [[TMP61:%.*]] = fmul <1 x double> [[BLOCK42]], [[SPLAT_SPLAT44]]
+; CHECK-NEXT: [[TMP62:%.*]] = fadd <1 x double> [[TMP59]], [[TMP61]]
+; CHECK-NEXT: [[BLOCK45:%.*]] = shufflevector <3 x double> [[TMP17]], <3 x double> undef, <1 x i32> <i32 1>
+; CHECK-NEXT: [[TMP63:%.*]] = extractelement <3 x double> [[SPLIT4]], i64 2
; CHECK-NEXT: [[SPLAT_SPLATINSERT46:%.*]] = insertelement <1 x double> undef, double [[TMP63]], i32 0
; CHECK-NEXT: [[SPLAT_SPLAT47:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT46]], <1 x double> undef, <1 x i32> zeroinitializer
; CHECK-NEXT: [[TMP64:%.*]] = fmul <1 x double> [[BLOCK45]], [[SPLAT_SPLAT47]]
; CHECK-NEXT: [[TMP65:%.*]] = fadd <1 x double> [[TMP62]], [[TMP64]]
-; CHECK-NEXT: [[BLOCK48:%.*]] = shufflevector <3 x double> [[SPLIT5]], <3 x double> undef, <1 x i32> <i32 1>
-; CHECK-NEXT: [[TMP66:%.*]] = extractelement <3 x double> [[SPLIT7]], i64 2
-; CHECK-NEXT: [[SPLAT_SPLATINSERT49:%.*]] = insertelement <1 x double> undef, double [[TMP66]], i32 0
+; CHECK-NEXT: [[TMP66:%.*]] = shufflevector <1 x double> [[TMP65]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
+; CHECK-NEXT: [[TMP67:%.*]] = shufflevector <3 x double> [[TMP57]], <3 x double> [[TMP66]], <3 x i32> <i32 0, i32 3, i32 2>
+; CHECK-NEXT: [[BLOCK48:%.*]] = shufflevector <3 x double> [[TMP5]], <3 x double> undef, <1 x i32> <i32 2>
+; CHECK-NEXT: [[TMP68:%.*]] = extractelement <3 x double> [[SPLIT4]], i64 0
+; CHECK-NEXT: [[SPLAT_SPLATINSERT49:%.*]] = insertelement <1 x double> undef, double [[TMP68]], i32 0
; CHECK-NEXT: [[SPLAT_SPLAT50:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT49]], <1 x double> undef, <1 x i32> zeroinitializer
-; CHECK-NEXT: [[TMP67:%.*]] = fmul <1 x double> [[BLOCK48]], [[SPLAT_SPLAT50]]
-; CHECK-NEXT: [[TMP68:%.*]] = fadd <1 x double> [[TMP65]], [[TMP67]]
-; CHECK-NEXT: [[TMP69:%.*]] = shufflevector <1 x double> [[TMP68]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
-; CHECK-NEXT: [[TMP70:%.*]] = shufflevector <3 x double> [[TMP60]], <3 x double> [[TMP69]], <3 x i32> <i32 0, i32 3, i32 2>
-; CHECK-NEXT: [[BLOCK51:%.*]] = shufflevector <3 x double> [[SPLIT3]], <3 x double> undef, <1 x i32> <i32 2>
-; CHECK-NEXT: [[TMP71:%.*]] = extractelement <3 x double> [[SPLIT7]], i64 0
-; CHECK-NEXT: [[SPLAT_SPLATINSERT52:%.*]] = insertelement <1 x double> undef, double [[TMP71]], i32 0
+; CHECK-NEXT: [[TMP69:%.*]] = fmul <1 x double> [[BLOCK48]], [[SPLAT_SPLAT50]]
+; CHECK-NEXT: [[BLOCK51:%.*]] = shufflevector <3 x double> [[TMP11]], <3 x double> undef, <1 x i32> <i32 2>
+; CHECK-NEXT: [[TMP70:%.*]] = extractelement <3 x double> [[SPLIT4]], i64 1
+; CHECK-NEXT: [[SPLAT_SPLATINSERT52:%.*]] = insertelement <1 x double> undef, double [[TMP70]], i32 0
; CHECK-NEXT: [[SPLAT_SPLAT53:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT52]], <1 x double> undef, <1 x i32> zeroinitializer
-; CHECK-NEXT: [[TMP72:%.*]] = fmul <1 x double> [[BLOCK51]], [[SPLAT_SPLAT53]]
-; CHECK-NEXT: [[BLOCK54:%.*]] = shufflevector <3 x double> [[SPLIT4]], <3 x double> undef, <1 x i32> <i32 2>
-; CHECK-NEXT: [[TMP73:%.*]] = extractelement <3 x double> [[SPLIT7]], i64 1
+; CHECK-NEXT: [[TMP71:%.*]] = fmul <1 x double> [[BLOCK51]], [[SPLAT_SPLAT53]]
+; CHECK-NEXT: [[TMP72:%.*]] = fadd <1 x double> [[TMP69]], [[TMP71]]
+; CHECK-NEXT: [[BLOCK54:%.*]] = shufflevector <3 x double> [[TMP17]], <3 x double> undef, <1 x i32> <i32 2>
+; CHECK-NEXT: [[TMP73:%.*]] = extractelement <3 x double> [[SPLIT4]], i64 2
; CHECK-NEXT: [[SPLAT_SPLATINSERT55:%.*]] = insertelement <1 x double> undef, double [[TMP73]], i32 0
; CHECK-NEXT: [[SPLAT_SPLAT56:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT55]], <1 x double> undef, <1 x i32> zeroinitializer
; CHECK-NEXT: [[TMP74:%.*]] = fmul <1 x double> [[BLOCK54]], [[SPLAT_SPLAT56]]
; CHECK-NEXT: [[TMP75:%.*]] = fadd <1 x double> [[TMP72]], [[TMP74]]
-; CHECK-NEXT: [[BLOCK57:%.*]] = shufflevector <3 x double> [[SPLIT5]], <3 x double> undef, <1 x i32> <i32 2>
-; CHECK-NEXT: [[TMP76:%.*]] = extractelement <3 x double> [[SPLIT7]], i64 2
-; CHECK-NEXT: [[SPLAT_SPLATINSERT58:%.*]] = insertelement <1 x double> undef, double [[TMP76]], i32 0
+; CHECK-NEXT: [[TMP76:%.*]] = shufflevector <1 x double> [[TMP75]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
+; CHECK-NEXT: [[TMP77:%.*]] = shufflevector <3 x double> [[TMP67]], <3 x double> [[TMP76]], <3 x i32> <i32 0, i32 1, i32 3>
+; CHECK-NEXT: [[BLOCK57:%.*]] = shufflevector <3 x double> [[TMP5]], <3 x double> undef, <1 x i32> zeroinitializer
+; CHECK-NEXT: [[TMP78:%.*]] = extractelement <3 x double> [[SPLIT5]], i64 0
+; CHECK-NEXT: [[SPLAT_SPLATINSERT58:%.*]] = insertelement <1 x double> undef, double [[TMP78]], i32 0
; CHECK-NEXT: [[SPLAT_SPLAT59:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT58]], <1 x double> undef, <1 x i32> zeroinitializer
-; CHECK-NEXT: [[TMP77:%.*]] = fmul <1 x double> [[BLOCK57]], [[SPLAT_SPLAT59]]
-; CHECK-NEXT: [[TMP78:%.*]] = fadd <1 x double> [[TMP75]], [[TMP77]]
-; CHECK-NEXT: [[TMP79:%.*]] = shufflevector <1 x double> [[TMP78]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
-; CHECK-NEXT: [[TMP80:%.*]] = shufflevector <3 x double> [[TMP70]], <3 x double> [[TMP79]], <3 x i32> <i32 0, i32 1, i32 3>
-; CHECK-NEXT: [[BLOCK60:%.*]] = shufflevector <3 x double> [[SPLIT3]], <3 x double> undef, <1 x i32> zeroinitializer
-; CHECK-NEXT: [[TMP81:%.*]] = extractelement <3 x double> [[SPLIT8]], i64 0
-; CHECK-NEXT: [[SPLAT_SPLATINSERT61:%.*]] = insertelement <1 x double> undef, double [[TMP81]], i32 0
+; CHECK-NEXT: [[TMP79:%.*]] = fmul <1 x double> [[BLOCK57]], [[SPLAT_SPLAT59]]
+; CHECK-NEXT: [[BLOCK60:%.*]] = shufflevector <3 x double> [[TMP11]], <3 x double> undef, <1 x i32> zeroinitializer
+; CHECK-NEXT: [[TMP80:%.*]] = extractelement <3 x double> [[SPLIT5]], i64 1
+; CHECK-NEXT: [[SPLAT_SPLATINSERT61:%.*]] = insertelement <1 x double> undef, double [[TMP80]], i32 0
; CHECK-NEXT: [[SPLAT_SPLAT62:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT61]], <1 x double> undef, <1 x i32> zeroinitializer
-; CHECK-NEXT: [[TMP82:%.*]] = fmul <1 x double> [[BLOCK60]], [[SPLAT_SPLAT62]]
-; CHECK-NEXT: [[BLOCK63:%.*]] = shufflevector <3 x double> [[SPLIT4]], <3 x double> undef, <1 x i32> zeroinitializer
-; CHECK-NEXT: [[TMP83:%.*]] = extractelement <3 x double> [[SPLIT8]], i64 1
+; CHECK-NEXT: [[TMP81:%.*]] = fmul <1 x double> [[BLOCK60]], [[SPLAT_SPLAT62]]
+; CHECK-NEXT: [[TMP82:%.*]] = fadd <1 x double> [[TMP79]], [[TMP81]]
+; CHECK-NEXT: [[BLOCK63:%.*]] = shufflevector <3 x double> [[TMP17]], <3 x double> undef, <1 x i32> zeroinitializer
+; CHECK-NEXT: [[TMP83:%.*]] = extractelement <3 x double> [[SPLIT5]], i64 2
; CHECK-NEXT: [[SPLAT_SPLATINSERT64:%.*]] = insertelement <1 x double> undef, double [[TMP83]], i32 0
; CHECK-NEXT: [[SPLAT_SPLAT65:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT64]], <1 x double> undef, <1 x i32> zeroinitializer
; CHECK-NEXT: [[TMP84:%.*]] = fmul <1 x double> [[BLOCK63]], [[SPLAT_SPLAT65]]
; CHECK-NEXT: [[TMP85:%.*]] = fadd <1 x double> [[TMP82]], [[TMP84]]
-; CHECK-NEXT: [[BLOCK66:%.*]] = shufflevector <3 x double> [[SPLIT5]], <3 x double> undef, <1 x i32> zeroinitializer
-; CHECK-NEXT: [[TMP86:%.*]] = extractelement <3 x double> [[SPLIT8]], i64 2
-; CHECK-NEXT: [[SPLAT_SPLATINSERT67:%.*]] = insertelement <1 x double> undef, double [[TMP86]], i32 0
+; CHECK-NEXT: [[TMP86:%.*]] = shufflevector <1 x double> [[TMP85]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
+; CHECK-NEXT: [[TMP87:%.*]] = shufflevector <3 x double> undef, <3 x double> [[TMP86]], <3 x i32> <i32 3, i32 1, i32 2>
+; CHECK-NEXT: [[BLOCK66:%.*]] = shufflevector <3 x double> [[TMP5]], <3 x double> undef, <1 x i32> <i32 1>
+; CHECK-NEXT: [[TMP88:%.*]] = extractelement <3 x double> [[SPLIT5]], i64 0
+; CHECK-NEXT: [[SPLAT_SPLATINSERT67:%.*]] = insertelement <1 x double> undef, double [[TMP88]], i32 0
; CHECK-NEXT: [[SPLAT_SPLAT68:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT67]], <1 x double> undef, <1 x i32> zeroinitializer
-; CHECK-NEXT: [[TMP87:%.*]] = fmul <1 x double> [[BLOCK66]], [[SPLAT_SPLAT68]]
-; CHECK-NEXT: [[TMP88:%.*]] = fadd <1 x double> [[TMP85]], [[TMP87]]
-; CHECK-NEXT: [[TMP89:%.*]] = shufflevector <1 x double> [[TMP88]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
-; CHECK-NEXT: [[TMP90:%.*]] = shufflevector <3 x double> undef, <3 x double> [[TMP89]], <3 x i32> <i32 3, i32 1, i32 2>
-; CHECK-NEXT: [[BLOCK69:%.*]] = shufflevector <3 x double> [[SPLIT3]], <3 x double> undef, <1 x i32> <i32 1>
-; CHECK-NEXT: [[TMP91:%.*]] = extractelement <3 x double> [[SPLIT8]], i64 0
-; CHECK-NEXT: [[SPLAT_SPLATINSERT70:%.*]] = insertelement <1 x double> undef, double [[TMP91]], i32 0
+; CHECK-NEXT: [[TMP89:%.*]] = fmul <1 x double> [[BLOCK66]], [[SPLAT_SPLAT68]]
+; CHECK-NEXT: [[BLOCK69:%.*]] = shufflevector <3 x double> [[TMP11]], <3 x double> undef, <1 x i32> <i32 1>
+; CHECK-NEXT: [[TMP90:%.*]] = extractelement <3 x double> [[SPLIT5]], i64 1
+; CHECK-NEXT: [[SPLAT_SPLATINSERT70:%.*]] = insertelement <1 x double> undef, double [[TMP90]], i32 0
; CHECK-NEXT: [[SPLAT_SPLAT71:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT70]], <1 x double> undef, <1 x i32> zeroinitializer
-; CHECK-NEXT: [[TMP92:%.*]] = fmul <1 x double> [[BLOCK69]], [[SPLAT_SPLAT71]]
-; CHECK-NEXT: [[BLOCK72:%.*]] = shufflevector <3 x double> [[SPLIT4]], <3 x double> undef, <1 x i32> <i32 1>
-; CHECK-NEXT: [[TMP93:%.*]] = extractelement <3 x double> [[SPLIT8]], i64 1
+; CHECK-NEXT: [[TMP91:%.*]] = fmul <1 x double> [[BLOCK69]], [[SPLAT_SPLAT71]]
+; CHECK-NEXT: [[TMP92:%.*]] = fadd <1 x double> [[TMP89]], [[TMP91]]
+; CHECK-NEXT: [[BLOCK72:%.*]] = shufflevector <3 x double> [[TMP17]], <3 x double> undef, <1 x i32> <i32 1>
+; CHECK-NEXT: [[TMP93:%.*]] = extractelement <3 x double> [[SPLIT5]], i64 2
; CHECK-NEXT: [[SPLAT_SPLATINSERT73:%.*]] = insertelement <1 x double> undef, double [[TMP93]], i32 0
; CHECK-NEXT: [[SPLAT_SPLAT74:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT73]], <1 x double> undef, <1 x i32> zeroinitializer
; CHECK-NEXT: [[TMP94:%.*]] = fmul <1 x double> [[BLOCK72]], [[SPLAT_SPLAT74]]
; CHECK-NEXT: [[TMP95:%.*]] = fadd <1 x double> [[TMP92]], [[TMP94]]
-; CHECK-NEXT: [[BLOCK75:%.*]] = shufflevector <3 x double> [[SPLIT5]], <3 x double> undef, <1 x i32> <i32 1>
-; CHECK-NEXT: [[TMP96:%.*]] = extractelement <3 x double> [[SPLIT8]], i64 2
-; CHECK-NEXT: [[SPLAT_SPLATINSERT76:%.*]] = insertelement <1 x double> undef, double [[TMP96]], i32 0
+; CHECK-NEXT: [[TMP96:%.*]] = shufflevector <1 x double> [[TMP95]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
+; CHECK-NEXT: [[TMP97:%.*]] = shufflevector <3 x double> [[TMP87]], <3 x double> [[TMP96]], <3 x i32> <i32 0, i32 3, i32 2>
+; CHECK-NEXT: [[BLOCK75:%.*]] = shufflevector <3 x double> [[TMP5]], <3 x double> undef, <1 x i32> <i32 2>
+; CHECK-NEXT: [[TMP98:%.*]] = extractelement <3 x double> [[SPLIT5]], i64 0
+; CHECK-NEXT: [[SPLAT_SPLATINSERT76:%.*]] = insertelement <1 x double> undef, double [[TMP98]], i32 0
; CHECK-NEXT: [[SPLAT_SPLAT77:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT76]], <1 x double> undef, <1 x i32> zeroinitializer
-; CHECK-NEXT: [[TMP97:%.*]] = fmul <1 x double> [[BLOCK75]], [[SPLAT_SPLAT77]]
-; CHECK-NEXT: [[TMP98:%.*]] = fadd <1 x double> [[TMP95]], [[TMP97]]
-; CHECK-NEXT: [[TMP99:%.*]] = shufflevector <1 x double> [[TMP98]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
-; CHECK-NEXT: [[TMP100:%.*]] = shufflevector <3 x double> [[TMP90]], <3 x double> [[TMP99]], <3 x i32> <i32 0, i32 3, i32 2>
-; CHECK-NEXT: [[BLOCK78:%.*]] = shufflevector <3 x double> [[SPLIT3]], <3 x double> undef, <1 x i32> <i32 2>
-; CHECK-NEXT: [[TMP101:%.*]] = extractelement <3 x double> [[SPLIT8]], i64 0
-; CHECK-NEXT: [[SPLAT_SPLATINSERT79:%.*]] = insertelement <1 x double> undef, double [[TMP101]], i32 0
+; CHECK-NEXT: [[TMP99:%.*]] = fmul <1 x double> [[BLOCK75]], [[SPLAT_SPLAT77]]
+; CHECK-NEXT: [[BLOCK78:%.*]] = shufflevector <3 x double> [[TMP11]], <3 x double> undef, <1 x i32> <i32 2>
+; CHECK-NEXT: [[TMP100:%.*]] = extractelement <3 x double> [[SPLIT5]], i64 1
+; CHECK-NEXT: [[SPLAT_SPLATINSERT79:%.*]] = insertelement <1 x double> undef, double [[TMP100]], i32 0
; CHECK-NEXT: [[SPLAT_SPLAT80:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT79]], <1 x double> undef, <1 x i32> zeroinitializer
-; CHECK-NEXT: [[TMP102:%.*]] = fmul <1 x double> [[BLOCK78]], [[SPLAT_SPLAT80]]
-; CHECK-NEXT: [[BLOCK81:%.*]] = shufflevector <3 x double> [[SPLIT4]], <3 x double> undef, <1 x i32> <i32 2>
-; CHECK-NEXT: [[TMP103:%.*]] = extractelement <3 x double> [[SPLIT8]], i64 1
+; CHECK-NEXT: [[TMP101:%.*]] = fmul <1 x double> [[BLOCK78]], [[SPLAT_SPLAT80]]
+; CHECK-NEXT: [[TMP102:%.*]] = fadd <1 x double> [[TMP99]], [[TMP101]]
+; CHECK-NEXT: [[BLOCK81:%.*]] = shufflevector <3 x double> [[TMP17]], <3 x double> undef, <1 x i32> <i32 2>
+; CHECK-NEXT: [[TMP103:%.*]] = extractelement <3 x double> [[SPLIT5]], i64 2
; CHECK-NEXT: [[SPLAT_SPLATINSERT82:%.*]] = insertelement <1 x double> undef, double [[TMP103]], i32 0
; CHECK-NEXT: [[SPLAT_SPLAT83:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT82]], <1 x double> undef, <1 x i32> zeroinitializer
; CHECK-NEXT: [[TMP104:%.*]] = fmul <1 x double> [[BLOCK81]], [[SPLAT_SPLAT83]]
; CHECK-NEXT: [[TMP105:%.*]] = fadd <1 x double> [[TMP102]], [[TMP104]]
-; CHECK-NEXT: [[BLOCK84:%.*]] = shufflevector <3 x double> [[SPLIT5]], <3 x double> undef, <1 x i32> <i32 2>
-; CHECK-NEXT: [[TMP106:%.*]] = extractelement <3 x double> [[SPLIT8]], i64 2
-; CHECK-NEXT: [[SPLAT_SPLATINSERT85:%.*]] = insertelement <1 x double> undef, double [[TMP106]], i32 0
-; CHECK-NEXT: [[SPLAT_SPLAT86:%.*]] = shufflevector <1 x double> [[SPLAT_SPLATINSERT85]], <1 x double> undef, <1 x i32> zeroinitializer
-; CHECK-NEXT: [[TMP107:%.*]] = fmul <1 x double> [[BLOCK84]], [[SPLAT_SPLAT86]]
-; CHECK-NEXT: [[TMP108:%.*]] = fadd <1 x double> [[TMP105]], [[TMP107]]
-; CHECK-NEXT: [[TMP109:%.*]] = shufflevector <1 x double> [[TMP108]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
-; CHECK-NEXT: [[TMP110:%.*]] = shufflevector <3 x double> [[TMP100]], <3 x double> [[TMP109]], <3 x i32> <i32 0, i32 1, i32 3>
-; CHECK-NEXT: [[TMP111:%.*]] = shufflevector <3 x double> [[TMP50]], <3 x double> [[TMP80]], <6 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5>
-; CHECK-NEXT: [[TMP112:%.*]] = shufflevector <3 x double> [[TMP110]], <3 x double> undef, <6 x i32> <i32 0, i32 1, i32 2, i32 undef, i32 undef, i32 undef>
-; CHECK-NEXT: [[TMP113:%.*]] = shufflevector <6 x double> [[TMP111]], <6 x double> [[TMP112]], <9 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8>
+
+; Embed result of multiply into flat vector.
+
+; CHECK-NEXT: [[TMP106:%.*]] = shufflevector <1 x double> [[TMP105]], <1 x double> undef, <3 x i32> <i32 0, i32 undef, i32 undef>
+; CHECK-NEXT: [[TMP107:%.*]] = shufflevector <3 x double> [[TMP97]], <3 x double> [[TMP106]], <3 x i32> <i32 0, i32 1, i32 3>
+; CHECK-NEXT: [[TMP108:%.*]] = shufflevector <3 x double> [[TMP47]], <3 x double> [[TMP77]], <6 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5>
+; CHECK-NEXT: [[TMP109:%.*]] = shufflevector <3 x double> [[TMP107]], <3 x double> undef, <6 x i32> <i32 0, i32 1, i32 2, i32 undef, i32 undef, i32 undef>
+; CHECK-NEXT: [[TMP110:%.*]] = shufflevector <6 x double> [[TMP108]], <6 x double> [[TMP109]], <9 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8>
+
+; Load %C and add result of multiply.
+
; CHECK-NEXT: [[C:%.*]] = load <9 x double>, <9 x double>* [[C_PTR:%.*]]
-; CHECK-NEXT: [[RES:%.*]] = fadd <9 x double> [[C]], [[TMP113]]
+; CHECK-NEXT: [[RES:%.*]] = fadd <9 x double> [[C]], [[TMP110]]
; CHECK-NEXT: store <9 x double> [[RES]], <9 x double>* [[C_PTR]]
; CHECK-NEXT: ret void
;
@@ -452,4 +484,3 @@ entry:
store <9 x double> %res, <9 x double>* %C.Ptr
ret void
}
-
diff --git a/llvm/test/Transforms/LowerMatrixIntrinsics/propagate-forward.ll b/llvm/test/Transforms/LowerMatrixIntrinsics/propagate-forward.ll
new file mode 100644
index 000000000000..3c398b319c01
--- /dev/null
+++ b/llvm/test/Transforms/LowerMatrixIntrinsics/propagate-forward.ll
@@ -0,0 +1,44 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
+; RUN: opt -lower-matrix-intrinsics -S < %s | FileCheck %s
+; RUN: opt -passes='lower-matrix-intrinsics' -S < %s | FileCheck %s
+
+; Check that we do not emit shufflevectors to flatten the result of the
+; transpose and store the columns directly.
+define void @transpose_store(<8 x double> %a, <8 x double>* %Ptr) {
+; CHECK-LABEL: @transpose_store(
+; CHECK-NEXT: entry:
+; CHECK-NEXT: [[SPLIT:%.*]] = shufflevector <8 x double> [[A:%.*]], <8 x double> undef, <2 x i32> <i32 0, i32 1>
+; CHECK-NEXT: [[SPLIT1:%.*]] = shufflevector <8 x double> [[A]], <8 x double> undef, <2 x i32> <i32 2, i32 3>
+; CHECK-NEXT: [[SPLIT2:%.*]] = shufflevector <8 x double> [[A]], <8 x double> undef, <2 x i32> <i32 4, i32 5>
+; CHECK-NEXT: [[SPLIT3:%.*]] = shufflevector <8 x double> [[A]], <8 x double> undef, <2 x i32> <i32 6, i32 7>
+; CHECK-NEXT: [[TMP0:%.*]] = extractelement <2 x double> [[SPLIT]], i64 0
+; CHECK-NEXT: [[TMP1:%.*]] = insertelement <4 x double> undef, double [[TMP0]], i64 0
+; CHECK-NEXT: [[TMP2:%.*]] = extractelement <2 x double> [[SPLIT1]], i64 0
+; CHECK-NEXT: [[TMP3:%.*]] = insertelement <4 x double> [[TMP1]], double [[TMP2]], i64 1
+; CHECK-NEXT: [[TMP4:%.*]] = extractelement <2 x double> [[SPLIT2]], i64 0
+; CHECK-NEXT: [[TMP5:%.*]] = insertelement <4 x double> [[TMP3]], double [[TMP4]], i64 2
+; CHECK-NEXT: [[TMP6:%.*]] = extractelement <2 x double> [[SPLIT3]], i64 0
+; CHECK-NEXT: [[TMP7:%.*]] = insertelement <4 x double> [[TMP5]], double [[TMP6]], i64 3
+; CHECK-NEXT: [[TMP8:%.*]] = extractelement <2 x double> [[SPLIT]], i64 1
+; CHECK-NEXT: [[TMP9:%.*]] = insertelement <4 x double> undef, double [[TMP8]], i64 0
+; CHECK-NEXT: [[TMP10:%.*]] = extractelement <2 x double> [[SPLIT1]], i64 1
+; CHECK-NEXT: [[TMP11:%.*]] = insertelement <4 x double> [[TMP9]], double [[TMP10]], i64 1
+; CHECK-NEXT: [[TMP12:%.*]] = extractelement <2 x double> [[SPLIT2]], i64 1
+; CHECK-NEXT: [[TMP13:%.*]] = insertelement <4 x double> [[TMP11]], double [[TMP12]], i64 2
+; CHECK-NEXT: [[TMP14:%.*]] = extractelement <2 x double> [[SPLIT3]], i64 1
+; CHECK-NEXT: [[TMP15:%.*]] = insertelement <4 x double> [[TMP13]], double [[TMP14]], i64 3
+; CHECK-NEXT: [[TMP16:%.*]] = bitcast <8 x double>* [[PTR:%.*]] to double*
+; CHECK-NEXT: [[TMP17:%.*]] = bitcast double* [[TMP16]] to <4 x double>*
+; CHECK-NEXT: store <4 x double> [[TMP7]], <4 x double>* [[TMP17]], align 8
+; CHECK-NEXT: [[TMP18:%.*]] = getelementptr double, double* [[TMP16]], i32 4
+; CHECK-NEXT: [[TMP19:%.*]] = bitcast double* [[TMP18]] to <4 x double>*
+; CHECK-NEXT: store <4 x double> [[TMP15]], <4 x double>* [[TMP19]], align 8
+; CHECK-NEXT: ret void
+;
+entry:
+ %c = call <8 x double> @llvm.matrix.transpose(<8 x double> %a, i32 2, i32 4)
+ store <8 x double> %c, <8 x double>* %Ptr
+ ret void
+}
+
+declare <8 x double> @llvm.matrix.transpose(<8 x double>, i32, i32)
diff --git a/llvm/test/Transforms/LowerMatrixIntrinsics/propagate-mixed-users.ll b/llvm/test/Transforms/LowerMatrixIntrinsics/propagate-mixed-users.ll
new file mode 100644
index 000000000000..f2bb272ba362
--- /dev/null
+++ b/llvm/test/Transforms/LowerMatrixIntrinsics/propagate-mixed-users.ll
@@ -0,0 +1,53 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
+; RUN: opt -lower-matrix-intrinsics -S < %s | FileCheck %s
+; RUN: opt -passes='lower-matrix-intrinsics' -S < %s | FileCheck %s
+
+; Currently we only lower stores with shape information, but need to embed the
+; matrix in a flat vector for function calls and returns.
+define <8 x double> @strided_load_4x4(<8 x double> %in, <8 x double>* %Ptr) {
+; CHECK-LABEL: @strided_load_4x4(
+; CHECK-NEXT: [[SPLIT:%.*]] = shufflevector <8 x double> [[IN:%.*]], <8 x double> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+; CHECK-NEXT: [[SPLIT1:%.*]] = shufflevector <8 x double> [[IN]], <8 x double> undef, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
+; CHECK-NEXT: [[TMP1:%.*]] = extractelement <4 x double> [[SPLIT]], i64 0
+; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x double> undef, double [[TMP1]], i64 0
+; CHECK-NEXT: [[TMP3:%.*]] = extractelement <4 x double> [[SPLIT1]], i64 0
+; CHECK-NEXT: [[TMP4:%.*]] = insertelement <2 x double> [[TMP2]], double [[TMP3]], i64 1
+; CHECK-NEXT: [[TMP5:%.*]] = extractelement <4 x double> [[SPLIT]], i64 1
+; CHECK-NEXT: [[TMP6:%.*]] = insertelement <2 x double> undef, double [[TMP5]], i64 0
+; CHECK-NEXT: [[TMP7:%.*]] = extractelement <4 x double> [[SPLIT1]], i64 1
+; CHECK-NEXT: [[TMP8:%.*]] = insertelement <2 x double> [[TMP6]], double [[TMP7]], i64 1
+; CHECK-NEXT: [[TMP9:%.*]] = extractelement <4 x double> [[SPLIT]], i64 2
+; CHECK-NEXT: [[TMP10:%.*]] = insertelement <2 x double> undef, double [[TMP9]], i64 0
+; CHECK-NEXT: [[TMP11:%.*]] = extractelement <4 x double> [[SPLIT1]], i64 2
+; CHECK-NEXT: [[TMP12:%.*]] = insertelement <2 x double> [[TMP10]], double [[TMP11]], i64 1
+; CHECK-NEXT: [[TMP13:%.*]] = extractelement <4 x double> [[SPLIT]], i64 3
+; CHECK-NEXT: [[TMP14:%.*]] = insertelement <2 x double> undef, double [[TMP13]], i64 0
+; CHECK-NEXT: [[TMP15:%.*]] = extractelement <4 x double> [[SPLIT1]], i64 3
+; CHECK-NEXT: [[TMP16:%.*]] = insertelement <2 x double> [[TMP14]], double [[TMP15]], i64 1
+; CHECK-NEXT: [[TMP17:%.*]] = shufflevector <2 x double> [[TMP4]], <2 x double> [[TMP8]], <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+; CHECK-NEXT: [[TMP18:%.*]] = shufflevector <2 x double> [[TMP12]], <2 x double> [[TMP16]], <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+; CHECK-NEXT: [[TMP19:%.*]] = shufflevector <4 x double> [[TMP17]], <4 x double> [[TMP18]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+; CHECK-NEXT: [[TMP20:%.*]] = bitcast <8 x double>* [[PTR:%.*]] to double*
+; CHECK-NEXT: [[TMP21:%.*]] = bitcast double* [[TMP20]] to <2 x double>*
+; CHECK-NEXT: store <2 x double> [[TMP4]], <2 x double>* [[TMP21]], align 8
+; CHECK-NEXT: [[TMP22:%.*]] = getelementptr double, double* [[TMP20]], i32 2
+; CHECK-NEXT: [[TMP23:%.*]] = bitcast double* [[TMP22]] to <2 x double>*
+; CHECK-NEXT: store <2 x double> [[TMP8]], <2 x double>* [[TMP23]], align 8
+; CHECK-NEXT: [[TMP24:%.*]] = getelementptr double, double* [[TMP20]], i32 4
+; CHECK-NEXT: [[TMP25:%.*]] = bitcast double* [[TMP24]] to <2 x double>*
+; CHECK-NEXT: store <2 x double> [[TMP12]], <2 x double>* [[TMP25]], align 8
+; CHECK-NEXT: [[TMP26:%.*]] = getelementptr double, double* [[TMP20]], i32 6
+; CHECK-NEXT: [[TMP27:%.*]] = bitcast double* [[TMP26]] to <2 x double>*
+; CHECK-NEXT: store <2 x double> [[TMP16]], <2 x double>* [[TMP27]], align 8
+; CHECK-NEXT: call void @foo(<8 x double> [[TMP19]])
+; CHECK-NEXT: ret <8 x double> [[TMP19]]
+;
+ %transposed = call <8 x double> @llvm.matrix.transpose(<8 x double> %in, i32 4, i32 2)
+ store <8 x double> %transposed, <8 x double>* %Ptr
+ call void @foo(<8 x double> %transposed)
+ ret <8 x double> %transposed
+}
+
+declare <8 x double> @llvm.matrix.transpose(<8 x double>, i32, i32)
+
+declare void @foo(<8 x double>)
More information about the llvm-commits
mailing list