[llvm] r200576 - [SLPV] Recognize vectorizable intrinsics during SLP vectorization and

Tue Feb 4 15:57:38 PST 2014

Hi Raul, 

Yes, please send a patch for review.  Also, please run the llvm test-suite and report any interesting performance changes. It should be really easy with the LNT scripts .

Thanks,
Nadav

On Feb 4, 2014, at 2:42 PM, Raul Silvera <rsilvera at google.com> wrote:

> I had a quick look at the original testcase from Reid, and we're only vectorizing in 32-bit mode. The vectorization cost analysis ends up deciding that there is a small gain to be had (I believe incorrectly).
> 
> The vectorization cost is computed as the cost of the load, bswap and store. 
> The load and store get cost = -3 each (loading a single v2x64 vs four i32)
> The bswap (including overhead) gets cost 4 (6 vector - 2 scalar).
> 
> Total we get cost 4-3-3=-2 so we vectorize (it vectorizes if cost <0).
> 
> I think the bug is that the scalarization overhead doesn't take into account the cost of legalizing the vector element type, which in this case incurs additional costs as it is i64 on 32-bit mode. I have a patch that adds the cost of legalizing the element type to BasicTTI::getScalarizationOverhead, and it stops this vectorization (we get cost 12-3-3=6).
> 
> If this sounds OK in principle I can submit a patch for review.
> 
> 
> 
> On Mon, Feb 3, 2014 at 9:39 AM, Hal Finkel <hfinkel at anl.gov> wrote:
> ----- Original Message -----
> > From: "Reid Kleckner" <rnk at google.com>
> > To: "Hal Finkel" <hfinkel at anl.gov>
> > Cc: "Raul Silvera" <rsilvera at google.com>, "Chandler Carruth" <chandlerc at gmail.com>, "llvm-commits"
> > <llvm-commits at cs.uiuc.edu>
> > Sent: Monday, February 3, 2014 11:36:55 AM
> > Subject: Re: [llvm] r200576 - [SLPV] Recognize vectorizable intrinsics during SLP vectorization and
> >
> >
> > Thanks Hal! That was exactly it. But a mystery remains: why did we
> > choose to vectorize here when we knew we'd have to expand it later?
> >
> 
> Well, it could be that the cost savings from everything else made it look worthwhile; and maybe it is worthwhile, but I'm skeptical. We should benchmark the resulting code if possible.
> 
>  -Hal
> 
> >
> >
> > On Mon, Feb 3, 2014 at 9:33 AM, Hal Finkel < hfinkel at anl.gov > wrote:
> >
> >
> >
> > ----- Original Message -----
> > > From: "Hal Finkel" < hfinkel at anl.gov >
> > > To: "Raul Silvera" < rsilvera at google.com >
> > > Cc: "Chandler Carruth" < chandlerc at gmail.com >, "llvm-commits" <
> > > llvm-commits at cs.uiuc.edu >
> >
> >
> > > Sent: Monday, February 3, 2014 11:24:26 AM
> > > Subject: Re: [llvm] r200576 - [SLPV] Recognize vectorizable
> > > intrinsics during SLP vectorization and
> > >
> > > ----- Original Message -----
> > > > From: "Hal Finkel" < hfinkel at anl.gov >
> > > > To: "Raul Silvera" < rsilvera at google.com >
> > > > Cc: "Chandler Carruth" < chandlerc at gmail.com >, "llvm-commits"
> > > > < llvm-commits at cs.uiuc.edu >
> > > > Sent: Monday, February 3, 2014 11:13:41 AM
> > > > Subject: Re: [llvm] r200576 - [SLPV] Recognize vectorizable
> > > > intrinsics during SLP vectorization and
> > > >
> > > > ----- Original Message -----
> > > > > From: "Raul Silvera" < rsilvera at google.com >
> > > > > To: "Reid Kleckner" < rnk at google.com >
> > > > > Cc: "Chandler Carruth" < chandlerc at gmail.com >, "llvm-commits"
> > > > > < llvm-commits at cs.uiuc.edu >
> > > > > Sent: Monday, February 3, 2014 10:51:38 AM
> > > > > Subject: Re: [llvm] r200576 - [SLPV] Recognize vectorizable
> > > > > intrinsics during SLP vectorization and
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > Thank you Reid. Have you reverted the change?
> > > > >
> > > > >
> > > > > I think the alternatives are to disable vectorization of bswap
> > > > > on
> > > > > 32-bit or teach llvm how to do the expansion.
> > > > >
> > > > >
> > > > > I'm using the same code as the Loop vectorizer to decide what
> > > > > to
> > > > > vectorize, which makes me wonder if the same issue could be hit
> > > > > with
> > > > > loop vectorization. I'll investigate a bit further and figure
> > > > > out
> > > > > what to do next.
> > > > >
> > > >
> > > > Actually, this is a bit strange. X86ISelLowering.cpp has:
> > > >
> > > > for (int i = MVT::FIRST_VECTOR_VALUETYPE;
> > > > i <= MVT::LAST_VECTOR_VALUETYPE; ++i) {
> > > > MVT VT = (MVT::SimpleValueType)i;
> > > > ...
> > > > setOperationAction(ISD::BSWAP, VT, Expand);
> > > > ...
> > > > }
> > > >
> > > > and so there should be no problem with lowering a vector bswap
> > > > for
> > > > any vector type, and if there is, that seems like a SDAG bug.
> > >
> > > Or if I had read more carefully, I would have seen that this must
> > > be
> > > the case (the code ends up in SelectionDAGLegalize::ExpandBSWAP,
> > > which only handles scalar types). Looks like we just don't handle
> > > this in LegalizeVectorOps.
> >
> > I fixed this in r200705.
> >
> > -Hal
> >
> >
> >
> > >
> > > That having been said, there must also be another problem because
> > > BSWAP is marked as expand, and so it should have a high cost. Maybe
> > > vectorizing the rest of the code makes it worthwhile, but this
> > > should be double-checked.
> > >
> > > -Hal
> > >
> > > >
> > > > -Hal
> > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Fri, Jan 31, 2014 at 5:40 PM, Reid Kleckner < rnk at google.com
> > > > > >
> > > > > wrote:
> > > > >
> > > > >
> > > > >
> > > > > Hey Raul,
> > > > >
> > > > >
> > > > > This patch broke the 32-bit self-host build. The x86 backend
> > > > > claims
> > > > > that the legalizer knows how to expand bswap intrinsics on
> > > > > vector
> > > > > types, but this is not the case. Running llc the test case I
> > > > > gave
> > > > > produces:
> > > > >
> > > > > Unhandled Expand type in BSWAP!
> > > > > UNREACHABLE executed at
> > > > > ..\lib\CodeGen\SelectionDAG\LegalizeDAG.cpp:2537!
> > > > >
> > > > >
> > > > > I'm going to revert this for now, but what should happen is
> > > > > that
> > > > > we
> > > > > should learn how to expand bswap on vectors.
> > > > >
> > > > >
> > > > > Reid
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Fri, Jan 31, 2014 at 5:17 PM, Reid Kleckner < rnk at google.com
> > > > > >
> > > > > wrote:
> > > > >
> > > > >
> > > > >
> > > > > This change caused us to vectorize bswap, which we then try to
> > > > > expand
> > > > > on i686, which hits an unreachable. Running llc on this repros:
> > > > >
> > > > >
> > > > >
> > > > > declare <2 x i64> @llvm.bswap.v2i64(<2 x i64>) #8
> > > > > define <2 x i64> @foo(<2 x i64> %v) {
> > > > > %s = call <2 x i64> @llvm.bswap.v2i64(<2 x i64> %v)
> > > > > ret <2 x i64> %s
> > > > > }
> > > > > attributes #8 = { nounwind readnone }
> > > > >
> > > > >
> > > > > I don't have a reduced test case of input to the SLP vectorizer
> > > > > yet
> > > > > because I'm new to reducing LLVM IR.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Fri, Jan 31, 2014 at 1:14 PM, Chandler Carruth <
> > > > > chandlerc at gmail.com > wrote:
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > Author: chandlerc
> > > > > Date: Fri Jan 31 15:14:40 2014
> > > > > New Revision: 200576
> > > > >
> > > > > URL: http://llvm.org/viewvc/llvm-project?rev=200576&view=rev
> > > > > Log:
> > > > > [SLPV] Recognize vectorizable intrinsics during SLP
> > > > > vectorization
> > > > > and
> > > > > transform accordingly. Based on similar code from Loop
> > > > > vectorization.
> > > > > Subsequent commits will include vectorization of function calls
> > > > > to
> > > > > vector intrinsics and form function calls to vector library
> > > > > calls.
> > > > >
> > > > > Patch by Raul Silvera! (Much delayed due to my not running
> > > > > dcommit)
> > > > >
> > > > > Added:
> > > > > llvm/trunk/test/Transforms/SLPVectorizer/X86/intrinsic.ll
> > > > > Modified:
> > > > > llvm/trunk/lib/Transforms/Vectorize/SLPVectorizer.cpp
> > > > >
> > > > > Modified: llvm/trunk/lib/Transforms/Vectorize/SLPVectorizer.cpp
> > > > > URL:
> > > > > http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Transforms/Vectorize/SLPVectorizer.cpp?rev=200576&r1=200575&r2=200576&view=diff
> > > > > ==============================================================================
> > > > > --- llvm/trunk/lib/Transforms/Vectorize/SLPVectorizer.cpp
> > > > > (original)
> > > > > +++ llvm/trunk/lib/Transforms/Vectorize/SLPVectorizer.cpp Fri
> > > > > Jan
> > > > > 31
> > > > > 15:14:40 2014
> > > > > @@ -947,6 +947,39 @@ void BoUpSLP::buildTree_rec(ArrayRef<Val
> > > > > buildTree_rec(Operands, Depth + 1);
> > > > > return;
> > > > > }
> > > > > + case Instruction::Call: {
> > > > > + // Check if the calls are all to the same vectorizable
> > > > > intrinsic.
> > > > > + IntrinsicInst *II = dyn_cast<IntrinsicInst>(VL[0]);
> > > > > + if (II==NULL) {
> > > > > + newTreeEntry(VL, false);
> > > > > + DEBUG(dbgs() << "SLP: Non-vectorizable call.\n");
> > > > > + return;
> > > > > + }
> > > > > +
> > > > > + Intrinsic::ID ID = II->getIntrinsicID();
> > > > > +
> > > > > + for (unsigned i = 1, e = VL.size(); i != e; ++i) {
> > > > > + IntrinsicInst *II2 = dyn_cast<IntrinsicInst>(VL[i]);
> > > > > + if (!II2 || II2->getIntrinsicID() != ID) {
> > > > > + newTreeEntry(VL, false);
> > > > > + DEBUG(dbgs() << "SLP: mismatched calls:" << *II << "!=" <<
> > > > > *VL[i]
> > > > > + << "\n");
> > > > > + return;
> > > > > + }
> > > > > + }
> > > > > +
> > > > > + newTreeEntry(VL, true);
> > > > > + for (unsigned i = 0, e = II->getNumArgOperands(); i != e;
> > > > > ++i)
> > > > > {
> > > > > + ValueList Operands;
> > > > > + // Prepare the operand vector.
> > > > > + for (unsigned j = 0; j < VL.size(); ++j) {
> > > > > + IntrinsicInst *II2 = dyn_cast<IntrinsicInst>(VL[j]);
> > > > > + Operands.push_back(II2->getArgOperand(i));
> > > > > + }
> > > > > + buildTree_rec(Operands, Depth + 1);
> > > > > + }
> > > > > + return;
> > > > > + }
> > > > > default:
> > > > > newTreeEntry(VL, false);
> > > > > DEBUG(dbgs() << "SLP: Gathering unknown instruction.\n");
> > > > > @@ -1072,6 +1105,30 @@ int BoUpSLP::getEntryCost(TreeEntry *E)
> > > > > int VecStCost = TTI->getMemoryOpCost(Instruction::Store, VecTy,
> > > > > 1,
> > > > > 0);
> > > > > return VecStCost - ScalarStCost;
> > > > > }
> > > > > + case Instruction::Call: {
> > > > > + CallInst *CI = cast<CallInst>(VL0);
> > > > > + IntrinsicInst *II = cast<IntrinsicInst>(CI);
> > > > > + Intrinsic::ID ID = II->getIntrinsicID();
> > > > > +
> > > > > + // Calculate the cost of the scalar and vector calls.
> > > > > + SmallVector<Type*, 4> ScalarTys, VecTys;
> > > > > + for (unsigned op = 0, opc = II->getNumArgOperands(); op!=
> > > > > opc;
> > > > > ++op) {
> > > > > + ScalarTys.push_back(CI->getArgOperand(op)->getType());
> > > > > +
> > > > > VecTys.push_back(VectorType::get(CI->getArgOperand(op)->getType(),
> > > > > + VecTy->getNumElements()));
> > > > > + }
> > > > > +
> > > > > + int ScalarCallCost = VecTy->getNumElements() *
> > > > > + TTI->getIntrinsicInstrCost(ID, ScalarTy, ScalarTys);
> > > > > +
> > > > > + int VecCallCost = TTI->getIntrinsicInstrCost(ID, VecTy,
> > > > > VecTys);
> > > > > +
> > > > > + DEBUG(dbgs() << "SLP: Call cost "<< VecCallCost -
> > > > > ScalarCallCost
> > > > > + << " (" << VecCallCost << "-" << ScalarCallCost << ")"
> > > > > + << " for " << *II << "\n");
> > > > > +
> > > > > + return VecCallCost - ScalarCallCost;
> > > > > + }
> > > > > default:
> > > > > llvm_unreachable("Unknown instruction");
> > > > > }
> > > > > @@ -1086,10 +1143,10 @@ bool
> > > > > BoUpSLP::isFullyVectorizableTinyTre
> > > > > return false;
> > > > >
> > > > > // Gathering cost would be too much for tiny trees.
> > > > > - if (VectorizableTree[0].NeedToGather ||
> > > > > VectorizableTree[1].NeedToGather)
> > > > > - return false;
> > > > > + if (VectorizableTree[0].NeedToGather ||
> > > > > VectorizableTree[1].NeedToGather)
> > > > > + return false;
> > > > >
> > > > > - return true;
> > > > > + return true;
> > > > > }
> > > > >
> > > > > int BoUpSLP::getTreeCost() {
> > > > > @@ -1555,6 +1612,32 @@ Value *BoUpSLP::vectorizeTree(TreeEntry
> > > > > E->VectorizedValue = S;
> > > > > return propagateMetadata(S, E->Scalars);
> > > > > }
> > > > > + case Instruction::Call: {
> > > > > + CallInst *CI = cast<CallInst>(VL0);
> > > > > +
> > > > > + setInsertPointAfterBundle(E->Scalars);
> > > > > + std::vector<Value *> OpVecs;
> > > > > + for (int j = 0, e = CI->getNumArgOperands(); j < e; ++j) {
> > > > > + ValueList OpVL;
> > > > > + for (int i = 0, e = E->Scalars.size(); i < e; ++i) {
> > > > > + CallInst *CEI = cast<CallInst>(E->Scalars[i]);
> > > > > + OpVL.push_back(CEI->getArgOperand(j));
> > > > > + }
> > > > > +
> > > > > + Value *OpVec = vectorizeTree(OpVL);
> > > > > + DEBUG(dbgs() << "SLP: OpVec[" << j << "]: " << *OpVec <<
> > > > > "\n");
> > > > > + OpVecs.push_back(OpVec);
> > > > > + }
> > > > > +
> > > > > + Module *M = F->getParent();
> > > > > + IntrinsicInst *II = cast<IntrinsicInst>(CI);
> > > > > + Intrinsic::ID ID = II->getIntrinsicID();
> > > > > + Type *Tys[] = { VectorType::get(CI->getType(),
> > > > > E->Scalars.size())
> > > > > };
> > > > > + Function *CF = Intrinsic::getDeclaration(M, ID, Tys);
> > > > > + Value *V = Builder.CreateCall(CF, OpVecs);
> > > > > + E->VectorizedValue = V;
> > > > > + return V;
> > > > > + }
> > > > > default:
> > > > > llvm_unreachable("unknown inst");
> > > > > }
> > > > >
> > > > > Added:
> > > > > llvm/trunk/test/Transforms/SLPVectorizer/X86/intrinsic.ll
> > > > > URL:
> > > > > http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Transforms/SLPVectorizer/X86/intrinsic.ll?rev=200576&view=auto
> > > > > ==============================================================================
> > > > > --- llvm/trunk/test/Transforms/SLPVectorizer/X86/intrinsic.ll
> > > > > (added)
> > > > > +++ llvm/trunk/test/Transforms/SLPVectorizer/X86/intrinsic.ll
> > > > > Fri
> > > > > Jan
> > > > > 31 15:14:40 2014
> > > > > @@ -0,0 +1,75 @@
> > > > > +; RUN: opt < %s -basicaa -slp-vectorizer -slp-threshold=-999
> > > > > -dce
> > > > > -S
> > > > > -mtriple=x86_64-apple-macosx10.8.0 -mcpu=corei7-avx | FileCheck
> > > > > %s
> > > > > +
> > > > > +target datalayout =
> > > > > "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"
> > > > > +target triple = "x86_64-apple-macosx10.8.0"
> > > > > +
> > > > > +declare double @llvm.fabs.f64(double) nounwind readnone
> > > > > +
> > > > > +;CHECK-LABEL: @vec_fabs_f64(
> > > > > +;CHECK: load <2 x double>
> > > > > +;CHECK: load <2 x double>
> > > > > +;CHECK: call <2 x double> @llvm.fabs.v2f64
> > > > > +;CHECK: store <2 x double>
> > > > > +;CHECK: ret
> > > > > +define void @vec_fabs_f64(double* %a, double* %b, double* %c)
> > > > > {
> > > > > +entry:
> > > > > + %i0 = load double* %a, align 8
> > > > > + %i1 = load double* %b, align 8
> > > > > + %mul = fmul double %i0, %i1
> > > > > + %call = tail call double @llvm.fabs.f64(double %mul) nounwind
> > > > > readnone
> > > > > + %arrayidx3 = getelementptr inbounds double* %a, i64 1
> > > > > + %i3 = load double* %arrayidx3, align 8
> > > > > + %arrayidx4 = getelementptr inbounds double* %b, i64 1
> > > > > + %i4 = load double* %arrayidx4, align 8
> > > > > + %mul5 = fmul double %i3, %i4
> > > > > + %call5 = tail call double @llvm.fabs.f64(double %mul5)
> > > > > nounwind
> > > > > readnone
> > > > > + store double %call, double* %c, align 8
> > > > > + %arrayidx5 = getelementptr inbounds double* %c, i64 1
> > > > > + store double %call5, double* %arrayidx5, align 8
> > > > > + ret void
> > > > > +}
> > > > > +
> > > > > +declare float @llvm.copysign.f32(float, float) nounwind
> > > > > readnone
> > > > > +
> > > > > +;CHECK-LABEL: @vec_copysign_f32(
> > > > > +;CHECK: load <4 x float>
> > > > > +;CHECK: load <4 x float>
> > > > > +;CHECK: call <4 x float> @llvm.copysign.v4f32
> > > > > +;CHECK: store <4 x float>
> > > > > +;CHECK: ret
> > > > > +define void @vec_copysign_f32(float* %a, float* %b, float*
> > > > > noalias
> > > > > %c) {
> > > > > +entry:
> > > > > + %0 = load float* %a, align 4
> > > > > + %1 = load float* %b, align 4
> > > > > + %call0 = tail call float @llvm.copysign.f32(float %0, float
> > > > > %1)
> > > > > nounwind readnone
> > > > > + store float %call0, float* %c, align 4
> > > > > +
> > > > > + %ix2 = getelementptr inbounds float* %a, i64 1
> > > > > + %2 = load float* %ix2, align 4
> > > > > + %ix3 = getelementptr inbounds float* %b, i64 1
> > > > > + %3 = load float* %ix3, align 4
> > > > > + %call1 = tail call float @llvm.copysign.f32(float %2, float
> > > > > %3)
> > > > > nounwind readnone
> > > > > + %c1 = getelementptr inbounds float* %c, i64 1
> > > > > + store float %call1, float* %c1, align 4
> > > > > +
> > > > > + %ix4 = getelementptr inbounds float* %a, i64 2
> > > > > + %4 = load float* %ix4, align 4
> > > > > + %ix5 = getelementptr inbounds float* %b, i64 2
> > > > > + %5 = load float* %ix5, align 4
> > > > > + %call2 = tail call float @llvm.copysign.f32(float %4, float
> > > > > %5)
> > > > > nounwind readnone
> > > > > + %c2 = getelementptr inbounds float* %c, i64 2
> > > > > + store float %call2, float* %c2, align 4
> > > > > +
> > > > > + %ix6 = getelementptr inbounds float* %a, i64 3
> > > > > + %6 = load float* %ix6, align 4
> > > > > + %ix7 = getelementptr inbounds float* %b, i64 3
> > > > > + %7 = load float* %ix7, align 4
> > > > > + %call3 = tail call float @llvm.copysign.f32(float %6, float
> > > > > %7)
> > > > > nounwind readnone
> > > > > + %c3 = getelementptr inbounds float* %c, i64 3
> > > > > + store float %call3, float* %c3, align 4
> > > > > +
> > > > > + ret void
> > > > > +}
> > > > > +
> > > > > +
> > > > > +
> > > > >
> > > > >
> > > > > _______________________________________________
> > > > > llvm-commits mailing list
> > > > > llvm-commits at cs.uiuc.edu
> > > > > http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > >
> > > > >
> > > > > Raúl E. Silvera
> > > > >
> > > > >
> > > > > _______________________________________________
> > > > > llvm-commits mailing list
> > > > > llvm-commits at cs.uiuc.edu
> > > > > http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> > > > >
> > > >
> > > > --
> > > > Hal Finkel
> > > > Assistant Computational Scientist
> > > > Leadership Computing Facility
> > > > Argonne National Laboratory
> > > >
> > > > _______________________________________________
> > > > llvm-commits mailing list
> > > > llvm-commits at cs.uiuc.edu
> > > > http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> > > >
> > >
> > > --
> > > Hal Finkel
> > > Assistant Computational Scientist
> > > Leadership Computing Facility
> > > Argonne National Laboratory
> > >
> > > _______________________________________________
> > > llvm-commits mailing list
> > > llvm-commits at cs.uiuc.edu
> > > http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> > >
> >
> > --
> > Hal Finkel
> > Assistant Computational Scientist
> > Leadership Computing Facility
> > Argonne National Laboratory
> >
> > _______________________________________________
> > llvm-commits mailing list
> > llvm-commits at cs.uiuc.edu
> > http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> >
> >
> 
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
> 
> 
> 
> -- 
>  Raúl E. Silvera 
> 
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20140204/2e8ee89a/attachment.html>