[llvm] r200576 - [SLPV] Recognize vectorizable intrinsics during SLP vectorization and

Mon Feb 3 09:24:26 PST 2014

----- Original Message -----
> From: "Hal Finkel" <hfinkel at anl.gov>
> To: "Raul Silvera" <rsilvera at google.com>
> Cc: "Chandler Carruth" <chandlerc at gmail.com>, "llvm-commits" <llvm-commits at cs.uiuc.edu>
> Sent: Monday, February 3, 2014 11:13:41 AM
> Subject: Re: [llvm] r200576 - [SLPV] Recognize vectorizable intrinsics	during	SLP vectorization and
> 
> ----- Original Message -----
> > From: "Raul Silvera" <rsilvera at google.com>
> > To: "Reid Kleckner" <rnk at google.com>
> > Cc: "Chandler Carruth" <chandlerc at gmail.com>, "llvm-commits"
> > <llvm-commits at cs.uiuc.edu>
> > Sent: Monday, February 3, 2014 10:51:38 AM
> > Subject: Re: [llvm] r200576 - [SLPV] Recognize vectorizable
> > intrinsics during	SLP vectorization and
> > 
> > 
> > 
> > 
> > Thank you Reid. Have you reverted the change?
> > 
> > 
> > I think the alternatives are to disable vectorization of bswap on
> > 32-bit or teach llvm how to do the expansion.
> > 
> > 
> > I'm using the same code as the Loop vectorizer to decide what to
> > vectorize, which makes me wonder if the same issue could be hit
> > with
> > loop vectorization. I'll investigate a bit further and figure out
> > what to do next.
> > 
> 
> Actually, this is a bit strange. X86ISelLowering.cpp has:
> 
>   for (int i = MVT::FIRST_VECTOR_VALUETYPE;
>            i <= MVT::LAST_VECTOR_VALUETYPE; ++i) {
>     MVT VT = (MVT::SimpleValueType)i;
>     ...
>     setOperationAction(ISD::BSWAP, VT, Expand);
>     ...
>   }
> 
> and so there should be no problem with lowering a vector bswap for
> any vector type, and if there is, that seems like a SDAG bug.

Or if I had read more carefully, I would have seen that this must be the case (the code ends up in SelectionDAGLegalize::ExpandBSWAP, which only handles scalar types). Looks like we just don't handle this in LegalizeVectorOps.

That having been said, there must also be another problem because BSWAP is marked as expand, and so it should have a high cost. Maybe vectorizing the rest of the code makes it worthwhile, but this should be double-checked.

 -Hal

> 
>  -Hal
> 
> > 
> > 
> > 
> > 
> > On Fri, Jan 31, 2014 at 5:40 PM, Reid Kleckner < rnk at google.com >
> > wrote:
> > 
> > 
> > 
> > Hey Raul,
> > 
> > 
> > This patch broke the 32-bit self-host build. The x86 backend claims
> > that the legalizer knows how to expand bswap intrinsics on vector
> > types, but this is not the case. Running llc the test case I gave
> > produces:
> > 
> > Unhandled Expand type in BSWAP!
> > UNREACHABLE executed at
> > ..\lib\CodeGen\SelectionDAG\LegalizeDAG.cpp:2537!
> > 
> > 
> > I'm going to revert this for now, but what should happen is that we
> > should learn how to expand bswap on vectors.
> > 
> > 
> > Reid
> > 
> > 
> > 
> > 
> > 
> > On Fri, Jan 31, 2014 at 5:17 PM, Reid Kleckner < rnk at google.com >
> > wrote:
> > 
> > 
> > 
> > This change caused us to vectorize bswap, which we then try to
> > expand
> > on i686, which hits an unreachable. Running llc on this repros:
> > 
> > 
> > 
> > declare <2 x i64> @llvm.bswap.v2i64(<2 x i64>) #8
> > define <2 x i64> @foo(<2 x i64> %v) {
> > %s = call <2 x i64> @llvm.bswap.v2i64(<2 x i64> %v)
> > ret <2 x i64> %s
> > }
> > attributes #8 = { nounwind readnone }
> > 
> > 
> > I don't have a reduced test case of input to the SLP vectorizer yet
> > because I'm new to reducing LLVM IR.
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > On Fri, Jan 31, 2014 at 1:14 PM, Chandler Carruth <
> > chandlerc at gmail.com > wrote:
> > 
> > 
> > 
> > 
> > Author: chandlerc
> > Date: Fri Jan 31 15:14:40 2014
> > New Revision: 200576
> > 
> > URL: http://llvm.org/viewvc/llvm-project?rev=200576&view=rev
> > Log:
> > [SLPV] Recognize vectorizable intrinsics during SLP vectorization
> > and
> > transform accordingly. Based on similar code from Loop
> > vectorization.
> > Subsequent commits will include vectorization of function calls to
> > vector intrinsics and form function calls to vector library calls.
> > 
> > Patch by Raul Silvera! (Much delayed due to my not running dcommit)
> > 
> > Added:
> > llvm/trunk/test/Transforms/SLPVectorizer/X86/intrinsic.ll
> > Modified:
> > llvm/trunk/lib/Transforms/Vectorize/SLPVectorizer.cpp
> > 
> > Modified: llvm/trunk/lib/Transforms/Vectorize/SLPVectorizer.cpp
> > URL:
> > http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Transforms/Vectorize/SLPVectorizer.cpp?rev=200576&r1=200575&r2=200576&view=diff
> > ==============================================================================
> > --- llvm/trunk/lib/Transforms/Vectorize/SLPVectorizer.cpp
> > (original)
> > +++ llvm/trunk/lib/Transforms/Vectorize/SLPVectorizer.cpp Fri Jan
> > 31
> > 15:14:40 2014
> > @@ -947,6 +947,39 @@ void BoUpSLP::buildTree_rec(ArrayRef<Val
> > buildTree_rec(Operands, Depth + 1);
> > return;
> > }
> > + case Instruction::Call: {
> > + // Check if the calls are all to the same vectorizable intrinsic.
> > + IntrinsicInst *II = dyn_cast<IntrinsicInst>(VL[0]);
> > + if (II==NULL) {
> > + newTreeEntry(VL, false);
> > + DEBUG(dbgs() << "SLP: Non-vectorizable call.\n");
> > + return;
> > + }
> > +
> > + Intrinsic::ID ID = II->getIntrinsicID();
> > +
> > + for (unsigned i = 1, e = VL.size(); i != e; ++i) {
> > + IntrinsicInst *II2 = dyn_cast<IntrinsicInst>(VL[i]);
> > + if (!II2 || II2->getIntrinsicID() != ID) {
> > + newTreeEntry(VL, false);
> > + DEBUG(dbgs() << "SLP: mismatched calls:" << *II << "!=" << *VL[i]
> > + << "\n");
> > + return;
> > + }
> > + }
> > +
> > + newTreeEntry(VL, true);
> > + for (unsigned i = 0, e = II->getNumArgOperands(); i != e; ++i) {
> > + ValueList Operands;
> > + // Prepare the operand vector.
> > + for (unsigned j = 0; j < VL.size(); ++j) {
> > + IntrinsicInst *II2 = dyn_cast<IntrinsicInst>(VL[j]);
> > + Operands.push_back(II2->getArgOperand(i));
> > + }
> > + buildTree_rec(Operands, Depth + 1);
> > + }
> > + return;
> > + }
> > default:
> > newTreeEntry(VL, false);
> > DEBUG(dbgs() << "SLP: Gathering unknown instruction.\n");
> > @@ -1072,6 +1105,30 @@ int BoUpSLP::getEntryCost(TreeEntry *E)
> > int VecStCost = TTI->getMemoryOpCost(Instruction::Store, VecTy, 1,
> > 0);
> > return VecStCost - ScalarStCost;
> > }
> > + case Instruction::Call: {
> > + CallInst *CI = cast<CallInst>(VL0);
> > + IntrinsicInst *II = cast<IntrinsicInst>(CI);
> > + Intrinsic::ID ID = II->getIntrinsicID();
> > +
> > + // Calculate the cost of the scalar and vector calls.
> > + SmallVector<Type*, 4> ScalarTys, VecTys;
> > + for (unsigned op = 0, opc = II->getNumArgOperands(); op!= opc;
> > ++op) {
> > + ScalarTys.push_back(CI->getArgOperand(op)->getType());
> > +
> > VecTys.push_back(VectorType::get(CI->getArgOperand(op)->getType(),
> > + VecTy->getNumElements()));
> > + }
> > +
> > + int ScalarCallCost = VecTy->getNumElements() *
> > + TTI->getIntrinsicInstrCost(ID, ScalarTy, ScalarTys);
> > +
> > + int VecCallCost = TTI->getIntrinsicInstrCost(ID, VecTy, VecTys);
> > +
> > + DEBUG(dbgs() << "SLP: Call cost "<< VecCallCost - ScalarCallCost
> > + << " (" << VecCallCost << "-" << ScalarCallCost << ")"
> > + << " for " << *II << "\n");
> > +
> > + return VecCallCost - ScalarCallCost;
> > + }
> > default:
> > llvm_unreachable("Unknown instruction");
> > }
> > @@ -1086,10 +1143,10 @@ bool BoUpSLP::isFullyVectorizableTinyTre
> > return false;
> > 
> > // Gathering cost would be too much for tiny trees.
> > - if (VectorizableTree[0].NeedToGather ||
> > VectorizableTree[1].NeedToGather)
> > - return false;
> > + if (VectorizableTree[0].NeedToGather ||
> > VectorizableTree[1].NeedToGather)
> > + return false;
> > 
> > - return true;
> > + return true;
> > }
> > 
> > int BoUpSLP::getTreeCost() {
> > @@ -1555,6 +1612,32 @@ Value *BoUpSLP::vectorizeTree(TreeEntry
> > E->VectorizedValue = S;
> > return propagateMetadata(S, E->Scalars);
> > }
> > + case Instruction::Call: {
> > + CallInst *CI = cast<CallInst>(VL0);
> > +
> > + setInsertPointAfterBundle(E->Scalars);
> > + std::vector<Value *> OpVecs;
> > + for (int j = 0, e = CI->getNumArgOperands(); j < e; ++j) {
> > + ValueList OpVL;
> > + for (int i = 0, e = E->Scalars.size(); i < e; ++i) {
> > + CallInst *CEI = cast<CallInst>(E->Scalars[i]);
> > + OpVL.push_back(CEI->getArgOperand(j));
> > + }
> > +
> > + Value *OpVec = vectorizeTree(OpVL);
> > + DEBUG(dbgs() << "SLP: OpVec[" << j << "]: " << *OpVec << "\n");
> > + OpVecs.push_back(OpVec);
> > + }
> > +
> > + Module *M = F->getParent();
> > + IntrinsicInst *II = cast<IntrinsicInst>(CI);
> > + Intrinsic::ID ID = II->getIntrinsicID();
> > + Type *Tys[] = { VectorType::get(CI->getType(), E->Scalars.size())
> > };
> > + Function *CF = Intrinsic::getDeclaration(M, ID, Tys);
> > + Value *V = Builder.CreateCall(CF, OpVecs);
> > + E->VectorizedValue = V;
> > + return V;
> > + }
> > default:
> > llvm_unreachable("unknown inst");
> > }
> > 
> > Added: llvm/trunk/test/Transforms/SLPVectorizer/X86/intrinsic.ll
> > URL:
> > http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Transforms/SLPVectorizer/X86/intrinsic.ll?rev=200576&view=auto
> > ==============================================================================
> > --- llvm/trunk/test/Transforms/SLPVectorizer/X86/intrinsic.ll
> > (added)
> > +++ llvm/trunk/test/Transforms/SLPVectorizer/X86/intrinsic.ll Fri
> > Jan
> > 31 15:14:40 2014
> > @@ -0,0 +1,75 @@
> > +; RUN: opt < %s -basicaa -slp-vectorizer -slp-threshold=-999 -dce
> > -S
> > -mtriple=x86_64-apple-macosx10.8.0 -mcpu=corei7-avx | FileCheck %s
> > +
> > +target datalayout =
> > "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"
> > +target triple = "x86_64-apple-macosx10.8.0"
> > +
> > +declare double @llvm.fabs.f64(double) nounwind readnone
> > +
> > +;CHECK-LABEL: @vec_fabs_f64(
> > +;CHECK: load <2 x double>
> > +;CHECK: load <2 x double>
> > +;CHECK: call <2 x double> @llvm.fabs.v2f64
> > +;CHECK: store <2 x double>
> > +;CHECK: ret
> > +define void @vec_fabs_f64(double* %a, double* %b, double* %c) {
> > +entry:
> > + %i0 = load double* %a, align 8
> > + %i1 = load double* %b, align 8
> > + %mul = fmul double %i0, %i1
> > + %call = tail call double @llvm.fabs.f64(double %mul) nounwind
> > readnone
> > + %arrayidx3 = getelementptr inbounds double* %a, i64 1
> > + %i3 = load double* %arrayidx3, align 8
> > + %arrayidx4 = getelementptr inbounds double* %b, i64 1
> > + %i4 = load double* %arrayidx4, align 8
> > + %mul5 = fmul double %i3, %i4
> > + %call5 = tail call double @llvm.fabs.f64(double %mul5) nounwind
> > readnone
> > + store double %call, double* %c, align 8
> > + %arrayidx5 = getelementptr inbounds double* %c, i64 1
> > + store double %call5, double* %arrayidx5, align 8
> > + ret void
> > +}
> > +
> > +declare float @llvm.copysign.f32(float, float) nounwind readnone
> > +
> > +;CHECK-LABEL: @vec_copysign_f32(
> > +;CHECK: load <4 x float>
> > +;CHECK: load <4 x float>
> > +;CHECK: call <4 x float> @llvm.copysign.v4f32
> > +;CHECK: store <4 x float>
> > +;CHECK: ret
> > +define void @vec_copysign_f32(float* %a, float* %b, float* noalias
> > %c) {
> > +entry:
> > + %0 = load float* %a, align 4
> > + %1 = load float* %b, align 4
> > + %call0 = tail call float @llvm.copysign.f32(float %0, float %1)
> > nounwind readnone
> > + store float %call0, float* %c, align 4
> > +
> > + %ix2 = getelementptr inbounds float* %a, i64 1
> > + %2 = load float* %ix2, align 4
> > + %ix3 = getelementptr inbounds float* %b, i64 1
> > + %3 = load float* %ix3, align 4
> > + %call1 = tail call float @llvm.copysign.f32(float %2, float %3)
> > nounwind readnone
> > + %c1 = getelementptr inbounds float* %c, i64 1
> > + store float %call1, float* %c1, align 4
> > +
> > + %ix4 = getelementptr inbounds float* %a, i64 2
> > + %4 = load float* %ix4, align 4
> > + %ix5 = getelementptr inbounds float* %b, i64 2
> > + %5 = load float* %ix5, align 4
> > + %call2 = tail call float @llvm.copysign.f32(float %4, float %5)
> > nounwind readnone
> > + %c2 = getelementptr inbounds float* %c, i64 2
> > + store float %call2, float* %c2, align 4
> > +
> > + %ix6 = getelementptr inbounds float* %a, i64 3
> > + %6 = load float* %ix6, align 4
> > + %ix7 = getelementptr inbounds float* %b, i64 3
> > + %7 = load float* %ix7, align 4
> > + %call3 = tail call float @llvm.copysign.f32(float %6, float %7)
> > nounwind readnone
> > + %c3 = getelementptr inbounds float* %c, i64 3
> > + store float %call3, float* %c3, align 4
> > +
> > + ret void
> > +}
> > +
> > +
> > +
> > 
> > 
> > _______________________________________________
> > llvm-commits mailing list
> > llvm-commits at cs.uiuc.edu
> > http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> > 
> > 
> > 
> > 
> > 
> > 
> > --
> > 
> > 
> > Raúl E. Silvera
> > 
> > 
> > _______________________________________________
> > llvm-commits mailing list
> > llvm-commits at cs.uiuc.edu
> > http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> > 
> 
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
> 
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> 

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory