[llvm] r200576 - [SLPV] Recognize vectorizable intrinsics during SLP vectorization and

Mon Feb 3 09:13:41 PST 2014

----- Original Message -----
> From: "Raul Silvera" <rsilvera at google.com>
> To: "Reid Kleckner" <rnk at google.com>
> Cc: "Chandler Carruth" <chandlerc at gmail.com>, "llvm-commits" <llvm-commits at cs.uiuc.edu>
> Sent: Monday, February 3, 2014 10:51:38 AM
> Subject: Re: [llvm] r200576 - [SLPV] Recognize vectorizable intrinsics during	SLP vectorization and
> 
> 
> 
> 
> Thank you Reid. Have you reverted the change?
> 
> 
> I think the alternatives are to disable vectorization of bswap on
> 32-bit or teach llvm how to do the expansion.
> 
> 
> I'm using the same code as the Loop vectorizer to decide what to
> vectorize, which makes me wonder if the same issue could be hit with
> loop vectorization. I'll investigate a bit further and figure out
> what to do next.
> 

Actually, this is a bit strange. X86ISelLowering.cpp has:

  for (int i = MVT::FIRST_VECTOR_VALUETYPE;
           i <= MVT::LAST_VECTOR_VALUETYPE; ++i) {
    MVT VT = (MVT::SimpleValueType)i;
    ...
    setOperationAction(ISD::BSWAP, VT, Expand);
    ...
  }

and so there should be no problem with lowering a vector bswap for any vector type, and if there is, that seems like a SDAG bug.

 -Hal

> 
> 
> 
> 
> On Fri, Jan 31, 2014 at 5:40 PM, Reid Kleckner < rnk at google.com >
> wrote:
> 
> 
> 
> Hey Raul,
> 
> 
> This patch broke the 32-bit self-host build. The x86 backend claims
> that the legalizer knows how to expand bswap intrinsics on vector
> types, but this is not the case. Running llc the test case I gave
> produces:
> 
> Unhandled Expand type in BSWAP!
> UNREACHABLE executed at
> ..\lib\CodeGen\SelectionDAG\LegalizeDAG.cpp:2537!
> 
> 
> I'm going to revert this for now, but what should happen is that we
> should learn how to expand bswap on vectors.
> 
> 
> Reid
> 
> 
> 
> 
> 
> On Fri, Jan 31, 2014 at 5:17 PM, Reid Kleckner < rnk at google.com >
> wrote:
> 
> 
> 
> This change caused us to vectorize bswap, which we then try to expand
> on i686, which hits an unreachable. Running llc on this repros:
> 
> 
> 
> declare <2 x i64> @llvm.bswap.v2i64(<2 x i64>) #8
> define <2 x i64> @foo(<2 x i64> %v) {
> %s = call <2 x i64> @llvm.bswap.v2i64(<2 x i64> %v)
> ret <2 x i64> %s
> }
> attributes #8 = { nounwind readnone }
> 
> 
> I don't have a reduced test case of input to the SLP vectorizer yet
> because I'm new to reducing LLVM IR.
> 
> 
> 
> 
> 
> 
> 
> On Fri, Jan 31, 2014 at 1:14 PM, Chandler Carruth <
> chandlerc at gmail.com > wrote:
> 
> 
> 
> 
> Author: chandlerc
> Date: Fri Jan 31 15:14:40 2014
> New Revision: 200576
> 
> URL: http://llvm.org/viewvc/llvm-project?rev=200576&view=rev
> Log:
> [SLPV] Recognize vectorizable intrinsics during SLP vectorization and
> transform accordingly. Based on similar code from Loop vectorization.
> Subsequent commits will include vectorization of function calls to
> vector intrinsics and form function calls to vector library calls.
> 
> Patch by Raul Silvera! (Much delayed due to my not running dcommit)
> 
> Added:
> llvm/trunk/test/Transforms/SLPVectorizer/X86/intrinsic.ll
> Modified:
> llvm/trunk/lib/Transforms/Vectorize/SLPVectorizer.cpp
> 
> Modified: llvm/trunk/lib/Transforms/Vectorize/SLPVectorizer.cpp
> URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Transforms/Vectorize/SLPVectorizer.cpp?rev=200576&r1=200575&r2=200576&view=diff
> ==============================================================================
> --- llvm/trunk/lib/Transforms/Vectorize/SLPVectorizer.cpp (original)
> +++ llvm/trunk/lib/Transforms/Vectorize/SLPVectorizer.cpp Fri Jan 31
> 15:14:40 2014
> @@ -947,6 +947,39 @@ void BoUpSLP::buildTree_rec(ArrayRef<Val
> buildTree_rec(Operands, Depth + 1);
> return;
> }
> + case Instruction::Call: {
> + // Check if the calls are all to the same vectorizable intrinsic.
> + IntrinsicInst *II = dyn_cast<IntrinsicInst>(VL[0]);
> + if (II==NULL) {
> + newTreeEntry(VL, false);
> + DEBUG(dbgs() << "SLP: Non-vectorizable call.\n");
> + return;
> + }
> +
> + Intrinsic::ID ID = II->getIntrinsicID();
> +
> + for (unsigned i = 1, e = VL.size(); i != e; ++i) {
> + IntrinsicInst *II2 = dyn_cast<IntrinsicInst>(VL[i]);
> + if (!II2 || II2->getIntrinsicID() != ID) {
> + newTreeEntry(VL, false);
> + DEBUG(dbgs() << "SLP: mismatched calls:" << *II << "!=" << *VL[i]
> + << "\n");
> + return;
> + }
> + }
> +
> + newTreeEntry(VL, true);
> + for (unsigned i = 0, e = II->getNumArgOperands(); i != e; ++i) {
> + ValueList Operands;
> + // Prepare the operand vector.
> + for (unsigned j = 0; j < VL.size(); ++j) {
> + IntrinsicInst *II2 = dyn_cast<IntrinsicInst>(VL[j]);
> + Operands.push_back(II2->getArgOperand(i));
> + }
> + buildTree_rec(Operands, Depth + 1);
> + }
> + return;
> + }
> default:
> newTreeEntry(VL, false);
> DEBUG(dbgs() << "SLP: Gathering unknown instruction.\n");
> @@ -1072,6 +1105,30 @@ int BoUpSLP::getEntryCost(TreeEntry *E)
> int VecStCost = TTI->getMemoryOpCost(Instruction::Store, VecTy, 1,
> 0);
> return VecStCost - ScalarStCost;
> }
> + case Instruction::Call: {
> + CallInst *CI = cast<CallInst>(VL0);
> + IntrinsicInst *II = cast<IntrinsicInst>(CI);
> + Intrinsic::ID ID = II->getIntrinsicID();
> +
> + // Calculate the cost of the scalar and vector calls.
> + SmallVector<Type*, 4> ScalarTys, VecTys;
> + for (unsigned op = 0, opc = II->getNumArgOperands(); op!= opc;
> ++op) {
> + ScalarTys.push_back(CI->getArgOperand(op)->getType());
> + VecTys.push_back(VectorType::get(CI->getArgOperand(op)->getType(),
> + VecTy->getNumElements()));
> + }
> +
> + int ScalarCallCost = VecTy->getNumElements() *
> + TTI->getIntrinsicInstrCost(ID, ScalarTy, ScalarTys);
> +
> + int VecCallCost = TTI->getIntrinsicInstrCost(ID, VecTy, VecTys);
> +
> + DEBUG(dbgs() << "SLP: Call cost "<< VecCallCost - ScalarCallCost
> + << " (" << VecCallCost << "-" << ScalarCallCost << ")"
> + << " for " << *II << "\n");
> +
> + return VecCallCost - ScalarCallCost;
> + }
> default:
> llvm_unreachable("Unknown instruction");
> }
> @@ -1086,10 +1143,10 @@ bool BoUpSLP::isFullyVectorizableTinyTre
> return false;
> 
> // Gathering cost would be too much for tiny trees.
> - if (VectorizableTree[0].NeedToGather ||
> VectorizableTree[1].NeedToGather)
> - return false;
> + if (VectorizableTree[0].NeedToGather ||
> VectorizableTree[1].NeedToGather)
> + return false;
> 
> - return true;
> + return true;
> }
> 
> int BoUpSLP::getTreeCost() {
> @@ -1555,6 +1612,32 @@ Value *BoUpSLP::vectorizeTree(TreeEntry
> E->VectorizedValue = S;
> return propagateMetadata(S, E->Scalars);
> }
> + case Instruction::Call: {
> + CallInst *CI = cast<CallInst>(VL0);
> +
> + setInsertPointAfterBundle(E->Scalars);
> + std::vector<Value *> OpVecs;
> + for (int j = 0, e = CI->getNumArgOperands(); j < e; ++j) {
> + ValueList OpVL;
> + for (int i = 0, e = E->Scalars.size(); i < e; ++i) {
> + CallInst *CEI = cast<CallInst>(E->Scalars[i]);
> + OpVL.push_back(CEI->getArgOperand(j));
> + }
> +
> + Value *OpVec = vectorizeTree(OpVL);
> + DEBUG(dbgs() << "SLP: OpVec[" << j << "]: " << *OpVec << "\n");
> + OpVecs.push_back(OpVec);
> + }
> +
> + Module *M = F->getParent();
> + IntrinsicInst *II = cast<IntrinsicInst>(CI);
> + Intrinsic::ID ID = II->getIntrinsicID();
> + Type *Tys[] = { VectorType::get(CI->getType(), E->Scalars.size())
> };
> + Function *CF = Intrinsic::getDeclaration(M, ID, Tys);
> + Value *V = Builder.CreateCall(CF, OpVecs);
> + E->VectorizedValue = V;
> + return V;
> + }
> default:
> llvm_unreachable("unknown inst");
> }
> 
> Added: llvm/trunk/test/Transforms/SLPVectorizer/X86/intrinsic.ll
> URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Transforms/SLPVectorizer/X86/intrinsic.ll?rev=200576&view=auto
> ==============================================================================
> --- llvm/trunk/test/Transforms/SLPVectorizer/X86/intrinsic.ll (added)
> +++ llvm/trunk/test/Transforms/SLPVectorizer/X86/intrinsic.ll Fri Jan
> 31 15:14:40 2014
> @@ -0,0 +1,75 @@
> +; RUN: opt < %s -basicaa -slp-vectorizer -slp-threshold=-999 -dce -S
> -mtriple=x86_64-apple-macosx10.8.0 -mcpu=corei7-avx | FileCheck %s
> +
> +target datalayout =
> "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"
> +target triple = "x86_64-apple-macosx10.8.0"
> +
> +declare double @llvm.fabs.f64(double) nounwind readnone
> +
> +;CHECK-LABEL: @vec_fabs_f64(
> +;CHECK: load <2 x double>
> +;CHECK: load <2 x double>
> +;CHECK: call <2 x double> @llvm.fabs.v2f64
> +;CHECK: store <2 x double>
> +;CHECK: ret
> +define void @vec_fabs_f64(double* %a, double* %b, double* %c) {
> +entry:
> + %i0 = load double* %a, align 8
> + %i1 = load double* %b, align 8
> + %mul = fmul double %i0, %i1
> + %call = tail call double @llvm.fabs.f64(double %mul) nounwind
> readnone
> + %arrayidx3 = getelementptr inbounds double* %a, i64 1
> + %i3 = load double* %arrayidx3, align 8
> + %arrayidx4 = getelementptr inbounds double* %b, i64 1
> + %i4 = load double* %arrayidx4, align 8
> + %mul5 = fmul double %i3, %i4
> + %call5 = tail call double @llvm.fabs.f64(double %mul5) nounwind
> readnone
> + store double %call, double* %c, align 8
> + %arrayidx5 = getelementptr inbounds double* %c, i64 1
> + store double %call5, double* %arrayidx5, align 8
> + ret void
> +}
> +
> +declare float @llvm.copysign.f32(float, float) nounwind readnone
> +
> +;CHECK-LABEL: @vec_copysign_f32(
> +;CHECK: load <4 x float>
> +;CHECK: load <4 x float>
> +;CHECK: call <4 x float> @llvm.copysign.v4f32
> +;CHECK: store <4 x float>
> +;CHECK: ret
> +define void @vec_copysign_f32(float* %a, float* %b, float* noalias
> %c) {
> +entry:
> + %0 = load float* %a, align 4
> + %1 = load float* %b, align 4
> + %call0 = tail call float @llvm.copysign.f32(float %0, float %1)
> nounwind readnone
> + store float %call0, float* %c, align 4
> +
> + %ix2 = getelementptr inbounds float* %a, i64 1
> + %2 = load float* %ix2, align 4
> + %ix3 = getelementptr inbounds float* %b, i64 1
> + %3 = load float* %ix3, align 4
> + %call1 = tail call float @llvm.copysign.f32(float %2, float %3)
> nounwind readnone
> + %c1 = getelementptr inbounds float* %c, i64 1
> + store float %call1, float* %c1, align 4
> +
> + %ix4 = getelementptr inbounds float* %a, i64 2
> + %4 = load float* %ix4, align 4
> + %ix5 = getelementptr inbounds float* %b, i64 2
> + %5 = load float* %ix5, align 4
> + %call2 = tail call float @llvm.copysign.f32(float %4, float %5)
> nounwind readnone
> + %c2 = getelementptr inbounds float* %c, i64 2
> + store float %call2, float* %c2, align 4
> +
> + %ix6 = getelementptr inbounds float* %a, i64 3
> + %6 = load float* %ix6, align 4
> + %ix7 = getelementptr inbounds float* %b, i64 3
> + %7 = load float* %ix7, align 4
> + %call3 = tail call float @llvm.copysign.f32(float %6, float %7)
> nounwind readnone
> + %c3 = getelementptr inbounds float* %c, i64 3
> + store float %call3, float* %c3, align 4
> +
> + ret void
> +}
> +
> +
> +
> 
> 
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> 
> 
> 
> 
> 
> 
> --
> 
> 
> Raúl E. Silvera
> 
> 
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> 

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory