[PATCH][X86] Teach the backend how to lower vector shift left into multiply rather than scalarizing it.

Wed Feb 12 13:52:00 PST 2014

On Wed, Feb 12, 2014 at 8:53 PM, Nadav Rotem <nrotem at apple.com> wrote:
> Almost there!  You forgot the LoopVectorizer, which should be easy to fix.

True! Sorry....

This new version of the patch should fix also the LoopVectorizer.

Please let me know what you think.

Cheers,
Andrea


> On Feb 12, 2014, at 12:04 PM, Andrea Di Biagio <andrea.dibiagio at gmail.com>
> wrote:
>
> Hi Nadav,
>
> This new patch should fix the Vectorizer as well.
>
> Differences are:
> - method 'getEntryCost' in SLPVectorizer is now able to correctly
> identify uniform and non-uniform constant values.
> - method getInstrCost in the BBVectorize now accepts two extra
> operands of type OperandValueKind.
> - method areInstsCompatible in BBVectorize is now able to correctly
> compute the cost of a merged shift instruction based on the knowledge
> of the operand value kinds.
>  Example 1.
>  given the following pair:
>      %shl1 = shl A, 1
>      %shl2 = shl B, 1
>
>  The operand value kind for the second operand of the merged shift
> will be equal to OK_UniformConstantValue
>
>  Example 2.
>  given the following pair:
>      %shl1 = shl A, 1
>      %shl2 = shl B, 2
>
>  The operand value kind for the second operand of the merged shift
> will be equal to OK_NonUniformConstantValue.
>
>
> Please let me know if ok to submit.
>
> Thanks :-)
> Andrea
>
> P.s.: on a slightly different topic.
> While testing the SLP vectorizer, I noticed that method
> 'vectorizeStoreChain' conservatively limits the length of a
> vectorizable chain of instructions based on the value of the global
> variable 'MinVecRegSize' (which is always equal to 128).
> My question is: when looking for profitable vectorizable trees,
> wouldn't it make more sense to set variable VF based (also) on the
> value returned by TTI::getRegisterBitWidth()? I think this could have
> a positive effect if the target supports for example 256bit vectors
> (so, AVX/AVX2 and AVX512f). That said, I am not an expert of
> SLPVectorizer and therefore I may have misunderstood the behavior of
> that method :-)
>
> On Wed, Feb 12, 2014 at 3:34 AM, Nadav Rotem <nrotem at apple.com> wrote:
>
> Hi Andrea,
>
> You added a new kind of OperandValueKind and you connected the cost model
> computation, but you forgot one big piece: the vectorizer!   The vectorizer
> needs to use the new OperandValueKind in order to efficiently estimate the
> cost of the shifts.  The patches you sent LGTM, but I would also like the
> vectorizer to use the new operand value kind.  :)
>
> Thanks,
> Nadav
>
>
>
> On Feb 11, 2014, at 12:29 PM, Andrea Di Biagio <andrea.dibiagio at gmail.com>
> wrote:
>
> Hi Nadav and Jim,
>
> Sorry for sending another version of the patch but I think I found how
> to properly fix/improve the cost model.
> For simplicity I have split the patch into two: a patch that
> introduces the new lowering rule for vector shifts and a patch that
> improves the x86 cost model.
>
> Test vec_shift6.ll should now cover all the cases where vector shifts
> are expanded into multiply instructions.
>
> I introduced into the cost model a new 'OperandValueKind' to identify
> operands which are constants but not constant splats.
> I called it 'OK_NonUniformConstValue' to distinguish it from
> 'OK_UniformConstantValue' which is used for splat values.
>
> I modified CostModel.cpp to allow returning 'OK_NonUniformConstValue'
> for non splat operands that are instances of ConstantVector or
> ConstantDataVector.
>
> I verified that the cost model still produces the expected results on
> other non x86 targets (ARM and PPC).
> On the X86 backend, I modified X86TargetTransformInfo to produce the
> 'expected' cost for vector shift left instructions that can be lowered
> as a vector multiply.
>
> Finally I added a new test to verify that the output of opt
> '-cost-model -analyze' is valid in the following configurations: SSE2,
> SSE4.1, AVX, AVX2.
>
> I didn't add a cost model for AVX512f, but I think (if it is ok for
> you) that this can be done as a separate patch.
>
> Please let me know what you think.
>
> Thanks,
> Andrea
>
> On Fri, Feb 7, 2014 at 8:51 PM, Andrea Di Biagio
> <andrea.dibiagio at gmail.com> wrote:
>
> Hi Jim and Nadav,
>
> here is a new version of the patch.
>
> The cost model is able to deal with most of the cases where the second
> operand of a SHL is of kind OK_UniformConstantValue.
> As far as I can see, the only missing rule is for the case where we
> have a v16i16 SHL in AVX2.
> I added a rule in X86TTI::getArithmeticInstrCost to update the cost
> for that specific case.
>
> One of the problems I have encountered is that method 'getOperandInfo'
> in CostModel either returns OK_AnyValue or OK_UniformConstantValue (I
> didn't find any case where we return OK_UniformValue..).
> All the shifts optimized by my patch are shifts where the second
> operand is a ConstantDataVector but not a splat. Therefore, function
> getOperandInfo in CostModel.cpp will always return OK_AnyValue for
> them.
>
> In conclusion, I am not sure if I have modified that part correctly.
> Please let me know if you had something else in mind and in case what
> should be the correct way to improve that part.
>
> I investigated other opportunties for applying this transformation to
> other vector types.
> This new patch improves my previous patch since we now know how to
> expand a packed v16i16 shift left into a single AVX2 vpmullw when the
> shift amount is a constant build_vector.
>
> Without AVX2 support, a packed v16i16 shift left is usually decomposed
> into two 128-bits shifts. Each new shift is then properly expanded
> into a multiply instruction (pmullw) thanks to the new transformation
> introduced by this patch.
>
> The backend already knows how to efficiently lower v8i32 shifts with
> AVX2: v8i32 shifts are expanded into VPSLLV instructions.
> Also, the backend already knows how to emit the AVX512f version of
> VPSLLVD and VPSLLVQ in the case of v1632 and v8i64 vectors.
> AVX512f seems to benefit alot from this new transformation (you can
> see it if you compare the output of test avx_shift6.ll when run for
> AVX512 before and after applying my patch).
>
> A vector shift left by constant v32i16 build_vector is initally split
> into two v16i16 shifts during type-legalization. Eventually the new
> algorithm converts the resulting shifts into multiply instructions.
>
> See new test-cases avx_shift6.ll for more details.
>
> Please let me know what you think.
>
> Thanks,
> Andrea
>
> On Thu, Feb 6, 2014 at 11:11 PM, Andrea Di Biagio
> <andrea.dibiagio at gmail.com> wrote:
>
> Hi Jim and Nadav,
>
> thanks for the feedback!
>
> On Thu, Feb 6, 2014 at 10:33 PM, Nadav Rotem <nrotem at apple.com> wrote:
>
> Please remember to update the x86 cost model in
> X86TTI::getArithmeticInstrCost.  In this function you should be able to
> check if the LHS is a constant.
>
>
> Ok, I will update the cost model.
>
> On Feb 6, 2014, at 2:26 PM, Jim Grosbach <grosbach at apple.com> wrote:
>
> Hi Andrea,
>
> This is a very nice improvement, but should do a bit more, I believe.
>
> AVX2 adds 256-bit wide vector versions of these instructions, so if AVX2 is
> available, the same transformation should be applied to v16i16 and v8i32
> shifts. Worth looking to see if AVX512 extends
>
> The test cases should check that when compiling for AVX, the VEX prefixed
> form of the instructions are generated instead of the SSE versions.
>
>
> Sure, I will send a new version of the patch that also improves AVX2
> code generation in the case of v16i16 and v8i32. I'll also have a look
> at AVX512 to identify cases that can also be improved.
>
> On my previous email I forgot to mention that this patch would also
> fix bug 18478 (http://llvm.org/bugs/show_bug.cgi?id=18478)
>
> Thanks again for your time.
> Andrea
>
>
> Hi,
>
> This patch teaches the backend how to efficiently lower a packed
> vector shift left into a packed vector multiply if the vector of shift
> counts is known to be constant (i.e. a constant build_vector).
>
> Instead of expanding a packed shift into a sequence of scalar shifts,
> the backend should try (when possible) to convert the vector shift
> into a vector multiply.
>
> Before this patch, a shift of a MVT::v8i16 vector by a build_vector of
> constants was always scalarized into a long sequence of "vector
> extracts + scalar shifts + vector insert".
> With this patch, if there is SSE2 support, we emit a single vector multiply.
>
> The new x86 test 'vec_shift6.ll' contains some examples of code that
> are affected by this patch.
>
> Please let me know if ok to submit.
>
> Thanks,
> Andrea Di Biagio
> SN Systems - Sony Computer Entertainment Group
> <patch.diff>_______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>
>
>
> <patch-cost-model.diff><patch-v2.diff>
>
>
> <patch-cost-model-v2.diff><patch-v2.diff>
>
>
-------------- next part --------------
Index: lib/Target/X86/X86TargetTransformInfo.cpp
===================================================================

--- lib/Target/X86/X86TargetTransformInfo.cpp	(revision 201259)
+++ lib/Target/X86/X86TargetTransformInfo.cpp	(working copy)
@@ -225,6 +225,13 @@
 
   // Look for AVX2 lowering tricks.
   if (ST->hasAVX2()) {
+    if (ISD == ISD::SHL && LT.second == MVT::v16i16 &&
+        (Op2Info == TargetTransformInfo::OK_UniformConstantValue ||
+         Op2Info == TargetTransformInfo::OK_NonUniformConstantValue))
+      // On AVX2, a packed v16i16 shift left by a constant build_vector
+      // is lowered into a vector multiply (vpmullw).
+      return LT.first;
+
     int Idx = CostTableLookup(AVX2CostTable, ISD, LT.second);
     if (Idx != -1)
       return LT.first * AVX2CostTable[Idx].Cost;
@@ -257,6 +264,20 @@
       return LT.first * SSE2UniformConstCostTable[Idx].Cost;
   }
 
+  if (ISD == ISD::SHL &&
+      Op2Info == TargetTransformInfo::OK_NonUniformConstantValue) {
+    EVT VT = LT.second;
+    if ((VT == MVT::v8i16 && ST->hasSSE2()) ||
+        (VT == MVT::v4i32 && ST->hasSSE41()))
+      // Vector shift left by non uniform constant can be lowered
+      // into vector multiply (pmullw/pmulld).
+      return LT.first;
+    if (VT == MVT::v4i32 && ST->hasSSE2())
+      // A vector shift left by non uniform constant is converted
+      // into a vector multiply; the new multiply is eventually
+      // lowered into a sequence of shuffles and 2 x pmuludq.
+      ISD = ISD::MUL;
+  }
 
   static const CostTblEntry<MVT::SimpleValueType> SSE2CostTable[] = {
     // We don't correctly identify costs of casts because they are marked as
@@ -271,6 +292,7 @@
     { ISD::SHL,  MVT::v8i16,  8*10 }, // Scalarized.
     { ISD::SHL,  MVT::v4i32,  2*5 }, // We optimized this using mul.
     { ISD::SHL,  MVT::v2i64,  2*10 }, // Scalarized.
+    { ISD::SHL,  MVT::v4i64,  4*10 }, // Scalarized. 
 
     { ISD::SRL,  MVT::v16i8,  16*10 }, // Scalarized.
     { ISD::SRL,  MVT::v8i16,  8*10 }, // Scalarized.
@@ -308,6 +330,7 @@
     // We don't have to scalarize unsupported ops. We can issue two half-sized
     // operations and we only need to extract the upper YMM half.
     // Two ops + 1 extract + 1 insert = 4.
+    { ISD::MUL,     MVT::v16i16,   4 },
     { ISD::MUL,     MVT::v8i32,    4 },
     { ISD::SUB,     MVT::v8i32,    4 },
     { ISD::ADD,     MVT::v8i32,    4 },
@@ -323,7 +346,15 @@
 
   // Look for AVX1 lowering tricks.
   if (ST->hasAVX() && !ST->hasAVX2()) {
-    int Idx = CostTableLookup(AVX1CostTable, ISD, LT.second);
+    EVT VT = LT.second;
+
+    // v16i16 and v8i32 shifts by non-uniform constants are lowered into a
+    // sequence of extract + two vector multiply + insert.
+    if (ISD == ISD::SHL && (VT == MVT::v8i32 || VT == MVT::v16i16) &&
+        Op2Info == TargetTransformInfo::OK_NonUniformConstantValue)
+      ISD = ISD::MUL;
+
+    int Idx = CostTableLookup(AVX1CostTable, ISD, VT);
     if (Idx != -1)
       return LT.first * AVX1CostTable[Idx].Cost;
   }
@@ -343,7 +374,7 @@
   // 2x pmuludq, 2x shuffle.
   if (ISD == ISD::MUL && LT.second == MVT::v4i32 && ST->hasSSE2() &&
       !ST->hasSSE41())
-    return 6;
+    return LT.first * 6;
 
   // Fallback to the default implementation.
   return TargetTransformInfo::getArithmeticInstrCost(Opcode, Ty, Op1Info,
Index: lib/Transforms/Vectorize/BBVectorize.cpp
===================================================================
--- lib/Transforms/Vectorize/BBVectorize.cpp	(revision 201259)
+++ lib/Transforms/Vectorize/BBVectorize.cpp	(working copy)
@@ -532,7 +532,11 @@
 
     // Returns the cost of the provided instruction using TTI.
     // This does not handle loads and stores.
-    unsigned getInstrCost(unsigned Opcode, Type *T1, Type *T2) {
+    unsigned getInstrCost(unsigned Opcode, Type *T1, Type *T2,
+                          TargetTransformInfo::OperandValueKind Op1VK = 
+                              TargetTransformInfo::OK_AnyValue,
+                          TargetTransformInfo::OperandValueKind Op2VK =
+                              TargetTransformInfo::OK_AnyValue) {
       switch (Opcode) {
       default: break;
       case Instruction::GetElementPtr:
@@ -562,7 +566,7 @@
       case Instruction::And:
       case Instruction::Or:
       case Instruction::Xor:
-        return TTI->getArithmeticInstrCost(Opcode, T1);
+        return TTI->getArithmeticInstrCost(Opcode, T1, Op1VK, Op2VK);
       case Instruction::Select:
       case Instruction::ICmp:
       case Instruction::FCmp:
@@ -1013,13 +1017,58 @@
       unsigned JCost = getInstrCost(J->getOpcode(), JT1, JT2);
       Type *VT1 = getVecTypeForPair(IT1, JT1),
            *VT2 = getVecTypeForPair(IT2, JT2);
+      TargetTransformInfo::OperandValueKind Op1VK =
+          TargetTransformInfo::OK_AnyValue;
+      TargetTransformInfo::OperandValueKind Op2VK =
+          TargetTransformInfo::OK_AnyValue;
 
+      // On some targets (example X86) the cost of a vector shift may vary
+      // depending on whether the second operand is a Uniform or
+      // NonUniform Constant.
+      switch (I->getOpcode()) {
+      default : break;
+      case Instruction::Shl:
+      case Instruction::LShr:
+      case Instruction::AShr:
+
+        // If both I and J are scalar shifts by constant, then the
+        // merged vector shift count would be either a constant splat value
+        // or a non-uniform vector of constants.
+        if (ConstantInt *CII = dyn_cast<ConstantInt>(I->getOperand(1))) {
+          if (ConstantInt *CIJ = dyn_cast<ConstantInt>(J->getOperand(1)))
+            Op2VK = CII == CIJ ? TargetTransformInfo::OK_UniformConstantValue :
+                               TargetTransformInfo::OK_NonUniformConstantValue;
+        } else {
+          // Check for a splat of a constant or for a non uniform vector
+          // of constants.
+          Value *IOp = I->getOperand(1);
+          Value *JOp = J->getOperand(1);
+          if (ConstantDataVector *CDVI = dyn_cast<ConstantDataVector>(IOp)) {
+            if (ConstantDataVector *CDVJ = dyn_cast<ConstantDataVector>(JOp)) {
+              Op2VK = TargetTransformInfo::OK_NonUniformConstantValue;
+              Constant *SplatValue = CDVI->getSplatValue();
+              if (SplatValue != NULL && SplatValue == CDVJ->getSplatValue())
+                Op2VK = TargetTransformInfo::OK_UniformConstantValue;
+            }
+          }
+
+          if (ConstantVector *CVI = dyn_cast<ConstantVector>(IOp)) {
+            if (ConstantVector *CVJ = dyn_cast<ConstantVector>(JOp)) {
+              Op2VK = TargetTransformInfo::OK_NonUniformConstantValue;
+              Constant *SplatValue = CVI->getSplatValue();
+              if (SplatValue != NULL && SplatValue == CVJ->getSplatValue())
+                Op2VK = TargetTransformInfo::OK_UniformConstantValue;
+            }
+          }
+        }
+      }
+
       // Note that this procedure is incorrect for insert and extract element
       // instructions (because combining these often results in a shuffle),
       // but this cost is ignored (because insert and extract element
       // instructions are assigned a zero depth factor and are not really
       // fused in general).
-      unsigned VCost = getInstrCost(I->getOpcode(), VT1, VT2);
+      unsigned VCost = getInstrCost(I->getOpcode(), VT1, VT2, Op1VK, Op2VK);
 
       if (VCost > ICost + JCost)
         return false;
Index: lib/Transforms/Vectorize/LoopVectorize.cpp
===================================================================
--- lib/Transforms/Vectorize/LoopVectorize.cpp	(revision 201259)
+++ lib/Transforms/Vectorize/LoopVectorize.cpp	(working copy)
@@ -5491,9 +5491,20 @@
       TargetTransformInfo::OK_AnyValue;
     TargetTransformInfo::OperandValueKind Op2VK =
       TargetTransformInfo::OK_AnyValue;
+    Value *Op2 = I->getOperand(1);
 
-    if (isa<ConstantInt>(I->getOperand(1)))
+    // Check for a splat of a constant or for a non uniform vector of constants.
+    if (isa<ConstantInt>(Op2))
       Op2VK = TargetTransformInfo::OK_UniformConstantValue;
+    else if (ConstantDataVector *CDV = dyn_cast<ConstantDataVector>(Op2)) {
+      Op2VK = TargetTransformInfo::OK_NonUniformConstantValue;
+      if (CDV->getSplatValue() != NULL)
+        Op2VK = TargetTransformInfo::OK_UniformConstantValue;
+    } else if (ConstantVector *CV = dyn_cast<ConstantVector>(Op2)) {
+      Op2VK = TargetTransformInfo::OK_NonUniformConstantValue;
+      if (CV->getSplatValue() != NULL)
+        Op2VK = TargetTransformInfo::OK_UniformConstantValue;
+    }
 
     return TTI.getArithmeticInstrCost(I->getOpcode(), VectorTy, Op1VK, Op2VK);
   }
Index: lib/Transforms/Vectorize/SLPVectorizer.cpp
===================================================================
--- lib/Transforms/Vectorize/SLPVectorizer.cpp	(revision 201259)
+++ lib/Transforms/Vectorize/SLPVectorizer.cpp	(working copy)
@@ -1044,12 +1044,26 @@
         TargetTransformInfo::OperandValueKind Op2VK =
             TargetTransformInfo::OK_UniformConstantValue;
 
-        // Check whether all second operands are constant.
-        for (unsigned i = 0; i < VL.size(); ++i)
-          if (!isa<ConstantInt>(cast<Instruction>(VL[i])->getOperand(1))) {
+        // If all operands are exactly the same ConstantInt then set the
+        // operand kind to OK_UniformConstantValue.
+        // If instead not all operands are constants, then set the operand kind
+        // to OK_AnyValue. If all operands are constants but not the same,
+        // then set the operand kind to OK_NonUniformConstantValue.
+        ConstantInt *CInt = NULL;
+        for (unsigned i = 0; i < VL.size(); ++i) {
+          const Instruction *I = cast<Instruction>(VL[i]);
+          if (!isa<ConstantInt>(I->getOperand(1))) {
             Op2VK = TargetTransformInfo::OK_AnyValue;
             break;
           }
+          if (i == 0) {
+            CInt = cast<ConstantInt>(I->getOperand(1));
+            continue;
+          }
+          if (Op2VK == TargetTransformInfo::OK_UniformConstantValue &&
+              CInt != cast<ConstantInt>(I->getOperand(1)))
+            Op2VK = TargetTransformInfo::OK_NonUniformConstantValue;
+        }
 
         ScalarCost =
             VecTy->getNumElements() *
Index: lib/Analysis/CostModel.cpp
===================================================================
--- lib/Analysis/CostModel.cpp	(revision 201259)
+++ lib/Analysis/CostModel.cpp	(working copy)
@@ -98,15 +98,20 @@
   TargetTransformInfo::OperandValueKind OpInfo =
     TargetTransformInfo::OK_AnyValue;
 
-  // Check for a splat of a constant.
+  // Check for a splat of a constant or for a non uniform vector of constants.
   ConstantDataVector *CDV = 0;
-  if ((CDV = dyn_cast<ConstantDataVector>(V)))
+  if ((CDV = dyn_cast<ConstantDataVector>(V))) {
+    OpInfo = TargetTransformInfo::OK_NonUniformConstantValue;
     if (CDV->getSplatValue() != NULL)
       OpInfo = TargetTransformInfo::OK_UniformConstantValue;
+  }
+
   ConstantVector *CV = 0;
-  if ((CV = dyn_cast<ConstantVector>(V)))
+  if ((CV = dyn_cast<ConstantVector>(V))) {
+    OpInfo = TargetTransformInfo::OK_NonUniformConstantValue;
     if (CV->getSplatValue() != NULL)
       OpInfo = TargetTransformInfo::OK_UniformConstantValue;
+  }
 
   return OpInfo;
 }
Index: include/llvm/Analysis/TargetTransformInfo.h
===================================================================
--- include/llvm/Analysis/TargetTransformInfo.h	(revision 201259)
+++ include/llvm/Analysis/TargetTransformInfo.h	(working copy)
@@ -321,9 +321,10 @@
 
   /// \brief Additional information about an operand's possible values.
   enum OperandValueKind {
-    OK_AnyValue,            // Operand can have any value.
-    OK_UniformValue,        // Operand is uniform (splat of a value).
-    OK_UniformConstantValue // Operand is uniform constant.
+    OK_AnyValue,                 // Operand can have any value.
+    OK_UniformValue,             // Operand is uniform (splat of a value).
+    OK_UniformConstantValue,     // Operand is uniform constant.
+    OK_NonUniformConstantValue   // Operand is a non uniform constant value.
   };
 
   /// \return The number of scalar or vector registers that the target has.