[PATCH] Improve Cost model for SLPVectorizer when we have a vector division by power of 2

Thu Aug 21 05:22:35 PDT 2014

Hi Karthik,

sorry for the long post.
I have a few comments related to your patch.

I think your cost tables are incomplete. For example (correct me if I am wrong), the cost of a UDIV/SDIV by a constant splat of powers-of-2 is exactly the same on AVX2 and AVX (and on SSE as well) if we exclude the legalization cost which maybe different for SSE.

Also, the cost of an SDIV would be 4 and not 3:
on AVX for example, you would get a vpsrad + vpsrld + vpaddd + vpsrad.
You would get an extra arithmetic shift at the beginning because you would need to propagate the sign bit. Under the assumption that we are dividing by a non-negative power-of-2 then we are done (i.e. the cost is 4); otherwise we would pay an extra ISD::SUB to negate the result. In your case you don't need to account for the extra ISD::SUB because method APInt::isPowerOf2() would only return true for values that are > 0.

That said, if possible I suggest to reuse the existing cost tables as much as possible rather than defining new tables (see comments below).

I hope this makes sense to you.
-Andrea

================
Comment at: lib/Target/X86/X86TargetTransformInfo.cpp:189-191
@@ -186,5 +188,5 @@
 
   int ISD = TLI->InstructionOpcodeToISD(Opcode);
   assert(ISD && "Invalid opcode");
 
   static const CostTblEntry<MVT::SimpleValueType>
----------------
Here you can add the following check:


```
// Unsigned divisions by powers-of-2 are always reduced to SRL.
if (ISD == ISD::UDIV && TargetTransformInfo::OK_UniformConstantValue &&
    Op2PropInfo == TargetTransformInfo::OP_PowerOf2)
  ISD = ISD::SRL;
```
I had a quick look at the  code and I am pretty sure that this is what we do for every target.

This will allow you to get rid of all your new entries for UDIV in the new cost tables you added for powers-of-2.

In the case of SDIV, each targets may provide its own custom implementation of `sdiv x, pow2`. On X86 we don't override method 'BuildSDIVPow2' and therefore we fall back to the standard expansion of signed divisions by pow2; as a result, we introduce the sequence SRA + SRL + ADD + SRA.

```
if (ISD == ISD::SDIV && TargetTransformInfo::OK_UniformConstantValue &&
    Op2PropInfo == TargetTransformInfo::OP_PowerOf2) {
  // On X86, vector signed division by constants power-of-two are
  // normally expanded to the sequence SRA + SRL + ADD + SRA.
  unsigned Cost = 2 * getArithmeticInstrCost(ISD::SRA, Ty, Op1Info, Op2Info, Opd1PropInfo, Opd2PropInfo);
  Cost += getArithmeticInstrCost(ISD::SRL, Ty, Op1Info, Op2Info, Opd1PropInfo, Opd2PropInfo);
  Cost += getArithmeticInstrCost(ISD::ADD, Ty, Op1Info, Op2Info, Opd1PropInfo, Opd2PropInfo);
  return Cost;
}
```

This will allow you to reuse the existing cost tables without having to add extra cost tables to handle the special case of divisions by powers-of-2.

As a side note:
in the future, we can think of removing the check for TargetTransformInfo::OK_UniformConstantValue to also address non-uniform constants as suggested by Hal.
For example, on AVX2, UDIV by non-uniform constants power-of-2 could be emitted as a single vprslv. That is because we can take advantage of vector shifts with variable count (which we don't have on SSE/AVX).
Unfortunately, this is not happening at the moment and we end up scalarizing the logical shifts instead. This is why we still need that check for now..

Example:
  %div = udiv <4 x i32> %A, <i32 4, i32 8, i32 16, i32 4>

Could be strenght reduced to:
  %div = lshr <4 x i32> %A, <i32 2, i32 3, i32 4, i32 2>

On AVX2, that logical shift right would be emitted as a single vpsrlvd.
Unfortunately this is not happening at the moment and we get a long sequence of
vpextrd + shrl + vpinsrd..

Similarly, On AVX2 we could strongly optimize the SDIV expansion by non-uniform constants power-of-2 (this - again - is not happening at the moment).

http://reviews.llvm.org/D4971