[PATCH 2/3] ARM cost model: Address computation in vector mem ops not free

Thu Feb 7 12:18:56 PST 2013

Hi all,

Let me rebase this patch, because due to the memory cost patch there has been some churn.

On Feb 7, 2013, at 1:32 PM, Renato Golin <renato.golin at linaro.org> wrote:

> On 7 February 2013 14:31, Arnold <aschwaighofer at apple.com> wrote:
> I agree with you, it is unfortunate. However, I am trying to model an idiosyncrasy of the processor that has a big implication on performance. It is very expensive on swift if you happen to load into a S register, or D sub lane. Two such instructions are not pipelined but sequentialized.
> 
> In that case, the cost will be much more than 2 or 3, no?       
> 
> 

3 is a number based on the architecture. 2 was just saying assume double the cost. But yeah, it is just a guess in the wind. I don't want to make it to expensive though, because I have to penalize all inserts even those that are not really effected (not coming from a load).
What we are saying here is that throughput of this instruction is estimated three times as low. This matters if the code is dominated by this instruction.

And the code where this matters is not so special. Any code that has a gather in it (and is mostly dominated by it) will suffer from it:

void example14(int **in, int **coeff, int *out) {
  int k,j,i=0;
  for (k = 0; k < K; k++) {
    int sum = 0;
    for (j = 0; j < M; j++)
      for (i = 0; i < N/64; i++)
          sum += in[i+k][j] * coeff[i][j];

    out[k] = sum;
  }

}

> Stride has the value of the isConsecutivePtr method:
> 
> Ok, in the original code you had:
> 
> if (Stride < 0)
>   return parent::cost();
> return Cost;
> 
> In this you have:
> 
> if (Stride > 0)
>   return Cost;
> return parent::cost();
> 
> It seems you're missing the case where it's == 0, but I can't tell which way it should go.

We don't get here with a zero stride. This will become clear again if I rebase this patch

if ((Stride = Legality->isConsecutivePtr(PointerOperand)))
         return costOfWideMemInst();

+    /// \return the cost of a vector memory instruction.
+    unsigned costOfWideMemInst() {
+      // We assume that address computation is involved.
+      unsigned Cost = TTI.getAddressComputationCost(VectorTy) +
+        TTI.getMemoryOpCost(Opcode, VectorTy, Alignment, AddressSpace);
+
+      if (Stride > 0)
+        return Cost;
+
+      // Reverse stride.
+      Cost += TTI.getShuffleCost(TargetTransformInfo::SK_Reverse, VectorTy, 0);
+      return Cost;
+    }
> 
> I don't think we need a function call for the value 3 here. It is a value just like any other that is returned by TTI.
> 
> What I'm trying to say is that this value seems to come out of the blue. I could be wrong, obviously, but it seems to me that you're experimenting with a micro-benchmark and fine-tuning to your particular example, which is dangerous on a wider perspective.
> 
> I understand that this might be a big hit on a set of examples, but we should get some constants out, just to make it clear that we're not talking about "idealized cycle count", but something else entirely.
> 
> Like:
> 
> const int AVOID_AT_ALL_COSTS = 100;
> const int DANGEROUS_IN_MOST_CASES = 10;
> const int NOT_GOOD_BUT_COULD_BE_OK = 5;
> 
> etc…

I am not sure about this. We are talking about estimated throughput of an instruction.

> 
> cheers,
> --renato

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20130207/7c153282/attachment.html>