[PATCH] D19426: [AArch64] Use the reciprocal estimation machinery

Wed Apr 27 05:04:45 PDT 2016

jmolloy added a subscriber: jmolloy.
jmolloy requested changes to this revision.
jmolloy added a reviewer: jmolloy.
jmolloy added a comment.
This revision now requires changes to proceed.

Hi,

> Then again, I'm testing the waters by opting to use additional features instead of a sequence of plethora of isCPU(). Perhaps it's time to use features, even of some other kind, so that all such nuances remain in the machine descriptions instead of peppered all over the rest of the source code. Perhaps not.

I don't like the use of features here. They are a very large, indiscriminate hammer for when to enable this optimization. I don't know about Exynos M1, but on many chips the decision of whether to use reciprocals or not is contextual.

Often, the iterative SQRT instruction is faster in latency than a reciprocal alternative. Not only because there is less instruction fetch/dispatch/issue overhead but also because the iterative version can exit early in hardware if the NR steps converge quickly. A reciprocal alternative has to have a fixed number of steps which must be enough for the worst case. The reciprocal has the advantage that it is fully pipelined whereas the iterative SQRT might not be.

In my experiments, reciprocals are a poor choice for any situations where there are

  (a) few data items to process, or 
  (b) the sqrt/div is on the critical path.

So this sequence would be pessimized by changing to reciprocals:

  t = 0;
  for (...) {
    t = t + a[i];
    t /= b[i];
  }

Because the divide is on the critical path and is a loop dependence, the core can never overlap executions of the divide, so changing to reciprocals and extending the critical path would be a lose.

However here:

  for (...) {
    a[i] = a[i] / b[i];
  }

We can vectorize and unroll this. For this, reciprocals could be a *significant* win.

Your current implementation doesn't consider any of the situations where reciprocals might be beneficial or not, and it can't (because you've moved the heuristic out of TTI/TLI into Subtarget).

Cheers,

James

Repository:
  rL LLVM

http://reviews.llvm.org/D19426