[PATCH] D21379: [X86] Heuristic to selectively build Newton-Raphson SQRT estimation

Mon Jul 25 07:39:34 PDT 2016

n.bozhenov added a comment.

Great questions, Sanjay!

> Shouldn't HSW show a latency improvement over IVB from using FMA?

FMA doesn't much affect the results. In most cases the difference between FMA
code and non-FMA code is not crucial. The only case significantly affected by
FMA is scalar SQRT on Haswell where NR got 15% higher throughput with FMA. Here
is the table for throughput-bound FMA code:

  |      |  HSW |  BDW |  SKL |
  |------+------+------+------|
  | x32  | +38% | -12% | -26% |
  | x128 | +12% | +32% | -30% |
  | x256 | +69% | +84% |  +6% |

And the updated table for latency-bound FMA code:

  |      |  HSW |  BDW |  SKL |
  |------+------+------+------|
  | x32  | -32% | -20% | -25% |
  | x128 | -34% | -28% | -25% |
  | x256 | -21% |  +6% |  -2% |

> How many N-R steps are included in your measurements?

I benchmarked the default number of steps which is one refinement step.

> Do the measurements include the change from https://reviews.llvm.org/D21127?

Yes.

> When we enabled the estimate generation code, we knew it had higher latency for SNB/IVB/HSW, but we reasoned that most real-world FP code would care more about throughput.

The idea behind this patch is that throughput bound code is likely to be
vectorized, so for vectorized code we should care about the throughput. But if
the code is scalar that probably means that the code has some kind of dependency
and we should care more about reducing the latency.

You mentioned PR21385, but the PR mostly deals with reciprocal square roots
which are NOT affected by this patch. SQRTSS+DIVSS is a way too heavy
combination even for Skylake. So, this patch is only about using SQRTSS/SQRTPS
instructions to calculate non-reciprocal roots.

> Do you have any benchmark numbers (test-suite, SPEC, etc) for those CPUs that shows a difference?

We can see large performance improvements for some our internal benchmarks. And
they were our motivating examples. As for Specs, the __updated__ version of the
patch doesn't affect any hot code in Spec2000 and Spec2006. Neither it affects
hot code in testsuite/Bullet benchmark (which is a very sqrt-intensive one).

Also, this patch makes difference not only to performance. It also improves the
accuracy in the affected cases. And generally we should prefer more precise code
even with fast-math unless we expect significant performance improvement.

> For the test file, please add RUNs that include the new attributes themselves rather than specifying a CPU.

Done. I've added a few more RUNs to the test to check the attributes themselves.

https://reviews.llvm.org/D21379