[PATCH] D21379: [X86] Heuristic to selectively build Newton-Raphson SQRT estimation

Mon Jul 25 12:57:41 PDT 2016

hfinkel added a comment.

In https://reviews.llvm.org/D21379#494665, @n.bozhenov wrote:

> Great questions, Sanjay!
>
> > Shouldn't HSW show a latency improvement over IVB from using FMA?
>
>
> FMA doesn't much affect the results. In most cases the difference between FMA
>  code and non-FMA code is not crucial. The only case significantly affected by
>  FMA is scalar SQRT on Haswell where NR got 15% higher throughput with FMA. Here
>  is the table for throughput-bound FMA code:
>
>   |      |  HSW |  BDW |  SKL |
>   |------+------+------+------|
>   | x32  | +38% | -12% | -26% |
>   | x128 | +12% | +32% | -30% |
>   | x256 | +69% | +84% |  +6% |
>   
>
> And the updated table for latency-bound FMA code:
>
>   |      |  HSW |  BDW |  SKL |
>   |------+------+------+------|
>   | x32  | -32% | -20% | -25% |
>   | x128 | -34% | -28% | -25% |
>   | x256 | -21% |  +6% |  -2% |
>   
>   
>
> > How many N-R steps are included in your measurements?
>
>
> I benchmarked the default number of steps which is one refinement step.
>
> > Do the measurements include the change from https://reviews.llvm.org/D21127?
>
>
> Yes.
>
> > When we enabled the estimate generation code, we knew it had higher latency for SNB/IVB/HSW, but we reasoned that most real-world FP code would care more about throughput.
>
>
>     
>
> The idea behind this patch is that throughput bound code is likely to be
>  vectorized, so for vectorized code we should care about the throughput. But if
>  the code is scalar that probably means that the code has some kind of dependency
>  and we should care more about reducing the latency.
>
> You mentioned PR21385, but the PR mostly deals with reciprocal square roots
>  which are NOT affected by this patch. SQRTSS+DIVSS is a way too heavy
>  combination even for Skylake. So, this patch is only about using SQRTSS/SQRTPS
>  instructions to calculate non-reciprocal roots.
>
> > Do you have any benchmark numbers (test-suite, SPEC, etc) for those CPUs that shows a difference?
>
>
> We can see large performance improvements for some our internal benchmarks. And
>  they were our motivating examples. As for Specs, the __updated__ version of the
>  patch doesn't affect any hot code in Spec2000 and Spec2006. Neither it affects
>  hot code in testsuite/Bullet benchmark (which is a very sqrt-intensive one).
>
> Also, this patch makes difference not only to performance. It also improves the
>  accuracy in the affected cases. And generally we should prefer more precise code
>  even with fast-math unless we expect significant performance improvement.
>
> > For the test file, please add RUNs that include the new attributes themselves rather than specifying a CPU.
>
>
> Done. I've added a few more RUNs to the test to check the attributes themselves.

I'd obviously also like to think that any throughput sensitive code is vectorized (or at least that the vectorizer has unrolled it to hide latency if helpful). In general this low-latency/low-throughput vs. high-latency/high-throughput is exactly the kind of situation that the MachineCombiner is supposed to handle. It might be difficult to use here, however, because we want to insert NR before instruction selection (to take advantage of DAGCombine simplifications and complex ISel patterns), and replacing the estimate/NR sequence in the MachineCombiner might be tricky (specially since the user can select the number of iterations, and so only matching the simple case of one iteration would be undesirable). As a result, unfortunately, phase ordering constraints might prevent using the principled solution here. The heuristic here does not seem unreasonable, however I just wish we did not have to use it.

In general, however, the rationale needs to be much better documented in the code. I don't see any comments in the patch itself about the latency/throughput tradeoff.

https://reviews.llvm.org/D21379