[PATCH] D21379: [X86] Heuristic to selectively build Newton-Raphson SQRT estimation

Wed Jun 15 13:17:00 PDT 2016

spatel added a comment.

In http://reviews.llvm.org/D21379#458695, @n.bozhenov wrote:

> Below are some figures to justify the change.
>  Experimental Newton-Raphson efficiency for latency-bound code:
>
>   |      |  IVB |  HSW |  BDW |  SKL |
>   |------+------+------+------+------|
>   | x32  | -41% | -40% | -21% | -40% |
>   | x128 | -32% | -32% | -17% | -35% |
>   
>
> Experimental Newton-Raphson efficiency for throughput-bound code:
>
>   |      |  IVB |  HSW |  BDW |  SKL |
>   |------+------+------+------+------|
>   | x32  | +18% | +21% | -17% | -40% |
>   | x128 | +10% | +14% | +28% | -50% |
>   | x256 |      | +68% | +85% |  +3% |
>   

1. Shouldn't HSW show a latency improvement over IVB from using FMA?
2. How many N-R steps are included in your measurements?
3. Do the measurements include the change from http://reviews.llvm.org/D21127?

When we enabled the estimate generation code ( https://llvm.org/bugs/show_bug.cgi?id=21385#c32 ), we knew it had higher latency for SNB/IVB/HSW, but we reasoned that most real-world FP code would care more about throughput. This patch proposes to change that behavior for those targets (ie, favor latency at the expense of throughput). Do you have any benchmark numbers (test-suite, SPEC, etc) for those CPUs that shows a difference?

For the test file, please add RUNs that include the new attributes themselves rather than specifying a CPU. That way we'll have coverage for the expected behavior independently of any individual CPU.

http://reviews.llvm.org/D21379