[PATCH] D85709: [InstSimplify] Implement Instruction simplification for X/sqrt(X) to sqrt(X).

Tue Aug 11 09:42:34 PDT 2020

venkataramanan.kumar.llvm added a comment.

In D85709#2210125 <https://reviews.llvm.org/D85709#2210125>, @spatel wrote:

> In D85709#2210042 <https://reviews.llvm.org/D85709#2210042>, @cameron.mcinally wrote:
>
>> In D85709#2210034 <https://reviews.llvm.org/D85709#2210034>, @lebedev.ri wrote:
>>
>>> In D85709#2209979 <https://reviews.llvm.org/D85709#2209979>, @cameron.mcinally wrote:
>>>
>>>> I'm fairly sure this transform is a performance loss. For a target like Skylake Server, a SQRT(x) can take up to 20 cycles. But a RSQRT(x) is about 6 cycles and a MUL(y) is 4 cycles. We'd be better off with a X*RSQRT(X).
>>>
>>> That is up to backends to decide. InstSimplify/InstCombine (and a few others) are canonicalization, target-independent passes.
>>> A single `sqrt(x)` is more canonical IR than `x/sqrt(x)`, because it's less instructions and `x` has less uses.
>>
>> I agree with that. It should be canonicalized. It would also be good to make sure that the backends have lowering code in place before introducing a 2x performance hit.
>
> I agree with both.
> Knowing that this is often a perf-critical pattern, we've put a lot of SDAG effort into optimization of it for x86 already, but it's a tricky problem because we get into a similar situation that we have with FMA. Ie, should we favor latency, throughput, or some combo?
> Here's an example to show that we are trying to make the right decision per-CPU in codegen:
> https://godbolt.org/z/s5GPqc
> Note that the codegen choice can differ between scalar/vector - example in the X86.td file:
>
>   // FeatureFastScalarFSQRT should be enabled if scalar FSQRT has shorter latency
>   // than the corresponding NR code. FeatureFastVectorFSQRT should be enabled if
>   // vector FSQRT has higher throughput than the corresponding NR code.
>   // The idea is that throughput bound code is likely to be vectorized, so for
>   // vectorized code we should care about the throughput of SQRT operations.
>   // But if the code is scalar that probably means that the code has some kind of
>   // dependency and we should care more about reducing the latency.

Agreed  "FeatureFastScalarFSQRT" can be removed if target thinks scalar FSQRT is costly.  I see currently set at "SKXTuning" (Skylake).

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D85709/new/

https://reviews.llvm.org/D85709