[PATCH] D85709: [InstSimplify] Implement Instruction simplification for X/sqrt(X) to sqrt(X).

Sanjay Patel via Phabricator via llvm-commits llvm-commits at lists.llvm.org
Tue Aug 11 07:53:52 PDT 2020


spatel added a comment.

In D85709#2210042 <https://reviews.llvm.org/D85709#2210042>, @cameron.mcinally wrote:

> In D85709#2210034 <https://reviews.llvm.org/D85709#2210034>, @lebedev.ri wrote:
>
>> In D85709#2209979 <https://reviews.llvm.org/D85709#2209979>, @cameron.mcinally wrote:
>>
>>> I'm fairly sure this transform is a performance loss. For a target like Skylake Server, a SQRT(x) can take up to 20 cycles. But a RSQRT(x) is about 6 cycles and a MUL(y) is 4 cycles. We'd be better off with a X*RSQRT(X).
>>
>> That is up to backends to decide. InstSimplify/InstCombine (and a few others) are canonicalization, target-independent passes.
>> A single `sqrt(x)` is more canonical IR than `x/sqrt(x)`, because it's less instructions and `x` has less uses.
>
> I agree with that. It should be canonicalized. It would also be good to make sure that the backends have lowering code in place before introducing a 2x performance hit.

I agree with both.
Knowing that this is often a perf-critical pattern, we've put a lot of SDAG effort into optimization of it for x86 already, but it's a tricky problem because we get into a similar situation that we have with FMA. Ie, should we favor latency, throughput, or some combo?
Here's an example to show that we are trying to make the right decision per-CPU in codegen:
https://godbolt.org/z/s5GPqc
Note that the codegen choice can differ between scalar/vector - example in the X86.td file:

  // FeatureFastScalarFSQRT should be enabled if scalar FSQRT has shorter latency
  // than the corresponding NR code. FeatureFastVectorFSQRT should be enabled if
  // vector FSQRT has higher throughput than the corresponding NR code.
  // The idea is that throughput bound code is likely to be vectorized, so for
  // vectorized code we should care about the throughput of SQRT operations.
  // But if the code is scalar that probably means that the code has some kind of
  // dependency and we should care more about reducing the latency.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D85709/new/

https://reviews.llvm.org/D85709



More information about the llvm-commits mailing list