[LLVMdev] [RFC] How to fix sqrt vs llvm.sqrt optimization asymmetry

Mon Nov 11 13:45:18 PST 2013

Hi Hal, all.

I'm not sure why llvm.sqrt is 'special'.  Maybe because there is a SSE 
packed sqrt instruction (SQRTPS) but not e.g. a packed sin instruction 
AFAIK.

As mentioned in a recent mail to this list, I would like llvm.sqrt to be 
defined as NaN for argument x < 0.  I believe this would bring it more 
into line with the other intrinsics, and with the libm result, which is 
NaN for x < 0: 
http://pubs.opengroup.org/onlinepubs/007904975/functions/sqrt.html

Cheers,
     Nick

On 10/11/2013 3:36 p.m., Hal Finkel wrote:
> Hello everyone,
>
> The particular motivation for this e-mail is my desire for feedback on how to fix PR17758; but there is a core design issue here, so I'd like a wide audience.
>
> The underlying issue is that, because the semantics of llvm.sqrt are purposefully defined to be different from libm sqrt (unlike all of the other llvm.<libm function> intrinsics) (*), and because autovectorization relies on the vector forms of these intrinsics when vectorizing function calls to libm math functions, we cannot vectorize a libm sqrt() call into a vector llvm.sqrt call. However, in fast-math mode, we'd like to vectorize calls to sqrt, and so I modified Clang to emit calls to llvm.sqrt in fast-math mode for sqrt (and sqrt[fl]). This makes it similar to the libm pow and fma calls, which Clang always transforms into the llvm.pow and llvm.fma intrinsics.
>
> Here's the problem: There is an InstCombine optimization for sqrt (inside visitFPTrunc), and a bunch of optimizations inside SimplifyLibCalls that apply only to the sqrt libm call, and not to the intrinsics. The result, among other things, is PR17758, where fast-math mode actually produces slower code for non-vectorized sqrt calls.
>
> Some questions:
>
>   - Is the asymmetry between optimizations performed on libm calls and their corresponding llvm.<libm function> intrinsics intentional, or just due to a lack of motivation?
>
>   - Even if unintentional, is this asymmetry in any way desirable (for sqrt in particular, or in general)?
>
>   - I can refactor all existing optimizations to be libm-call vs. intrinsics agnostic, but is that the desired solution? If so, any advice on a particularly-nice way to do this would certainly be appreciated.
>
> For example, an alternative solution for PR17758 in particular would be to revert the Clang change, introduce a new intrinsic for sqrt that does match the libm semantics, and have vectorization use that when available.
>
> Another alternative is to revert the Clang change and make autovectorization of libm sqrt -> llvm.sqrt dependent on the NoNaNsFPMath TargetOptions flag (this requires directly exposing parts of TargetOptions to the IR level).
>
> I believe that both of these alternatives also require fixing the inliner to deal properly with fast-math attributes during LTO (unless I can just ignore this for now). This was the objection raised to the TargetOptions solution when I first brought it up.
>
> (*) According to the language reference, the specific difference is, "Unlike sqrt in libm, however, llvm.sqrt has undefined behavior for negative numbers other than -0.0 (which allows for better optimization, because there is no need to worry about errno being set). llvm.sqrt(-0.0) is defined to return -0.0 like IEEE sqrt."
>
> Thanks again,
> Hal
>