[llvm-dev] FENV_ACCESS and floating point LibFunc calls

Thu May 11 18:51:03 PDT 2017

Hi Michael,

My main concern is performance in default mode, so I see the FENV fix as
the side effect. :)

I looked at this very briefly before leaving my dev machine for the day. I
think we're going to need a cmov conversion pass that happens very late
(maybe this already exists but just isn't firing for the cases I looked
at). Trying to convert these any sooner will likely be too harmful to the
DAG. PowerPC has this functionality, so we should already have a model to
share or steal from.

On Thu, May 11, 2017 at 7:30 PM Michael Clark <michaeljclark at mac.com> wrote:

> Hi Sanjay,
>
> I can’t comment on the bug as I don’t have an LLVM bugzilla account. I’ve
> emailed bugs-admin to gain access to bugzilla so I can comment on the bug.
>
> I note that on your bug that you have stated that the branch is faster
> than the conditional move. Faster code is a side effect of the fix in this
> particular case. The real issue is as Andrew mentioned, it that the
> predicate logic; the ISD:FSUB in particular is causing FE_INEXACT accrued
> exception state for exact conversions.
>
> Here is some code that shows the error and has a workaround by forcing the
> compiler to create a branch by using inline asm for the conversion.
>
> https://godbolt.org/g/huIgK1
>
> The bug affects both float to u64 and double to u64 conversions.
>
> Expected output:
>
> $ clang  fcvt-bug.cc
> $ ./a.out
> 1 exact
> 1 inexact
> 1 exact
> 1 inexact
> 1 exact
> 1 inexact
> 1 exact
> 1 inexact
>
>
> Erroneous output:
>
> $ clang -DDISABLE_WORKAROUND fcvt-bug.cc
> $ ./a.out
> 1 exact
> 1 inexact
> 1 inexact
> 1 inexact
> 1 exact
> 1 inexact
> 1 inexact
> 1 inexact
>
>
> Michael.
>
> On 12 May 2017, at 11:36 AM, Sanjay Patel <spatel at rotateright.com> wrote:
>
> Thanks, Andy. I'm not sure how to solve that or my case given the DAG's
> basic-block limit. Probably CodeGenPrepare or SelectionDAGBuilder...or we
> wait until after isel and try to split it up in a machine instruction pass.
>
> I filed my example here:
> https://bugs.llvm.org/show_bug.cgi?id=33013
>
> Feel free to comment there and/or open a new bug for the FP_TO_UINT case.
>
> On Thu, May 11, 2017 at 4:20 PM, Kaylor, Andrew <andrew.kaylor at intel.com>
> wrote:
>
>> Hi Sanjay,
>>
>>
>>
>> The issue that Michael is talking about is that in
>> SelectionDAGLegalize::ExpandNode() we’re doing this:
>>
>>
>>
>>   case ISD::FP_TO_UINT: {
>>
>>     SDValue True, False;
>>
>>     EVT VT =  Node->getOperand(0).getValueType();
>>
>>     EVT NVT = Node->getValueType(0);
>>
>>     APFloat apf(DAG.EVTToAPFloatSemantics(VT),
>>
>>                 APInt::getNullValue(VT.getSizeInBits()));
>>
>>     APInt x = APInt::getSignMask(NVT.getSizeInBits());
>>
>>     (void)apf.convertFromAPInt(x, false, APFloat::rmNearestTiesToEven);
>>
>>     Tmp1 = DAG.getConstantFP(apf, dl, VT);
>>
>>     Tmp2 = DAG.getSetCC(dl, getSetCCResultType(VT),
>>
>>                         Node->getOperand(0),
>>
>>                         Tmp1, ISD::SETLT);
>>
>>     True = DAG.getNode(ISD::FP_TO_SINT, dl, NVT, Node->getOperand(0));
>>
>>     // TODO: Should any fast-math-flags be set for the FSUB?
>>
>>     False = DAG.getNode(ISD::FP_TO_SINT, dl, NVT,
>>
>>                         DAG.getNode(ISD::FSUB, dl, VT,
>>
>>                                     Node->getOperand(0), Tmp1));
>>
>>     False = DAG.getNode(ISD::XOR, dl, NVT, False,
>>
>>                         DAG.getConstant(x, dl, NVT));
>>
>>     Tmp1 = DAG.getSelect(dl, NVT, Tmp2, True, False);
>>
>>     Results.push_back(Tmp1);
>>
>>     break;
>>
>>   }
>>
>>
>>
>> This results in code that speculatively executes the FSUB and then uses a
>> cmov instruction (based on whether or not the original value was less than
>> LLONG_MAX) to select between the direct FP_TO_SINT result and the result of
>> the FSUB.  If we assume that for most data sets a significant majority of
>> values being converted this way will be less than LLONG_MAX then the
>> processor should be able to predict a branch here pretty reliably and so
>> branching code would be faster than code that does a speculative FSUB
>> followed by a cmov.
>>
>>
>>
>> The question is, how do we need to change this to get the back end to
>> generate a branch instead of a cmov.  I’m guessing that there are
>> circumstances in which the target-specific backend would make that decision
>> on its own and sink the FSUB, but for the X86 backend that doesn’t seem to
>> be happening in this case.  My inclination was to just rewrite the
>> ExpandNode() code above to explicitly create a branch, but I haven’t tried
>> it yet.
>>
>>
>>
>> As a side effect, which is what originally drew Michael’s attention to
>> this conversion, the INEXACT floating point flag is set even for exact
>> conversions as a result of the speculative FSUB.  That’s how this got
>> related to FENV_ACCESS support.
>>
>>
>>
>> I don’t know if there has been a bug opened for this issue.  There was an
>> earlier discussion here:
>> https://groups.google.com/forum/#!topic/llvm-dev/Tl9SD-edUJI
>>
>>
>>
>> -Andy
>>
>>
>>
>>
>>
>> *From:* Sanjay Patel [mailto:spatel at rotateright.com]
>> *Sent:* Thursday, May 11, 2017 2:06 PM
>> *To:* Kaylor, Andrew <andrew.kaylor at intel.com>
>> *Cc:* Michael Clark <michaeljclark at mac.com>; llvm-dev <
>> llvm-dev at lists.llvm.org>
>> *Subject:* Re: [llvm-dev] FENV_ACCESS and floating point LibFunc calls
>>
>>
>>
>> Sounds like the select lowering issue is definitely separate from the
>> FENV work.
>>
>> Is there a bug report with a C or IR example? You want to generate
>> compare and branch instead of a cmov for something like this?
>>
>>
>> int foo(float x) {
>>   if (x < 42.0f)
>>     return x;
>>   return 12;
>> }
>>
>> define i32 @foo(float %x) {
>>   %cmp = fcmp olt float %x, 4.200000e+01
>>   %conv = fptosi float %x to i32
>>   %ret = select i1 %cmp, i32 %conv, i32 12
>>   ret i32 %ret
>> }
>>
>> $ clang -O2 cmovfp.c -S -o -
>>     movss    LCPI0_0(%rip), %xmm1    ## xmm1 = mem[0],zero,zero,zero
>>     ucomiss    %xmm0, %xmm1
>>     cvttss2si    %xmm0, %ecx
>>     movl    $12, %eax
>>     cmoval    %ecx, %eax
>>     retq
>>
>>
>>
>> On Thu, May 11, 2017 at 1:28 PM, Kaylor, Andrew via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>> Hi Michael,
>>
>>
>>
>> To be honest I haven’t started working on FP to integer conversions for
>> FENV_ACCESS yet.
>>
>>
>>
>> To some extent I would consider the issue that you found independent of
>> what I’m doing to constrain FP behavior.  That is, I think we ought to make
>> the change you’re asking for even apart from the FENV_ACCESS work.  When I
>> get to the conversions for FENV_ACCESS support it may require some
>> additional constraints, but I think if the branching conversion is usually
>> faster (and it looks like it will be) then that should be the default
>> behavior.
>>
>>
>>
>> I’ll try to look into that.  I’d offer to give you advice on putting
>> together a patch, but I’m still learning my way around the ISel code
>> myself.  I think I know enough to figure out what to do but not enough to
>> tell someone else how to do it without a bunch of wrong turns.
>>
>>
>>
>> -Andy
>>
>>
>>
>>
>>
>> *From:* Michael Clark [mailto:michaeljclark at mac.com]
>> *Sent:* Wednesday, May 10, 2017 7:59 PM
>> *To:* Kaylor, Andrew <andrew.kaylor at intel.com>
>> *Cc:* llvm-dev <llvm-dev at lists.llvm.org>
>> *Subject:* FENV_ACCESS and floating point LibFunc calls
>>
>>
>>
>> Hi Andy,
>>
>>
>>
>> I’m interested to try out your patches…
>>
>>
>>
>> I understand the scope of FENV_ACCESS is relatively wide, however I’m
>> still curious if you managed to figure out how to prevent the
>> SelectionDAGLegalize::ExpandNode() FP_TO_UINT lowering of the FPToUI
>> intrinsic from producing the predicate logic that incorrectly sets the
>> floating point accrued exceptions due to unconditional execution of the
>> ISD::FSUB node attached to the SELECT node. It’s a little above my head to
>> try to solve this issue with my current understanding of LLVM but I could
>> give it a try. I’d need some guidance as to how the lowering of SELECT can
>> be controlled. i.e. where LLVM decides whether and how to lower a select
>> node as a branch vs predicate logic.
>>
>>
>>
>> I’d almost forgotten that we microbenchmarked this and found the
>> branching version is faster with regular input (< LLONG_MAX).
>>
>>
>>
>> - https://godbolt.org/g/ytgk7l
>>
>> All, Where does LLVM decide to lower select as predicate logic vs
>> branches and how does the cost model work? I’m curious about a tuning flag
>> to generate branches instead of computing both values and using conditional
>> moves…
>>
>>
>>
>> Best,
>>
>> Michael.
>>
>>
>>
>> On 11 May 2017, at 11:41 AM, via llvm-dev <llvm-dev at lists.llvm.org>
>> wrote:
>>
>>
>>
>> Hi all,
>>
>> Background
>> I've been working on adding the necessary support to LLVM for clang to be
>> able to support the STDC FENV_ACCESS pragma, which basically allows users
>> to modify rounding mode at runtime and depend on the value of
>> floating-point status flags or to unmask floating point exceptions without
>> unexpected side effects.  I've committed an initial patch (r293226) that
>> adds constrained intrinsics for the basic FP operations, and I have a patch
>> up for review now (https://reviews.llvm.org/D32319) that adds
>> constrained versions of a number of libm-like FP intrinsics.
>>
>> Current problem
>> Now I'm trying to make sure I have a good solution for the way in which
>> the optimizer handles recognized calls to libm functions (sqrt, pow, cos,
>> sin, etc.).  Basically, I need to prevent all passes from making any
>> modifications to these calls that would make assumptions about rounding
>> mode or improperly affect the FP status flags (either suppressing flags
>> that should be present or setting flags that should not be set).  For
>> instance, there are circumstances in which the optimizer will constant fold
>> a call to one of these functions if the value of the arguments are known at
>> compile time, but this constant folding generally assumes the default
>> rounding mode and if the library call would have set a status flag, I need
>> the flag to be set.
>>
>> Question
>> My question is, can/should I just rely on the front end setting the
>> "nobuiltin" attribute for the call site in any location where the FP
>> behavior needs to be restricted?
>>
>> Ideally, I would like to be able to conditionally enable optimizations
>> like constant folding if I am able to prove that the rounding mode, though
>> dynamic, is known for the callsite at compile time (the constrained
>> intrinsics have a mechanism to enable this), but at the moment I am more
>> concerned about correctness and would be willing to sacrifice optimizations
>> to get correct behavior.  Long term, I was thinking that maybe I could do
>> something like attach metadata to indicate rounding mode and exception
>> behavior when they were known.
>>
>> Is there a better way to do this?
>>
>> Thanks,
>> Andy
>>
>>
>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>>
>>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170512/ce5f6bc1/attachment.html>