[cfe-dev] [llvm-dev] Floating point operations with specific rounding and exception properties

Wed Aug 21 09:46:27 PDT 2019

Thank you for sharing your experience. It seems that LICM may also be
unsafe.

 my intuition says that there are far less unsafe optimization than there
> are safe optimizations.

...

 safe code running at the equivalent of -O0 is fairly useless

I believe this is the right viewpoint.There must be a way without
sacrificing performance.

Thanks,
--Serge

ср, 21 авг. 2019 г. в 21:57, Cameron McInally <cameron.mcinally at nyu.edu>:

> On Tue, Aug 20, 2019 at 9:15 PM Serge Pavlov <sepavloff at gmail.com> wrote:
>
>> Which optimization did you find unsafe?
>>
>> Thanks,
>> --Serge
>>
>>
>> ср, 21 авг. 2019 г. в 05:12, Cameron McInally <cameron.mcinally at nyu.edu>:
>>
>>> On Tue, Aug 20, 2019 at 1:02 PM Serge Pavlov via llvm-dev <
>>> llvm-dev at lists.llvm.org> wrote:
>>>
>>>> Hi all,
>>>>
>>>> During the review of https://reviews.llvm.org/D65997
>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__reviews.llvm.org_D65997&d=DwMFaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=O_4M49EtSpZ_-BQYeigzGv0P4__noMcSu2RYEjS1vKs&m=fTfAlQ0FHnQez3xiw8VnBL1XaxmBqn_-WD5E0mh4GrY&s=muQ0CzykdJ9Nhg0UshJTnEQXPSKHdFGptZXwFvVD2l0&e=>
>>>> an issue was revealed, which relates to the decision of how compiler should
>>>> represents constrained floating point operations.
>>>>
>>>> If a floating point operation requires rounding mode or exception
>>>> behavior different from the default, it should be represented by
>>>> constrained intrinsic (
>>>> http://llvm.org/docs/LangRef.html#constrained-floating-point-intrinsics
>>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__llvm.org_docs_LangRef.html-23constrained-2Dfloating-2Dpoint-2Dintrinsics&d=DwMFaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=O_4M49EtSpZ_-BQYeigzGv0P4__noMcSu2RYEjS1vKs&m=fTfAlQ0FHnQez3xiw8VnBL1XaxmBqn_-WD5E0mh4GrY&s=5LUnqvzxtBJcUMLBjiVfnEwD53rKH1ZIQmEcvXhbEdo&e=>).
>>>> An important point is that according to the current design decision, if
>>>> some part of a function contains such intrinsic, all floating point
>>>> operations in the function must be represented by constrained intrinsics as
>>>> well. Such decision should prevent from undesired moves of fp operations.
>>>> The discussion is in the thread
>>>> http://lists.llvm.org/pipermail/cfe-dev/2017-August/055325.html
>>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.llvm.org_pipermail_cfe-2Ddev_2017-2DAugust_055325.html&d=DwMFaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=O_4M49EtSpZ_-BQYeigzGv0P4__noMcSu2RYEjS1vKs&m=fTfAlQ0FHnQez3xiw8VnBL1XaxmBqn_-WD5E0mh4GrY&s=l5-qb-0vYUlkEbQT46x1HYz9WtpgOLaojeUkghA_QNg&e=>,
>>>> the relevant example is:
>>>>
>>>> double f(double a, double b, double c) {
>>>>   {
>>>> #pragma STDC FENV_ACCESS ON
>>>>     feenableexcept(FE_OVERFLOW);
>>>>     double d = a * b;
>>>>     fedisableexcept(FE_OVERFLOW);
>>>>   }
>>>>   return c * d;
>>>> }
>>>>
>>>>
>>>> The second fmul must not be hoisted up to before the fedisableexcept.
>>>> Using constrained intrinsics is expected to help in this case as they are
>>>> not handled by optimization passes.
>>>>
>>>> The concern is that using constrained intrinsics in a small region of a
>>>> function results in using such intrinsics everywhere in the function
>>>> including functions that inline it. As constrained intrinsics prevent from
>>>> optimizations, it can result in performance degradation.
>>>>
>>>> A couple of examples:
>>>> 1. There is a performance critical function that makes most of
>>>> calculations in default fp mode, but in some points it enables fp
>>>> exceptions and makes an action that can trigger such exception. Using
>>>> constrained intrinsics would result in performance loss, although the code
>>>> that actually needs them is very compact.
>>>> 2. Cores that are used for machine learning usually work with short
>>>> data (half, bfloat16 or even shorter). Rounding control in this case is
>>>> much more important than for big cores; using proper rounding in different
>>>> parts of algorithm can gain precision. Constrained intrinsics is the only
>>>> way to enforce particular rounding mode. However using them results in poor
>>>> optimization, which is intolerable. In such cores rounding mode may be
>>>> encoded in instructions, so code movements cannot break semantics.
>>>>
>>>> Representation of fp operations could be more flexible, so that a user
>>>> would not pay for rounding/exception control by performance degradation.
>>>> For that we need to be able to mix constrained intrinsics and regular fp
>>>> operation in a function.
>>>>
>>>> The question is: how can we prevent from moving fp operations through
>>>> boundaries of a region, where specific rounding and/or exception behavior
>>>> are applied? Any ideas?
>>>>
>>>
>>> Okay, I'll bite...
>>>
>>> Preventing the hoisting of FP arithmetic was one of the driving factors
>>> in creating the constrained intrinsics. If we could solve that problem,
>>> then the constrained intrinsics would be *less* necessary (I say "less"
>>> since there are other problems, but hoisting is one of the significant
>>> ones).
>>>
>>> That said, our out-of-tree FPEnv mode attempts to do just that --
>>> selectively throttle unsafe optimizations. Barring any YDKWYDK's, I intend
>>> to blow the doors off of the constrained intrinsics, performance-wise. :P
>>>
>>>
> Oh, there are quite a lot. I mentioned Hoisting already. Constant Folding
> is a big one. InstCombine and DAGCombine have some issues, like preserving
> masks (op+select masks -- this may be less of a problem with true
> predication). The LoopVectorizer also needed proper masks (not just masked
> loads/stores) for targets that support them. APFloat has some issues (I'm
> intending to upstream fixes for signaling NaNs, if I ever have time). And a
> host of others.
>
> Stepping back a little, the goal of FPEnv-safe compilation is just that...
> to avoid unsafe FP transformations. The constrained intrinsics
> implementation seeks to prevent almost all FP optimizations at first, safe
> and unsafe, and then later add safe optimizations back in. My alternative
> implementation is to find and *very* selectively throttle unsafe
> optimizations -- my intuition says that there are far less unsafe
> optimization than there are safe optimizations. I believe this is the much
> shorter path. So, the two competing implementations are really attacking
> the problem from two different ends. Who gets to the goal first is TBD...
>
> To be completely fair, is my alternative solution the best path for
> upstream LLVM? Maybe, maybe not. The constrained intrinsics will be far
> less buggy in the early stages, since essentially all optimizations are
> quashed. But in the same breath, safe code running at the equivalent of -O0
> is fairly useless (at least to our customers).
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20190821/9cb71360/attachment.html>