[PATCH] D20341: [CUDA] Enable fusing FP ops for CUDA by default.

Wed May 18 19:09:13 PDT 2016

hfinkel added a comment.

In http://reviews.llvm.org/D20341#432586, @jlebar wrote:

> > But people also don't expect IEEE compliance on GPUs
>
>
> Is that true?

Yes.

> You have a lot more experience with this than I do, but my observation of nvidia's hardware is that it's moved to add *more* IEEE compliance as it's matured.  For example, older hardware didn't support denormals, but newer chips do.  Surely that's in response to some users.

This is also true, but user expectations change slowly.

> One of our goals with CUDA in clang is to make device code as similar as possible to host code.  Throwing out IEEE compliance seems counter to that goal.

> 

> I also don't see the bright line here.  Like, if we can FMA to our heart's content, where do we draw the line wrt IEEE compliance?  Do we turn on flush-denormals-to-zero by default?  Do we use approximate transcendental functions instead of the more accurate ones?  Do we assume floating point arithmetic is associative?  What is the principle that leads us to do FMAs but not these other optimizations?

> 

> In addition, CUDA != GPUs.  Maybe this is something to turn on by default for NVPTX, although I'm still pretty uncomfortable with that.  Prior art in other compilers is interesting, but I think it's notable that clang doesn't do this for any other targets (afaict?) despite the fact that gcc does.

> 

> The main argument I see for this is "nvcc does it, and people will think clang is slow if we don't".  That's maybe not a bad argument, but it makes me sad.  :(

In http://reviews.llvm.org/D20341#433344, @tra wrote:

> I don't think using FMA throws away IEEE compliance.
>
> IEEE 784-2008 says:
>
> > A language standard should also define, and require implementations to provide, attributes that allow and
>
> >  disallow value-changing optimizations, separately or collectively, for a block. These optimizations might
>
> >  include, but are not limited to:
>
> >  ...
>
> >  ― Synthesis of a fusedMultiplyAdd operation from a multiplication and an addition
>
>
> It sounds like FMA use is up to user/language and IEEE standard is fine with it either way.

That's correct. FMA formation is allowed, although the default for this, and how it's done is unfortunately a function of many aspects of the programming environment (language, target platform, etc.).

> We need to establish what is the language standard that we need to adhere to. C++ standard itself does not seem to say much about FP precision or particular FP format.

> 

> C11 standard (ISO/IEC 9899:201x draft, 7.12.2) says:

> 

> > The default state (‘‘on’’ or ‘‘off’’) for the [FP_CONTRACT] pragma is implementation-defined.

> 

> 

> Nvidia has fairly detailed description of their FP.

>  http://docs.nvidia.com/cuda/floating-point/index.html#fused-multiply-add-fma

> 

> > The fused multiply-add operator on the GPU has high performance and increases the accuracy of computations. **No special flags or function calls are needed to gain this benefit in CUDA programs**. Understand that a hardware fused multiply-add operation is not yet available on the CPU, which can cause differences in numerical results.

> 

> 

> At the moment it's the most specific guideline I managed to find regarding expected FP behavior applicable to CUDA.

I think this is the most important point. IEEE allows an implementation choice here, and users who already have working CUDA code have tested that code within that context. This is different from the host's choice (at least on x86), but users already expect this. There is a performance impact, but there's also a numerical impact, and I don't think we do our users any favors by differing from NVIDIA here.

http://reviews.llvm.org/D20341