[PATCH] D113107: Support of expression granularity for _Float16.

Wed Nov 17 12:37:37 PST 2021

zahiraam added a comment.

In D113107#3138415 <https://reviews.llvm.org/D113107#3138415>, @rjmccall wrote:

> In D113107#3137921 <https://reviews.llvm.org/D113107#3137921>, @zahiraam wrote:
>
>> In D113107#3136464 <https://reviews.llvm.org/D113107#3136464>, @rjmccall wrote:
>>
>>> Does GCC actually change the formal types of expressions to `float`, or is it doing this "no intermediate casts thing" as some sort of fp_contract-style custom emission of trees of expressions that involve `_Float16`?
>>>
>>> In any case, doing operation-by-operation emulation seems like the right first approach rather than starting by doing the less conformant thing.
>>
>> I have created another patch https://reviews.llvm.org/D114099 that does the first step.
>>
>> Not sure what you mean by "no intermediate casts thing".
>
> I think we keep dancing around this in this review, so let me go back and start from the basics.  There are four approaches I know of for evaluating a homogeneous `_Float16` expression like `a + b + c`:
>
> 1. You can perform each operation with normal `_Float16` semantics.  Ideally, you would have hardware support for this.  If that isn't available, you can emulate the operations in software.  It happens to be true that, for the operations (`+`, `-`, `*`, `/`, `sqrt`) on `_Float16`, this emulation can just involve converting to e.g. `float`, doing the operation, and immediately converting back.  The core property of this approach is that there are no detectable differences from hardware support.
>
> 2. As a slight twist on approach #1, you can ignore the differences between native `_Float16` and emulation with `float`; instead, you just always do arithmetic in `float`.  This potentially changes the result in some cases; e.g. Steve Canon tells me that FMAs on `float` avoid some rounding errors that FMAs on `_Float16` fall subject to.
>
> 3. Approaches #1 and #2 require a lot of intermediate conversions when hardware isn't available.  In our example, `a + b + c` has to be calculated as `(_Float16) ((float) (_Float16) ((float) a + (float) b) + (float) c)`, where the result of one addition is converted down and then converted back again.  You can avoid this by specifically recognizing this pattern and eliminating the conversion from sub-operations that happen to be of type `float`, so that in our example, `a + b + c` would be calculated as `(_Float16) ((float) a + (float) b + (float) c)`.  This is actually allowed by the C standard by default as a form of FP contraction; in fact, I believe C's rules for FP contraction were originally designed for exactly this kind of situation, except that it was emulating `float` with `double` on hardware that only provided arithmetic on the latter.  Obviously, this can change results.
>
> 4. The traditional language rule for `__fp16` is superficially similar to Approach #3 in terms of generated code, but it has some subtle differences in terms of the language.  `__fp16` is immediately promoted to `float` whenever it appears as an arithmetic operand.  What this means is that operations are performed in `float` but then not formally converted back (unless they're used in a context which requires a value of the original type, which entails a normal conversion, just as if you assigned a `double` into a `float` variable).  Thus, for example, `a + b + c` would actually have type `float`, not type `__fp16`.
>
> What this patch is doing to `_Float16` is approach #4, basically treating it like `__fp16`.  That is non-conformant, and it doesn't seem to be what GCC does.  You can see that quite clearly here: https://godbolt.org/z/55oaajoax
>
> What I believe GCC is doing (when not forbidden by `-fexcess-precision`) is approach #3: basically, FP contraction on expressions of `_Float16` type.

@rjmccall Thanks for the clarification. That's very helpful. This patch is indeed doing #3, but that was my understanding from @pengfei.: doing for _Float16 with no avx512fp16, what _fp16 is doing for expressions.  But really we want to follow gcc on this.

================
Comment at: clang/lib/CodeGen/CGExprScalar.cpp:1315
+  if ((SrcType->isHalfType() || iSFloat16Allowed) &&
+      !CGF.getContext().getLangOpts().NativeHalfType) {
     // Cast to FP using the intrinsic if the half type itself isn't supported.
----------------
zahiraam wrote:
> pengfei wrote:
> > zahiraam wrote:
> > > zahiraam wrote:
> > > > pengfei wrote:
> > > > > rjmccall wrote:
> > > > > > pengfei wrote:
> > > > > > > rjmccall wrote:
> > > > > > > > pengfei wrote:
> > > > > > > > > rjmccall wrote:
> > > > > > > > > > Okay, this condition is pretty ridiculous to be repeating in three different places across the compiler.  Especially since you're going to change it when you implement the new option, right?
> > > > > > > > > > 
> > > > > > > > > > Can we state this condition more generally?  I'm not sure why this is so narrowly restricted, and the variable name isn't telling me anything, since `_Float16` must by definition be "allowed" if we have an expression of `_Float16` type.
> > > > > > > > > > since _Float16 must by definition be "allowed" if we have an expression of _Float16 type.
> > > > > > > > > 
> > > > > > > > > _Float16 is allowed only on a few targets. https://clang.llvm.org/docs/LanguageExtensions.html#half-precision-floating-point
> > > > > > > > > By the way, we should update for X86 since it's not limited to avx512fp16 now.
> > > > > > > > > _Float16 is allowed only on a few targets.
> > > > > > > > 
> > > > > > > > Yes, I know that.  But if `SrcType->isFloat16Type()` is true, we must be on one of those targets, because the type doesn't otherwise exist.
> > > > > > > I see your point now. The problem here is we want to allow the `_Float16` to be used more broadly. But the target doesn't really support it sometime. Currently full arithmatic operations are supported only on target with AVX512FP16.
> > > > > > > We should cast for those targets without AVX512FP16 while avoid to do on AVX512FP16.
> > > > > > I agree that many targets don't natively support arithmetic on this format, but x86 is not the first one that does.  Unless I'm misunderstanding, we already track this property in Clang's TargetInfo as `hasLegalHalfType()`.  `+avx512fp16` presumably ought to set this.
> > > > > > 
> > > > > > I'm not sure what the interaction with the `NativeHalfType` LangOpt is supposed to be here.  My understanding is that that option is just supposed to affect `__fp16`, basically turning it into a proper arithmetic type, i.e. essentially `_Float16`.  Whatever effect you want to apply to `_Float16` should presumably happen even if that option not set.
> > > > > > 
> > > > > > More broadly, I don't think your approach in this patch is correct.  The type of operations on `_Float16` should not change based on whether the target natively supports `_Float16`.  If we need to emulate those operations on targets that don't provide them natively, we should do that at a lower level than the type system.
> > > > > > 
> > > > > > The most appropriate place to do that is going to depend on the exact semantics we want.
> > > > > > 
> > > > > > If we want to preserve `half` semantics exactly regardless of target, we should have Clang's IR generation actually emit `half` operations.  Targets that don't support those operations natively will have to lower at least some of those operations into compiler-rt calls, but that's not at all unprecedented.
> > > > > > 
> > > > > > If we're okay with playing loose for performance reasons, we can promote to `float` immediately around individual arithmetic operations.  IR generation is probably the most appropriate place to do that.  But I'm quite concerned about that making `_Float16` feel like an unpredictable/unportable type; it seems to me that software emulation is much better.
> > > > > > 
> > > > > > If you're proposing the latter, I think you need to raise that more widely than a code review; please make a post on llvm-dev.
> > > > > > we already track this property in Clang's TargetInfo as `hasLegalHalfType()`
> > > > > 
> > > > > That sounds a good approch. Thank you.
> > > > > 
> > > > > > The type of operations on `_Float16` should not change based on whether the target natively supports `_Float16`. If we need to emulate those operations on targets that don't provide them natively, we should do that at a lower level than the type system.
> > > > > 
> > > > > Unfortunately, we can't do it at low level. The reason is (I'm not expert in frontend, just recalled from last disscussion with GCC folks) we have to do expresssion emulation to respect C/C++ semantics. GCC has option `-fexcess-precision=16` to match the same result with native instructions, but the default is `-fexcess-precision=fast` according to language semantics.
> > > > > 
> > > > > > The most appropriate place to do that is going to depend on the exact semantics we want...
> > > > > 
> > > > > Note, we are not simply doing emulation in the frontend. It's backend's responsibility to emulate a single `half` operation. But it's frontend's responsibility to choose whether to emit several `half` operations or emit promote + several `float` operations + truncate. As described in the title, this patch is doing for the latter.
> > > > > Okay, this condition is pretty ridiculous to be repeating in three different places across the compiler.  Especially since you're going to change it when you implement the new option, right?
> > > > > 
> > > > > Can we state this condition more generally?  I'm not sure why this is so narrowly restricted, and the variable name isn't telling me anything, since `_Float16` must by definition be "allowed" if we have an expression of `_Float16` type.
> > > > 
> > > > I agree  :)  I wasn't aware of this flag. I am still in the process of figuring out how float16 work and what is exactly required for it.
> > > > Yes may be having 2 separate patches is the right way to go. One that turns Float16 on for x86 without the use of +avx512fp16 feature and another one that does the emulation ( half to float -> float arithmetic-> truncation to half). @pengfei ?
> > > > > we already track this property in Clang's TargetInfo as `hasLegalHalfType()`
> > > > 
> > > > That sounds a good approch. Thank you.
> > > > 
> > > > > The type of operations on `_Float16` should not change based on whether the target natively supports `_Float16`. If we need to emulate those operations on targets that don't provide them natively, we should do that at a lower level than the type system.
> > > > 
> > > > Unfortunately, we can't do it at low level. The reason is (I'm not expert in frontend, just recalled from last disscussion with GCC folks) we have to do expresssion emulation to respect C/C++ semantics. GCC has option `-fexcess-precision=16` to match the same result with native instructions, but the default is `-fexcess-precision=fast` according to language semantics.
> > > > 
> > > > > The most appropriate place to do that is going to depend on the exact semantics we want...
> > > > 
> > > > Note, we are not simply doing emulation in the frontend. It's backend's responsibility to emulate a single `half` operation. But it's frontend's responsibility to choose whether to emit several `half` operations or emit promote + several `float` operations + truncate. As described in the title, this patch is doing for the latter.
> > > 
> > > The _Float16 type is supported for both C and C++, on x86 systems with SSE2 enabled. 
> > > From https://gcc.gnu.org/onlinedocs/gcc/Half-Precision.html:
> > > "On x86 targets with SSE2 enabled, **without -mavx512fp16, all operations will be emulated by software emulation** and the float instructions. The default behavior for FLT_EVAL_METHOD is to keep the intermediate result of the operation as 32-bit precision. This may lead to inconsistent behavior between software emulation and AVX512-FP16 instructions. Using -fexcess-precision=16 will force round back after each operation.
> > > 
> > > Using -mavx512fp16 will generate AVX512-FP16 instructions instead of software emulation. The default behavior of FLT_EVAL_METHOD is to round after each operation. The same is true with -fexcess-precision=standard and -mfpmath=sse. If there is no -mfpmath=sse, -fexcess-precision=standard alone does the same thing as before, It is useful for code that does not have _Float16 and runs on the x87 FPU."
> > > This is what we are trying to reach out with this patch.
> > > 
> > I planed to enable Float16 and emulation in one patch, i.e., D107082. It's almost backend work. The only thing I can't do in that patch is emulating in C/C++ expression granularity, which has to leverage FE facilities. That's all what I expected for this patch.
> > I'm ashamed I haven't finished it yet and leaving confusion here :(
> @pengfei I propose to re-factor the code in this patch and in D107082 in order to create the first patch. Then when that's done we can work on the emulation?
Please see https://reviews.llvm.org/D114099 and review please. Thanks.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D113107/new/

https://reviews.llvm.org/D113107