[PATCH] D136176: Implement support for option 'fexcess-precision'.

Wed Dec 14 16:43:48 PST 2022

andrew.w.kaylor added inline comments.

================
Comment at: clang/test/CodeGen/X86/fexcess-precision.c:89
+// CHECK-EXT-NEXT:    [[EXT1:%.*]] = fpext half [[TMP1]] to float
+// CHECK-EXT-NEXT:    [[MUL:%.*]] = fmul float [[EXT]], [[EXT1]]
+// CHECK-EXT-NEXT:    [[TMP2:%.*]] = load half, ptr [[C_ADDR]]
----------------
I apologize for taking so long to get back to this. I'm not sure how to get this comment to show up as a response to the questions @rjmccall asked about my earlier remarks here. Apologies if this ends up being redundant.

I said:

> This is not what I'd expect to see. I'd expect the operations to use the half type with explicit truncation inserted where needed.

@rjmccall said:

> Are you suggesting that the frontend emit half operations normally, with some intrinsic to force half precision on casts and assignments, and that a backend pass would aggressively promote operations between those intrinsics? I think that would be a pretty error-prone representation, both in terms of guaranteeing the use of excess precision in some situations (and thus getting stable behavior across compiler releases) and guaranteeing truncation in others (and thus preserving correctness). The frontend would have to carefully emit intrinsics in a bunch of places or else default to introducing a bug.

Maybe I'm asking that -fexcess-precision=fast not be an alias of -fexcess-precision=standard. The idea behind 'fast' as I understand it is to allow the compiler to generate whatever instructions are most efficient for the target. If the target supports native fp16 operations, the fastest code will use those instructions and not use excess precision. If the target does not support fp16, the fastest code would be to perform the calculations at 32-bit precision and only truncate on casts and assignments (or, even faster, not even then). You can see an analogy for what I mean by looking at what gcc does with 32-bit floats when SSE is disabled: https://godbolt.org/z/xhEGbjG4G

With -fexcess-precision=fast, gcc takes liberties to make the code run as fast as possible. With -fexcess-precision=standard, it truncates to the source value on assignments, casts, or return.

If you generate code using the half type here, the backend will legalize it and **should** make it as fast as possible. In fact, it looks like currently the X86 backend will insert calls to extend and truncate the values to maintain the semantics of the IR (https://godbolt.org/z/YGnj4cqvv). That's sensible, but it's not what the X86 backend does in the 32-bit float + no SSE case.

The main thing I'd want to say here is that the front end has no business trying to decide what will be fastest for the target and encoding that. At most, the front end should be encoding the semantics in the source and setting some flag in the IR to indicate the excess precision handling that's allowed. This feels to me like it requires a new fast-math flag, maybe 'axp' = "allow excess precision".

The case where the front end is encoding excess precision into the IR feels much more like -ffp-eval-method. When the front end encodes an fpext to float and does a calculation, the optimizer is forced to respect that unless it can prove that doing the calculation at lower precision won't change the result (and it rarely does that). For targets that have native FP16 support, this will mean that -fexcess-precision=fast results in slower code.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D136176/new/

https://reviews.llvm.org/D136176