[PATCH] D136176: Implement support for option 'fexcess-precision'.

Thu Dec 15 14:53:40 PST 2022

rjmccall added inline comments.

================
Comment at: clang/test/CodeGen/X86/fexcess-precision.c:89
+// CHECK-EXT-NEXT:    [[EXT1:%.*]] = fpext half [[TMP1]] to float
+// CHECK-EXT-NEXT:    [[MUL:%.*]] = fmul float [[EXT]], [[EXT1]]
+// CHECK-EXT-NEXT:    [[TMP2:%.*]] = load half, ptr [[C_ADDR]]
----------------
andrew.w.kaylor wrote:
> pengfei wrote:
> > andrew.w.kaylor wrote:
> > > I apologize for taking so long to get back to this. I'm not sure how to get this comment to show up as a response to the questions @rjmccall asked about my earlier remarks here. Apologies if this ends up being redundant.
> > > 
> > > I said:
> > > 
> > > > This is not what I'd expect to see. I'd expect the operations to use the half type with explicit truncation inserted where needed.
> > > 
> > > 
> > > @rjmccall said:
> > > 
> > > > Are you suggesting that the frontend emit half operations normally, with some intrinsic to force half precision on casts and assignments, and that a backend pass would aggressively promote operations between those intrinsics? I think that would be a pretty error-prone representation, both in terms of guaranteeing the use of excess precision in some situations (and thus getting stable behavior across compiler releases) and guaranteeing truncation in others (and thus preserving correctness). The frontend would have to carefully emit intrinsics in a bunch of places or else default to introducing a bug.
> > > 
> > > 
> > > Maybe I'm asking that -fexcess-precision=fast not be an alias of -fexcess-precision=standard. The idea behind 'fast' as I understand it is to allow the compiler to generate whatever instructions are most efficient for the target. If the target supports native fp16 operations, the fastest code will use those instructions and not use excess precision. If the target does not support fp16, the fastest code would be to perform the calculations at 32-bit precision and only truncate on casts and assignments (or, even faster, not even then). You can see an analogy for what I mean by looking at what gcc does with 32-bit floats when SSE is disabled: https://godbolt.org/z/xhEGbjG4G
> > > 
> > > With -fexcess-precision=fast, gcc takes liberties to make the code run as fast as possible. With -fexcess-precision=standard, it truncates to the source value on assignments, casts, or return.
> > > 
> > > If you generate code using the half type here, the backend will legalize it and **should** make it as fast as possible. In fact, it looks like currently the X86 backend will insert calls to extend and truncate the values to maintain the semantics of the IR (https://godbolt.org/z/YGnj4cqvv). That's sensible, but it's not what the X86 backend does in the 32-bit float + no SSE case.
> > > 
> > > The main thing I'd want to say here is that the front end has no business trying to decide what will be fastest for the target and encoding that. At most, the front end should be encoding the semantics in the source and setting some flag in the IR to indicate the excess precision handling that's allowed. This feels to me like it requires a new fast-math flag, maybe 'axp' = "allow excess precision".
> > > 
> > > The case where the front end is encoding excess precision into the IR feels much more like -ffp-eval-method. When the front end encodes an fpext to float and does a calculation, the optimizer is forced to respect that unless it can prove that doing the calculation at lower precision won't change the result (and it rarely does that). For targets that have native FP16 support, this will mean that -fexcess-precision=fast results in slower code.
> > > For targets that have native FP16 support, this will mean that -fexcess-precision=fast results in slower code.
> > 
> > This won't be true. We checked `TI.hasLegalHalfType()` before doing excess precision. Maybe better to add a RUN for `avx512fp16` case to address the concern.
> > 
> > > At most, the front end should be encoding the semantics in the source and setting some flag in the IR to indicate the excess precision handling that's allowed.
> > 
> > I think encoding excess precision into the IR is the easy way to encode the information of the casts and assignments. Otherwise, we have to use intrinsics to tell such information to the backend. As @rjmccall explained, it's more error-prone.
> > 
> > > If you generate code using the half type here, the backend will legalize it and should make it as fast as possible.
> > 
> > I think the current backend implementations in both GCC and LLVM have already tried to make it as fast as possible in the condition of not to make the code too complicated.
> > The behavior in 32-bit float + no SSE case is benefited from x87 feature that always does excess precision. It is somehow taking bugs as feature from the semantics of the IR.
> > That says, it is feasible to mock the same behavior of x87 special fast behavior for fp16. It needs extra work in the backend to globally promote all half types to float to just get some "bug" behavior. I don't think GCC or us have interest with it.
> > 
> > In conclusion, I think the only common thing in the 32-bit float + no SSE case and fp16 is codegen (either FE or BE) needs to respect the boundary of the casts and assignments in some condition. I think the current FE implemetation is the best way to solve the problem (without extra BE work).
> > This won't be true. We checked TI.hasLegalHalfType() before doing excess precision. Maybe better to add a RUN for avx512fp16 case to address the concern.
> 
> Oh, I see. I had missed that because I was focusing on the test to understand the expected output. If the front end won't be generating the extend and truncate instructions when the target has a legal half type that takes care of my primary objection. I'm still not entirely convinced that the front end should be making this decision, but it is a reasonable implementation.
> 
> 
> 
> > The behavior in 32-bit float + no SSE case is benefited from x87 feature that always does excess precision. It is somehow taking bugs as feature from the semantics of the IR.
> > That says, it is feasible to mock the same behavior of x87 special fast behavior for fp16. It needs extra work in the backend to globally promote all half types to float to just get some "bug" behavior. I don't think GCC or us have interest with it.
> 
> I'm not quite sure what you're saying here. I don't consider the fact that we're generating x87 instructions in the 32-bit float + no SSE case to be a bug. That's the behavior most people want. If you're saying that ignoring the IR semantics is a bug even though the result is what we intend, I would agree with that.
> 
> I don't understand why gcc doesn't maintain the same distinction between -fexcess-precision=fast and -fexcess-precision=standard for 16-bit types that they do for 32-bit types, but I don't suppose that's critical.
This is an unfortunate situation where the fast thing to do depends on the target capabilities, and misunderstanding those capabilities in either direction can lead to performance problems.

On the one hand, if the target supports native FP16 operations, you are absolutely right that emitting an excess-precision code pattern will interfere with optimal use of hardware.  The backend can turn `trunc(add.fp32(ext(a), ext(b)))` back into `add.fp16(a,b)`, but it cannot turn `trunc(add.fp32(add.fp32(ext(a), ext(b)), ext(c)))` back into `add.fp16(add.fp16(a,b), c)`, which means it will be stuck with a lot of extra FP16 truncations and extensions (as well as FP32 math).  That is why excess precision is only enabled when the target does not support FP16.  (Of course, this decision is made statically based on whether the *minimum CPU target* supports FP16, not dynamically based on what the actual execution-time CPU can do.)

On the other hand, if the target *doesn't* support native FP16 operations, *not* emitting an excess-precision code pattern will interfere with optimal use of hardware.  The backend can turn `add.fp16(a,b)` into `trunc(add.fp32(ext(a), ext(b)))`, but it cannot turn `add.fp16(add.fp16(a,b), c)` into `trunc(add.fp32(add.fp32(ext(a), ext(b)), ext(c)))` without explicit permission.  Instead, it must emit `trunc(add.fp32(ext(trunc(add.fp32(ext(a), ext(b)))), ext(c)))`, which means it will *also* be stuck doing a lot of extra FP16 truncations and extensions.

Now, if you're willing to do `-fexcess-precision=fast`, you can absolutely make the backend eliminate that intermediate truncation and extension.  And I would agree that if we want to support that mode, we should probably emit FP16 operations out of the frontend instead of the excess-precision pattern that we emit under `-fexcess-precision=standard`.  But currently we do not support that mode, and I do think the most reliable and least error-prone way of implementing `-fexcess-precision=standard` is to do what we're currently do in the frontend.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D136176/new/

https://reviews.llvm.org/D136176