[PATCH] D66092: [CodeGen] Generate constrained fp intrinsics depending on FPOptions

Serge Pavlov via Phabricator via cfe-commits cfe-commits at lists.llvm.org
Thu Aug 15 23:00:07 PDT 2019

sepavloff added a comment.

> The thing that makes the IR semantically incomplete is that there is nothing there to prevent incorrect code motion of the non-constrained operations. Consider this case:
>   if (someCondition) {
>     #pragma clang fp rounding(downward)
>     fesetround(FE_DOWNWARD);
>     x = y/z;
>     fesetround(FE_TONEAREST);
>   }
>   a = b/c;
> If you generate a regular fdiv instruction for the 'a = b/c;' statement, there is nothing that would prevent it from being hoisted above the call to fesetround() and so it might be rounded incorrectly.

This is a good example, as it demonstrates intended usage of the pragma: there is a big program, in which some small pieces must be executed in some special way. Some notes:

- User expects that small change confined to selected block is local, it does not affects the code outside the block. The specification of `pragma` just ensures it. If the change affects the entire function (and possibly other functions that use it), it is felt as something wrong.
- The pragma usage is different from intended. The purpose of `#pragma clang fp rounding` is to model C2x `#pragma STDC FENV_ROUND` (http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2347.pdf, 7.6.2). Such pragma would set rounding mode at the beginning of the block and restore previous state at the end. That is the code should look like:

  if (someCondition) {
    #pragma clang fp rounding(downward)
    x = y/z;
  a = b/c;

- What is the issue with moving `a = b/c`? If it moves ahead of `if` statement it seems OK, because the rounding mode is the same in that point. It cannot be moved inside the block (where rounding mode is different) because it breaks semantics. We could consider another example:

  for (i = …) {
    #pragma clang fp rounding(downward)
    a[i] = x/y;

If `x` and `y` are loop invariants, `x/y` could be hoisted out of the loop. However on IR level it would be moved as constrained intrinsic, so semantic would preserve.
The issue arises only when an expression is moved inside the block where specific rounding mode is in effect. Something like this:

  z = x*y;
  for (i = …) {
    #pragma clang fp rounding(downward)
    a[i] += z;

And for some reason `z=x*y` is inserted into the loop. In such cases the node, that comes from outside the block, must be transformed.

- There must be more than one way to prevent undesirable moves. For instance, `fence` node may be extended so that it prevented moving floating operation across it, and they may be used to organize a region where specific floating point environment is in act.

> In D66092#1630997 <https://reviews.llvm.org/D66092#1630997>, @sepavloff wrote:
>> Another issue is non-standard rounding. It can be represented by constrained intrinsics only. The rounding does not require restrictions on code motion, so mixture of constrained and unconstrained operation is OK. Replacement of all operations with constrained intrinsics would give poorly optimized code, because compiler does not optimize them. It would be a bad thing if a user adds the pragma to execute a statement with specific rounding mode and loses optimization.
> I agree that loss of optimization would be a bad thing, but I think it's unavoidable. By using non-default rounding modes the user is implicitly accepting some loss of optimization. This may be more than they would have expected, but I can't see any way around it.

Nowadays there are many architectures designed for machine learning tasks. The usually operate on short data (half, bfloat16 etc), in which precision is relatively low. Rounding control in this case is much more important that on big cores. Kernel writers do fancy things using appropriate rounding modes for different pieces of code to gain accuracy. Such processors may encode rounding mode in their instructions. Cost of using specific rounding mode is zero. Loss of performance in this use case is not excusable.

In any case impact on performance must be minimized.

  rG LLVM Github Monorepo



More information about the cfe-commits mailing list