[PATCH] D26708: Fix -f[no-]reciprocal-math -ffast-math interaction, including LTO

Tue Nov 15 15:22:10 PST 2016

wristow created this revision.
wristow added reviewers: spatel, hfinkel.
wristow added a subscriber: llvm-commits.
Herald added a reviewer: tstellarAMD.
Herald added subscribers: mehdi_amini, nhaehnle, nemanjai.

There are some inconsistencies in the handling of fast-math-flags that are
(a) preventing the selective disabling of individual fast-math features and
(b) keeping some features from working properly with LTO.
The proposed change here fixes an immediate problem where reciprocal-math
isn't handled correctly.  This is one small step in handling these issues
that happen for some of the other fast-math-flags as well.

Here is a simple test-case that illustrates the problem for the
reciprocal-math situation.

  extern void use(float x, float y);

  void test(float a, float b, float c)
  {
    float q1 = a / c;
    float q2 = b / c;
    use(q1, q2);
  }

Without -ffast-math, two divisions will be done, and with -ffast-math only
one division will happen, since this will be transformed into:

  float tmp = 1.0f / c;
  float q1 = a * tmp;
  float q2 = b * tmp;
  use(q1, q2);

The bug is that with `-ffast-math -fno-reciprocal-math`, this
reciprocal-transformation is **not** suppressed.

____________________________________________________

tl;dr

The situation is that passing `-ffast-math` on the command-line, results in
passing the following 6 lower-level flags to cc1:

  -menable-no-infs
  -menable-no-nans
  -fno-signed-zeros
  -freciprocal-math
  -fno-trapping-math
  -ffp-contract=fast

and also still passing `-ffast-math` itself to cc1 (the act of passing
`-ffast-math` to cc1 results in the macro `__FAST_MATH__` being defined).

These low level flags can be disabled individually.  As an aside, when
`-ffast-math` is used, and a certain subset of the above are not disabled,
then the switch:

  -menable-unsafe-fp-math

is also passed to cc1.

Ultimately, even when `-fno-reciprocal-math` is passed on the command-line,
the fact that `-ffast-math` is still passed to cc1 ends up setting the flag
`UnsafeAlgebra` in LLVM, which ends up over-riding the user's request to
suppress the reciprocal-math transformation.  The code-change here is a
fairly simple one to deal with this.

Prior to this change, the reciprocal transformations are enabled when
either the `fast` or `arcp` IR-level flags are on.  The philosophy of the
approach taken here is to only enable reciprocal transformations when
`arcp` is on.  To put it another way, rather than an "umbrella" flag such
as `fast` being checked in the back-end (along with an individual flag like
`arcp`), it seems to me that just checking the individual flag expresses
the need more cleanly.  Any fast-math-related transformation that doesn't
have an individual flag (e.g., re-association currently doesn't), should
eventually have an individual flag defined for it, and then that individual
flag should be checked.  In the end, if `-ffast-math` sets the 6
lower-level flags described above, then the equivalent setting of the
individual flags should be equivalent.  That is, ultimately, the following
2 user-commands should produce the same code:

  clang -c -O2 -ffast-math foo.c
  clang -c -O2 -D__FAST_MATH__ -fno-honor-infinities -fno-honor-nans -fno-signed-zeros -freciprocal-math -fno-trapping-math -ffp-contract=fast foo.c

and this proposed change is a small step in that direction.

This is my first venture into this area of LLVM, and I may be
misunderstanding some of the bigger-picture aspects.  Maybe the approach of
controlling the reciprocal transformation strictly by `arcp` (rather than
`arcp` OR `fast`) is counter to some other expectations.  In which case,
I'll be happy to learn more about how this is intended to work.

Related to this being my first venture into this area, although this fixes
the immediate problem, even with it there are some lurking issues (possibly
in Clang rather than LLVM, or maybe in both, but I'm not sure).
Specifically, if the above test-case (that computes the two quotients `q1`
and `q2`) is compiled with `-ffast-math` producing a .ll file, then there
is no indication in the .ll file that the reciprocal transformation is
enabled.  That is, the `arcp` flag is not on the division instructions
(although the `fast` flag is, as expected -- but in this model I'm
describing, the `fast` flag is not to be checked for this).  Continuing to
process that .ll file through to an assembly file shows two divisions
happening, confirming that (incorrectly) the reciprocal-transformation
does **not** happen.  For example:

  $ clang -S -o test_via_ll.ll -emit-llvm -O2 -ffast-math test.c
  $ llc -o test_via_ll.s test_via_ll.ll  # via .ll: -ffast-math did not get the job done
  $ grep div test_via_ll.s
          divss   %xmm2, %xmm0
          divss   %xmm2, %xmm1
  $

Whereas doing the same compilation via a .bc file **does** honor the request
for doing the reciprocal transformation:

  $ clang -c -o test_via_bc.bc -emit-llvm -O2 -ffast-math test.c
  $ llc -o test_via_bc.s test_via_bc.bc
  $ grep div test_via_bc.s               # via .bc: -ffast-math did get the job done
          divss   %xmm2, %xmm3
  $

To be clear, without this proposed change, the approach via the .ll file
does correctly do the reciprocal transformation (as does the .bc approach).
And with my proposed change, the .bc approach continues to work (and the
bug that's the whole point of this patch is fixed), but the .ll approach shown
above fails.  Manually adding the `arcp` flag to test_via_ll.s does result
in the reciprocal transformation being done with the updated compiler, as
expected.

I don't think this .ll issue causes any problems when going through normal
compilations (that is, when producing object files or bitcode files), but
it clearly is trouble when producing .ll files.

https://reviews.llvm.org/D26708

Files:
  include/llvm/IR/Operator.h
  lib/CodeGen/SelectionDAG/DAGCombiner.cpp
  test/CodeGen/AArch64/fdiv-combine.ll
  test/CodeGen/AMDGPU/fdiv.ll
  test/CodeGen/PowerPC/fdiv-combine.ll
  test/CodeGen/X86/fdiv-combine.ll
  test/LTO/X86/Inputs/fast-with-recip.ll
  test/LTO/X86/Inputs/fast-without-recip.ll
  test/LTO/X86/fast-recip.ll
  test/Transforms/InstCombine/fast-math.ll
  test/Transforms/SLPVectorizer/X86/propagate_ir_flags.ll
  unittests/IR/IRBuilderTest.cpp

-------------- next part --------------
A non-text attachment was scrubbed...
Name: D26708.78085.patch
Type: text/x-patch
Size: 9357 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20161115/df71a300/attachment.bin>