[llvm-dev] Trouble when suppressing a portion of fast-math-transformations

Thu Sep 28 17:56:01 PDT 2017

Hi all,

In a mailing-list post last November:
  http://lists.llvm.org/pipermail/llvm-dev/2016-November/107104.html

I raised some concerns that having the IR-level fast-math-flag 'fast' act as an
"umbrella" to implicitly turn on all the lower-level fast-math-flags, causes
some fundamental problems.  Those fundamental problems are related to
situations where a user wants to disable a portion of the fast-math behavior.
For example, to enable all the fast-math transformations except for the
reciprocal-math transformation, a command like the following is what a user
would expect to work:

  clang++ -O2 -ffast-math -fno-reciprocal-math -c foo.cpp

But that isn't what it's doing.

I believe this is a serious problem, but I also want to avoid over-stating the
seriousness.  To be explicit, the problems I'm describing here happen when
'-ffast-math' is used with one or more of the underlying fast-math-related
aspects _disabled_ (like the '-fno-reciprocal-math' example, above).
Conversely, when '-ffast-math' is used "on its own", the situation is fine.
For terminology here, I'll refer to these underlying fast-math-related aspects
(like reciprocal-math, associative-math, math-errno, and others) as
"sub-fast-math" aspects.

I apologize for the length of this post.  I'm putting the summary up front, so
that anyone interested in fast-math issues can quickly get the big-picture of
the issues I'm describing here.

In Summary:

1.  With the change of r297837, the driver now more cleanly handles
    '-ffast-math', and other sub-fast-math switches (like
    '-f[no]-reciprocal-math', '-f[no-]math-errno', and others).

2.  Prior to that change, the disabling of a sub-fast-math switch was often
    ineffective.  So as an example, the following two commands often resulted
    in the same code-gen, even if there were
    fast-math-reciprocal-transformations that were done:
        clang++ -O2 -ffast-math -c foo.cpp
        clang++ -O2 -ffast-math -fno-reciprocal-math -c foo.cpp

3.  Since that change, the disabling of a sub-fast-math switch disables many
    more sub-fast-math transformations than just the one specified.  So now,
    the following two commands often result in very similar (and sometimes
    identical) code-gen:
        clang++ -O2 -c foo.cpp
        clang++ -O2 -ffast-math -fno-reciprocal-math -c foo.cpp

    That is, disabling a single sub-fast-math transformation in some (many?)
    cases now ends up disabling almost all the fast-math transformations.
    This causes a performance hit for people that have been doing this.

4.  To fix this, I think that additional fast-math-flags are likely needed in
    the IR.  Instead of the following set:
            'nnan' + 'ninf' + 'nsz' + 'arcp' + 'contract'
    something like this:
            'reassoc' + 'libm' + 'nnan' + 'ninf' + 'nsz' + 'arcp' + 'contract'
    would be more useful.  Related to this, the current 'fast' flag which acts
    as an umbrella (enabling 'nnan' + 'ninf' + 'nsz' + 'arcp' + 'contract') may
    not be needed.  A discussion on this point was raised last November on the
    mailing list:
      http://lists.llvm.org/pipermail/llvm-dev/2016-November/107104.html

TL;DR

More details are in that thread from November, but the problem in its entirety
involved both back-end LLVM issues, and front-end Clang (driver) issues.  The
LLVM issues are related to the umbrella aspect of 'fast', along with other
fast-math-flags implementation details (described below).  The front-end
aspects in Clang are related to the driver's handling of '-ffast-math' (which
also had an "umbrella" aspect).  The driver code has been refactored since that
November post, fixing the umbrella aspect of the front-end.  But I never got
around to working on the related back-end issues (nor has anyone else), and the
refactored front-end now results in the back-end issues manifesting
differently, and arguably in a worse way (details on the "worse" aspect,
below).

For reference, the refactored driver code was done in r297837:

  [Driver] Restructure handling of -ffast-math and similar options

To be clear, I'm not at all suggesting that the above change was incorrect.  I
think that refactoring of the driver code is the right thing to do.  An aspect
of this refactoring is that prior to it, when a user passed '-ffast-math' on
the command-line, it was also passed to the cc1 process, even if a
sub-fast-math component was disabled.  With the refactoring, the driver only
passes '-ffast-math' to cc1 when a specific set of sub-fast-math components are
enabled.

More specifically, when a user specifies just '-ffast-math' on the
command-line, the following 7 sub-fast-math switches:
  -fno-honor-infinities
  -fno-honor-nans
  -fno-math-errno
  -fassociative-math
  -freciprocal-math
  -fno-signed-zeros
  -fno-trapping-math

get passed to cc1 (this is true both with the old (pre r297837) and new (since
r297837) compilers).  Furthermore, the "umbrella" '-ffast-math' is also passed
to cc1 in this case of the user specifying just '-ffast-math' on the
command-line (again, in both the old and new compilers).

The difference related to this issue in the old/new behavior, is that when a
user turns on fast-math but disables one (or more) of the sub-fast-math
switches, for example, as in:

  clang++ -O2 -ffast-math -fno-reciprocal-math -c foo.cpp

then in the old mode '-ffast-math' was still passed to cc1 (acting as an
umbrella, causing trouble), but in the new mode '-ffast-math' is no longer
passed to cc1 in this case.  (In both the old and new modes,
'-freciprocal-math' is not passed to cc1 with this command-line, as you'd
expect.)

What's happening is that in the old mode, it was the user passing '-ffast-math'
on the command-line that resulted in passing the umbrella '-ffast-math' to cc1
(even if all 7 of the sub-fast-math switches were disabled by the user).
Whereas in the new mode, the '-ffast-math' switch is passed to cc1 iff all 7 of
the underlying sub-fast-math switches are enabled.

I'd say that's an improvement in the handling of the switches, and also on the
plus side, I think it makes dealing with the concerns I raised in November LLVM
a little clearer, and so more manageable in some sense.  But on the negative
side, since the new behavior in LLVM is arguably worse, fixing the back-end
issues is now a higher priority for my customers.

The behavior that is arguably worse, is that when a user enables fast-math, but
attempts to disable one of the sub-fast-math aspects, the old behavior (pre
r297837) was that the sub-fast-math aspect to be disabled, generally (often?)
remained enabled.  The new behavior (since r297837) is that when disabling a
sub-fast-math aspect, that aspect plus many more (possibly often the majority)
of the fast-math transformations are disabled.  So this results in a
performance regression in these fast-math contexts when a sub-fast-math aspect
is disabled, which is why it is a fairly high priority for us.

FTR, r297837 was made during llvm 5.0 development, so the new behavior has the
effect of a performance regression in moving from 4.0 to 5.0.  In describing
things here, I'll compare llvm 4.0 with llvm 5.0 behavior.  But more precisely,
it's pre-r297837 with post-r297837 behavior.

Here is a tiny example, to illustrate it concretely:

$ cat assoc.cpp
//////////// "assoc.cpp" ////////////
float foo(float a, float x)
{
  return ((a + x) - x);  // fastmath reassociation eliminates the arithmetic
}
/////////////////////////////////////
$

When -ffast-math is specified, the reassociation enabled by it allows us to
simply return the first argument (and that reassociation does happen with
'-ffast-math', with both the old and new compilers):

$ clang -c -O2 -o x.o assoc.cpp
$ llvm-objdump -d x.o | grep "^ .*:    "
       0:       f3 0f 58 c1     addss   %xmm1, %xmm0
       4:       f3 0f 5c c1     subss   %xmm1, %xmm0
       8:       c3      retq
$ clang -c -O2 -ffast-math -o x.o assoc.cpp
$ llvm-objdump -d x.o | grep "^ .*:    "
       0:       c3      retq
$

FTR, GCC also does the reassociation transformation here when '-ffast-math' is
used, as expected.

But when using '-ffast-math' and disabling a sub-fast-math aspect of it (say
via '-fno-reciprocal-math', '-fno-associative-math', or '-fmath-errno'), both
the old and new compilers exhibit incorrect behavior in some cases.  With the
old compiler, the behavior was that using any of these switches did not disable
the transformation.  Those switches were mostly ineffective.  (Only
'-fno-associative-math' should disable the transformation in this example, so
the fact that the other ones didn't disable it is correct/desired.)  Here is
the old behavior for the above test-case, when some example sub-fast-math
aspects are individually disabled:

$ old/bin/clang --version | grep version
clang version 4.0.0 (tags/RELEASE_400/final)
$ old/bin/clang -c -O2 -o x.o assoc.cpp
$ llvm-objdump -d x.o | grep "^ .*:    "
       0:       f3 0f 58 c1     addss   %xmm1, %xmm0
       4:       f3 0f 5c c1     subss   %xmm1, %xmm0
       8:       c3      retq
$ old/bin/clang -c -O2 -ffast-math -o x.o assoc.cpp
$ llvm-objdump -d x.o | grep "^ .*:    "
       0:       c3      retq
$ old/bin/clang -c -O2 -ffast-math -fno-reciprocal-math -o x.o assoc.cpp
$ llvm-objdump -d x.o | grep "^ .*:    "
       0:       c3      retq
$ old/bin/clang -c -O2 -ffast-math -fno-associative-math -o x.o assoc.cpp # Error
$ llvm-objdump -d x.o | grep "^ .*:    "
       0:       c3      retq
$ old/bin/clang -c -O2 -ffast-math -fmath-errno -o x.o assoc.cpp
$ llvm-objdump -d x.o | grep "^ .*:    "
       0:       c3      retq
$

So with the old compiler, the case marked 'Error' above is incorrect, in that
the reassociation should be suppressed in that case, but it isn't.

Again FTR, the GCC behavior disables the re-association in the case marked
'Error' above.

Moving on to the new compiler, instead of '-fno-associative-math' being
ineffective, the problem is that when disabling other sub-fast-math aspects
(unrelated to reassociation), the transformation is suppressed, when it should
not be.  Here is the new behavior with that same set of sub-fast-math aspects
individually disabled:

$ new/bin/clang --version | grep version
clang version 5.0.0 (tags/RELEASE_500/final)
$ new/bin/clang -c -O2 -o x.o assoc.cpp
$ llvm-objdump -d x.o | grep "^ .*:    "
       0:       f3 0f 58 c1     addss   %xmm1, %xmm0
       4:       f3 0f 5c c1     subss   %xmm1, %xmm0
       8:       c3      retq
$ new/bin/clang -c -O2 -ffast-math -o x.o assoc.cpp
$ llvm-objdump -d x.o | grep "^ .*:    "
       0:       c3      retq
$ new/bin/clang -c -O2 -ffast-math -fno-reciprocal-math -o x.o assoc.cpp # Error
$ llvm-objdump -d x.o | grep "^ .*:    "
       0:       f3 0f 58 c1     addss   %xmm1, %xmm0
       4:       f3 0f 5c c1     subss   %xmm1, %xmm0
       8:       c3      retq
$ new/bin/clang -c -O2 -ffast-math -fno-associative-math -o x.o assoc.cpp # Good
$ llvm-objdump -d x.o | grep "^ .*:    "
       0:       f3 0f 58 c1     addss   %xmm1, %xmm0
       4:       f3 0f 5c c1     subss   %xmm1, %xmm0
       8:       c3      retq
$ new/bin/clang -c -O2 -ffast-math -fmath-errno -o x.o assoc.cpp # Error
$ llvm-objdump -d x.o | grep "^ .*:    "
       0:       f3 0f 58 c1     addss   %xmm1, %xmm0
       4:       f3 0f 5c c1     subss   %xmm1, %xmm0
       8:       c3      retq
$

The two cases marked as 'Error' are incorrectly suppressing the re-association.
The case marked as 'Good' is now doing the right thing for this test-case.

Again FTR, the GCC behavior allows the re-association in the cases marked
'Error' above to happen.

__________________________________________________________________

Note that the '-f[no-]associative-math' flag has other problems, reported in
PR27372 (https://bugs.llvm.org/show_bug.cgi?id=27372).  Those "other problems"
are related to the fact that there isn't an LLVM IR fast-math-flag that
explicitly indicates whether reassociation is enabled or disabled.  As a
consequence, the front-end essentially drops that flag on the floor.  The
back-end has no way of explicitly looking for that capability, and so the
back-end implementation instead relies on the "umbrella" aspect of 'fast'
implicitly turning on all the lower-level fast-math-flags.  This is a key
aspect of the problem.  Near the start of this post, I mentioned that the LLVM
issues are related to the umbrella aspect of 'fast', along with other
fast-math-flag implementation details.  The fact that the back-end has no way
of explicitly checking whether reassociation is enabled is what I meant by
those other implementation details.

Going to a more general discussion of the problem, the documentation of the
fast-math-flags at:
  http://llvm.org/docs/LangRef.html#fast-math-flags

can be described loosely as:

nnan       Allow optimizations to assume the arguments and result are not NaN
ninf       Allow optimizations to assume the arguments and result are not +/-Inf
nsz        Allow optimizations to treat the sign of a zero argument or result
           as insignificant
arcp       Allow optimizations to use the reciprocal of an argument rather than
           perform division
contract   Allow floating-point contraction (e.g. fused multiply-and-add)

And the flag 'fast' is defined there as:

fast       Fast - Allow algebraically equivalent transformations that may
           dramatically change results in floating point (e.g. reassociate).
           This flag implies all the others.

(Side point: Back in November, 'contract' was not an explicit fast-math-flag.
This is a recent change, but it doesn't impact the issue I'm raising here.)

To summarize, and to relate this somewhat back to the November 2016 post:
  http://lists.llvm.org/pipermail/llvm-dev/2016-November/107104.html

as described in that older post, this means that 'fast' could be described as:

        Very loosely, 'fast' means "all the aggressive FP-transformations that
        are not controlled by one of the other 5, plus it implies all the other
        5".  If for terminology, we call those additional aggressive
        optimizations 'aggr', then we have:

            'fast' == 'aggr' + 'nnan' + 'ninf' + 'nsz' + 'arcp' + 'contract'

But there isn't a specific flag for 'aggr' (it's just "on" when all the other
flags are "on").  Reassociation is part of these additional 'aggr'
transformations.  Back in November, Hal pointed out that libm transformations
are another part of these 'aggr' transformations.  With that, one possible
direction is to add two more sub-fast-math flags, say 'reassoc' and 'libm':

            'reassoc' + 'libm' + 'nnan' + 'ninf' + 'nsz' + 'arcp' + 'contract'

This would allow disabling (for example) 'arcp' without suppressing
reassociation.  Whether there would be a need for an "umbrella" flag 'fast'
that implies all the others is somewhat orthogonal, although personally I feel
it complicates the issue and doesn't provide any significant benefit.  I can
imagine that there is a benefit that I haven't thought of -- I don't claim to
have a deep understanding of the implementation.  So I'd like to hear what
others think.

One important aspect of this is that it appears to me there are quite a few
fast-math transformations that are enabled only when all the underlying
sub-fast-math flags are on (that is, only when the 'fast' umbrella flag is
set).  That's a key part of the problem of PR27372.  In this context, the
change in behavior from r297837 is that with the old behavior, the following
two commands are almost equivalent (in many cases, they are equivalent):

  $ # Old behavior: The following two commands are nearly identical:
  $ clang -c -O2 -ffast-math foo.cpp
  $ clang -c -O2 -ffast-math -fno-reciprocal-math foo.cpp
  $

Whereas with the new behavior (post-r297837), the following two commands are
almost always equivalent:

  $ # New behavior: The following two commands are nearly identical:
  $ clang -c -O2 foo.cpp
  $ clang -c -O2 -ffast-math -fno-reciprocal-math foo.cpp
  $

(Again, '-fno-reciprocal-math' is just an example of the suppression of a
sub-fast-math aspect here.  '-fno-associative-math and '-fmath-errno' would
also be good examples.)

Succinctly, if a '-ffast-math' user now disables a sub-fast-math aspect, they
will be frustrated that they end up disabling almost the entire set of
fast-math transformations.  Whereas previously, they would be frustrated that
their attempt of disabling a specific sub-fast-math aspect was ineffective.  So
previously, they might try to "fix a numerical instability" by disabling a
sub-fast-math aspect (and be frustrated by it not being effective), and now if
they try to "fix that numerical instability", they will succeed, but they will
see a performance-hit of losing nearly all the performance gain that
'-ffast-math' was providing.

As an aside, on the PS4 with llvm 4.0 (and earlier) compilers, we've had a few
customers frustrated that '-ffast-math -fno-reciprocal-math' was still doing
reciprocal transformations.  So we've had a private change to make
'-fno-reciprocal-math' suppress the reciprocal optimization.  With a vanilla
llvm 5.0, those customers would see a performance hit (so we have a different
private change to address that).

As a final point here, to give more weight to this, I took a random bit of code
I found on github that that has floating-point fast-math opportunities in it,
and experimented with it.  (I just searched for 'mandelbrot', and took the
first thing I found.)  Specifically:

  https://gist.github.com/andrejbauer/7919569

This test-case has a few divisions in it, but it doesn't contain any
reciprocal-transformation opportunities (so '-f[no-]reciprocal-math' should
essentially be a no-op).

The old Clang behavior has the following two commands being nearly identical
(they generate essentially equivalent code -- just some minor register
change):

$ # Old Clang behavior:
$ # No significant difference when -fno-reciprocal-math is added (as desired)
$ clang -S -O2 -ffast-math -o O2fm.s mandelbrot.c
$ clang -S -O2 -ffast-math -fno-reciprocal-math -o O2fm.no_arcp.s mandelbrot.c
$ diff O2fm.s O2fm.no_arcp.s | wc
      4      10      56
$

That is, as expected/desired, the '-fno-reciprocal-math' switch has essentially
no impact on this, since there are no reciprocal transformations being done.
Also as expected, the difference between "plain -O2" and '-O2 -ffast-math' is
more substantial:

$ # Old Clang behavior:
$ # '-O2' vs '-O2 -ffast-math' shows a significant difference (as desired)
$ clang -S -O2 -o O2.s mandelbrot.c
$ diff O2.s O2fm.s | wc
     43     184    1305
$

That is, adding '-ffast-math' to '-O2' is transforming the code, presumably
making it faster (at the cost of a potential loss in numerical accuracy).

With GCC for this example (I used version 4.8.4, which isn't particularly
modern, but I happen to have it handy), I get similar behavior.  For example,
the following two commands produce identical assembly code:

$ gcc -S -O2 -ffast-math -o O2fm.s mandelbrot.c
$ gcc -S -O2 -ffast-math -fno-reciprocal-math -o O2fm.no_arcp.s mandelbrot.c
$ diff O2fm.s O2fm.no_arcp.s
$

and that code is substantially different than the GCC "plain -O2" code:

$ gcc -S -O2 -o O2.s mandelbrot.c
$ diff O2.s O2fm.s | wc
     44     126     719
$

But comparing this to the new Clang behavior, we see that
'-fno-reciprocal-math' is mow "disabling too much", as discussed in detail
above for the simple "assoc.cpp" test-case.  Specifically:

$ # New Clang behavior:
$ clang -S -O2 -o O2.s mandelbrot.c
$ clang -S -O2 -ffast-math -o O2fm.s mandelbrot.c
$ clang -S -O2 -ffast-math -fno-reciprocal-math -o O2fm.no_arcp.s mandelbrot.c
$
$ # Adding -ffast-math to -O2 continues to show significant diffs (expected)
$ diff O2.s O2fm.s | wc
     35     105     622
$
$ # too many differences -- should be nearly the same
$ diff O2fm.s O2fm.no_arcp.s | wc
     29      89     526
$

So with the new behavior, even though there are no reciprocal transformation
opportunities, disabling that transformation via '-fno-reciprocal-math'
disables many (most) of the fast-math features.  In fact, comparing plain '-O2'
with '-O2 -ffast-math -fno-reciprocal-math', it's clear that they are virtually
identical with the new Clang behavior.  Specifically, we get only a minor
difference (of swapping of two register operands in a comparison, and changing
the sense of the associated branch) when comparing '-O2' with
'-O2 -ffast-math -fno-reciprocal-math':

$ # New Clang behavior:
$ # nearly identical, but there should be many diffs
$ diff O2.s O2fm.no_arcp.s
188,189c188,189
<       ucomisd %xmm5, %xmm6
<       ja      .LBB0_7
---
>       ucomisd %xmm6, %xmm5
>       jb      .LBB0_7
$

In full disclosure, for this "mandelbrot.c" test-case, I don't know if any of
the changes in code-gen done by us or by GCC when '-ffast-math' is enabled are
helpful (from a performance perspective) or dangerous (from a precise IEEE FP
math perspective).  All I know is that for both us and GCC at -O2, the switch
'-ffast-math' changed the code-gen, and that '-ffast-math -fno-reciprocal-math'
didn't suppress any of those changes for GCC, but it suppressed essentially all
of the changes for us.

For continuity, I'm repeating the summary here (that I had near the beginning).

In Summary:

1.  With the change of r297837, the driver now more cleanly handles
    '-ffast-math', and other sub-fast-math switches (like
    '-f[no]-reciprocal-math', '-f[no-]math-errno', and others).

2.  Prior to that change, the disabling of a sub-fast-math switch was often
    ineffective.  So as an example, the following two commands often resulted
    in the same code-gen, even if there were
    fast-math-reciprocal-transformations that were done:
        clang++ -O2 -ffast-math -c foo.cpp
        clang++ -O2 -ffast-math -fno-reciprocal-math -c foo.cpp

3.  Since that change, the disabling of a sub-fast-math switch disables many
    more sub-fast-math transformations than just the one specified.  So now,
    the following two commands often result in very similar (and sometimes
    identical) code-gen:
        clang++ -O2 -c foo.cpp
        clang++ -O2 -ffast-math -fno-reciprocal-math -c foo.cpp

    That is, disabling a single sub-fast-math transformation in some (many?)
    cases now ends up disabling almost all the fast-math transformations.
    This causes a performance hit for people that have been doing this.

4.  To fix this, I think that additional fast-math-flags are likely needed in
    the IR.  Instead of the following set:
            'nnan' + 'ninf' + 'nsz' + 'arcp' + 'contract'
    something like this:
            'reassoc' + 'libm' + 'nnan' + 'ninf' + 'nsz' + 'arcp' + 'contract'
    would be more useful.  Related to this, the current 'fast' flag which acts
    as an umbrella (enabling 'nnan' + 'ninf' + 'nsz' + 'arcp' + 'contract') may
    not be needed.  A discussion on this point was raised last November on the
    mailing list:
      http://lists.llvm.org/pipermail/llvm-dev/2016-November/107104.html

Thanks,
-Warren
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170929/b0d22582/attachment-0001.html>