[llvm-dev] Update on strict FP status

Wed May 23 07:48:07 PDT 2018

Hello,

at the recent EuroLLVM developer meeting in Bristol I held a BoF
session on the topic "Towards implementing #pragma STDC FENV_ACCESS".
I've also had a number of follow-on discussions both on-site in
Bristol and online since.  This post is intended as a summary of
my current understanding set of requirements and implementation
details covering the overall topic.

I'm posting this here in the hope this can serve as a basis for
the various more detailed discussions that are still ongoing
(e.g. in various Phabricator proposals right now).  Any comments
are welcome!

Semantics of #pragma STDC FENV_ACCESS
=====================================

To provide a baseline for the implementation discussion, first an
overview of the features required to handle the strict floating-point
mode defined by the C and IEEE standard:

1. Floating-point rounding modes
2. Default floating-point exception handling
3. Trapping floating-point exception handling

Each of these separate features imposes different constraints on the
optimizations that LLVM may perform involving FP expressions:

1. Floating-point rounding modes

Outside of FENV_ACCESS regions, all FP operations are supposed to be
performed in the "default" rounding mode.

But inside FENV_ACCESS regions, FP operations implicitly depend on
a "current" rounding mode setting, which may be changed by certain
C library calls (plus some platform-specific intrinsics).  In addition,
those calls may be performed within subroutines (as long as those are
also within FENV_ACCESS), so *any* function call within a FENV_ACCESS
must be considered as potentially changing the rounding mode.

In effect, this means the compiler may not move or combine FP
operations accross function call sites.

2. Default floating-point exception handling

Inside FENV_ACCESS regions, every floating-point operation that
causes an exception must be considered to set a "status flag"
associated with this exception type.  Those flags can be queried
using C library calls (plus some platform-specific intrinsics),
and there are other such calls to explicitly set or clear those
flags as well.  As with the rounding modes, those calls may be
performed in subroutines as well, so any function call within a
FENV_ACCESS region must be considered as potentially *using* and
changing the floating-point exception status flags.

The values of the status flags on entry to a FENV_ACCESS are to
be considered undefined according to the C standard.

Compiler optimizations are supposed to preserve the values of
all exception status bits at any point where they can be
(potentially) inspected by the program, i.e. at all call sites
within FENV_ACCESS regions.  This still allows a number of
optimizations, e.g. to reorder FP operations or combine two
identical operations within a region uninterrupted by calls.
But other optimizations should be avoided, e.g. optimizing
away an unused FP operation may result in an exception flag
now being unset that would otherwise have been set.  The same
applies to floating-point constant folding.

3. Trapping floating-point exception handling

Within a FENV_ACCESS region, library calls may be used to switch
exception handling semantics to a "trapping" mode by setting
corresponding mask bits.  Any subsequent FP instruction that
raises an exception with the associated mask bit set will cause
a trap.  Usually, this will be a hardware trap that is translated
by the operating system into some form of software exception that
can by handled by the applcation; on Linux systems this takes the
form of a SIGFPE signal.

As above, those mask bits can be set and reset via (operating-
system specific) library calls and/or platform-specific intrinsics,
all of which may also be done within subroutine calls.

In effect, this requires the compiler to treat any floating-point
operation within a FENV_ACCESS region as potentially trapping,
which means the same restrictions apply as with e.g. memory accesses
(cannot be speculated etc.)   However, according to the C standard,
the implementation is not required to preserve the *number* of
different traps, so identical operations may still be combined
(unless there is an intervening function call).

The C standard requires all user code to explicitly switch back
to non-trapping mode for all exceptions whenever leaving a
FENV_ACCESS region (both by "falling off the end" of the region
and by calling a subroutine defined outside of FENV_ACCESS).

Implementation requirements on parts of the compiler
====================================================

A. clang front end

The front end needs to determine which instructions are part of
FENV_ACCESS regions and which are not.  This takes into account
both the semantics of the #pragma as defined by the standard,
and the implementation-defined default rules that apply to code
outside of any #pragma.  GCC currently has the following two
related command-line options:

-frounding-math: Do not assume default rounding mode
-ftrapping-math: Assume FP operations may trap

clang accepts but (basically) ignores those options.  As a first
step, it might make sense to have the FENV_ACCESS default
behavior triggered by these options, even while the front end
does not yet support the actual #pragma.

The front end then needs to transmit the information about
FENV_ACCESS regions to later passes.  However, I believe that
we do not actually have to implement "regions" as such at the
IR level.  Instead, it would be sufficient to track the follwing
information:

- For each FP operation, whether it is within a FENV_ACCESS region.
- For each call site, whether it is within a FENV_ACCESS region.

The former requires new IR support; the approach currently under
investigation uses the experimental "constrained FP" intrinsics
instead of traditional floating-point operations for this.  The
latter can be done simply by annotating those call sites with an
attribute.

In addition to that, the front-end itself needs to disable any
early optimizations that do not preserve strict FP semantics,
in particular it must not speculate FP operations if they may
trap.  (Currently, the front end transforms "? :" on floating-
point types into a select IR statement; for trapping FP
operations, an explicit branch must be used instead.)

B. LLVM IR and LLVM common optimizations

As mentioned in the previous section, we need some IR to annotate
FP instructions and call sites within FENV_ACCESS regions.  All
common optimizations then need to respect the strict FP semantics
associated with those regions.

The current approach uses experimental intrinsics.  This has the
advantage that most optimizations never trigger since they don't
even recognize those new intrinsics.  Also, the intrinsics can
be marked as having side-effects and/or being non-speculatable.

The overall effect is that more optimizations are suppressed
than would be strictly necessary.  But this may still be a good
first step, since the result is now safe but maybe not optimal
-- which can be improved upon over time by teaching the specific
semantics of those intrinsics to optimization passes.

However, some open questions remain.  If at some point we want
to model the constrained FP semantics more precisely than just
as "unmodeled side effects", this may have to be reflected at
the IR level directly.  For example, to model rounding mode
behavior, at some point we might require explicit tracking of
data dependencies on the rounding mode by representing the
rounding mode as SSA values defined by function calls and used
by FP intrinsics.  Similarly, to track exception status flags,
they might be modeled as SSA values set by FP intrinsics and
used by function calls.

(There is a possibly related question of how to optimally model
the property of many math library routines that they may access
the "errno" variable but no other memory ...  It might also be
possible to model e.g. exception status as a thread-local "memory"
location that is modified by FP operations, just like errno.)

Another currently unresolved issue is that at the moment nothing
prevents *standard* floating-point operations from being moved
*inside* FENV_ACCESS regions.  This may also be invalid, since
those operations now may cause unexpected traps etc.  (More
specifically, what is invalid is moving any standard FP operation
across a *call site* within a FENV_ACCESS region.)  Note that
this is even an issue if we only support changing the default
(and no actual #pragma) if mutiple object files using different
default settings are being linked together using LTO.

This last issue could in theory be solved by having all optimization
passes respect the requirement that floating-point operations may
not be moved across call sites marked with the strict FP attribute.
But that does not appear to be straightforward since it would
introduce a "new" type of dependeny that would have to be added
throughout LLVM code.  If this must be avoided, we'd have to
find a way to explicity track dependencies at the IR level.  In
the extreme, this could end up equivalent to just always using
the constrained intrinsics for everything ...

C. Code generation

In the back end, effects of strict FP mode have to passed through
to lower-level representations including SelectionDAG and MI.

Currently, the "unmodeled side effect" logic of the constrained
intrinsics is modeled by putting them on the chain during SelectionDAG.
(If we ever model semantics more precisely at the IR level, that
would need to be reflected on SelectionDAG accordingly.)

At the MI level, there is no representation at all.  One option to
fix this would be to model target-specific registers that implement
the IEEE semantics.  Most platforms have registers (or parts of
registers) that hold:
- the current rounding mode
- the exception status flags
- the exception masks (which enable traps)
Marking FP instructions as using and/or defining these registers
would enforce ordering requirements.  It may be too strict in some
cases (e.g. two instructions setting exception status flags may
still be reordered).  On the other hand, I believe if instructions
may actually *trap*, we actually need the hasSideEffects flag even
if register dependencies are modeled.

If we do need hasSideEffects, there is a separate discussion on
whether this can be implemented without each back end having to
duplicate all FP instruction patterns (one with hasSideEffects
and one without), e.g. by having a new feature that allows to
describe the side-effect status using an MI operand.

Next steps
==========

I believe it is important to break up the full amount of work
into incremental steps that provide some useful benefits on their
own.  At first, we should be able to get to a state where clang
can be used to build programs that use some (maybe not all) strict
FP features, where the generated code is always correct but may
not always be optimal.  To get there, I think we need at a
minimum:

- Implement clang support for the default flags, e.g. GCC's
  -frounding-math and -ftrapping-math, and generate always
  the constrained intrinsics.  clang should also mark all
  call sites then (as mentioned above).

- For now, add the requirement that LTO is not supported if
  this would cause mixing of strict and non-strict FP code.
  In the alternative, have the LTO pass automatically transform
  and floating-point operation into a constrained intrinsic
  if *any* (other) module already uses the latter.

- At the IR level, complete the set of supported constrained
  FP intrinsics (there are still some missing, see e.g
  https://reviews.llvm.org/D43515).
  Also, it seems not all variants (e.g. for vector types) are
  supported correctly through codegen (see e.g.
  https://reviews.llvm.org/D46967).

- Allow targets to correctly reflect constrained intrinsics
  semantics at the MI level and final machine code generation
  (see e.g. https://reviews.llvm.org/D45576).

- Review all optimization and codegen passes to verify they
  fully respect strict FP semantics.

Once this is done, we can improve on the solution by:

- Supporting mixing strict and non-strict FP operations
  (would lift the LTO restriction).  (Note: there seems
  to be still some "invention required" here, see above.)

- Actually implementing the #pragma supporting different
  regions within a compilation unit (prereq: support for
  mixing strict and non-strict FP operations).

- Add more optimization of constrained FP intrinsics in
  common optimizers and/or target back ends.

Does this look reasonable?  Please let me know if there's
anything I overlooked, or you have any additional comments
or questions.

Mit freundlichen Gruessen / Best Regards

Ulrich Weigand

--
  Dr. Ulrich Weigand | Phone: +49-7031/16-3727
  STSM, GNU/Linux compilers and toolchain
  IBM Deutschland Research & Development GmbH
  Vorsitzende des Aufsichtsrats: Martina Koederitz | Geschäftsführung: Dirk
Wittkopp
  Sitz der Gesellschaft: Böblingen | Registergericht: Amtsgericht
Stuttgart, HRB 243294
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180523/de6c1b04/attachment-0001.html>