[LLVMdev] API break for out-of-tree targets implementing TargetLoweringBase::isFMAFasterThanMulAndAdd

Mon Jul 8 15:10:24 PDT 2013

Hello,

To any out-of-tree targets, please be aware that I intend to commit a
patch that will break the build of any target implementing
TargetLoweringBase::isFMAFasterThanMulAndAdd, for the reasons
described below. (Basically, the current interface definition is
broken and not followed, and no in-tree target was doing the right
thing with it, so it is unlikely any out-of-tree target is either...)

To un-break your build after this patch goes through, you will need to
rename isFMAFasterThanMulAndAdd to isFMAFasterThanFMulAndFAdd and
ensure that it returns true for any type that eventually legalizes to
a type for which FMAs are faster than FMul + FAdd (which usually means
you have hardware support of the operation.). You can look at in-tree
target implementations as an example.

Please let me know if there are any objections before tomorrow morning.

Stephen

---------- Forwarded message ----------
From: Stephen Lin <swlin at post.harvard.edu>
Date: Sun, Jul 7, 2013 at 9:25 PM
Subject: [PATCH] Resolve issues with fmuladd intrinsic handling across
multiple backends
To: llvm-commits at cs.uiuc.edu

Hi,

While working on another patch, I discovered multiple related issues
with fmuladd intrinsic handling which I believe the attached patch
resolves.

Currently, the operation depends on the target's implementation of the
virtual function TargetLoweringBase::isFMAFasterThanMulAndAdd, the
comments of which currently claims:

-  /// isFMAFasterThanMulAndAdd - Return true if an FMA operation is faster than
-  /// a pair of mul and add instructions. fmuladd intrinsics will be
expanded to
-  /// FMAs when this method returns true (and FMAs are legal),
otherwise fmuladd
-  /// is expanded to mul + add.

The "and FMAs are legal" portion of the above comment is simply a lie;
the legality of FMA operations is not checked before lowering fmuladd
to ISD::FMA; however, the AArch64, SystemZ, and X86 implementations of
this function are coded assuming that legality is checked and thus
simply return true. This results in the following issues:

1. On X86(-64) targets, ISD::FMA nodes are formed when lowering
fmuladd intrinsics even if the subtarget does not support FMA
instructions, leading to laughably bad code generation in some
situations (for example, compiling a call to "@llvm.fmuladd.v16f32(<16
x float> %a, <16 x float> %b, <16 x float> %c)" without "-mattr=+fma"
or "-mattr=+fma4" results in 16 calls for the fmaf libm function
instead of AVX muls and adds.

2. On AArch64 targets, ISD::FMA nodes are formed for operations on
fp128, resulting in a call to a software fp128 FMA implementation
rather than a software fp128 multiply and a software fp128 add. This
does not seem to be the intended behavior given the comment above;
however, I am not sure if this is actually better or worse, since
neither set of operations is supported in hardware...does anyone know
which is likely to be faster?

However, on further investigation, it seems like not checking the
legality of an FMA operation in lowering fmuladd intrinsics is a
feature, not a bug, since it allows formation of FMAs with types like
v16f32, as long as they legalize (via splitting, scalarization,
promotion, etc.) to types that support FMAs; it turns out this case is
explicitly tested for in test/CodeGen/X86/wide-fma-contraction.ll.

So the proper solution, it seems, is for
TargetLoweringInfo::isFMAFasterThanMulAndAdd implementations to check
types and preconditions rather than depending on the caller to do so.
The last in-tree target to implement this function, PowerPC, actually
does check types, however, it only checks for the specific legal
types, and therefore the following occurs:

3. On PowerPC targets, FMAs are not generated from fmuladd intrinsics
on types like v2f32, v8f32, v4f64, etc., even though they promote,
split, scalarize, etc. to types that support hardware FMAs.

This patch resolves all these issues by modifying the implementations
of this virtual to check for subtarget features and for EVTs that
legalize to MVTs that support hardware FMAs. (If I've understood the
legalization rules correctly, then it turns out the latter can always
be done in these targets just by checking the scalar type, but this is
not the correct solution in the general case.)
Comments are adjusted accordingly.

(For the AArch64 backend in particular, I made the assumption, for
now, that FMAs should not be formed on fp128, but this can be changed
with a one-line fix.)

Since all current implementations of this virtual were buggy and the
comment describing its usage was incorrect, I've also decided to
rename it from "TargetLoweringBase::isFMAFasterThanMulAndAdd" to
"TargetLoweringBase::isFMAFasterThanFMulAndFAdd" to force a merge
conflict for any out-of-tree target implementing it; the new name is
more accurate and consistent with the name of other functions in this
interface anyway*. I will send a message to LLVMDev describing the
change and the appropriate fixes required if/when it goes through.
(Suggestions for a better solution to this problem are welcome!)

Also, I took this opportunity to modify DAGCombiner to only check for
type legality when forming ISD::FMA nodes (in unsafe math mode) after
legalization, since this is allowed by the new function interface
definition (as long as isFMAFasterThanFMulAndFAdd returns true for the
type).  In practice with current target implementations this change
seems to be a no-op, since the FMA will be formed after legalization
anyway; however, there doesn't seem to be any harm in forming the FMA
earlier if possible.

No current tests are affected by this change; I've added tests that
the specific symptoms described above are fixed as well as some other
sanity checks, just in case.

Please review and let me know if you have any comments! In particular,
if anyone has an definite knowledge about what the correct behavior of
fmuladd of fp128 on AArch64 should be, please let me know.

Thanks,
Stephen

* fast integer FMA instructions are at least theoretically possible, I
think (?); however, the current comment already indicates that the
function is used specifically to lower fmuladd intrinsics so "FMul"
and "FAdd" are more accurate.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: fma-faster-check-types.patch
Type: application/octet-stream
Size: 26781 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130708/52adf721/attachment.obj>