[PATCH] Resolve issues with fmuladd intrinsic handling across multiple backends

Sun Jul 7 22:13:58 PDT 2013

----- Original Message -----
> Hi,
> 
> While working on another patch, I discovered multiple related issues
> with fmuladd intrinsic handling which I believe the attached patch
> resolves.
> 
> Currently, the operation depends on the target's implementation of
> the
> virtual function TargetLoweringBase::isFMAFasterThanMulAndAdd, the
> comments of which currently claims:
> 
> -  /// isFMAFasterThanMulAndAdd - Return true if an FMA operation is
> faster than
> -  /// a pair of mul and add instructions. fmuladd intrinsics will be
> expanded to
> -  /// FMAs when this method returns true (and FMAs are legal),
> otherwise fmuladd
> -  /// is expanded to mul + add.
> 
> The "and FMAs are legal" portion of the above comment is simply a
> lie;
> the legality of FMA operations is not checked before lowering fmuladd
> to ISD::FMA; however, the AArch64, SystemZ, and X86 implementations
> of
> this function are coded assuming that legality is checked and thus
> simply return true. This results in the following issues:
> 
> 1. On X86(-64) targets, ISD::FMA nodes are formed when lowering
> fmuladd intrinsics even if the subtarget does not support FMA
> instructions, leading to laughably bad code generation in some
> situations (for example, compiling a call to
> "@llvm.fmuladd.v16f32(<16
> x float> %a, <16 x float> %b, <16 x float> %c)" without "-mattr=+fma"
> or "-mattr=+fma4" results in 16 calls for the fmaf libm function
> instead of AVX muls and adds.
> 
> 2. On AArch64 targets, ISD::FMA nodes are formed for operations on
> fp128, resulting in a call to a software fp128 FMA implementation
> rather than a software fp128 multiply and a software fp128 add. This
> does not seem to be the intended behavior given the comment above;
> however, I am not sure if this is actually better or worse, since
> neither set of operations is supported in hardware...does anyone know
> which is likely to be faster?
> 
> However, on further investigation, it seems like not checking the
> legality of an FMA operation in lowering fmuladd intrinsics is a
> feature, not a bug, since it allows formation of FMAs with types like
> v16f32, as long as they legalize (via splitting, scalarization,
> promotion, etc.) to types that support FMAs; it turns out this case
> is
> explicitly tested for in test/CodeGen/X86/wide-fma-contraction.ll.
> 
> So the proper solution, it seems, is for
> TargetLoweringInfo::isFMAFasterThanMulAndAdd implementations to check
> types and preconditions rather than depending on the caller to do so.
> The last in-tree target to implement this function, PowerPC, actually
> does check types, however, it only checks for the specific legal
> types, and therefore the following occurs:
> 
> 3. On PowerPC targets, FMAs are not generated from fmuladd intrinsics
> on types like v2f32, v8f32, v4f64, etc., even though they promote,
> split, scalarize, etc. to types that support hardware FMAs.

Thanks for working on this! The PPC portions LGTM.

 -Hal

> 
> This patch resolves all these issues by modifying the implementations
> of this virtual to check for subtarget features and for EVTs that
> legalize to MVTs that support hardware FMAs. (If I've understood the
> legalization rules correctly, then it turns out the latter can always
> be done in these targets just by checking the scalar type, but this
> is
> not the correct solution in the general case.)
> Comments are adjusted accordingly.
> 
> (For the AArch64 backend in particular, I made the assumption, for
> now, that FMAs should not be formed on fp128, but this can be changed
> with a one-line fix.)
> 
> Since all current implementations of this virtual were buggy and the
> comment describing its usage was incorrect, I've also decided to
> rename it from "TargetLoweringBase::isFMAFasterThanMulAndAdd" to
> "TargetLoweringBase::isFMAFasterThanFMulAndFAdd" to force a merge
> conflict for any out-of-tree target implementing it; the new name is
> more accurate and consistent with the name of other functions in this
> interface anyway*. I will send a message to LLVMDev describing the
> change and the appropriate fixes required if/when it goes through.
> (Suggestions for a better solution to this problem are welcome!)
> 
> Also, I took this opportunity to modify DAGCombiner to only check for
> type legality when forming ISD::FMA nodes (in unsafe math mode) after
> legalization, since this is allowed by the new function interface
> definition (as long as isFMAFasterThanFMulAndFAdd returns true for
> the
> type).  In practice with current target implementations this change
> seems to be a no-op, since the FMA will be formed after legalization
> anyway; however, there doesn't seem to be any harm in forming the FMA
> earlier if possible.
> 
> No current tests are affected by this change; I've added tests that
> the specific symptoms described above are fixed as well as some other
> sanity checks, just in case.
> 
> Please review and let me know if you have any comments! In
> particular,
> if anyone has an definite knowledge about what the correct behavior
> of
> fmuladd of fp128 on AArch64 should be, please let me know.
> 
> Thanks,
> Stephen
> 
> * fast integer FMA instructions are at least theoretically possible,
> I
> think (?); however, the current comment already indicates that the
> function is used specifically to lower fmuladd intrinsics so "FMul"
> and "FAdd" are more accurate.
> 
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> 

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory