[llvm-dev] Fusing contract fadd/fsub with normal fmul

Mon Jun 12 11:22:56 PDT 2017

On Mon, Jun 12, 2017 at 9:40 AM, Sanjay Patel <spatel at rotateright.com> wrote:
> For reference, the FMF 'contract' patches are listed here:
> https://bugs.llvm.org/show_bug.cgi?id=25721#c6
>
> If we can make the documentation better, that would certainly be a welcome
> patch.
>
> It would be better to see the IR for your example(s), but I think you'd need

The IR of the scalar loop is
```
if13:                                             ; preds = %scalar.ph, %if13
 %s.124 = phi double [ %51, %if13 ], [ %bc.merge.rdx, %scalar.ph ]
 %"i#672.023" = phi i64 [ %52, %if13 ], [ %bc.resume.val, %scalar.ph ]
 %46 = getelementptr double, double* %13, i64 %"i#672.023"
 %47 = load double, double* %46, align 8
 %48 = getelementptr double, double* %15, i64 %"i#672.023"
 %49 = load double, double* %48, align 8
 %50 = fmul double %47, %49
 %51 = fadd fast double %s.124, %50
 %52 = add nuw nsw i64 %"i#672.023", 1
 %53 = icmp slt i64 %52, %9
 br i1 %53, label %if13, label
%L11.outer.split.L11.outer.split.split_crit_edge.outer.loopexit
```

And it can be vectorized to

```
vector.body:                                      ; preds =
%vector.body, %vector.ph
 %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
 %vec.phi = phi <4 x double> [ %19, %vector.ph ], [ %40, %vector.body ]
 %vec.phi94 = phi <4 x double> [ zeroinitializer, %vector.ph ], [ %41,
%vector.body ]
 %vec.phi95 = phi <4 x double> [ zeroinitializer, %vector.ph ], [ %42,
%vector.body ]
 %vec.phi96 = phi <4 x double> [ zeroinitializer, %vector.ph ], [ %43,
%vector.body ]
 %20 = getelementptr double, double* %13, i64 %index
 %21 = bitcast double* %20 to <4 x double>*
 %wide.load = load <4 x double>, <4 x double>* %21, align 8
 %22 = getelementptr double, double* %20, i64 4
 %23 = bitcast double* %22 to <4 x double>*
 %wide.load100 = load <4 x double>, <4 x double>* %23, align 8
 %24 = getelementptr double, double* %20, i64 8
 %25 = bitcast double* %24 to <4 x double>*
 %wide.load101 = load <4 x double>, <4 x double>* %25, align 8
 %26 = getelementptr double, double* %20, i64 12
 %27 = bitcast double* %26 to <4 x double>*
 %wide.load102 = load <4 x double>, <4 x double>* %27, align 8
 %28 = getelementptr double, double* %15, i64 %index
 %29 = bitcast double* %28 to <4 x double>*
 %wide.load103 = load <4 x double>, <4 x double>* %29, align 8
 %30 = getelementptr double, double* %28, i64 4
 %31 = bitcast double* %30 to <4 x double>*
 %wide.load104 = load <4 x double>, <4 x double>* %31, align 8
 %32 = getelementptr double, double* %28, i64 8
 %33 = bitcast double* %32 to <4 x double>*
 %wide.load105 = load <4 x double>, <4 x double>* %33, align 8
 %34 = getelementptr double, double* %28, i64 12
 %35 = bitcast double* %34 to <4 x double>*
 %wide.load106 = load <4 x double>, <4 x double>* %35, align 8
 %36 = fmul <4 x double> %wide.load, %wide.load103
 %37 = fmul <4 x double> %wide.load100, %wide.load104
 %38 = fmul <4 x double> %wide.load101, %wide.load105
 %39 = fmul <4 x double> %wide.load102, %wide.load106
 %40 = fadd fast <4 x double> %vec.phi, %36
 %41 = fadd fast <4 x double> %vec.phi94, %37
 %42 = fadd fast <4 x double> %vec.phi95, %38
 %43 = fadd fast <4 x double> %vec.phi96, %39
 %index.next = add i64 %index, 16
 %44 = icmp eq i64 %index.next, %n.vec
 br i1 %44, label %middle.block, label %vector.body
```

If contracting normal mul and fast add is allowed, both loop can use fma.

> 'contract' on both the fmul and fadd to generate an FMA. Conservatively, we
> wouldn't alter the result if either component somehow required strict FP. To
> vectorize, you probably need 'fast' on both ops because vectorization would
> be changing the order of operations (reassociation).
>
>
> On Fri, Jun 9, 2017 at 9:04 PM, Yichao Yu via llvm-dev
> <llvm-dev at lists.llvm.org> wrote:
>>
>> Hi,
>>
>> On LLVM 5.0 (current trunk), fadd/fsub and fmul that are both marked
>> with `contract` or `fast` can be merged to a fma instruction by the
>> backend.
>>
>> I'm wondering about the exact semantic of this new flag as well as
>> `fast` and in particular, would it be valid to do this when only the
>> `fadd`/`fsub` (and not the `fmul`) is marked with `contract` or at
>> least `fast`. The reasoning is that doing this will have a similar
>> effect as if the `fadd`/`fsub` is performed not to IEEE spec so a
>> single flag on this instruction should be enough for the
>> transformation.
>>
>> The particular case I'm interested in is vectorized loop with
>> reduction like in pseudo C code `s += a[i] * b[i]`. Our front end will
>> recognize this and mark the `+` as `fast` to enable vectorization.
>> It'll be great if this can enable the reduction to be done with `fma`
>> instructions.
>>
>> Yichao Yu
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>