[llvm] [InstCombine] Canonicalize reassoc contract fmuladd to fmul + fadd (PR #90434)

Wed May 8 12:36:24 PDT 2024

================
@@ -204,6 +204,34 @@ define float @fmuladd_fneg_x_fneg_y_fast(float %x, float %y, float %z) {
   ret float %fmuladd
 }
 
+define float @fmuladd_unfold(float %x, float %y, float %z) {
+; CHECK-LABEL: @fmuladd_unfold(
+; CHECK-NEXT:    [[TMP1:%.*]] = fmul reassoc contract float [[X:%.*]], [[Y:%.*]]
+; CHECK-NEXT:    [[FMULADD:%.*]] = fadd reassoc contract float [[TMP1]], [[Z:%.*]]
+; CHECK-NEXT:    ret float [[FMULADD]]
+;
+  %fmuladd = call reassoc contract float @llvm.fmuladd.f32(float %x, float %y, float %z)
+  ret float %fmuladd
+}
+
+define float @fmuladd_unfold_missing_reassoc(float %x, float %y, float %z) {
+; CHECK-LABEL: @fmuladd_unfold_missing_reassoc(
+; CHECK-NEXT:    [[FMULADD:%.*]] = call contract float @llvm.fmuladd.f32(float [[X:%.*]], float [[Y:%.*]], float [[Z:%.*]])
+; CHECK-NEXT:    ret float [[FMULADD]]
+;
+  %fmuladd = call contract float @llvm.fmuladd.f32(float %x, float %y, float %z)
+  ret float %fmuladd
+}
+
+define float @fmuladd_unfold_missing_contract(float %x, float %y, float %z) {
+; CHECK-LABEL: @fmuladd_unfold_missing_contract(
+; CHECK-NEXT:    [[FMULADD:%.*]] = call reassoc float @llvm.fmuladd.f32(float [[X:%.*]], float [[Y:%.*]], float [[Z:%.*]])
+; CHECK-NEXT:    ret float [[FMULADD]]
+;
+  %fmuladd = call reassoc float @llvm.fmuladd.f32(float %x, float %y, float %z)
+  ret float %fmuladd
+}
+
----------------
jcranmer-intel wrote:

> I kind of think you do need reassoc here, or otherwise there is a risk of contracting across different statements than was in the original program. There's an implicit arithmetic_fence embedded in the fmuladd

Arguably, by the same logic, you'd also need `arcp` to prevent the `fmul` from being converted into an `fdiv` (if you believe that `arcp` permits bidirectional transformations), not that I think there is any profitable optimization that would do that. Potentially other not-yet-invented FMF as well. Unless you interpret FMF on an `fmuladd` as allowing appropriate rewrites on the `fmul` and `fadd` that affect one instruction but not the other.

> That is, without any fast-math flags this
> 
> `%r = call float @llvm.fmuladd(float %b, float %c, float %d)`
> 
> is semantically equivalent to this:
> 
> ```
> %t1 = fmul contract float %b, %c
> %t2 = fadd contract float %t1, %d
> %r = call float @llvm.arithmetic.fence.f32(float %t2)
> ```

Not quite, I think, you also need arithmetic fences on `%b`, `%c`, and `%d`.

> Now, introducing the `contract` fast-math flag raises an interesting question that isn't clearly answered in the current language reference. Do the fast-math flags on an intrinsic apply to the implied operations within the intrinsic or only to the intrinsic as an atomic operation interacting with other operations?

If you start from an IEEE 754 base, `fma` is clearly defined as a core atomic ternary operation, but `fmuladd` exists in a weird limbo state of being half `fma` and half `fmul + fadd`.

> On the other hand, if you take the approach that the `contract` flag overrides the implied arithmetic fence, this works.

The simple answer is that having FMF of any kind on an arithmetic fence is kind of a semantics tug-of-war and just shouldn't happen...

> That said, I don't think the fast-math flags should eliminate the implied arithmetic fence. I think both the original canonicalization and this proposed extension to it are illegal. If flag had intended to allow contraction across operations, it shouldn't have used the intrinsic.

... which itself raises the issue of if the arithmetic fence itself is part of the implied semantics of `fmuladd`, and in particular if the effect of FMF on an `fmuladd` has different effects from the FMF on an implied arithmetic fence.

Over the course of this discussion, I think I've leaned more in the direction of not canonicalizing `fmuladd` into `fmul + fadd`, and instead making passes support the instruction as if it were an `fma` (although I note that the motivating issue also seems to fail if you use an `fma` as a reduction).

[I think we have a clear contender for a topic in next week's FP working group meeting!]

https://github.com/llvm/llvm-project/pull/90434