[PATCH] D124867: [SLP][NFC] Pre-commit test showing vectorization preventing FMA

Wed May 18 09:49:23 PDT 2022

wjschmidt updated this revision to Diff 430413.
wjschmidt retitled this revision from "[SLP][NFC] Pre-commit test showing horizontal reduction preventing FMA" to "[SLP][NFC] Pre-commit test showing vectorization preventing FMA".
wjschmidt edited the summary of this revision.
wjschmidt added a comment.

Thanks for the helpful comments to date!  In this version, I've managed to remove the undefs from the original test.  I also added a second test that removes the loop structure.  For both tests, today we will generate an unprofitable horizontal reduction.  With the first test, adding cost modeling to constrain the horizontal reduction allows FMAs to be generated.  With the second test, this is insufficient, as we then decide to vectorize the multiplies in an unprofitable way.  The two tests demonstrate the need to account for lost FMAs in the cost modeling both when vectorizing for a reduction and when vectorizing a list of multiplies.

I will have two follow-up patches.  The first introduces costing for lost FMAs, and applies it to the horizontal reduction.  The expected test case results are modified to show the first test is properly handled, but the second still has vectorized multiplies.  The second patch applies the costing changes to the case of vectorizing a list, and both tests then leave the FMA opportunities in place.  Breaking this into two patches hopefully makes it clearer what happens with the tests.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D124867/new/

https://reviews.llvm.org/D124867

Files:
  llvm/test/Transforms/SLPVectorizer/X86/slp-fma-loss.ll


Index: llvm/test/Transforms/SLPVectorizer/X86/slp-fma-loss.ll
===================================================================

--- /dev/null
+++ llvm/test/Transforms/SLPVectorizer/X86/slp-fma-loss.ll
@@ -0,0 +1,71 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
+; RUN: opt -slp-vectorizer -S -mcpu=core-avx2 -mtriple=x86_64-unknown-linux-gnu -slp-threshold=-2 < %s | FileCheck %s
+
+; This test checks for a case when a horizontal reduction of floating-point
+; adds may look profitable, but is not because it eliminates generation of
+; floating-point FMAs that would be more profitable.
+
+; FIXME: We generate a horizontal reduction today.
+
+define void @hr() #0 {
+; CHECK-LABEL: @hr(
+; CHECK-NEXT:    br label [[LOOP:%.*]]
+; CHECK:       loop:
+; CHECK-NEXT:    [[PHI0:%.*]] = phi double [ 0.000000e+00, [[TMP0:%.*]] ], [ [[OP_RDX:%.*]], [[LOOP]] ]
+; CHECK-NEXT:    [[CVT0:%.*]] = uitofp i16 0 to double
+; CHECK-NEXT:    [[TMP1:%.*]] = insertelement <4 x double> <double poison, double 0.000000e+00, double 0.000000e+00, double 0.000000e+00>, double [[CVT0]], i32 0
+; CHECK-NEXT:    [[TMP2:%.*]] = fmul fast <4 x double> zeroinitializer, [[TMP1]]
+; CHECK-NEXT:    [[TMP3:%.*]] = call fast double @llvm.vector.reduce.fadd.v4f64(double -0.000000e+00, <4 x double> [[TMP2]])
+; CHECK-NEXT:    [[OP_RDX]] = fadd fast double [[TMP3]], [[PHI0]]
+; CHECK-NEXT:    br i1 true, label [[EXIT:%.*]], label [[LOOP]]
+; CHECK:       exit:
+; CHECK-NEXT:    ret void
+;
+  br label %loop
+
+loop:
+  %phi0 = phi double [ 0.000000e+00, %0 ], [ %add3, %loop ]
+  %cvt0 = uitofp i16 0 to double
+  %mul0 = fmul fast double 0.000000e+00, %cvt0
+  %add0 = fadd fast double %mul0, %phi0
+  %mul1 = fmul fast double 0.000000e+00, 0.000000e+00
+  %add1 = fadd fast double %mul1, %add0
+  %mul2 = fmul fast double 0.000000e+00, 0.000000e+00
+  %add2 = fadd fast double %mul2, %add1
+  %mul3 = fmul fast double 0.000000e+00, 0.000000e+00
+  %add3 = fadd fast double %mul3, %add2
+  br i1 true, label %exit, label %loop
+
+exit:
+  ret void
+}
+
+; This test checks for a case when either a horizontal reduction of
+; floating-point adds, or vectorizing a tree of floating-point multiplies,
+; may look profitable; but both are not because this eliminates generation
+; of floating-point FMAs that would be more profitable.
+
+; FIXME: We generate a horizontal reduction today, and if that's disabled, we
+; still vectorize some of the multiplies.
+
+define double @hr_or_mul() #0 {
+; CHECK-LABEL: @hr_or_mul(
+; CHECK-NEXT:    [[CVT0:%.*]] = uitofp i16 3 to double
+; CHECK-NEXT:    [[TMP1:%.*]] = insertelement <4 x double> poison, double [[CVT0]], i32 0
+; CHECK-NEXT:    [[SHUFFLE:%.*]] = shufflevector <4 x double> [[TMP1]], <4 x double> poison, <4 x i32> zeroinitializer
+; CHECK-NEXT:    [[TMP2:%.*]] = fmul fast <4 x double> <double 7.000000e+00, double -4.300000e+01, double 2.200000e-02, double 9.500000e+00>, [[SHUFFLE]]
+; CHECK-NEXT:    [[TMP3:%.*]] = call fast double @llvm.vector.reduce.fadd.v4f64(double -0.000000e+00, <4 x double> [[TMP2]])
+; CHECK-NEXT:    [[OP_RDX:%.*]] = fadd fast double [[TMP3]], [[CVT0]]
+; CHECK-NEXT:    ret double [[OP_RDX]]
+;
+  %cvt0 = uitofp i16 3 to double
+  %mul0 = fmul fast double 7.000000e+00, %cvt0
+  %add0 = fadd fast double %mul0, %cvt0
+  %mul1 = fmul fast double -4.300000e+01, %cvt0
+  %add1 = fadd fast double %mul1, %add0
+  %mul2 = fmul fast double 2.200000e-02, %cvt0
+  %add2 = fadd fast double %mul2, %add1
+  %mul3 = fmul fast double 9.500000e+00, %cvt0
+  %add3 = fadd fast double %mul3, %add2
+  ret double %add3
+}


-------------- next part --------------
A non-text attachment was scrubbed...
Name: D124867.430413.patch
Type: text/x-patch
Size: 3608 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20220518/0636dcbb/attachment-0001.bin>