[PATCH] D138421: [AArch64][SVE] Enable Tail-Folding. WIP

Mon Nov 21 05:39:05 PST 2022

SjoerdMeijer created this revision.
SjoerdMeijer added reviewers: paulwalker-arm, david-arm, dmgreen, sdesmalen, fhahn.
Herald added subscribers: ctetreau, psnobl, hiraditya, kristof.beyls, tschuett.
Herald added a reviewer: efriedma.
Herald added a project: All.
SjoerdMeijer requested review of this revision.
Herald added a subscriber: pcwang-thead.
Herald added a project: LLVM.

This is enabling tail-folding for SVE. As you know, tail-folding has great potential to improve codegen by not having to emit a vector + epilogue loop, runtime checks for this, and also some setup code for the vector loop. This can help performance significantly in some cases.

I have added WIP (work-in-progress) to the subject as I am looking into collecting some more performance numbers and wanted to get your input while I am doing that.

My results so far on a 2x256b SVE implementation:

- 5% uplift for X264 (SPEC INT 2017)
- Neutral for the other apps in SPECINT2017.
- 1% uplift on an embedded benchmark. It's not a very representative workload, but it has a few matrix kernels and this 1% is significant for that benchmark, nicely illustrating benefits of tail-folding.
- I've tried the llvm test-suite, but just trying to generate a baseline shows it's really noisy. I haven't yet tried with tail-folding, because I am not sure I can conclude anything from the numbers.

What I will do next is getting numbers for SPEC FP 2017.

This change enables the "simple" tail-folding, so isn't e.g. dealing with reductions/recurrences. This seemed like a first good step to me while we get more experience with this. I am interested to hear if you have suggestions for workloads or cases that I should check.


https://reviews.llvm.org/D138421

Files:
  llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
  llvm/test/Transforms/LoopVectorize/AArch64/sve-tail-folding-option.ll


Index: llvm/test/Transforms/LoopVectorize/AArch64/sve-tail-folding-option.ll
===================================================================

--- llvm/test/Transforms/LoopVectorize/AArch64/sve-tail-folding-option.ll
+++ llvm/test/Transforms/LoopVectorize/AArch64/sve-tail-folding-option.ll
@@ -1,5 +1,6 @@
 ; RUN: opt < %s -loop-vectorize -sve-tail-folding=disabled -S | FileCheck %s -check-prefix=CHECK-NOTF
-; RUN: opt < %s -loop-vectorize -sve-tail-folding=default -S | FileCheck %s -check-prefix=CHECK-NOTF
+; RUN: opt < %s -loop-vectorize -sve-tail-folding=default -S | FileCheck %s -check-prefix=CHECK-SIMPLE
+; RUN: opt < %s -loop-vectorize -sve-tail-folding=simple -S | FileCheck %s -check-prefix=CHECK-SIMPLE
 ; RUN: opt < %s -loop-vectorize -sve-tail-folding=all -S | FileCheck %s -check-prefix=CHECK-TF
 ; RUN: opt < %s -loop-vectorize -sve-tail-folding=disabled+simple+reductions+recurrences -S | FileCheck %s -check-prefix=CHECK-TF
 ; RUN: opt < %s -loop-vectorize -sve-tail-folding=all+noreductions -S | FileCheck %s -check-prefix=CHECK-TF-NORED
@@ -17,6 +18,14 @@
 ; CHECK-NOTF-NOT:     %{{.*}} = phi <vscale x 4 x i1>
 ; CHECK-NOTF:         store <vscale x 4 x i32> %[[SPLAT]], <vscale x 4 x i32>*
 
+; CHECK-SIMPLE-LABEL: @simple_memset(
+; CHECK-SIMPLE:       vector.ph:
+; CHECK-SIMPLE:         %[[INSERT:.*]] = insertelement <vscale x 4 x i32> poison, i32 %val, i32 0
+; CHECK-SIMPLE:         %[[SPLAT:.*]] = shufflevector <vscale x 4 x i32> %[[INSERT]], <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer
+; CHECK-SIMPLE:       vector.body:
+; CHECK-SIMPLE:         %[[ACTIVE_LANE_MASK:.*]] = phi <vscale x 4 x i1>
+; CHECK-SIMPLE:         call void @llvm.masked.store.nxv4i32.p0nxv4i32(<vscale x 4 x i32> %[[SPLAT]], {{.*}} %[[ACTIVE_LANE_MASK]]
+
 ; CHECK-TF-NORED-LABEL: @simple_memset(
 ; CHECK-TF-NORED:       vector.ph:
 ; CHECK-TF-NORED:         %[[INSERT:.*]] = insertelement <vscale x 4 x i32> poison, i32 %val, i32 0
@@ -73,6 +82,14 @@
 ; CHECK-NOTF:       middle.block:
 ; CHECK-NOTF-NEXT:    call fast float @llvm.vector.reduce.fadd.nxv4f32(float -0.000000e+00, <vscale x 4 x float> %[[ADD]])
 
+; CHECK-SIMPLE-LABEL: @fadd_red_fast
+; CHECK-SIMPLE:       vector.body:
+; CHECK-SIMPLE-NOT:     %{{.*}} = phi <vscale x 4 x i1>
+; CHECK-SIMPLE:         %[[LOAD:.*]] = load <vscale x 4 x float>
+; CHECK-SIMPLE:         %[[ADD:.*]] = fadd fast <vscale x 4 x float> %[[LOAD]]
+; CHECK-SIMPLE:       middle.block:
+; CHECK-SIMPLE-NEXT:    call fast float @llvm.vector.reduce.fadd.nxv4f32(float -0.000000e+00, <vscale x 4 x float> %[[ADD]])
+
 ; CHECK-TF-NORED-LABEL: @fadd_red_fast
 ; CHECK-TF-NORED:       vector.body:
 ; CHECK-TF-NORED-NOT:     %{{.*}} = phi <vscale x 4 x i1>
@@ -141,6 +158,19 @@
 ; CHECK-NOTF:         %[[ADD:.*]] = add nsw <vscale x 4 x i32> %[[LOAD]], %[[SPLICE]]
 ; CHECK-NOTF:         store <vscale x 4 x i32> %[[ADD]]
 
+; CHECK-SIMPLE-LABEL: @add_recur
+; CHECK-SIMPLE:       entry:
+; CHECK-SIMPLE:         %[[PRE:.*]] = load i32, i32* %src, align 4
+; CHECK-SIMPLE:       vector.ph:
+; CHECK-SIMPLE:         %[[RECUR_INIT:.*]] = insertelement <vscale x 4 x i32> poison, i32 %[[PRE]]
+; CHECK-SIMPLE:       vector.body:
+; CHECK-SIMPLE-NOT:     %{{.*}} = phi <vscale x 4 x i1>
+; CHECK-SIMPLE:         %[[VECTOR_RECUR:.*]] = phi <vscale x 4 x i32> [ %[[RECUR_INIT]], %vector.ph ], [ %[[LOAD:.*]], %vector.body ]
+; CHECK-SIMPLE:         %[[LOAD]] = load <vscale x 4 x i32>
+; CHECK-SIMPLE:         %[[SPLICE:.*]] = call <vscale x 4 x i32> @llvm.experimental.vector.splice.nxv4i32(<vscale x 4 x i32> %[[VECTOR_RECUR]], <vscale x 4 x i32> %[[LOAD]], i32 -1)
+; CHECK-SIMPLE:         %[[ADD:.*]] = add nsw <vscale x 4 x i32> %[[LOAD]], %[[SPLICE]]
+; CHECK-SIMPLE:         store <vscale x 4 x i32> %[[ADD]]
+
 ; CHECK-TF-NORED-LABEL: @add_recur
 ; CHECK-TF-NORED:       entry:
 ; CHECK-TF-NORED:         %[[PRE:.*]] = load i32, i32* %src, align 4
@@ -220,6 +250,12 @@
 ; CHECK-NOTF:         %{{.*}} = shufflevector <8 x float> %[[LOAD]], <8 x float> poison, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
 ; CHECK-NOTF:         %{{.*}} = shufflevector <8 x float> %[[LOAD]], <8 x float> poison, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
 
+; CHECK-SIMPLE-LABEL: @interleave(
+; CHECK-SIMPLE:       vector.body:
+; CHECK-SIMPLE:         %[[LOAD:.*]] = load <8 x float>, <8 x float>
+; CHECK-SIMPLE:         %{{.*}} = shufflevector <8 x float> %[[LOAD]], <8 x float> poison, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
+; CHECK-SIMPLE:         %{{.*}} = shufflevector <8 x float> %[[LOAD]], <8 x float> poison, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
+
 ; CHECK-TF-LABEL: @interleave(
 ; CHECK-TF:       vector.body:
 ; CHECK-TF:         %[[LOAD:.*]] = load <8 x float>, <8 x float>
Index: llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
===================================================================
--- llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
+++ llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
@@ -61,9 +61,7 @@
         Bits = 0;
       else if (TailFoldType == "all")
         Bits = TFAll;
-      else if (TailFoldType == "default")
-        Bits = 0; // Currently defaults to never tail-folding.
-      else if (TailFoldType == "simple")
+      else if (TailFoldType == "default"|| TailFoldType == "simple")
         add(TFSimple);
       else if (TailFoldType == "reductions")
         add(TFReductions);


-------------- next part --------------
A non-text attachment was scrubbed...
Name: D138421.476865.patch
Type: text/x-patch
Size: 5395 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20221121/df0dea08/attachment.bin>