[llvm] [LoopVectorize] Don't discount instructions scalarized due to tail folding (PR #109289)

Fri Oct 18 06:03:12 PDT 2024

david-arm wrote:

I compiled these tests that all end up using predicated blocks, using the command "clang -O3 -mcpu=generic -fno-unroll-loops -S ./pr109289.c":

```
void lowtrip(short *dest) {
  for (int i = 0; i < 15; i++) {
    dest[i] = 1;
  }
}

void conditional(short *dest, int n) {
  for (int i = 0; i < 16; i++) {
    if (i & 0x1)
      dest[i] = 1;  
  }
}
```

Compile this test with "clang -O3 -mcpu=generic -fno-unroll-loops -S ./pr109289_optsize.c":

```
void optsize(short *dest, short val, int n) {
  for (int i = 0; i < n; i++) { 
    dest[i] = val;
  }
} 
```

The IR before the loop vectoriser looks like this:

== pr109289.c ==

```   
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128-Fn32"
target triple = "aarch64-unknown-linux-gnu"

; Function Attrs: nofree norecurse nosync nounwind memory(argmem: write) uwtable
define dso_local void @lowtrip(ptr nocapture noundef writeonly %dest) local_unnamed_addr #0 {
entry:
  br label %for.body

for.cond.cleanup:                                 ; preds = %for.body
  ret void

for.body:                                         ; preds = %entry, %for.body
  %indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
  %arrayidx = getelementptr inbounds i16, ptr %dest, i64 %indvars.iv
  store i16 1, ptr %arrayidx, align 2
  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
  %exitcond.not = icmp eq i64 %indvars.iv.next, 15
  br i1 %exitcond.not, label %for.cond.cleanup, label %for.body
}

; Function Attrs: nofree norecurse nosync nounwind memory(argmem: readwrite) uwtable
define dso_local void @conditional(ptr nocapture noundef writeonly %dest, i32 noundef %n) local_unnamed_addr #0 {
entry:
  br label %for.body

for.cond.cleanup:                                 ; preds = %for.inc
  ret void

for.body:                                         ; preds = %entry, %for.inc
  %indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.inc ]
  %and6 = and i64 %indvars.iv, 1
  %tobool.not = icmp eq i64 %and6, 0
  br i1 %tobool.not, label %for.inc, label %if.then

if.then:                                          ; preds = %for.body
  %arrayidx = getelementptr inbounds i16, ptr %dest, i64 %indvars.iv
  store i16 1, ptr %arrayidx, align 2
  br label %for.inc

for.inc:                                          ; preds = %for.body, %if.then
  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
  %exitcond.not = icmp eq i64 %indvars.iv.next, 16
  br i1 %exitcond.not, label %for.cond.cleanup, label %for.body
}

attributes #0 = { "target-cpu"="generic" }
```

== pr109289_optsize.c ==

```  
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128-Fn32"
target triple = "aarch64-unknown-linux-gnu"

; Function Attrs: nofree norecurse nosync nounwind memory(argmem: write) uwtable
define dso_local void @optsize(ptr nocapture noundef writeonly %dest, i16 noundef %val, i32 noundef %n) local_unnamed_addr #0 {
entry:
  %cmp3 = icmp sgt i32 %n, 0
  br i1 %cmp3, label %for.body.preheader, label %for.cond.cleanup   

for.body.preheader:                               ; preds = %entry
  %wide.trip.count = zext nneg i32 %n to i64
  br label %for.body

for.cond.cleanup.loopexit:                        ; preds = %for.body
  br label %for.cond.cleanup

for.cond.cleanup:                                 ; preds = %for.cond.cleanup.loopexit, %entry
  ret void

for.body:                                         ; preds = %for.body.preheader, %for.body
  %indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %indvars.iv.next, %for.body ]
  %arrayidx = getelementptr inbounds i16, ptr %dest, i64 %indvars.iv
  store i16 %val, ptr %arrayidx, align 2   
  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
  %exitcond.not = icmp eq i64 %indvars.iv.next, %wide.trip.count
  br i1 %exitcond.not, label %for.cond.cleanup.loopexit, label %for.body
}

attributes #0 = { optsize "target-cpu"="generic" }
```

In all 3 tests the vectoriser decides to vectorise as follows:

== lowtrip ==
LV: Not allowing scalar epilogue due to low trip count.
LV: Found an estimated cost of 2 for VF 1 For instruction:   store i16 1
LV: Scalar loop costs: 4.
LV: Found an estimated cost of 2 for VF 2 For instruction:   store i16 1
LV: Vector loop of width 2 costs: 2.
LV: Found an estimated cost of 4 for VF 4 For instruction:   store i16 1
LV: Vector loop of width 4 costs: 1.
LV: Found an estimated cost of 8 for VF 8 For instruction:   store i16 1
LV: Vector loop of width 8 costs: 1.
LV: Selecting VF: 8.
Executing best plan with VF=8, UF=1

== conditional ==
LV: Found an estimated cost of 0 for VF 1 For instruction:   br i1
LV: Found an estimated cost of 2 for VF 1 For instruction:   store i16 1
LV: Scalar loop costs: 4.
LV: Found an estimated cost of 4 for VF 2 For instruction:   br i1
LV: Found an estimated cost of 2 for VF 2 For instruction:   store i16 1
LV: Vector loop of width 2 costs: 5.
LV: Found an estimated cost of 8 for VF 4 For instruction:   br i1
LV: Found an estimated cost of 4 for VF 4 For instruction:   store i16 1
LV: Vector loop of width 4 costs: 4.
LV: Found an estimated cost of 16 for VF 8 For instruction:   br i1
LV: Found an estimated cost of 8 for VF 8 For instruction:   store i16 1
LV: Vector loop of width 8 costs: 4.
LV: Selecting VF: 1.
Executing best plan with VF=8, UF=1

== optsize ==
LV: Not allowing scalar epilogue due to -Os/-Oz.
LV: Found an estimated cost of 2 for VF 1 For instruction:   store i16
LV: Scalar loop costs: 4.
LV: Found an estimated cost of 2 for VF 2 For instruction:   store i16
LV: Vector loop of width 2 costs: 2.
LV: Found an estimated cost of 4 for VF 4 For instruction:   store i16 %val
LV: Vector loop of width 4 costs: 1.
LV: Found an estimated cost of 8 for VF 8 For instruction:   store i16 %val
LV: Vector loop of width 8 costs: 1.
LV: Selecting VF: 8.
Executing best plan with VF=8, UF=1 

In all three cases we have made the wrong decision based on the currently generated code. These are some example runtimes with vectorisation of all 3 (I had to force vectorisation of conditional) on neoverse-v1 (I ran each function in a loop many times and timed them):

lowtrip: 50 ms
conditional: 31 ms  
optsize: 47 ms

Without vectorisation on neoverse-v1:

lowtrip: 26 ms
conditional: 45 ms
optsize: 26 ms

On the surface it appears for lowtrip and optsize we shouldn't be vectorising, but for conditional we should. When I dug into this it seems that although conditional should suffer the same problem as lowtrip and optsize, the codegen is actually much better. This is the final IR for lowtrip and conditional:

```
define dso_local void @lowtrip(ptr nocapture noundef writeonly %dest) local_unnamed_addr #0 {
entry:
  br label %vector.body

vector.body:                                      ; preds = %pred.store.continue18, %entry
  %index = phi i64 [ 0, %entry ], [ %index.next, %pred.store.continue18 ]
  %vec.ind = phi <8 x i64> [ <i64 0, i64 1, i64 2, i64 3, i64 4, i64 5, i64 6, i64 7>, %entry ], [ %vec.ind.next, %pred.store.continue18 ]
  %0 = icmp ult <8 x i64> %vec.ind, <i64 15, i64 15, i64 15, i64 15, i64 15, i64 15, i64 15, i64 15>
  %1 = extractelement <8 x i1> %0, i64 0
  br i1 %1, label %pred.store.if, label %pred.store.continue

...

define dso_local void @conditional(ptr nocapture noundef writeonly %dest, i32 noundef %n) local_unnamed_addr #0 {
entry:
  br label %vector.body

vector.body:                                      ; preds = %pred.store.continue20, %entry
  %index = phi i64 [ 0, %entry ], [ %index.next, %pred.store.continue20 ]
  %vec.ind = phi <8 x i64> [ <i64 0, i64 1, i64 2, i64 3, i64 4, i64 5, i64 6, i64 7>, %entry ], [ %vec.ind.next, %pred.store.continue20 ]
  %0 = trunc <8 x i64> %vec.ind to <8 x i1>
  %1 = extractelement <8 x i1> %0, i64 0
  br i1 %1, label %pred.store.if, label %pred.store.continue

...
```

Ignoring the fact we're missing an IR optimisation that should detect the "icmp ult" is guaranteed to return all-true, you can see what's happened after the CodeGenPrepare pass for each test:

```
define dso_local void @lowtrip(ptr nocapture noundef writeonly %dest) local_unnamed_addr #0 {
entry:
  %scevgep = getelementptr i8, ptr %dest, i64 8
  br label %vector.body

vector.body:                                      ; preds = %pred.store.continue18, %entry
  %lsr.iv1 = phi ptr [ %scevgep2, %pred.store.continue18 ], [ %scevgep, %entry ]
  %lsr.iv = phi i64 [ %lsr.iv.next, %pred.store.continue18 ], [ 16, %entry ]
  %vec.ind = phi <8 x i64> [ <i64 0, i64 1, i64 2, i64 3, i64 4, i64 5, i64 6, i64 7>, %entry ], [ %vec.ind.next, %pred.store.continue18 ]
  %0 = icmp ult <8 x i64> %vec.ind, <i64 15, i64 15, i64 15, i64 15, i64 15, i64 15, i64 15, i64 15>
  %1 = extractelement <8 x i1> %0, i64 0   
  br i1 %1, label %pred.store.if, label %pred.store.continue

pred.store.if:                                    ; preds = %vector.body
  %scevgep3 = getelementptr i8, ptr %lsr.iv1, i64 -8
  store i16 1, ptr %scevgep3, align 2, !tbaa !6
  br label %pred.store.continue

pred.store.continue:                              ; preds = %pred.store.if, %vector.body
  %2 = icmp ult <8 x i64> %vec.ind, <i64 15, i64 15, i64 15, i64 15, i64 15, i64 15, i64 15, i64 15>
  %3 = extractelement <8 x i1> %2, i64 1
  br i1 %3, label %pred.store.if5, label %pred.store.continue6

pred.store.if5:                                   ; preds = %pred.store.continue
  %scevgep5 = getelementptr i8, ptr %lsr.iv1, i64 -6
  store i16 1, ptr %scevgep5, align 2, !tbaa !6
  br label %pred.store.continue6

pred.store.continue6:                             ; preds = %pred.store.if5, %pred.store.continue
  %4 = icmp ult <8 x i64> %vec.ind, <i64 15, i64 15, i64 15, i64 15, i64 15, i64 15, i64 15, i64 15>
  %5 = extractelement <8 x i1> %4, i64 2
  br i1 %5, label %pred.store.if7, label %pred.store.continue8

...

define dso_local void @conditional(ptr nocapture noundef writeonly %dest, i32 noundef %n) local_unnamed_addr #0 {
entry:
  %scevgep = getelementptr i8, ptr %dest, i64 8
  br label %vector.body

vector.body:                                      ; preds = %pred.store.continue20, %entry
  %lsr.iv1 = phi ptr [ %scevgep2, %pred.store.continue20 ], [ %scevgep, %entry ]
  %lsr.iv = phi i64 [ %lsr.iv.next, %pred.store.continue20 ], [ 16, %entry ]
  %vec.ind = phi <8 x i64> [ <i64 0, i64 1, i64 2, i64 3, i64 4, i64 5, i64 6, i64 7>, %entry ], [ %vec.ind.next, %pred.store.continue20 ]
  %0 = trunc <8 x i64> %vec.ind to <8 x i1>
  %1 = extractelement <8 x i1> %0, i64 0
  br i1 %1, label %pred.store.if, label %pred.store.continue

pred.store.if:                                    ; preds = %vector.body
  %scevgep3 = getelementptr i8, ptr %lsr.iv1, i64 -8
  store i16 1, ptr %scevgep3, align 2, !tbaa !6
  br label %pred.store.continue

pred.store.continue:                              ; preds = %pred.store.if, %vector.body
  %2 = extractelement <8 x i1> %0, i64 1
  br i1 %2, label %pred.store.if7, label %pred.store.continue8

pred.store.if7:                                   ; preds = %pred.store.continue
  %scevgep5 = getelementptr i8, ptr %lsr.iv1, i64 -6
  store i16 1, ptr %scevgep5, align 2, !tbaa !6
  br label %pred.store.continue8

pred.store.continue8:                             ; preds = %pred.store.if7, %pred.store.continue
  %3 = extractelement <8 x i1> %0, i64 2
  br i1 %3, label %pred.store.if9, label %pred.store.continue10

...
```

For the lowtrip and optsize cases the "icmp ult" is being sunk by the CodeGenPrepare pass into the blocks below, which is why the resulting assembly ends up being terrible. However, if we calculated the icmp just once in the loop header and then did an extractelement for each predicated block we'd end up with the same efficient code as the conditional function. In summary, I think we should instead fix the codegen for the lowtrip case. If we do that, then I expect the performance to be better than a scalar loop.

In this case I agree with Florian that the best solution to the bug you're trying to fix is to limit the scope of the patch to just -Os and fix this before going through the cost model. The change in this PR is being applied to more than just programs compiled with -Os, but also for lowtrip loops or targets that prefer tail-folding. It's not obvious to me that always preferring a scalar loop is the right way to go without further analysis of the impact.

Instead, I think if you're building with -Os and know that the blocks are going to be predicated due to tail-folding and replicated then we should just bail out immediately. For example, I think that `collectInstsToScalarize` sets up a list of `PredicatedBBsAfterVectorization` which you might be able to check in `LoopVectorizationPlanner::plan`.

It looks like there is also follow-on work:

1. Improve the lowering of lowtrip to be more efficient code where I hope/expect that vectorising should be better than scalar.
2. Introduce a middle-end optimisation (perhaps InstCombine?) to detect that
"%0 = icmp ult <8 x i64> %vec.ind, <i64 15, i64 15, i64 15, i64 15, i64 15, i64 15, i64 15, i64 15>"
returns true for every lane. We know this because the maximum index is 14.
3. Teach the cost model that for the conditional loop vectorising is actually more beneficial than a scalar loop, at least on neoverse-v1.

https://github.com/llvm/llvm-project/pull/109289