[llvm] [LoopVectorize] Don't discount instructions scalarized due to tail folding (PR #109289)

Fri Sep 27 08:38:07 PDT 2024

================
@@ -21,7 +21,31 @@ define void @foo(ptr noalias %a, ptr noalias %b, ptr noalias %c, i64 %N) {
 ; CHECK-NEXT:   vector.body:
 ; CHECK-NEXT:     EMIT vp<[[CAN_IV:%.+]]> = CANONICAL-INDUCTION ir<0>, vp<[[CAN_INC:%.*]]>
 ; CHECK-NEXT:     WIDEN-INDUCTION %iv = phi 0, %iv.next, ir<1>, vp<[[VF]]>
+; CHECK-NEXT:     vp<[[STEPS:%.+]]> = SCALAR-STEPS vp<[[CAN_IV]]>, ir<1>
 ; CHECK-NEXT:     EMIT vp<[[CMP:%.+]]> = icmp ule ir<%iv>, vp<[[BTC]]>
+; CHECK-NEXT:   Successor(s): pred.load
----------------
david-arm wrote:

At first glance this looks worse than before, unless I'm missing something. It looks like previously we were reusing the same predicated blocks to perform both the load and store, i.e. something like

```
  %cmp = icmp ... <4 x i32>
  %lane0 = extractelement <4 x i32> %cmp, i32 0
  br i1 %lane0, label %block1.if, label %block1.continue

%block1.if:
  .. do load ..
  .. do store ..
...
```

whereas now we've essentially split out the loads and stores with duplicate control flow, i.e.

```
  %cmp = icmp ... <4 x i32>
  %lane0 = extractelement <4 x i32> %cmp, i32 0
  br i1 %lane0, label %block1.load.if, label %block1.load.continue

%block1.if:
  .. do load ..
...

%stores:
  %lane0.1 = extractelement <4 x i32> %cmp, i32 0
  br i1 %lane0.1, label %block1.store.if, label %block1.store.continue
...
```

I'd expect the extra control flow to hurt performance.

https://github.com/llvm/llvm-project/pull/109289