[llvm] [VPlan] Enable vectorization of early-exit loops with unit-stride fault-only-first loads (PR #151300)

Fri Dec 12 01:53:46 PST 2025

https://github.com/lukel97 commented:

I don't think this is going to play well with EVL tail folding because we'll now have two different transforms trying to convert the plan to variable stepping, convertFFLoadEarlyExitToVLStepping and transformRecipestoEVLRecipes.

At a high level I wonder if we even want to support vp.load.ff without EVL tail folding to begin with. This PR from what I understand is kind of reimplementing a weaker version of EVL tail folding, since the variable stepping is a hard requirement of the vp.load.ff intrinsic that we can't avoid. It can reduce the number of lanes read for any reason.

Trying this PR out on some llvm-test-suite benchmarks shows that the generated code always seems to generate LMUL 8 step vectors which is probably not great for performance:

```diff
+.LBB1197_155:                           #   Parent Loop BB1197_123 Depth=1
+                                        # =>  This Inner Loop Header: Depth=2
+       sub     a2, a0, a1
+       addi    a3, sp, 1280
+       add     a3, a3, a1
+       minu    a2, s11, a2
+       vsetvli zero, a2, e8, m2, ta, ma
+       vle8ff.v        v16, (a3)
+       csrr    a2, vl
+       csrr    a3, vlenb
+       vsetvli a4, zero, e64, m8, ta, ma
+       vid.v   v8
+       vmv.v.v v24, v8
+       vadd.vx v8, v8, a3
+       zext.w  a4, a2
+       vmsltu.vx       v18, v8, a4
+       vmsltu.vx       v8, v24, a4
+       srli    a4, a3, 2
+       vsetvli a5, zero, e8, m2, ta, ma
+       vmseq.vi        v9, v16, 0
+       srli    a3, a3, 3
+       vsetvli zero, a4, e8, mf4, ta, ma
+       vslideup.vx     v8, v18, a3
+       vsetvli a3, zero, e8, m2, ta, ma
+       vmand.mm        v8, v8, v9
+       vcpop.m a3, v8
+       bnez    a3, .LBB1197_157
+# %bb.156:                              #   in Loop: Header=BB1197_155 Depth=2
+       add.uw  a1, a2, a1
+       bne     a1, a0, .LBB1197_155
+.LBB1197_157:                           #   in Loop: Header=BB1197_123 Depth=1
+       snez    a1, a3
+       beqz    a1, .LBB1197_159
+.LBB1197_158:                           #   in Loop: Header=BB1197_123 Depth=1
```

So I don't think there's really much reason why we would want to emit non-tail folded early-exit loops if we can tail fold them eventually.

I understand that this is supposed to be an incremental PR, but I think maybe a better ordering might be to start by supporting early exit loops with tail folding. I think this means we need to address the "variable header mask" TODO here:

```c++
bool LoopVectorizationLegality::canFoldTailByMasking() const {
  // The only loops we can vectorize without a scalar epilogue, are loops with
  // a bottom-test and a single exiting block. We'd have to handle the fact
  // that not every instruction executes on the last iteration.  This will
  // require a lane mask which varies through the vector loop body.  (TODO)
  if (TheLoop->getExitingBlock() != TheLoop->getLoopLatch()) {
    LLVM_DEBUG(
        dbgs()
        << "LV: Cannot fold tail by masking. Requires a singe latch exit\n");
    return false;
  }

```

I think we can do this if we replace the notion of a header mask with the notion of a per-block header mask. I'll see if I can create an issue to discuss some of the design of this more. 

https://github.com/llvm/llvm-project/pull/151300