[llvm] [VPlan] Enable vectorization of early-exit loops with unit-stride fault-only-first loads (PR #151300)
Luke Lau via llvm-commits
llvm-commits at lists.llvm.org
Fri Dec 12 01:53:46 PST 2025
https://github.com/lukel97 commented:
I don't think this is going to play well with EVL tail folding because we'll now have two different transforms trying to convert the plan to variable stepping, convertFFLoadEarlyExitToVLStepping and transformRecipestoEVLRecipes.
At a high level I wonder if we even want to support vp.load.ff without EVL tail folding to begin with. This PR from what I understand is kind of reimplementing a weaker version of EVL tail folding, since the variable stepping is a hard requirement of the vp.load.ff intrinsic that we can't avoid. It can reduce the number of lanes read for any reason.
Trying this PR out on some llvm-test-suite benchmarks shows that the generated code always seems to generate LMUL 8 step vectors which is probably not great for performance:
```diff
+.LBB1197_155: # Parent Loop BB1197_123 Depth=1
+ # => This Inner Loop Header: Depth=2
+ sub a2, a0, a1
+ addi a3, sp, 1280
+ add a3, a3, a1
+ minu a2, s11, a2
+ vsetvli zero, a2, e8, m2, ta, ma
+ vle8ff.v v16, (a3)
+ csrr a2, vl
+ csrr a3, vlenb
+ vsetvli a4, zero, e64, m8, ta, ma
+ vid.v v8
+ vmv.v.v v24, v8
+ vadd.vx v8, v8, a3
+ zext.w a4, a2
+ vmsltu.vx v18, v8, a4
+ vmsltu.vx v8, v24, a4
+ srli a4, a3, 2
+ vsetvli a5, zero, e8, m2, ta, ma
+ vmseq.vi v9, v16, 0
+ srli a3, a3, 3
+ vsetvli zero, a4, e8, mf4, ta, ma
+ vslideup.vx v8, v18, a3
+ vsetvli a3, zero, e8, m2, ta, ma
+ vmand.mm v8, v8, v9
+ vcpop.m a3, v8
+ bnez a3, .LBB1197_157
+# %bb.156: # in Loop: Header=BB1197_155 Depth=2
+ add.uw a1, a2, a1
+ bne a1, a0, .LBB1197_155
+.LBB1197_157: # in Loop: Header=BB1197_123 Depth=1
+ snez a1, a3
+ beqz a1, .LBB1197_159
+.LBB1197_158: # in Loop: Header=BB1197_123 Depth=1
```
So I don't think there's really much reason why we would want to emit non-tail folded early-exit loops if we can tail fold them eventually.
I understand that this is supposed to be an incremental PR, but I think maybe a better ordering might be to start by supporting early exit loops with tail folding. I think this means we need to address the "variable header mask" TODO here:
```c++
bool LoopVectorizationLegality::canFoldTailByMasking() const {
// The only loops we can vectorize without a scalar epilogue, are loops with
// a bottom-test and a single exiting block. We'd have to handle the fact
// that not every instruction executes on the last iteration. This will
// require a lane mask which varies through the vector loop body. (TODO)
if (TheLoop->getExitingBlock() != TheLoop->getLoopLatch()) {
LLVM_DEBUG(
dbgs()
<< "LV: Cannot fold tail by masking. Requires a singe latch exit\n");
return false;
}
```
I think we can do this if we replace the notion of a header mask with the notion of a per-block header mask. I'll see if I can create an issue to discuss some of the design of this more.
https://github.com/llvm/llvm-project/pull/151300
More information about the llvm-commits
mailing list