[llvm] [VPlan] Enable vectorization of early-exit loops with unit-stride fault-only-first loads (PR #151300)

Tue Dec 16 02:28:15 PST 2025

arcbbb wrote:

> Trying this PR out on some llvm-test-suite benchmarks shows that the generated code always seems to generate LMUL 8 step vectors which is probably not great for performance:
> 
> So I don't think there's really much reason why we would want to emit non-tail folded early-exit loops if we can tail fold them eventually.
> 

If performance is a concern, I’m considering replacing the active‑lane‑mask + reduce_or sequence with a vfirst intrinsic + icmp in CodeGenPrepare. The idea is to rewrite:
```
  %14 = call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 0, i64 %11)
  %15 = select <vscale x 16 x i1> %14, <vscale x 16 x i1> %13, <vscale x 16 x i1> zeroinitializer
  %16 = freeze <vscale x 16 x i1> %15
  %17 = call i1 @llvm.vector.reduce.or.nxv16i1(<vscale x 16 x i1> %16)
```
into:
```
%tmp0 = call i32 @llvm.riscv.vfirst(%13, zeroinitializer, %11)
%tmp1 = icmp sge i32 %tmp0, 0
```
Would this be preferable?

> I understand that this is supposed to be an incremental PR, but I think maybe a better ordering might be to start by supporting early exit loops with tail folding. I think this means we need to address the "variable header mask" TODO here:
> 
> I think we can do this if we replace the notion of a header mask with the notion of a per-block header mask. I'll see if I can create an issue to discuss some of the design of this more.

Thanks for flagging early‑exit loops with tail folding. That is also a nice-to-have and I am keen to see it.
We can tackle it separately. Hopefully it doesn’t block this PR!

https://github.com/llvm/llvm-project/pull/151300