[llvm] [LV] Add initial legality checks for ee loops with stores (PR #145663)

Mon Aug 11 04:10:32 PDT 2025

================
@@ -530,20 +538,89 @@ class LoopVectorizationLegality {
   /// Returns true if this is an early exit loop that can be vectorized.
   /// Currently, a loop with an uncountable early exit is considered
   /// vectorizable if:
-  ///   1. There are no writes to memory in the loop.
+  ///   1. Writes to memory do not form a dependence with any load used as
+  ///      part of the uncounted exit condition.
   ///   2. The loop has only one early uncountable exit
   ///   3. The early exit block dominates the latch block.
   ///   4. The latch block has an exact exit count.
   ///   5. The loop does not contain reductions or recurrences.
   ///   6. We can prove at compile-time that loops will not contain faulting
-  ///   loads.
+  ///      loads, or that any faulting loads would also occur in a purely
+  ///      scalar loop.
   ///   7. It is safe to speculatively execute instructions such as divide or
-  ///   call instructions.
+  ///      call instructions.
   /// The list above is not based on theoretical limitations of vectorization,
   /// but simply a statement that more work is needed to support these
   /// additional cases safely.
   bool isVectorizableEarlyExitLoop();
 
+  /// When vectorizing an early exit loop containing side effects, we need to
+  /// determine whether an uncounted exit will be taken before any operation
+  /// that has side effects.
+  ///
+  /// Consider a loop like the following:
+  /// for (int i = 0; i < N; ++i) {
+  ///   a[i] = b[i];
+  ///   if (c[i] == 0)
+  ///     break;
+  /// }
+  ///
+  /// We have both a load and a store operation occurring before the condition
+  /// is checked for early termination. We could potentially restrict
+  /// vectorization to cases where we know all addresses are guaranteed to be
+  /// dereferenceable, which would allow the load before the condition check to
+  /// be vectorized.
+  ///
+  /// The store, however, should not execute across all lanes if early
+  /// termination occurs before the end of the vector. We must only store to the
+  /// locations that would have been stored to by a scalar loop. So we need to
+  /// know what the result of 'c[i] == 0' is before performing the vector store,
+  /// with or without masking.
+  ///
+  /// We can either do this by moving the condition load to the top of the
+  /// vector body and using the comparison to create masks for other operations
+  /// in the loop, or by looking ahead one vector iteration and bailing out to
+  /// the scalar loop if an exit would occur.
+  ///
+  /// Using the latter approach (applicable to more targets), we need to hoist
+  /// the first load (of c[0]) out of the loop then rotate the load within the
+  /// loop to the next iteration, remembering to adjust the vector trip count.
+  /// Something like the following:
+  ///
+  /// vec.ph:
+  ///   %ci.0 = load <4 x i32>, ptr %c
+  ///   %cmp.0 = icmp eq <4 x i32> %ci.0, zeroinitializer
+  ///   %any.of.0 = call i1 @llvm.vector.reduce.or.v4i1(<4 x i1> %cmp.0)
+  ///   br i1 %any.of.0, label %scalar.ph, label %vec.body
+  /// vec.body:
+  ///   %iv = phi...
+  ///   phi for c[i] if used elsewhere in the loop...
+  ///   other operations in the loop...
+  ///   %iv.next = add i64 %iv, 4
+  ///   %addr.next = getelementptr i32, ptr %c, i64 %iv.next
+  ///   %ci.next = load <4 x i32>, ptr %addr.next
+  ///   %cmp.next = icmp eq <4 x i32> %ci.next, zeroinitializer
+  ///   %any.of.next = call i1 @llvm.vector.reduce.or.v4i1(<4 x i1> %cmp.next)
+  ///   iv.next compared with shortened vector tripcount...
+  ///   uncounted condition combined with counted condition...
+  ///   br...
+  ///
+  /// Doing this means the last few iterations will always be performed by a
+  /// scalar loop regardless of which exit is taken, and so vector iterations
+  /// will never execute a memory operation to a location that the scalar loop
+  /// would not have.
+  ///
+  /// This means we must ensure that it is safe to move the load for 'c[i]'
+  /// before other memory operations (or any other observable side effects) in
+  /// the loop.
----------------
david-arm wrote:

This comment looks really helpful, thanks!

Is it worth adding as well that `c[i]` is not permitted to have more than one use for now, because an outside use complicates things since you now have to use `phi for c[i]` instead of `c[i]` itself.

https://github.com/llvm/llvm-project/pull/145663