[PATCH] D150851: [LoopVectorize] Vectorize select-cmp reduction pattern for increasing integer induction variable

Tue Jul 25 01:07:57 PDT 2023

Mel-Chen added a comment.

In D150851#4528089 <https://reviews.llvm.org/D150851#4528089>, @artagnon wrote:

> In D150851#4523028 <https://reviews.llvm.org/D150851#4523028>, @artagnon wrote:
>
>> May I suggest using the logic outlined in `isInductionMinMaxPattern` in D152693 <https://reviews.llvm.org/D152693>?
>
> It's not that simple, as you'd lose the information about `hasNoSignedWrap` from the `AR` of the `trunc`. I've also been playing with attempting to extend this patch to the decreasing IV case, and it seems we've painted ourselves into a corner by being more concerned about the codegen performance than the generality of the patch: the constant IV start is a serious limitation when looking at the decreasing IV case. I think other reviewers like @fhahn and @Ayal have also expressed interest in greater generality (perhaps at the cost of codegen performance).

Yes, this issue is known, which is why I left a TODO in the summary.
I made the following modifications before:

     auto IsIncreasingLoopInduction = [&SE, &Loop](Value *V) {
  +    // FIXME: Should focus on SExt and Trunc only?
  +    if (auto *Cast = dyn_cast<CastInst>(V))
  +      V = Cast->getOperand(0);
  +
       auto *Phi = dyn_cast<PHINode>(V);

However, this approach is not rigorous. 
Regardless of whether the `nsw` flag can still be used to determine whether to use signed max or unsigned max, proving whether the truncated induction variable is monotonically increasing is already a challenge.
Consider the following scenario:

  previous step: i64:0000000000000000000000000000000001111111111111111111111111111111 -> i32:01111111111111111111111111111111
  add one
  current step: i64:0000000000000000000000000000000010000000000000000000000000000000 -> i32:10000000000000000000000000000000 

I have discussed this issue with my colleagues several times, and here are some directions we came up with:

1. Use SCEV overflow check. There is an overflow check generator in PSE that we can use, but it seems to be designed for checking the type of the original induction variable, not the truncated type. Using this approach would require implementing a new check generator and inserting it in the preheader.
2. Prevent the generation of truncated induction variables. In the IndVarSimplifyPass, there is an option called `-indvars-widen-indvars` https://llvm.org/doxygen/IndVarSimplify_8cpp.html#aea2c111bf1f82fd672acaad1931e7e2d. This transformation widens the type of the induction variable to reduce the number of sign/zero-extends. However, if there are users in the loop that require the narrower original type, truncation will occur. Perhaps we can make some adjustments in this area.

  Before IndVarSimplifyPass:
  ; Preheader:
    for.body.preheader:                               ; preds = %entry
      br label %for.body

    ; Loop:
    for.body:                                         ; preds = %for.body.preheader, %for.body
      %i.011 = phi i32 [ %inc, %for.body ], [ 0, %for.body.preheader ]
      %idx.010 = phi i32 [ %cond, %for.body ], [ %ii, %for.body.preheader ]
      %idxprom = zext i32 %i.011 to i64
      %arrayidx = getelementptr inbounds i64, ptr %a, i64 %idxprom
      %0 = load i64, ptr %arrayidx, align 8, !tbaa !4
      %arrayidx2 = getelementptr inbounds i64, ptr %b, i64 %idxprom
      %1 = load i64, ptr %arrayidx2, align 8, !tbaa !4
      %cmp3 = icmp sgt i64 %0, %1
      %cond = select i1 %cmp3, i32 %i.011, i32 %idx.010
      %inc = add nuw nsw i32 %i.011, 1 
      %cmp = icmp slt i32 %inc, %n
      br i1 %cmp, label %for.body, label %for.cond.cleanup.loopexit, !llvm.loop !8

    ; Exit blocks
    for.cond.cleanup.loopexit:                        ; preds = %for.body
      %cond.lcssa = phi i32 [ %cond, %for.body ]
      br label %for.cond.cleanup

  After IndVarSimplifyPass:
  ; Preheader:
    for.body.preheader:                               ; preds = %entry
      %wide.trip.count = zext i32 %n to i64
      br label %for.body

    ; Loop:
    for.body:                                         ; preds = %for.body.preheader, %for.body
      %indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %indvars.iv.next, %for.body ]
      %idx.010 = phi i32 [ %cond, %for.body ], [ %ii, %for.body.preheader ]
      %arrayidx = getelementptr inbounds i64, ptr %a, i64 %indvars.iv
      %0 = load i64, ptr %arrayidx, align 8, !tbaa !4
      %arrayidx2 = getelementptr inbounds i64, ptr %b, i64 %indvars.iv
      %1 = load i64, ptr %arrayidx2, align 8, !tbaa !4
      %cmp3 = icmp sgt i64 %0, %1
      %2 = trunc i64 %indvars.iv to i32
      %cond = select i1 %cmp3, i32 %2, i32 %idx.010
      %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
      %exitcond = icmp ne i64 %indvars.iv.next, %wide.trip.count
      br i1 %exitcond, label %for.body, label %for.cond.cleanup.loopexit, !llvm.loop !8

    ; Exit blocks
    for.cond.cleanup.loopexit:                        ; preds = %for.body
      %cond.lcssa = phi i32 [ %cond, %for.body ]
      br label %for.cond.cleanup

1. Only target induction variables that determine the branch condition in the loop latch. With this approach, we can indirectly determine whether the induction variable is signed using loop guards, and directly perform vectorization according to the signed overflow is undefined behavior.

  Signed IV loop guard:
  entry:
    %cmp9 = icmp sgt i32 %n, 0
    br i1 %cmp9, label %for.body.preheader, label %for.cond.cleanup

  Unsigned IV loop guard:
  entry:
    %cmp9.not = icmp eq i32 %n, 0
    br i1 %cmp9.not, label %for.cond.cleanup, label %for.body.preheader

Currently, I personally prefer option 2, but since the performance impact of this modification is uncertain, our final decision is to proceed with option 3.

And of course, the worst-case would be to use a general conditional selecting reduction method to solve the problem.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D150851/new/

https://reviews.llvm.org/D150851