[llvm-dev] [RFC] New pass: LoopExitValues

Tue Sep 15 12:27:44 PDT 2015

On Mon, Sep 14, 2015 at 2:10 PM, Wei Mi <wmi at google.com> wrote:
>
> I tried the patch on the matrix_mul testcase and saw where the
> redundency was removed. My patch in http://reviews.llvm.org/D12090
> cannot handle the case. I feel like it is a different problem.
>
> void matrix_mul(unsigned int Size, unsigned int *Dst, unsigned int
> *Src, unsigned int Val) {
>   for (int Outer = 0; Outer < Size; ++Outer)
>     for (int Inner = 0; Inner < Size; ++Inner)
>        Dst[Outer * Size + Inner] = Src[Outer * Size + Inner] * Val;
> }
>
> "Outer * Size + Inner" is an induction variable in inner loop. Its
> SCEV is {{0,+,%Size}<outerloop>,+,1}<innerloop>, so the initial value
> of the induction variable for inner loop is {0, +, %Size}<outerloop>,
> .i.e, an "lsr.iv.next = lsr.iv + %Size" will be generated at the end
> of every iteration of outerloop to prepare the initial value of the
> induction variable in the next outerloop iteration.
>
> However, the iteration number of inner loop happens to be "Size" too,
> so the exit value of "Outer * Size + Inner" in inner loop happens to
> be the same with the initial value of "Outer * Size + Inner" in the
> next iteration of outerloop. So the exit value of the induction
> variable can be reused and "lsr.iv.next = lsr.iv + %Size" which is
> used to compute the initial value of "Outer * Size + Inner" can be
> omitted.
>

Thanks for experimenting with the patch!  If I understand correctly,
you show that the redundancy is inherent in the loop and not a
byproduct of other passes.  Following this insight, changing just one
line of the IR produces assembly *identical* to the effect of the new
LEV pass.  Specifically, in the outer loop, the multiply can be
replaced with a PHI node that recycles the %add value coming from the
inner loop.

Should the front end be responsible for this sort of optimization?

Original:
  %mul = mul i32 %Outer.026, %Size

Replacing with this line has same effect as LEV pass:
  %mul = phi i32 [ %add, %for.cond.cleanup.3 ], [ 0, %entry ]

The original IR for the matrix_mul.c test case:

define void @matrix_mul(i32 %Size, i32* nocapture %Dst, i32* nocapture
readonly %Src, i32 %Val) #0 {
entry:
  %cmp.25 = icmp eq i32 %Size, 0
  br i1 %cmp.25, label %for.cond.cleanup, label %for.body.4.lr.ph

for.body.4.lr.ph:                                 ; preds = %entry,
%for.cond.cleanup.3
  %Outer.026 = phi i32 [ %inc10, %for.cond.cleanup.3 ], [ 0, %entry ]
;****************
  %mul = mul i32 %Outer.026, %Size
;****************
  br label %for.body.4

for.cond.cleanup:                                 ; preds =
%for.cond.cleanup.3, %entry
  ret void

for.cond.cleanup.3:                               ; preds = %for.body.4
  %inc10 = add nuw nsw i32 %Outer.026, 1
  %exitcond28 = icmp eq i32 %inc10, %Size
  br i1 %exitcond28, label %for.cond.cleanup, label %for.body.4.lr.ph

for.body.4:                                       ; preds =
%for.body.4, %for.body.4.lr.ph
  %Inner.024 = phi i32 [ 0, %for.body.4.lr.ph ], [ %inc, %for.body.4 ]
  %add = add i32 %Inner.024, %mul
  %arrayidx = getelementptr inbounds i32, i32* %Src, i32 %add
  %0 = load i32, i32* %arrayidx, align 4, !tbaa !1
  %mul5 = mul i32 %0, %Val
  %arrayidx8 = getelementptr inbounds i32, i32* %Dst, i32 %add
  store i32 %mul5, i32* %arrayidx8, align 4, !tbaa !1
  %inc = add nuw nsw i32 %Inner.024, 1
  %exitcond = icmp eq i32 %inc, %Size
  br i1 %exitcond, label %for.cond.cleanup.3, label %for.body.4
}