[PATCH] D47816: [InstCombine] Stop sinking instructions across function call.

Tue Jun 19 09:08:29 PDT 2018

renlin added a comment.

In https://reviews.llvm.org/D47816#1136439, @lebedev.ri wrote:

> In https://reviews.llvm.org/D47816#1136433, @renlin wrote:
>
> > In https://reviews.llvm.org/D47816#1136432, @lebedev.ri wrote:
> >
> > > In https://reviews.llvm.org/D47816#1136426, @renlin wrote:
> > >
> > > > In https://reviews.llvm.org/D47816#1136346, @lebedev.ri wrote:
> > > >
> > > > > I'm seeing `what`, but not `why`.
> > > > >  What is the motivation behind this change?
> > > > >  What problem is it trying to solve?
> > > >
> > > >
> > > > In one of the test case I have, the sinking of a load instruction (together with the operands used) across a inlined function increases the number of instructions generated.
> > >
> > >
> > > Instructions as in IR instructions? Or the instructions in the final assembly?
> >
> >
> > Final machine assembly.
>
>
> Perhaps you can add an example test?
>  There seems to be some precedent for that:
>
>   llvm/test/CodeGen/X86$ grep -r instcombine | grep RUN
>   2009-03-23-i80-fp80.ll:; RUN: opt < %s -instcombine -S | grep 302245289961712575840256
>   2009-03-23-i80-fp80.ll:; RUN: opt < %s -instcombine -S | grep K40018000000000000000
>   vec_udiv_to_shift.ll:; RUN: opt < %s -instcombine -S | FileCheck %s
>   vec_ins_extract.ll:; RUN: opt < %s -sroa -instcombine | \
>   no-plt-libcalls.ll:; RUN: opt < %s -instcombine -S | FileCheck %s
>
>
> But i must say, right now this sounds the problem is elsewhere,
>  and this change only papers over it, by pessimizing all other cases.

Given the following test case (similar as the one in my very initial patch)

  define i64 @inline_func(i64* %in, i32 %radius) readonly noinline {
  ;;define i64 @inline_func(i64* %in, i32 %radius) readonly alwaysinline {
  entry:
    %cmp12 = icmp sgt i32 %radius, 0
    br i1 %cmp12, label %for.body.preheader, label %for.cond.cleanup

  for.body.preheader:
    %wide.trip.count = zext i32 %radius to i64
    br label %for.body

  for.cond.cleanup:
    %max_val.0.lcssa = phi i64 [ 0, %entry ], [ %max_val.0., %for.body ]
    ret i64 %max_val.0.lcssa

  for.body:
    %indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %indvars.iv.next, %for.body ]
    %max_val.013 = phi i64 [ 0, %for.body.preheader ], [ %max_val.0., %for.body ]
    %arrayidx = getelementptr inbounds i64, i64* %in, i64 %indvars.iv
    %0 = load i64, i64* %arrayidx, align 4
    %cmp1 = icmp eq i64 %max_val.013, %0
    %max_val.0. = select i1 %cmp1, i64 %max_val.013, i64 %0
    %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
    %exitcond = icmp eq i64 %indvars.iv.next, %wide.trip.count
    br i1 %exitcond, label %for.cond.cleanup, label %for.body
  }

  define void @test1(i64* nocapture %out, i64* %in, i32 %w, i32 %n)  {
  entry:
    %idxprom = sext i32 %w to i64
    %arrayidx = getelementptr inbounds i64, i64* %in, i64 %idxprom
    %0 = load i64, i64* %arrayidx, align 4
    %call = tail call i64 @inline_func(i64* %in, i32 %n)
    %cmp = icmp eq i64 %call, -1
    br i1 %cmp, label %if.then, label %if.end

  if.then: 
    %cmp1 = icmp eq i64 %0, %call
    %conv = sext i1 %cmp1 to i64
    store i64 %conv, i64* %out, align 4
    br label %if.end

  if.end: 
    ret void
  }

Compiled with Clang with -O2 flag for aarch64, without the change here, there is more core register save/restores in the stack frame and one more register move.
It is needed to hold the base pointer and offset in callee save registers to make sure the value is preserved across function call. And register move instructions are need to move them from argument registers to callee save registers.

With the change, only the register holding the load value will be saved in the prologue.
Even mark the function as //alwaysinline//,  one more register move instruction is generated to hold the pointer of //in//.
Because, the code in inline_func will use and change the pointer.

I didn't put it as code-generation test because I think it might be too fragile to check the exact code-gen.

I agree, optimization passes are related.
I might missed something, but I didn't see any other passes that are obviously doing things wrong or not trying hard to optimize code for this particular case.

https://reviews.llvm.org/D47816