[llvm-dev] [InstCombine] rL292492 affected LoopVectorizer and caused 17.30%/11.37% perf regressions on Cortex-A53/Cortex-A15 LNT machines

Wed Feb 1 10:23:06 PST 2017

Hi,

It looks I have found what causes the insertion of the instructions.

*** IR in rL292487 before Loop Vectorize ***:

if.then:                                          ; preds = %for.body5
  %add = shl nsw i64 %i.119, 1
  %cmp917 = icmp slt i64 %add, 8193
  br i1 %cmp917, label %for.body10.preheader, label %for.end14

for.body10.preheader:                             ; preds = %if.then
  br label %for.body10

for.body10:                                       ; preds = %for.body10.preheader, %for.body10
  %k.018 = phi i64 [ %add13, %for.body10 ], [ %add, %for.body10.preheader ]
  %arrayidx11 = getelementptr inbounds [8193 x i8], [8193 x i8]* @main.flags, i64 0, i64 %k.018
  store i8 0, i8* %arrayidx11, align 1, !tbaa !5
  %add13 = add nuw nsw i64 %k.018, %i.119
  %cmp9 = icmp slt i64 %add13, 8193
  br i1 %cmp9, label %for.body10, label %for.end14.loopexit

for.end14.loopexit:                               ; preds = %for.body10
  br label %for.end14

for.end14:                                        ; preds = %for.end14.loopexit, %if.then
  %inc15 = add nsw i32 %count.122, 1
  br label %for.inc16

*** IR in rL292492 before Loop Vectorize ***:

if.then:                                          ; preds = %for.body5
  %cmp917 = icmp slt i64 %i.119, 4097
  br i1 %cmp917, label %for.body10.preheader, label %for.end14

for.body10.preheader:                             ; preds = %if.then
  %add = shl nsw i64 %i.119, 1
  br label %for.body10

for.body10:                                       ; preds = %for.body10.preheader, %for.body10
  %k.018 = phi i64 [ %add13, %for.body10 ], [ %add, %for.body10.preheader ]
  %arrayidx11 = getelementptr inbounds [8193 x i8], [8193 x i8]* @main.flags, i64 0, i64 %k.018
  store i8 0, i8* %arrayidx11, align 1, !tbaa !5
  %add13 = add nuw nsw i64 %k.018, %i.119
  %cmp9 = icmp slt i64 %add13, 8193
  br i1 %cmp9, label %for.body10, label %for.end14.loopexit

for.end14.loopexit:                               ; preds = %for.body10
  br label %for.end14

for.end14:                                        ; preds = %for.end14.loopexit, %if.then
  %inc15 = add nsw i32 %count.122, 1
  br label %for.inc16

****************************************

‘%cmp917 = icmp slt i64 %add, 8193’ is changed to ‘%cmp917 = icmp slt i64 %i.119, 4097’ in ‘if.then’ basic block.
In ScalarEvolution.cpp there is the following code which checks if a loop is guarded by a basic block (in our case the ‘if.then’ basic block):

{code}
ScalarEvolution::ExitLimit
ScalarEvolution::howManyLessThans(const SCEV *LHS, const SCEV *RHS,
                                  const Loop *L, bool IsSigned,
                                  bool ControlsExit, bool AllowPredicates) {
…
  // If the loop entry is guarded by the result of the backedge test of the
  // first loop iteration, then we know the backedge will be taken at least
  // once and so the backedge taken count is as above. If not then we use the
  // expression (max(End,Start)-Start)/Stride to describe the backedge count,
  // as if the backedge is taken at least once max(End,Start) is End and so the
  // result is as above, and if not max(End,Start) is Start so we get a backedge
  // count of zero.
  const SCEV *BECount;
  if (isLoopEntryGuardedByCond(L, Cond, getMinusSCEV(Start, Stride), RHS))
    BECount = BECountIfBackedgeTaken;
  else {
    End = IsSigned ? getSMaxExpr(RHS, Start) : getUMaxExpr(RHS, Start);
    BECount = computeBECount(getMinusSCEV(End, Start), Stride, false);
  }
{code}

As ICMP instructions in the ‘if.then’ basic block and in the ‘for.body10’ basic block use different constants (the constants are the same 8193 in rL292487) ScalarEvolution thinks the ‘if.then’ basic block is not a guardian. This causes the insertion of SMaxExpr.
In fact the execution semantic is not changed. So we need to update ScalarEvolution to support such cases.

Kind regards,
Evgeny Astigeevich
Senior Compiler Engineer
Compilation Tools
ARM

From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of Evgeny Astigeevich via llvm-dev
Sent: Wednesday, January 25, 2017 5:53 PM
To: Michael Kuperstein
Cc: llvm-dev; nd
Subject: Re: [llvm-dev] [InstCombine] rL292492 affected LoopVectorizer and caused 17.30%/11.37% perf regressions on Cortex-A53/Cortex-A15 LNT machines

Hi Michael,

Thank you for information. You’ve gave me ideas what to look at.
It looks like SCEV might reconstruct things InstCombine have optimized out to create runtime checks.

Thanks,
Evgeny

From: Michael Kuperstein [mailto:mkuper at google.com]
Sent: Tuesday, January 24, 2017 9:46 PM
To: Sanjay Patel
Cc: Evgeny Astigeevich; llvm-dev; nd
Subject: Re: [llvm-dev] [InstCombine] rL292492 affected LoopVectorizer and caused 17.30%/11.37% perf regressions on Cortex-A53/Cortex-A15 LNT machines

On Tue, Jan 24, 2017 at 1:20 PM, Sanjay Patel via llvm-dev <llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote:

I started looking at the log files that you attached, and I'm confused. The code that is supposedly causing the perf regression is created by the loop vectorizer, right? Except the bad code is not in the "vector.body", so is there something peculiar about this benchmark that the hot loop is not the vector loop? But there's another mystery: there are no vector ops in the "vector.body"!

I haven't looked at this particular example, but this isn't very rare.

1) The vectorizer actually doubles as an unroller, under some circumstances. So it's conceivable to have a vector.body without vector instructions.

2) From what was pasted above, the "bad" code is in the runtime checks that LV generates.
The general problem is that if we don't know what the loop count of a loop is going to be, we assume it's high, so the cost of the runtime checks is negligible. This fails if we have a loop with a low (but statically unknown) trip-count nested inside a hot loop with a high trip-count. In this case, the overhead from the runtime checks, that end up being inside the outer loop, may cost more than what we gain from vectorization.
To try to avoid this, we have a threshold that prevents us from generating too costly runtime checks, but I guess this particular check is not considered complex enough.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170201/d87878c5/attachment.html>