[llvm-dev] Missing vectorization of loop due to load late in the loop

Wed May 6 23:03:26 PDT 2020

Hi,

On Wed, 2020-05-06 at 22:35 +0000, Finkel, Hal J. wrote:
> > From: llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of
> > Mikael Holmén via llvm-dev <llvm-dev at lists.llvm.org>
> > Sent: Tuesday, May 5, 2020 8:25 AM
> > To: llvm-dev at lists.llvm.org <llvm-dev at lists.llvm.org>
> > Subject: [llvm-dev] Missing vectorization of loop due to load late
> > in the loop
> >  
> > Hi,
> > 
> > TL;DR: A loop doesn't get vectorized due to the interaction of
> > loop-
> > rotate, licm and instcombine. What to do about it?
> > 
> > 
> > Full story:
> > 
> > In the benchmarks for our out-of-tree target we have a case that we
> > would like to get vectorized, but currently it isn't. I've done
> > some
> > digging to see why and have some kind of idea what prevents it, but
> > I
> > don't know what the best way to fix it would be so I thought I'd
> > share
> > a reduced version of it to see if anyone here have ideas.
> > 
> > So, what happens can be reproduced on trunk with
> >  opt -O3 -S -o - bbi-39227-reduced.ll
> > 
> > The input program consists of two nested loops where the inner loop
> > loads a value and does some calculations and then the outer loop
> > writes
> > the calculated value somewhere.
> > 
> > With
> >  -debug-only=loop-vectorize
> > 
> > I see this printout from the vectorizer for the inner loop:
> >  LV: Not vectorizing: Found an unidentified PHI   %h.15 = phi i32 [
> > %h.11, %inner.cond.preheader ], [ %h.1, %inner.body ]
> > 
> > When we run the vectorizer the inner loop looks like
> > 
> > inner.body:                                       ; preds =
> > %inner.cond.preheader, %inner.body
> >   %h.15 = phi i32 [ %h.11, %inner.cond.preheader ], [ %h.1,
> > %inner.body
> > ]
> >   %h.pn4 = phi i32* [ %h, %inner.cond.preheader ], [ %hp.1,
> > %inner.body
> > ]
> >   %j.03 = phi i16 [ 0, %inner.cond.preheader ], [ %j.1, %inner.body
> > ]
> >   %real.02 = phi i32 [ 0, %inner.cond.preheader ], [ %sub,
> > %inner.body
> > ]
> >   %hp.1 = getelementptr inbounds i32, i32* %h.pn4, i64 1
> >   %0 = shl i32 %h.15, 16
> >   %conv7 = ashr exact i32 %0, 16
> >   %add = sub i32 %real.02, %h.15
> >   %sub = add i32 %add, %conv7
> >   %j.1 = add nuw nsw i16 %j.03, 1
> >   %h.1 = load i32, i32* %hp.1, align 1, !tbaa !4
> >   %cmp3 = icmp ult i16 %j.03, 99
> >   br i1 %cmp3, label %inner.body, label %inner.end, !llvm.loop !8
> > 
> > And the vectorizer bails out since the load is placed "late" in the
> > loop, so
> >  RecurrenceDescriptor::isFirstOrderRecurrence
> > returns false.
> > 
> > If we just move the load before the definition of %0 in the
> > vectorizer
> > input, then we instead get
> >  LV: We can vectorize this loop!
> 
> That just seems like a bug. Unless I'm missing something, the load
> depends only on %hp.1, there are no use-def or memory dependencies in
> between the load and the phi. The semantics are exactly the same
> between the two placements of the load instruction. The order of
> independent SSA calculations should not affect whether or not the
> vectorizer is able to vectorize the loop.

Great! That's what I thought too, but as I don't know the vectorizer
I'm not sure what requirements are reasonable on its input.

When the load is placed early, it's this piece of code in
LoopVectorizationLegality::canVectorizeInstrs() that saves us:

    if (RecurrenceDescriptor::isFirstOrderRecurrence(Phi, TheLoop,
                                                     SinkAfter, DT)) {
      AllowedExit.insert(Phi);
      FirstOrderRecurrences.insert(Phi);
      continue;
    }

but when it's placed late it doesn't and we end up with "Found an
unidentified PHI".

The vectorizer behavior can be reproduced with
 opt -loop-vectorize -S -o - bbi-39227-lv.ll

I'll write a PR about this then. And if anyone has ideas about how the
LoopVectorizationLegality checks can be relaxed to allow this case too,
please chime in.

Thanks!
Mikael

> 
>  -Hal
> 
> 
> > Originally the loop had two loads, and then one of them was
> > actually
> > placed early in the loop block. That was done by instcombine, by 
> >  InstCombiner::FoldPHIArgLoadIntoPHI.
> > 
> > The reason this doesn't happen for the load we see in the reduced
> > example is because after loop rotation licm hoists the start value
> > of
> > the PHI, so when instcombine tries to do FoldPHIArgLoadIntoPHI, the
> > start value isn't placed in the direct predecessor block of the
> > PHI,
> > and the folding is aborted.
> > 
> > Therefore I tried to squeeze in an additional run of instcombine
> > before
> > the licm run that does that hoisting, and then instcombine does the
> > folding and the load in the inner loop is done early instead of
> > late in
> > the loop. The vectorizer then is happy and accepts to vectorize the
> > loop. I have no idea if inserting another run of instcombine is a
> > good
> > idea though and I see some mixed results in other benchmarks.
> > 
> > So, the question is what really to do about this...
> > 
> > Should the vectorizer vectorize the loop anyway in this case?
> > 
> > Should licm not hoist the initial value of the PHI (but hoisting is
> > in
> > general nice, so...).
> > 
> > Should instcombine realize it can do something about this case even
> > if
> > the initial value of the PHI is "far" from the PHI.
> > 
> > Something completely different?
> > 
> > Any ideas?
> > 
> > Thanks,
> > Mikael
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: bbi-39227-lv.ll
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200507/1f0a0ef6/attachment.ksh>