[llvm-dev] Missing vectorization of loop due to load late in the loop
Mikael Holmén via llvm-dev
llvm-dev at lists.llvm.org
Wed May 6 23:03:26 PDT 2020
Hi,
On Wed, 2020-05-06 at 22:35 +0000, Finkel, Hal J. wrote:
> > From: llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of
> > Mikael Holmén via llvm-dev <llvm-dev at lists.llvm.org>
> > Sent: Tuesday, May 5, 2020 8:25 AM
> > To: llvm-dev at lists.llvm.org <llvm-dev at lists.llvm.org>
> > Subject: [llvm-dev] Missing vectorization of loop due to load late
> > in the loop
> >
> > Hi,
> >
> > TL;DR: A loop doesn't get vectorized due to the interaction of
> > loop-
> > rotate, licm and instcombine. What to do about it?
> >
> >
> > Full story:
> >
> > In the benchmarks for our out-of-tree target we have a case that we
> > would like to get vectorized, but currently it isn't. I've done
> > some
> > digging to see why and have some kind of idea what prevents it, but
> > I
> > don't know what the best way to fix it would be so I thought I'd
> > share
> > a reduced version of it to see if anyone here have ideas.
> >
> > So, what happens can be reproduced on trunk with
> > opt -O3 -S -o - bbi-39227-reduced.ll
> >
> > The input program consists of two nested loops where the inner loop
> > loads a value and does some calculations and then the outer loop
> > writes
> > the calculated value somewhere.
> >
> > With
> > -debug-only=loop-vectorize
> >
> > I see this printout from the vectorizer for the inner loop:
> > LV: Not vectorizing: Found an unidentified PHI %h.15 = phi i32 [
> > %h.11, %inner.cond.preheader ], [ %h.1, %inner.body ]
> >
> > When we run the vectorizer the inner loop looks like
> >
> > inner.body: ; preds =
> > %inner.cond.preheader, %inner.body
> > %h.15 = phi i32 [ %h.11, %inner.cond.preheader ], [ %h.1,
> > %inner.body
> > ]
> > %h.pn4 = phi i32* [ %h, %inner.cond.preheader ], [ %hp.1,
> > %inner.body
> > ]
> > %j.03 = phi i16 [ 0, %inner.cond.preheader ], [ %j.1, %inner.body
> > ]
> > %real.02 = phi i32 [ 0, %inner.cond.preheader ], [ %sub,
> > %inner.body
> > ]
> > %hp.1 = getelementptr inbounds i32, i32* %h.pn4, i64 1
> > %0 = shl i32 %h.15, 16
> > %conv7 = ashr exact i32 %0, 16
> > %add = sub i32 %real.02, %h.15
> > %sub = add i32 %add, %conv7
> > %j.1 = add nuw nsw i16 %j.03, 1
> > %h.1 = load i32, i32* %hp.1, align 1, !tbaa !4
> > %cmp3 = icmp ult i16 %j.03, 99
> > br i1 %cmp3, label %inner.body, label %inner.end, !llvm.loop !8
> >
> > And the vectorizer bails out since the load is placed "late" in the
> > loop, so
> > RecurrenceDescriptor::isFirstOrderRecurrence
> > returns false.
> >
> > If we just move the load before the definition of %0 in the
> > vectorizer
> > input, then we instead get
> > LV: We can vectorize this loop!
>
> That just seems like a bug. Unless I'm missing something, the load
> depends only on %hp.1, there are no use-def or memory dependencies in
> between the load and the phi. The semantics are exactly the same
> between the two placements of the load instruction. The order of
> independent SSA calculations should not affect whether or not the
> vectorizer is able to vectorize the loop.
Great! That's what I thought too, but as I don't know the vectorizer
I'm not sure what requirements are reasonable on its input.
When the load is placed early, it's this piece of code in
LoopVectorizationLegality::canVectorizeInstrs() that saves us:
if (RecurrenceDescriptor::isFirstOrderRecurrence(Phi, TheLoop,
SinkAfter, DT)) {
AllowedExit.insert(Phi);
FirstOrderRecurrences.insert(Phi);
continue;
}
but when it's placed late it doesn't and we end up with "Found an
unidentified PHI".
The vectorizer behavior can be reproduced with
opt -loop-vectorize -S -o - bbi-39227-lv.ll
I'll write a PR about this then. And if anyone has ideas about how the
LoopVectorizationLegality checks can be relaxed to allow this case too,
please chime in.
Thanks!
Mikael
>
> -Hal
>
>
> > Originally the loop had two loads, and then one of them was
> > actually
> > placed early in the loop block. That was done by instcombine, by
> > InstCombiner::FoldPHIArgLoadIntoPHI.
> >
> > The reason this doesn't happen for the load we see in the reduced
> > example is because after loop rotation licm hoists the start value
> > of
> > the PHI, so when instcombine tries to do FoldPHIArgLoadIntoPHI, the
> > start value isn't placed in the direct predecessor block of the
> > PHI,
> > and the folding is aborted.
> >
> > Therefore I tried to squeeze in an additional run of instcombine
> > before
> > the licm run that does that hoisting, and then instcombine does the
> > folding and the load in the inner loop is done early instead of
> > late in
> > the loop. The vectorizer then is happy and accepts to vectorize the
> > loop. I have no idea if inserting another run of instcombine is a
> > good
> > idea though and I see some mixed results in other benchmarks.
> >
> > So, the question is what really to do about this...
> >
> > Should the vectorizer vectorize the loop anyway in this case?
> >
> > Should licm not hoist the initial value of the PHI (but hoisting is
> > in
> > general nice, so...).
> >
> > Should instcombine realize it can do something about this case even
> > if
> > the initial value of the PHI is "far" from the PHI.
> >
> > Something completely different?
> >
> > Any ideas?
> >
> > Thanks,
> > Mikael
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: bbi-39227-lv.ll
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200507/1f0a0ef6/attachment.ksh>
More information about the llvm-dev
mailing list