[LLVMdev] Improving loop vectorizer support for loops with a volatile iteration variable

Hal Finkel hfinkel at anl.gov
Wed Jul 15 23:58:02 PDT 2015


----- Original Message -----

> From: "Hal Finkel" <hfinkel at anl.gov>
> To: "Chandler Carruth" <chandlerc at google.com>
> Cc: llvmdev at cs.uiuc.edu
> Sent: Thursday, July 16, 2015 1:46:42 AM
> Subject: Re: [LLVMdev] Improving loop vectorizer support for loops
> with a volatile iteration variable

> ----- Original Message -----

> > From: "Chandler Carruth" <chandlerc at google.com>
> 
> > To: "Hal Finkel" <hfinkel at anl.gov>
> 
> > Cc: "Hyojin Sung" <hsung at us.ibm.com>, llvmdev at cs.uiuc.edu
> 
> > Sent: Thursday, July 16, 2015 1:06:03 AM
> 
> > Subject: Re: [LLVMdev] Improving loop vectorizer support for loops
> > with a volatile iteration variable
> 

> > On Wed, Jul 15, 2015 at 6:36 PM Hal Finkel < hfinkel at anl.gov >
> > wrote:
> 

> > > > From: "Chandler Carruth" < chandlerc at google.com >
> > > 
> > 
> 
> > > > To: "Hyojin Sung" < hsung at us.ibm.com >, llvmdev at cs.uiuc.edu
> > > 
> > 
> 
> > > > Sent: Wednesday, July 15, 2015 7:34:54 PM
> > > 
> > 
> 
> > > > Subject: Re: [LLVMdev] Improving loop vectorizer support for
> > > > loops
> > > > with a volatile iteration variable
> > > 
> > 
> 
> > > > On Wed, Jul 15, 2015 at 12:55 PM Hyojin Sung < hsung at us.ibm.com
> > > > >
> > > > wrote:
> > > 
> > 
> 

> > > > > Hi all,
> > > > 
> > > 
> > 
> 

> > > > > I would like to propose an improvement of the “almost dead”
> > > > > block
> > > > > elimination in Transforms/Local.cpp so that it will preserve
> > > > > the
> > > > > canonical loop form for loops with a volatile iteration
> > > > > variable.
> > > > 
> > > 
> > 
> 

> > > > > *** Problem statement
> > > > 
> > > 
> > 
> 
> > > > > Nested loops in LCALS Subset B (
> > > > > https://codesign.llnl.gov/LCALS.php
> > > > > ) are not vectorized with LLVM -O3 because the LLVM loop
> > > > > vectorizer
> > > > > fails the test whether the loop latch and exiting block of a
> > > > > loop
> > > > > is
> > > > > the same. The loops are vectorizable, and get vectorized with
> > > > > LLVM
> > > > > -O2
> > > > 
> > > 
> > 
> 
> > > > I would be interested to know why -O2 succeeds here.
> > > 
> > 
> 

> > > > > and also with other commercial compilers (icc, xlc).
> > > > 
> > > 
> > 
> 

> > > > > *** Details
> > > > 
> > > 
> > 
> 
> > > > > These loops ended up with different loop latch and exiting
> > > > > block
> > > > > after a series of optimizations including loop unswitching,
> > > > > jump
> > > > > threading, simplify-the-CFG, and loop simplify. The
> > > > > fundamental
> > > > > problem here is that the above optimizations cannot recognize
> > > > > a
> > > > > loop
> > > > > with a volatile iteration variable and do not preserve its
> > > > > canonical
> > > > > loop structure.
> > > > 
> > > 
> > 
> 
> > > > Ok, meta-level question first:
> > > 
> > 
> 

> > > > Why do we care about performance of loops with a volatile
> > > > iteration
> > > > variable?
> > > 
> > 
> 
> > > I don't think we do, however, I think that misses the point. In
> > > this
> > > case, the volatile iteration variable is just an easy way to
> > > expose
> > > this problem that we have with nested loop canonicalization and
> > > the
> > > vectorizer. To be specific:
> > 
> 

> > > This we vectorizer just fine:
> > 
> 

> > > void foo1(float * restrict x, float * restrict y, float *
> > > restrict
> > > z)
> > > {
> > 
> 
> > > for (int i = 0; i < 1000; ++i) {
> > 
> 
> > > for (int j = 0; j < 1600; ++j) {
> > 
> 
> > > x[j] = y[j] + z[j];
> > 
> 
> > > }
> > 
> 
> > > }
> > 
> 
> > > }
> > 
> 

> > > And, indeed, this we don't (the only change is adding volatile on
> > > i):
> > 
> 

> > > void foo2(float * restrict x, float * restrict y, float *
> > > restrict
> > > z)
> > > {
> > 
> 
> > > for (volatile int i = 0; i < 1000; ++i) {
> > 
> 
> > > for (int j = 0; j < 1600; ++j) {
> > 
> 
> > > x[j] = y[j] + z[j];
> > 
> 
> > > }
> > 
> 
> > > }
> > 
> 
> > > }
> > 
> 

> > > However, this we don't either, and that's a big problem:
> > 
> 

> > > int done(float *x, float *y, float *z);
> > 
> 
> > > void foo3(float * restrict x, float * restrict y, float *
> > > restrict
> > > z)
> > > {
> > 
> 
> > > while (!done(x, y, z)) {
> > 
> 
> > > for (int j = 0; j < 1600; ++j) {
> > 
> 
> > > x[j] = y[j] + z[j];
> > 
> 
> > > }
> > 
> 
> > > }
> > 
> 
> > > }
> > 
> 

> > > And the underlying reason is the same. The IR at the point in
> > > time
> > > when the loop vectorizer runs looks like this:
> > 
> 

> > > define void @foo3(float* noalias %x, float* noalias %y, float*
> > > noalias %z) #0 {
> > 
> 
> > > entry:
> > 
> 
> > > %call.14 = tail call i32 @done(float* %x, float* %y, float* %z)
> > > #1
> > 
> 
> > > %lnot.15 = icmp eq i32 %call.14, 0
> > 
> 
> > > br i1 %lnot.15, label %for.body.preheader, label %while.end
> > 
> 

> > > for.body.preheader: ; preds = %entry
> > 
> 
> > > br label %for.body
> > 
> 

> > > while.cond.loopexit: ; preds = %for.body
> > 
> 
> > > %call = tail call i32 @done(float* %x, float* %y, float* %z) #1
> > 
> 
> > > %lnot = icmp eq i32 %call, 0
> > 
> 
> > > br i1 %lnot, label %for.body.backedge, label %while.end.loopexit
> > 
> 

> > > for.body: ; preds = %for.body.backedge, %for.body.preheader
> > 
> 
> > > %indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %
> > > indvars.iv.be
> > > ,
> > > %for.body.backedge ]
> > 
> 
> > > ...
> > 
> 
> > > %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
> > 
> 
> > > %exitcond = icmp eq i64 %indvars.iv.next, 1600
> > 
> 
> > > br i1 %exitcond, label %while.cond.loopexit, label
> > > %for.body.backedge
> > 
> 

> > > for.body.backedge: ; preds = %for.body, %while.cond.loopexit
> > 
> 
> > > % indvars.iv.be = phi i64 [ %indvars.iv.next, %for.body ], [ 0,
> > > %while.cond.loopexit ]
> > 
> 
> > > br label %for.body
> > 
> 

> > > while.end.loopexit: ; preds = %while.cond.loopexit
> > 
> 
> > > br label %while.end
> > 
> 

> > > while.end: ; preds = %while.end.loopexit, %entry
> > 
> 
> > > ret void
> > 
> 
> > > }
> > 
> 

> > > and we can currently only vectorize loops where the loop latch is
> > > also the loop's exiting block. In this case, as in the case with
> > > the
> > > volatile variable, vectorization is blocked by this constraint
> > > (here
> > > the backedge is from the terminator of %for.body.backedge, but
> > > the
> > > loop exiting block is %for.body). The check in the vectorizer is
> > > explicit:
> > 
> 

> > > // We only handle bottom-tested loops, i.e. loop in which the
> > > condition is
> > 
> 
> > > // checked at the end of each iteration. With that we can assume
> > > that
> > > all
> > 
> 
> > > // instructions in the loop are executed the same number of
> > > times.
> > 
> 
> > > if (TheLoop->getExitingBlock() != TheLoop->getLoopLatch()) {
> > 
> 
> > > ...
> > 
> 

> > Thanks for the detailed explanation. This makes much more sense why
> > we need to handle it. I think its much better to look at nested
> > loops of this form than anything to do with volatile -- the latter
> > is too prone to other random optimizations turning off.
> 

> > Regarding this problem, it would be interesting to know based on
> > this
> > explanation what the desired fix would be. I see at least these
> > options:
> 

> > 1) Canonicalize loops harder to make them look the way the
> > vectorizer
> > wants. If this can be done without causing significant problems, it
> > seems likely the best approach.
> 
> I agree. In this case, we could certainly fold the trivial
> %for.body.backedge block into %for.body, meaning transforming this:

> for.body.backedge: ; preds = %while.cond.loopexit, %for.body
> %indvars.iv.be = phi i64 [ %indvars.iv.next, %for.body ], [ 0,
> %while.cond.loopexit ]
> br label %for.body

> for.body: ; preds = %for.body.backedge, %for.body.preheader
> %indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %indvars.iv.be,
> %for.body.backedge ]
> ...
> %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
> %exitcond = icmp eq i64 %indvars.iv.next, 1600
> br i1 %exitcond, label %while.cond.loopexit, label %for.body.backedge

> into this:

> for.body: ; preds = %for.body.backedge, %for.body.preheader
> %indvars.iv.be = phi i64 [ %indvars.iv.next, %for.body ], [ 0,
> %while.cond.loopexit ]
> %indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %indvars.iv.be,
> %for.body ]
> ...
> %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
> %exitcond = icmp eq i64 %indvars.iv.next, 1600
> br i1 %exitcond, label %while.cond.loopexit, label %for.body

> and this seems pretty trivial when %for.body.backedge is completely
> empty (as in this case), but if it had non-PHI instructions in it on
> which the existing PHIs in %for.body depended, then maybe this
> becomes less trivial?

Although, based on what I said below, the case with instructions there we can't currently vectorize anyway for more-fundamental reasons. 

-Hal 

> > 2) Teach the vectorizer to vectorize without this constraint by
> > instead establishing the actual invariant it cares about.
> 
> It really cares that there's no code that comes in between the latch
> and the exit, because such code is not really part of the loop (it
> only runs once), or code in between the exit and the latch (because
> such code runs in one fewer iterations than the code before the
> exit). At least nothing with side effects I presume.

> -Hal

> > Maybe there is another strategy?
> 

> > > -Hal
> > 
> 

> > > > That seems both counter-intuitive and unlikely to be a useful
> > > > goal.
> > > > We simply don't optimize volatile operations well in *any* part
> > > > of
> > > > the optimizer, and I'm not sure why we need to start trying to
> > > > fix
> > > > that. This seems like an irreparably broken benchmark, but
> > > > perhaps
> > > > there is a motivation I don't yet see.
> > > 
> > 
> 

> > > > Assuming that sufficient motivation arises to try to fix this,
> > > > see
> > > > my
> > > > comments below:
> > > 
> > 
> 

> > > > > (1) Loop unswitching generates several empty placeholder BBs
> > > > > only
> > > > > with PHI nodes after separating out a shorter path with no
> > > > > inner
> > > > > loop execution from a standard path.
> > > > 
> > > 
> > 
> 

> > > > > (2) Jump threading and simplify-the-CFG passes independently
> > > > > calls
> > > > > TryToSimplifyUnconditionalBranchFromEmptyBlock() in
> > > > > Transforms/Utils/Local.cpp to get rid of almost empty BBs.
> > > > 
> > > 
> > 
> 

> > > > > (3) TryToSimplifyUnconditionalBranchFromEmtpyBlock()
> > > > > eliminates
> > > > > the
> > > > > placeholder BBs after loop unswitching and merges them into
> > > > > subsequent blocks including the header of the inner loop.
> > > > > Before
> > > > > eliminating the blocks, the function checks if the block is a
> > > > > loop
> > > > > header by looking at its PHI nodes so that it can be saved,
> > > > > but
> > > > > the
> > > > > test fails with the loops with a volatile iteration variable.
> > > > 
> > > 
> > 
> 
> > > > Why does this fail for a volatile iteration variable but not
> > > > for
> > > > a
> > > > non-volatile one? I think understanding that will be key to
> > > > understanding how it should be fixed.
> > > 
> > 
> 

> > > > _______________________________________________
> > > 
> > 
> 
> > > > LLVM Developers mailing list
> > > 
> > 
> 
> > > > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
> > > 
> > 
> 
> > > > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> > > 
> > 
> 

> > > --
> > 
> 

> > > Hal Finkel
> > 
> 
> > > Assistant Computational Scientist
> > 
> 
> > > Leadership Computing Facility
> > 
> 
> > > Argonne National Laboratory
> > 
> 

> --

> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory

> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

-- 

Hal Finkel 
Assistant Computational Scientist 
Leadership Computing Facility 
Argonne National Laboratory 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150716/ffe1982b/attachment.html>


More information about the llvm-dev mailing list