[llvm-commits] [PATCH] Stack Coloring optimization

Mon Sep 3 19:07:48 PDT 2012

On Mon, 3 Sep 2012 18:29:07 +0100
James Molloy <James.Molloy at arm.com> wrote:

> Hi Hal,
> 
> It is along the same lines, and is very similar. It affects
> PendingLoads in SelectionDAGBuilder.
> 
> Where I've differed from you in algorithm (and I'm still trying to
> prove to myself whether they should be functionally equivalent, yours
> and mine...) is to try and keep as closely as possible to the
> previous behaviour, i.e. bunching up loads but never bunching up
> stores.
> 
> Instead of calculating whether mem ops should be flushed in getRoot
> as you do, I use the AliasSetTracker to maintain a chain root for
> every known nonaliasing set of operations. Target memory intrinsics
> and calls obviously serialize everything, and when AliasSets merge
> their associated roots are TokenFactored.
> 
> That way, we have several chains but the behaviour in each is very
> similar to previously, so the ideal is that it doesn't affect
> performance too much.

Sounds good.

> 
> Indeed, this appears to be the case. Because mine is not as
> wide-ranging an optimisation as yours, the speedups are small (5-8%
> on non-tiny benchmarks), but similarly the regressions are trivial
> (0-1% if my numbers add up).

This was measured on x86 or ARM? I ended up running into problems with
the ILP-scheduling heuristics used for x86.

> 
> In synthetic benchmarks which resemble very closely OpenCL kernels
> (unrolled loops where we often have the idiom "load stuff; do stuff;
> store stuff;" and reordering loads past stores is very important for
> ILP), I have measured around 40% speedup.

Great. These kinds of unrolled kernels were also my motivation for
looking at this.

 -Hal

> 
> Cheers,
> 
> James
> ________________________________________
> From: Hal Finkel [hfinkel at anl.gov]
> Sent: 03 September 2012 17:38
> To: James Molloy
> Cc: Jakob Stoklund Olesen; llvm-commits at cs.uiuc.edu
> Subject: Re: [llvm-commits] [PATCH] Stack Coloring optimization
> 
> On Mon, 03 Sep 2012 14:47:59 +0100
> James Molloy <james.molloy at arm.com> wrote:
> 
> > Hi,
> >
> > I'm interested in this; is this code in trunk at the moment?
> >
> > I've been working on an optimisation to put non-aliasing loads and
> > stores on different chains during selectiondag creation - is this
> > scheduler code supposed to reorder independent loads and stores?
> 
> James,
> 
> Is this different from the patch I proposed last year?
> 
>  -Hal
> 
> >
> > Cheers,
> >
> > James
> >
> > On Thu, 2012-08-30 at 19:57 +0100, Jakob Stoklund Olesen wrote:
> > > On Aug 30, 2012, at 2:51 AM, Nadav Rotem <nrotem at apple.com> wrote:
> > >
> > > > Hi All,
> > > >
> > > > I've been working on a new optimization for reducing the stack
> > > > size.  Currently, when we declare allocas in LLVM IR, these
> > > > allocas are directly translated to stack slots. And when we
> > > > inline small functions into larger function, these allocas add
> > > > up and take up lots of space.  In some cases we know that the
> > > > use of the allocas is bounded by disjoint regions.  In this
> > > > optimization we merge multiple disjoint slots into a single
> > > > slot.  LLVM uses the lifetime markers for specifying the
> > > > regions in which the allcoa is used.  This patch propagates the
> > > > lifetime markers through SelectionDAG and makes them pseudo
> > > > ops.  Later, a pre-register-allocator pass constructs live
> > > > intervals which represent the lifeless of different stack
> > > > slots. Next, the pass merges disjoint intervals.  Notice that
> > > > lifetime markers and not perfect single-entry-single exit
> > > > regions. They may be removed by optimizations, they may start
> > > > with two markers, and end with one, or even not end at all!
> > > >
> > > > So, why is this done in codegen?  There are a number of reasons.
> > > > First, joining allocas may hinder alias analysis. Second, in the
> > > > future we would like to share the alloca space with spill slots.
> > >
> > > About alias analysis. Andy was just showing me the scheduler's AA
> > > code. It is using the memory operands to find the underlying LLVM
> > > IR object. Loads and stores to different allocas are partitioned
> > > according to their underlying IR object.
> > >
> > > Merging stack slots before the MI scheduler could invalidate this
> > > form of alias analysis since two IR allocas can share a stack
> > > slot.
> > >
> > > /jakob
> > >
> > > _______________________________________________
> > > llvm-commits mailing list
> > > llvm-commits at cs.uiuc.edu
> > > http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> > >
> >
> >
> >
> >
> > _______________________________________________
> > llvm-commits mailing list
> > llvm-commits at cs.uiuc.edu
> > http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> 
> 
> 
> --
> Hal Finkel
> Postdoctoral Appointee
> Leadership Computing Facility
> Argonne National Laboratory
> 
> 
> -- IMPORTANT NOTICE: The contents of this email and any attachments
> are confidential and may also be privileged. If you are not the
> intended recipient, please notify the sender immediately and do not
> disclose the contents to any other person, use it for any purpose, or
> store or copy the information in any medium.  Thank you.
> 

-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory