[llvm-commits] [PATCH] Allow SelectionDAGBuilder to reorder loads past stores

Thu Dec 22 16:15:36 PST 2011

On Wed, 2011-12-21 at 10:44 -0600, Sergei Larin wrote:
> Hal, 
> 
>   I have actually done the same fix internally (couple months ago) which
> also resulted in severe performance degradation. To solve it for our back
> end (Hexagon) I ended up modifying the scheduler. In fact I have introduced
> our own (calling it VLIW) scheduler to handle newly available parallelism
> and resulting reg pressure. Result was significant overall performance gain
> on a wide (internal) test suite, with some kernels gaining 40-60%. I tried
> to accomplish the same with existing infrastructure, but failed. Now you are
> seeing similar issue with another architecture. I really wonder what your
> next move shell be.

Sergei,

I have been able to extract a similar performance gain on a set of
benchmarks I use internally by enabling load/store reordering
(especially from those with partially-unrolled loops). I have the
advantage of being able to, for the most part, use the existing
infrastructure. The PPC 440-style chips that I work with, for example,
are multi-pipeline but in-order and, once the artificial load/store
dependencies are removed, the scoreboard hazard detection works pretty
well. Combining the initial bottom-up scheduling with a post-RA top-down
pass (after full anti-dependence breaking) generates highly-competitive
schedules in many cases.

I can certainly understand, however, how the current schedulers would be
suboptimal for your kind of architecture.

My current issue now is the ILP scheduling used for the x86
architectures. Because they have no itineraries, the scheduling is
purely heuristic, and the heuristics currently in place were never tuned
without the strict critical-chain load/store ordering. When you hand
this scheduler something else sometimes it does a great job and
sometimes it does not. I don't expect to have changes accepted into
trunk if they mess up performance on x86, and so I've been working to
retune the heuristics to deal with more-independent loads and stores.

>   I have not checked my changes in for a simple reason that we are not
> caught with the LLVM tip in our internal repository (we are several months
> behind), and Evan has changed the game rules enough (I mean the removal of
> top-down schedulers) ...for my design to be incompatible with the tip (my
> scheduler is top-down).

Indeed. PPC 970 scheduling was non-existent for a while until I updated
the hazard detector. For the time being, I changed it from a pre- to
post-RA detector, so it is still top-down, but I've not really looked at
how well that does.

>   I still plan to submit my work, but it needs to be changed it first, and
> that takes time. 

I am curious to know how you are doing that, algorithmically speaking.
In some cases, just "inverting" the selection logic is sufficient, but
it is not clear to me that is always the case.

>   Finally, what I am trying to say - if you are interested in what I have
> been doing, or you know a better solution for the problem within existing
> infrastructure, I would be very interested in talking about it. 

I am interested in what you've been doing (and I'm sure a number of
other people are interested as well). I don't really have a better
solution for you (unless you can do everything you need with a post-RA
hazard detector, in which case, use that for now).

More generally, however, since Evan has said that they'll be updating
the schedulers in the coming year anyway, we should work (as a
community) to make a clear set of requirements so that, hopefully,
whatever comes out of the design process with work with as many
architectures as possible.

 -Hal

> 
> Thanks.
> 
> 
> Sergei Larin
> 
> --
> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum.
> 
> 
> > -----Original Message-----
> > From: llvm-commits-bounces at cs.uiuc.edu [mailto:llvm-commits-
> > bounces at cs.uiuc.edu] On Behalf Of Hal Finkel
> > Sent: Wednesday, December 21, 2011 7:50 AM
> > To: Jakob Stoklund Olesen
> > Cc: llvm-commits at cs.uiuc.edu
> > Subject: Re: [llvm-commits] [PATCH] Allow SelectionDAGBuilder to
> > reorder loads past stores
> > 
> > It turns out that a significant cause of the performance regressions
> > caused by this patch are related to this issue: with the patch applied
> > the scheduler is now free to schedule many more things, especially
> > stores, after calls (especially intrinsics that are expanded to lib
> > calls). This tendency is bad because of the spilling necessary to cross
> > the call boundary. I am working on a proposed solution, and I'll post
> > an
> > updated patch soon.
> > 
> > Thanks again,
> > Hal
> > 
> > On Tue, 2011-12-20 at 12:52 -0600, Hal Finkel wrote:
> > > On Tue, 2011-12-20 at 10:44 -0800, Jakob Stoklund Olesen wrote:
> > > > On Dec 20, 2011, at 9:22 AM, Hal Finkel wrote:
> > > >
> > > > > when I later look at the register map, only XMM0 and XMM1 are
> > ever
> > > > > assigned to vregs, everything else is spilled. This is wrong. Do
> > you
> > > > > have any ideas on what could be going wrong or other things I
> > should
> > > > > examine? Could the register allocator not be accounting correctly
> > for
> > > > > callee-saved registers when computing live-interval interference
> > > > > information?
> > > >
> > > > There are no callee-saved xmm registers.
> > >
> > > Thanks! I was mixing up the Win64 calling convention with the regular
> > > one. That explains things, so, I suppose the right thing to do is to
> > > make sure all stores are flushed before any call (which I think it
> > > already does), and any intrinsic that will be expanded (which it will
> > > not currently do).
> > >
> > >  -Hal
> > >
> > > >
> > > > /jakob
> > > >
> > >
> > 
> > --
> > Hal Finkel
> > Postdoctoral Appointee
> > Leadership Computing Facility
> > Argonne National Laboratory
> > 
> > _______________________________________________
> > llvm-commits mailing list
> > llvm-commits at cs.uiuc.edu
> > http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> 

-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory