[llvm-commits] [PATCH] Allow SelectionDAGBuilder to reorder loads past stores

Mon Dec 19 12:19:11 PST 2011

On Mon, 2011-12-19 at 10:52 -0800, Evan Cheng wrote:
> Hi Hal,
> 
> IMHO, this is the wrong way to fix this problem. The fundamental issue here is SelectionDAG is being used by the pre-RA scheduler so the conservative DAG building can have a significant impact on the schedule and the performance of generated code. Changing SelectionDAGBuilder to relax this is one way to go about this. But as you can tell, the current scheduler is not quite ready for this and you can see some large performance regressions.
> 

Fair enough, however, I fear that whether they get the load/store
independence information from the DAG or from their own aliasing
analysis, the results will be equally good (or bad) as with the current
patch. Either way, it is the scheduling heuristics that will need
improving.

As far as I can tell, most applications exhibit a performance gain from
this patch. So *if* the regressions are caused by deficiencies in
scheduling heuristics (as opposed to problems with DAGCombine or
Legalize), it might be worthwhile eating the performance regressions for
now, and using them as use cases to improve the scheduling heuristics
(once any incorrect code generation is fixed). In the mean time, most
users should be happier ;)

> We have long understood the current scheduling strategy is wrong. The plan for 2012 is to add a MachineInstr pre-RA scheduler and leave SelectionDAG for legalizer and dag combine. The new scheduler will probably use aliasing information to reorder loads / stores.
> 

The key point here is making sure that the scheduler has enough
information to make full use of the aliasing analysis. This is not
currently the case. The easiest way of doing this would be to make sure
that the scheduler has access to the original IR instructions.
Alternatively, the aliasing analysis could be enhanced to deal with ptr
+offset pairs, etc.

> The problem with changing SelectionDAGBuilder is this can negatively impact optimizations done during dag combine and legalization. So as it is the patch should not be enabled by default. At best, it can be committed but turned off. Do you have a good understanding what caused the severe regressions you have identified?

I have not yet looked into all of the regressions in detail. From what I
have seen, the ILP scheduler, once it is revealed that the loads/stores
are independent, tends to schedule many loads together in a big block,
then a block of computation, then a block of stores. This is (highly)
suboptimal compared to the original sequence in some cases.

I would be fine with submitting the patch such that it is turned off by
default. Especially while there are test-suite failures, I think that it
would need to be this way.

Thanks again,
Hal 

> 
> Evan
> 
> On Dec 19, 2011, at 9:42 AM, Hal Finkel wrote:
> 
> > The current SelectionDAGBuilder does not allow loads to be reordered
> > past stores, and does not allow stores to be reordered. This is a side
> > effect of the way the critical chain is constructed: there is a queue of
> > pending loads that is flushed (in parallel) to the root of the chain
> > upon encountering any store (and that store is also appended to the root
> > of the chain). Among other things, loop unrolling is far less effective
> > than it otherwise could be.
> > 
> > The attached patch allows SelectionDAGBuilder to use the available alias
> > analysis to reorder independent loads and stores. It changes the queue
> > of pending loads into a more general queue of pending memory operations,
> > and flushes, in parallel, all potentially-conflicting loads and stores
> > as necessary.
> > 
> > This can result in a significant performance boost. On my x86_64
> > machine, the average percentage decrease in execution time is ~8% (to
> > calculate my performance numbers from the test suite, I've included only
> > the 174 tests with a base execution time of at least 0.1s; the times of
> > the shorter tests seem noisy on my machine). Of these, 131 showed a
> > performance increase and 36 showed a performance decrease.
> > 
> > The top-5 winners were:
> > MultiSource/Benchmarks/MiBench/consumer-typeset/consumer-typeset - 92%
> > performance increase ( = runtime decrease)
> > MultiSource/Benchmarks/llubenchmark/llu - 47% performance increase
> > MultiSource/Applications/minisat/minisat - 47% performance increase
> > MultiSource/Benchmarks/sim/sim - 40% performance increase
> > MultiSource/Benchmarks/Prolangs-C++/life/life - 35.7% performance
> > increase
> > The top-5 losers were:
> > MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame - 88%
> > performance decrease
> > MultiSource/Benchmarks/VersaBench/beamformer/beamformer - 49%
> > performance decrease
> > MultiSource/Benchmarks/MallocBench/espresso/espresso 47% performance
> > decrease
> > MultiSource/Benchmarks/MiBench/automotive-bitcount/automotive-bitcount -
> > 21% performance decrease
> > MultiSource/Benchmarks/MiBench/network-patricia/network-patricia - 20%
> > performance decrease
> > 
> > The patch adds a few new options:
> > max-parallel-chains - replaces the old MaxParallelChains constant)
> > max-load-store-reorder - the maximum size of the reorder buffer -
> > previously it was unlimited, but contained only stores
> > no-reordering-past-stores - invokes the previous behavior
> > 
> > Some of the regression tests had to be updated because the order of some
> > stores changed. For most of these, I just updated the test to reflect
> > the new instruction sequence. The following tests I've marked as XFAIL
> > because they would require larger changes (and I'd like someone with
> > more experience than me to make sure that they really are okay and make
> > any necessary adjustments):
> > CodeGen/X86/2008-02-22-LocalRegAllocBug.ll
> > CodeGen/X86/2010-09-17-SideEffectsInChain.ll
> > CodeGen/X86/lea-recursion.ll
> > 
> > Also, there is one test-suite runtime failure on x86_64:
> > MultiSource/Benchmarks/Ptrdist/ft/ft
> > 
> > And several test-suite runtime failures on i686:
> > MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4
> > SingleSource/Benchmarks/Misc-C++/Large/ray
> > SingleSource/Benchmarks/Misc-C++/stepanov_container
> > SingleSource/Benchmarks/Shootout-C++/lists
> > SingleSource/Benchmarks/Shootout-C++/lists1
> > SingleSource/Benchmarks/Shootout-C++/sieve
> > 
> > Please review (and help with the test-suite failures).
> > 
> > Thank you in advance,
> > Hal
> > 
> > -- 
> > Hal Finkel
> > Postdoctoral Appointee
> > Leadership Computing Facility
> > Argonne National Laboratory
> > 
> > <llvm_lsro-20111219.diff>_______________________________________________
> > llvm-commits mailing list
> > llvm-commits at cs.uiuc.edu
> > http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> 

-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory