[llvm-commits] [PATCH] Allow SelectionDAGBuilder to reorder loads past stores

Mon Dec 19 09:42:09 PST 2011

The current SelectionDAGBuilder does not allow loads to be reordered
past stores, and does not allow stores to be reordered. This is a side
effect of the way the critical chain is constructed: there is a queue of
pending loads that is flushed (in parallel) to the root of the chain
upon encountering any store (and that store is also appended to the root
of the chain). Among other things, loop unrolling is far less effective
than it otherwise could be.

The attached patch allows SelectionDAGBuilder to use the available alias
analysis to reorder independent loads and stores. It changes the queue
of pending loads into a more general queue of pending memory operations,
and flushes, in parallel, all potentially-conflicting loads and stores
as necessary.

This can result in a significant performance boost. On my x86_64
machine, the average percentage decrease in execution time is ~8% (to
calculate my performance numbers from the test suite, I've included only
the 174 tests with a base execution time of at least 0.1s; the times of
the shorter tests seem noisy on my machine). Of these, 131 showed a
performance increase and 36 showed a performance decrease.

The top-5 winners were:
MultiSource/Benchmarks/MiBench/consumer-typeset/consumer-typeset - 92%
performance increase ( = runtime decrease)
MultiSource/Benchmarks/llubenchmark/llu - 47% performance increase
MultiSource/Applications/minisat/minisat - 47% performance increase
MultiSource/Benchmarks/sim/sim - 40% performance increase
MultiSource/Benchmarks/Prolangs-C++/life/life - 35.7% performance
increase
The top-5 losers were:
MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame - 88%
performance decrease
MultiSource/Benchmarks/VersaBench/beamformer/beamformer - 49%
performance decrease
MultiSource/Benchmarks/MallocBench/espresso/espresso 47% performance
decrease
MultiSource/Benchmarks/MiBench/automotive-bitcount/automotive-bitcount -
21% performance decrease
MultiSource/Benchmarks/MiBench/network-patricia/network-patricia - 20%
performance decrease

The patch adds a few new options:
max-parallel-chains - replaces the old MaxParallelChains constant)
max-load-store-reorder - the maximum size of the reorder buffer -
previously it was unlimited, but contained only stores
no-reordering-past-stores - invokes the previous behavior

Some of the regression tests had to be updated because the order of some
stores changed. For most of these, I just updated the test to reflect
the new instruction sequence. The following tests I've marked as XFAIL
because they would require larger changes (and I'd like someone with
more experience than me to make sure that they really are okay and make
any necessary adjustments):
CodeGen/X86/2008-02-22-LocalRegAllocBug.ll
CodeGen/X86/2010-09-17-SideEffectsInChain.ll
CodeGen/X86/lea-recursion.ll

Also, there is one test-suite runtime failure on x86_64:
MultiSource/Benchmarks/Ptrdist/ft/ft

And several test-suite runtime failures on i686:
MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4
SingleSource/Benchmarks/Misc-C++/Large/ray
SingleSource/Benchmarks/Misc-C++/stepanov_container
SingleSource/Benchmarks/Shootout-C++/lists
SingleSource/Benchmarks/Shootout-C++/lists1
SingleSource/Benchmarks/Shootout-C++/sieve

Please review (and help with the test-suite failures).

Thank you in advance,
Hal

-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory

-------------- next part --------------
A non-text attachment was scrubbed...
Name: llvm_lsro-20111219.diff
Type: text/x-patch
Size: 24543 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20111219/657329d9/attachment.bin>