[llvm-commits] [PATCH] Allow SelectionDAGBuilder to reorder loads past stores

Evan Cheng evan.cheng at apple.com
Mon Dec 19 22:46:30 PST 2011


On Dec 19, 2011, at 8:31 PM, Hal Finkel wrote:

> On Mon, 2011-12-19 at 15:02 -0800, Evan Cheng wrote:
>> On Dec 19, 2011, at 12:19 PM, Hal Finkel wrote:
>> 
>>> On Mon, 2011-12-19 at 10:52 -0800, Evan Cheng wrote:
>>>> Hi Hal,
>>>> 
>>>> IMHO, this is the wrong way to fix this problem. The fundamental issue here is SelectionDAG is being used by the pre-RA scheduler so the conservative DAG building can have a significant impact on the schedule and the performance of generated code. Changing SelectionDAGBuilder to relax this is one way to go about this. But as you can tell, the current scheduler is not quite ready for this and you can see some large performance regressions.
>>>> 
>>> 
>>> Fair enough, however, I fear that whether they get the load/store
>>> independence information from the DAG or from their own aliasing
>>> analysis, the results will be equally good (or bad) as with the current
>>> patch. Either way, it is the scheduling heuristics that will need
>>> improving.
>> 
>> Right. One of the argument for doing scheduling on MachineInstr's and later in the codegen pipeline is so it can make better decisions. For example, currently the pre-RA scheduler does a decent job of estimating register pressure. However, instructions are moved (LICM, sink), deleted (coalescing, CSE) so by definition the scheduler is working with half accurate information.
>> 
>>> 
>>> As far as I can tell, most applications exhibit a performance gain from
>>> this patch. So *if* the regressions are caused by deficiencies in
>>> scheduling heuristics (as opposed to problems with DAGCombine or
>>> Legalize), it might be worthwhile eating the performance regressions for
>>> now, and using them as use cases to improve the scheduling heuristics
>>> (once any incorrect code generation is fixed). In the mean time, most
>>> users should be happier ;)
>> 
>> Unfortunately, we can't allow changes that cause massive regressions. For example, 88% regressions to LAME encoding. Even if that means a lot of other benchmarks are benefiting from the change. These regressions have to be studied, understood, and fixed before the change can be enabled.
>> 
>>> 
>>>> We have long understood the current scheduling strategy is wrong. The plan for 2012 is to add a MachineInstr pre-RA scheduler and leave SelectionDAG for legalizer and dag combine. The new scheduler will probably use aliasing information to reorder loads / stores.
>>>> 
>>> 
>>> The key point here is making sure that the scheduler has enough
>>> information to make full use of the aliasing analysis. This is not
>>> currently the case. The easiest way of doing this would be to make sure
>>> that the scheduler has access to the original IR instructions.
>>> Alternatively, the aliasing analysis could be enhanced to deal with ptr
>>> +offset pairs, etc.
>>> 
>>>> The problem with changing SelectionDAGBuilder is this can negatively impact optimizations done during dag combine and legalization. So as it is the patch should not be enabled by default. At best, it can be committed but turned off. Do you have a good understanding what caused the severe regressions you have identified?
>>> 
>>> I have not yet looked into all of the regressions in detail. From what I
>>> have seen, the ILP scheduler, once it is revealed that the loads/stores
>>> are independent, tends to schedule many loads together in a big block,
>>> then a block of computation, then a block of stores. This is (highly)
>>> suboptimal compared to the original sequence in some cases.
>> 
>> Right and this should be fixed.
> 
> I'll investigate this further, but here is a simple example of what can
> happen. Consider the following loop:
> 
> __attribute__((aligned(16))) float a[LEN],b[LEN],c[LEN];
> 
>                for (int i = 0; i < LEN; i++) {
>                        a[i] = sin(b[i]) + cos(c[i]);
>                }
> 
> where the loop is partially unrolled. Without the patch, this compiles
> to:
>        movss   c-36(,%rbx,4), %xmm0
>        cvtss2sd        %xmm0, %xmm0
>        callq   cos
>        movsd   %xmm0, 16(%rsp)         # 8-byte Spill
>        movss   b-36(,%rbx,4), %xmm0
>        cvtss2sd        %xmm0, %xmm0
>        callq   sin
>        addsd   16(%rsp), %xmm0         # 8-byte Folded Reload
>        cvtsd2ss        %xmm0, %xmm0
>        movss   %xmm0, a-36(,%rbx,4)
> ...
> 
> With the patch, the scheduler is allowed to schedule the loads for the
> next iteration before the store from the pervious iteration, and it
> does. Unfortunately, it does it like this:
>        movss   c-36(,%rbx,4), %xmm0
>        cvtss2sd        %xmm0, %xmm0
>        callq   cos
>        movsd   %xmm0, 64(%rsp)         # 8-byte Spill
>        movss   b-36(,%rbx,4), %xmm0
>        cvtss2sd        %xmm0, %xmm0
>        callq   sin
>        addsd   64(%rsp), %xmm0         # 8-byte Folded Reload
>        cvtsd2ss        %xmm0, %xmm0
>        movss   %xmm0, 56(%rsp)         # 4-byte Spill
>        movss   c-32(,%rbx,4), %xmm0
>        cvtss2sd        %xmm0, %xmm0
>        callq   cos
>        movsd   %xmm0, 64(%rsp)         # 8-byte Spill
>        movss   56(%rsp), %xmm0         # 4-byte Reload
>        movss   %xmm0, a-36(,%rbx,4)
> ...
> 
> So, in short, it introduces an unnecessary stack spill in order to delay
> the store. Any ideas on why it would do this?

Not sure. Are nodes being scheduled across call nodes? Please add -debug-only=pre-RA-sched and compare the schedules.

Evan

> 
>> 
>>> 
>>> I would be fine with submitting the patch such that it is turned off by
>>> default. Especially while there are test-suite failures, I think that it
>>> would need to be this way.
>> 
>> That's a reasonable compromise. Someone will have review the patch carefully first though.
>> 
>> BTW, what's the compile time impact?
> 
> I looked at this quickly with my debug build, and it seemed to have a
> negligible effect (I've put the default cap on the reorder buffer at 8,
> so any effect is bound to be small, it seemed just as small when I set
> the buffer size to 64). I'll do a test suite run with an optimized build
> to get some real timing numbers.
> 
> Thanks again,
> Hal
> 
>> 
>> Evan
>> 
>>> 
>>> Thanks again,
>>> Hal 
>>> 
>>>> 
>>>> Evan
>>>> 
>>>> On Dec 19, 2011, at 9:42 AM, Hal Finkel wrote:
>>>> 
>>>>> The current SelectionDAGBuilder does not allow loads to be reordered
>>>>> past stores, and does not allow stores to be reordered. This is a side
>>>>> effect of the way the critical chain is constructed: there is a queue of
>>>>> pending loads that is flushed (in parallel) to the root of the chain
>>>>> upon encountering any store (and that store is also appended to the root
>>>>> of the chain). Among other things, loop unrolling is far less effective
>>>>> than it otherwise could be.
>>>>> 
>>>>> The attached patch allows SelectionDAGBuilder to use the available alias
>>>>> analysis to reorder independent loads and stores. It changes the queue
>>>>> of pending loads into a more general queue of pending memory operations,
>>>>> and flushes, in parallel, all potentially-conflicting loads and stores
>>>>> as necessary.
>>>>> 
>>>>> This can result in a significant performance boost. On my x86_64
>>>>> machine, the average percentage decrease in execution time is ~8% (to
>>>>> calculate my performance numbers from the test suite, I've included only
>>>>> the 174 tests with a base execution time of at least 0.1s; the times of
>>>>> the shorter tests seem noisy on my machine). Of these, 131 showed a
>>>>> performance increase and 36 showed a performance decrease.
>>>>> 
>>>>> The top-5 winners were:
>>>>> MultiSource/Benchmarks/MiBench/consumer-typeset/consumer-typeset - 92%
>>>>> performance increase ( = runtime decrease)
>>>>> MultiSource/Benchmarks/llubenchmark/llu - 47% performance increase
>>>>> MultiSource/Applications/minisat/minisat - 47% performance increase
>>>>> MultiSource/Benchmarks/sim/sim - 40% performance increase
>>>>> MultiSource/Benchmarks/Prolangs-C++/life/life - 35.7% performance
>>>>> increase
>>>>> The top-5 losers were:
>>>>> MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame - 88%
>>>>> performance decrease
>>>>> MultiSource/Benchmarks/VersaBench/beamformer/beamformer - 49%
>>>>> performance decrease
>>>>> MultiSource/Benchmarks/MallocBench/espresso/espresso 47% performance
>>>>> decrease
>>>>> MultiSource/Benchmarks/MiBench/automotive-bitcount/automotive-bitcount -
>>>>> 21% performance decrease
>>>>> MultiSource/Benchmarks/MiBench/network-patricia/network-patricia - 20%
>>>>> performance decrease
>>>>> 
>>>>> The patch adds a few new options:
>>>>> max-parallel-chains - replaces the old MaxParallelChains constant)
>>>>> max-load-store-reorder - the maximum size of the reorder buffer -
>>>>> previously it was unlimited, but contained only stores
>>>>> no-reordering-past-stores - invokes the previous behavior
>>>>> 
>>>>> Some of the regression tests had to be updated because the order of some
>>>>> stores changed. For most of these, I just updated the test to reflect
>>>>> the new instruction sequence. The following tests I've marked as XFAIL
>>>>> because they would require larger changes (and I'd like someone with
>>>>> more experience than me to make sure that they really are okay and make
>>>>> any necessary adjustments):
>>>>> CodeGen/X86/2008-02-22-LocalRegAllocBug.ll
>>>>> CodeGen/X86/2010-09-17-SideEffectsInChain.ll
>>>>> CodeGen/X86/lea-recursion.ll
>>>>> 
>>>>> Also, there is one test-suite runtime failure on x86_64:
>>>>> MultiSource/Benchmarks/Ptrdist/ft/ft
>>>>> 
>>>>> And several test-suite runtime failures on i686:
>>>>> MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4
>>>>> SingleSource/Benchmarks/Misc-C++/Large/ray
>>>>> SingleSource/Benchmarks/Misc-C++/stepanov_container
>>>>> SingleSource/Benchmarks/Shootout-C++/lists
>>>>> SingleSource/Benchmarks/Shootout-C++/lists1
>>>>> SingleSource/Benchmarks/Shootout-C++/sieve
>>>>> 
>>>>> Please review (and help with the test-suite failures).
>>>>> 
>>>>> Thank you in advance,
>>>>> Hal
>>>>> 
>>>>> -- 
>>>>> Hal Finkel
>>>>> Postdoctoral Appointee
>>>>> Leadership Computing Facility
>>>>> Argonne National Laboratory
>>>>> 
>>>>> <llvm_lsro-20111219.diff>_______________________________________________
>>>>> llvm-commits mailing list
>>>>> llvm-commits at cs.uiuc.edu
>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>>> 
>>> 
>>> -- 
>>> Hal Finkel
>>> Postdoctoral Appointee
>>> Leadership Computing Facility
>>> Argonne National Laboratory
>>> 
>> 
> 
> -- 
> Hal Finkel
> Postdoctoral Appointee
> Leadership Computing Facility
> Argonne National Laboratory
> 




More information about the llvm-commits mailing list