[llvm-commits] [PATCH] Allow SelectionDAGBuilder to reorder loads past stores

Tue Dec 20 09:22:20 PST 2011

On Mon, 2011-12-19 at 22:46 -0800, Evan Cheng wrote:
> On Dec 19, 2011, at 8:31 PM, Hal Finkel wrote:
> 
> > On Mon, 2011-12-19 at 15:02 -0800, Evan Cheng wrote:
> >> On Dec 19, 2011, at 12:19 PM, Hal Finkel wrote:
> >> 
> >>> On Mon, 2011-12-19 at 10:52 -0800, Evan Cheng wrote:
> >>>> Hi Hal,
> >>>> 
> >>>> IMHO, this is the wrong way to fix this problem. The fundamental issue here is SelectionDAG is being used by the pre-RA scheduler so the conservative DAG building can have a significant impact on the schedule and the performance of generated code. Changing SelectionDAGBuilder to relax this is one way to go about this. But as you can tell, the current scheduler is not quite ready for this and you can see some large performance regressions.
> >>>> 
> >>> 
> >>> Fair enough, however, I fear that whether they get the load/store
> >>> independence information from the DAG or from their own aliasing
> >>> analysis, the results will be equally good (or bad) as with the current
> >>> patch. Either way, it is the scheduling heuristics that will need
> >>> improving.
> >> 
> >> Right. One of the argument for doing scheduling on MachineInstr's and later in the codegen pipeline is so it can make better decisions. For example, currently the pre-RA scheduler does a decent job of estimating register pressure. However, instructions are moved (LICM, sink), deleted (coalescing, CSE) so by definition the scheduler is working with half accurate information.
> >> 
> >>> 
> >>> As far as I can tell, most applications exhibit a performance gain from
> >>> this patch. So *if* the regressions are caused by deficiencies in
> >>> scheduling heuristics (as opposed to problems with DAGCombine or
> >>> Legalize), it might be worthwhile eating the performance regressions for
> >>> now, and using them as use cases to improve the scheduling heuristics
> >>> (once any incorrect code generation is fixed). In the mean time, most
> >>> users should be happier ;)
> >> 
> >> Unfortunately, we can't allow changes that cause massive regressions. For example, 88% regressions to LAME encoding. Even if that means a lot of other benchmarks are benefiting from the change. These regressions have to be studied, understood, and fixed before the change can be enabled.
> >> 
> >>> 
> >>>> We have long understood the current scheduling strategy is wrong. The plan for 2012 is to add a MachineInstr pre-RA scheduler and leave SelectionDAG for legalizer and dag combine. The new scheduler will probably use aliasing information to reorder loads / stores.
> >>>> 
> >>> 
> >>> The key point here is making sure that the scheduler has enough
> >>> information to make full use of the aliasing analysis. This is not
> >>> currently the case. The easiest way of doing this would be to make sure
> >>> that the scheduler has access to the original IR instructions.
> >>> Alternatively, the aliasing analysis could be enhanced to deal with ptr
> >>> +offset pairs, etc.
> >>> 
> >>>> The problem with changing SelectionDAGBuilder is this can negatively impact optimizations done during dag combine and legalization. So as it is the patch should not be enabled by default. At best, it can be committed but turned off. Do you have a good understanding what caused the severe regressions you have identified?
> >>> 
> >>> I have not yet looked into all of the regressions in detail. From what I
> >>> have seen, the ILP scheduler, once it is revealed that the loads/stores
> >>> are independent, tends to schedule many loads together in a big block,
> >>> then a block of computation, then a block of stores. This is (highly)
> >>> suboptimal compared to the original sequence in some cases.
> >> 
> >> Right and this should be fixed.
> > 
> > I'll investigate this further, but here is a simple example of what can
> > happen. Consider the following loop:
> > 
> > __attribute__((aligned(16))) float a[LEN],b[LEN],c[LEN];
> > 
> >                for (int i = 0; i < LEN; i++) {
> >                        a[i] = sin(b[i]) + cos(c[i]);
> >                }
> > 
> > where the loop is partially unrolled. Without the patch, this compiles
> > to:
> >        movss   c-36(,%rbx,4), %xmm0
> >        cvtss2sd        %xmm0, %xmm0
> >        callq   cos
> >        movsd   %xmm0, 16(%rsp)         # 8-byte Spill
> >        movss   b-36(,%rbx,4), %xmm0
> >        cvtss2sd        %xmm0, %xmm0
> >        callq   sin
> >        addsd   16(%rsp), %xmm0         # 8-byte Folded Reload
> >        cvtsd2ss        %xmm0, %xmm0
> >        movss   %xmm0, a-36(,%rbx,4)
> > ...
> > 
> > With the patch, the scheduler is allowed to schedule the loads for the
> > next iteration before the store from the pervious iteration, and it
> > does. Unfortunately, it does it like this:
> >        movss   c-36(,%rbx,4), %xmm0
> >        cvtss2sd        %xmm0, %xmm0
> >        callq   cos
> >        movsd   %xmm0, 64(%rsp)         # 8-byte Spill
> >        movss   b-36(,%rbx,4), %xmm0
> >        cvtss2sd        %xmm0, %xmm0
> >        callq   sin
> >        addsd   64(%rsp), %xmm0         # 8-byte Folded Reload
> >        cvtsd2ss        %xmm0, %xmm0
> >        movss   %xmm0, 56(%rsp)         # 4-byte Spill
> >        movss   c-32(,%rbx,4), %xmm0
> >        cvtss2sd        %xmm0, %xmm0
> >        callq   cos
> >        movsd   %xmm0, 64(%rsp)         # 8-byte Spill
> >        movss   56(%rsp), %xmm0         # 4-byte Reload
> >        movss   %xmm0, a-36(,%rbx,4)
> > ...
> > 
> > So, in short, it introduces an unnecessary stack spill in order to delay
> > the store. Any ideas on why it would do this?
> 
> Not sure. Are nodes being scheduled across call nodes? Please add -debug-only=pre-RA-sched and compare the schedules.

I've looked at the debugging output, I am not sure what I am looking
for, however. In the good version, because the result is stored
immediately, there is no issue with how registers are allocated, etc.
across calls. The "bad" version would work fine without the stack spill.
There are plenty of unused callee-saved registers that could be used to
hold that result instead of spilling it to the stack. The question is
just why none of them are being used in favor of the stack spill.

Looking at the schedule reveals:
SU(31): 0x56a8db0: f64 = ADDSDrr 0x56e3740, 0x56e2a30 [ORD=21] [ID=31]
SU(30): 0x56a8bb0: f32 = CVTSD2SSrr 0x56a8db0 [ORD=22] [ID=30]
... (other stuff including a call) ...
SU(29): 0x56ada80: ch = MOVSSmr 0x56a97b0, 0x56a95b0, 0x56ad180,
0x56b30f0, 0x56df500, 0x56a8bb0, 0x564dec0<Mem:ST4[%
scevgep106](align=8)(tbaa=!"float")> [ORD=28] [ID=29]

but that does not explain to me why the spill is there. When
SelectionDAGBuilder sees the instruction stream, those calls to cos()
and sin() are not really calls, but are intrinsics that get mapped to
ISD::FCOS and ISD::FSIN, and so they don't trigger any flushing of the
pending-memory-operations queue.

When in comes to register allocations, we get something like this:
592B            %vreg15<def> = ADDSDrr %vreg15, %vreg11<kill>; FR64:%
vreg15,%vreg11
608B            %vreg16<def> = CVTSD2SSrr %vreg15<kill>; FR32:%vreg16
FR64:%vreg15
624B            ADJCALLSTACKDOWN64 0, %RSP<imp-def>, %
EFLAGS<imp-def,dead>, %RSP<imp-use>
640B            %vreg17<def> = MOVSSrm %noreg, 4, %vreg2, <ga:@c+-32>, %
noreg; mem:LD4[%scevgep25] FR32:%vreg17 GR64_NOSP:%vreg2
656B            %vreg18<def> = CVTSS2SDrr %vreg17<kill>; FR64:%vreg18
FR32:%vreg17
672B            %XMM0<def> = COPY %vreg18<kill>; FR64:%vreg18
688B            CALL64pcrel32 <es:cos>, %XMM0<kill>, %RAX<imp-def,dead>,
%RDX<imp-def,dead>, %RSI<imp-def,dead>, %RDI<imp-def,dead>, %
XMM0<imp-def>, %EFLAGS<imp-def,dead>, %RSP<imp-use>, ...
704B            ADJCALLSTACKUP64 0, 0, %RSP<imp-def>, %
EFLAGS<imp-def,dead>, %RSP<imp-use>
720B            %vreg19<def> = COPY %XMM0<kill>; FR64:%vreg19
736B            ADJCALLSTACKDOWN64 0, %RSP<imp-def>, %
EFLAGS<imp-def,dead>, %RSP<imp-use>
752B            MOVSSmr %noreg, 4, %vreg2, <ga:@a+-36>, %noreg, %
vreg16<kill>; mem:ST4[%scevgep106](align=8)(tbaa=!"float") GR64_NOSP:%
vreg2 FR32:%vreg16

then during allocation I see a message:
selectOrSplit FR32:%vreg16,3.202614e-01 = [608r,752r:0)  0 at 608r
RS_Assign Cascade 0
wait for second round
queuing new interval: %vreg16,3.202614e-01 = [608r,752r:0)  0 at 608r

and then:
selectOrSplit FR32:%vreg16,3.202614e-01 = [608r,752r:0)  0 at 608r
RS_Split Cascade 0
Analyze counted 2 instrs in 1 blocks, through 0 blocks.
Inline spilling FR32:%vreg16,3.202614e-01 = [608r,752r:0)  0 at 608r
>From original %vreg16,3.202614e-01 = [608r,752r:0)  0 at 608r
Merged spilled regs: SS#19 = [608r,752r:0)  0 at invalid
spillAroundUses %vreg16
        rewrite: 608r   %vreg128<def> = CVTSD2SSrr %vreg15<kill>; FR32:%
vreg128 FR64:%vreg15
        spilled: 616r   MOVSSmr <fi#19>, 1, %noreg, 0, %noreg, %
vreg128<kill>; mem:ST4[FixedStack19] FR32:%vreg128
        interval: %vreg128,inf = [608r,616r:0)  0 at 608r
        reload:  744r   %vreg129<def> = MOVSSrm <fi#19>, 1, %noreg, 0, %
noreg; mem:LD4[FixedStack19] FR32:%vreg129
        rewrite: 752r   MOVSSmr %noreg, 4, %vreg104, <ga:@a+-36>, %
noreg, %vreg129<kill>; mem:ST4[%scevgep106](align=8)(tbaa=!"float")
GR64_NOSP:%vreg104 FR32:%vreg129
        interval: %vreg129,inf = [744r,752r:0)  0 at 744r
queuing new interval: %vreg128,inf = [608r,616r:0)  0 at 608r
queuing new interval: %vreg129,inf = [744r,752r:0)  0 at 744r

when I later look at the register map, only XMM0 and XMM1 are ever
assigned to vregs, everything else is spilled. This is wrong. Do you
have any ideas on what could be going wrong or other things I should
examine? Could the register allocator not be accounting correctly for
callee-saved registers when computing live-interval interference
information?

Thanks again,
Hal

> 
> Evan
> 
> > 
> >> 
> >>> 
> >>> I would be fine with submitting the patch such that it is turned off by
> >>> default. Especially while there are test-suite failures, I think that it
> >>> would need to be this way.
> >> 
> >> That's a reasonable compromise. Someone will have review the patch carefully first though.
> >> 
> >> BTW, what's the compile time impact?
> > 
> > I looked at this quickly with my debug build, and it seemed to have a
> > negligible effect (I've put the default cap on the reorder buffer at 8,
> > so any effect is bound to be small, it seemed just as small when I set
> > the buffer size to 64). I'll do a test suite run with an optimized build
> > to get some real timing numbers.
> > 
> > Thanks again,
> > Hal
> > 
> >> 
> >> Evan
> >> 
> >>> 
> >>> Thanks again,
> >>> Hal 
> >>> 
> >>>> 
> >>>> Evan
> >>>> 
> >>>> On Dec 19, 2011, at 9:42 AM, Hal Finkel wrote:
> >>>> 
> >>>>> The current SelectionDAGBuilder does not allow loads to be reordered
> >>>>> past stores, and does not allow stores to be reordered. This is a side
> >>>>> effect of the way the critical chain is constructed: there is a queue of
> >>>>> pending loads that is flushed (in parallel) to the root of the chain
> >>>>> upon encountering any store (and that store is also appended to the root
> >>>>> of the chain). Among other things, loop unrolling is far less effective
> >>>>> than it otherwise could be.
> >>>>> 
> >>>>> The attached patch allows SelectionDAGBuilder to use the available alias
> >>>>> analysis to reorder independent loads and stores. It changes the queue
> >>>>> of pending loads into a more general queue of pending memory operations,
> >>>>> and flushes, in parallel, all potentially-conflicting loads and stores
> >>>>> as necessary.
> >>>>> 
> >>>>> This can result in a significant performance boost. On my x86_64
> >>>>> machine, the average percentage decrease in execution time is ~8% (to
> >>>>> calculate my performance numbers from the test suite, I've included only
> >>>>> the 174 tests with a base execution time of at least 0.1s; the times of
> >>>>> the shorter tests seem noisy on my machine). Of these, 131 showed a
> >>>>> performance increase and 36 showed a performance decrease.
> >>>>> 
> >>>>> The top-5 winners were:
> >>>>> MultiSource/Benchmarks/MiBench/consumer-typeset/consumer-typeset - 92%
> >>>>> performance increase ( = runtime decrease)
> >>>>> MultiSource/Benchmarks/llubenchmark/llu - 47% performance increase
> >>>>> MultiSource/Applications/minisat/minisat - 47% performance increase
> >>>>> MultiSource/Benchmarks/sim/sim - 40% performance increase
> >>>>> MultiSource/Benchmarks/Prolangs-C++/life/life - 35.7% performance
> >>>>> increase
> >>>>> The top-5 losers were:
> >>>>> MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame - 88%
> >>>>> performance decrease
> >>>>> MultiSource/Benchmarks/VersaBench/beamformer/beamformer - 49%
> >>>>> performance decrease
> >>>>> MultiSource/Benchmarks/MallocBench/espresso/espresso 47% performance
> >>>>> decrease
> >>>>> MultiSource/Benchmarks/MiBench/automotive-bitcount/automotive-bitcount -
> >>>>> 21% performance decrease
> >>>>> MultiSource/Benchmarks/MiBench/network-patricia/network-patricia - 20%
> >>>>> performance decrease
> >>>>> 
> >>>>> The patch adds a few new options:
> >>>>> max-parallel-chains - replaces the old MaxParallelChains constant)
> >>>>> max-load-store-reorder - the maximum size of the reorder buffer -
> >>>>> previously it was unlimited, but contained only stores
> >>>>> no-reordering-past-stores - invokes the previous behavior
> >>>>> 
> >>>>> Some of the regression tests had to be updated because the order of some
> >>>>> stores changed. For most of these, I just updated the test to reflect
> >>>>> the new instruction sequence. The following tests I've marked as XFAIL
> >>>>> because they would require larger changes (and I'd like someone with
> >>>>> more experience than me to make sure that they really are okay and make
> >>>>> any necessary adjustments):
> >>>>> CodeGen/X86/2008-02-22-LocalRegAllocBug.ll
> >>>>> CodeGen/X86/2010-09-17-SideEffectsInChain.ll
> >>>>> CodeGen/X86/lea-recursion.ll
> >>>>> 
> >>>>> Also, there is one test-suite runtime failure on x86_64:
> >>>>> MultiSource/Benchmarks/Ptrdist/ft/ft
> >>>>> 
> >>>>> And several test-suite runtime failures on i686:
> >>>>> MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4
> >>>>> SingleSource/Benchmarks/Misc-C++/Large/ray
> >>>>> SingleSource/Benchmarks/Misc-C++/stepanov_container
> >>>>> SingleSource/Benchmarks/Shootout-C++/lists
> >>>>> SingleSource/Benchmarks/Shootout-C++/lists1
> >>>>> SingleSource/Benchmarks/Shootout-C++/sieve
> >>>>> 
> >>>>> Please review (and help with the test-suite failures).
> >>>>> 
> >>>>> Thank you in advance,
> >>>>> Hal
> >>>>> 
> >>>>> -- 
> >>>>> Hal Finkel
> >>>>> Postdoctoral Appointee
> >>>>> Leadership Computing Facility
> >>>>> Argonne National Laboratory
> >>>>> 
> >>>>> <llvm_lsro-20111219.diff>_______________________________________________
> >>>>> llvm-commits mailing list
> >>>>> llvm-commits at cs.uiuc.edu
> >>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> >>>> 
> >>> 
> >>> -- 
> >>> Hal Finkel
> >>> Postdoctoral Appointee
> >>> Leadership Computing Facility
> >>> Argonne National Laboratory
> >>> 
> >> 
> > 
> > -- 
> > Hal Finkel
> > Postdoctoral Appointee
> > Leadership Computing Facility
> > Argonne National Laboratory
> > 
> 

-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory