[llvm-dev] RFC: Shrink wrapping vs SplitCSR

Thu May 4 11:45:24 PDT 2017

Francis Visoiu Mistrih <fvisoiumistrih at apple.com> writes:

I'm resending this as I forgot to CC llvm-dev on my reply last night....

Hi Francis,

Thanks for the detailed reply!
I took a quick look at the patch, and I'll try to take a closer look at
it tomorrow.

Could you please subscribe me to future patches? Either myself or
someone from my team can help with testing and/or integration on PPC if
you'd like. This is something we've been struggling with for a while and
are anxious to make some progress on it :)

Kit

> Hi Kit,
>
>> On May 2, 2017, at 7:54 PM, Kit Barton via llvm-dev <llvm-dev at lists.llvm.org> wrote:
>> 
>> 
>> Hi all, 
>> 
>> We've seen several examples recently of performance opportunities on
>> POWER if we can improve the location of save/restore code for
>> callee-saved registers. Both Nemanja and myself have discussed this with
>> several people, and it seems that there are two possibilities for
>> improving this:
>> 
>>  1. Extend shrink wrapping to make the analysis of callee-saved
>>     registers more precise. 
>>  2. Focus on enabling and (possibly) improving SplitCSR. 
>> 
>> I would like opinions from people on the preferred way to proceed. 
>> 
>> I am leaning toward improving shrink wrapping, at least as a short-term
>> solution. However, I fully admit that this is because I am familiar with
>> the shrink wrapping code and completely naive about SplitCSR and what
>> work would be necessary to get this working well.
>> 
>> My proposal would be to implement the flow sensitive analysis described
>> by Fred Chow (PLDI '88) and make the necessary extensions in shrink
>> wrapping to handle multiple save/restore points. At that point we can do
>> an evaluation to understand the improvements it provides and the impact
>> on compile time. Once we have these results, we can look at the best way
>> to enable it (e.g., option, target opt-in, higher opts, etc.). 
>
> Back in 2009, there was an implementation of Fred Chow’s algorithm, that has been removed in r193749, because it was unused and untested.
>
> I have been working on a new implementation of Fred Chow’s algorithm for a while now.
>
> It seems that this algorithm has been avoided because of the compile time impact that was not worth compared to the performance improvement.
>
> The main reason is that there are probably loops in the CFG. In my
> implementation, we decided that we never want to save / restore inside a loop,
> so we consider loops as a single block. We are using scc_iterators instead of
> MachineLoopInfo in order to handle irreducible loops as well, that are not
> handled by the current SW implementation. That way, we can compute the
> anticipation / availability attributes in linear time.
>
> In terms of correctness, there is one test on X86 that still fails, and two others on AArch64, because of the way compact unwinding encodes register saves.
>
> In terms of compile-time, there are no regressions on CTMark.
>
> In terms of code-size, we get a 0.8% increase on X86_64 and AArch64, mostly because we cannot use push / pop anymore.
>
> For now, we only worked on shrink-wrapping CSRs, but keep the stack setup in the entry block / return blocks, which can give worse results in some cases compared to the current shrink-wrapping pass. I am currently working on fixing this.
>
> For execution-time, not many improvements showed up due to the stack setup and the transformation of push/pop -> mov $reg, (mem)/mov (mem), $reg which can be partially solved.
>
> In terms of what the algorithm can do, and how it can outperform the current one, I got some stats based on where we save / restore, along with the block frequency (with PGO), and we can see a theoretical 8% improvement.
>
> I put an early review here a while ago: https://reviews.llvm.org/D30808, and I
> will update it soon. As you can see, all the PrologEpilogInserter and
> (X86|AArch64)TargetFrameLowering code look horribly hacky, because so many
> things assume the stack setup and callee saved saves will stick together. Fixing
> all those assumptions is going to be the most tricky part of shrink-wrapping,
> not the algorithm itself.
>
> There are some cases where the current shrink-wrapping is better, and that’s in
> cases like in the Fig.3 of Chow’s paper, and that can be probably solved by
> using something similar to what’s described in this paper: Post Register
> Allocation Spill Code Optimization by Christopher Lupo and Kent D. Wilken, which
> uses the PST to optimize saves / restores placement along with a cost model.
>
> Cheers,