[llvm-dev] A thought to improve IPRA

Fri Jul 29 02:37:18 PDT 2016

On Fri, Jul 29, 2016 at 1:36 AM, Mehdi Amini <mehdi.amini at apple.com> wrote:

>
> On Jul 28, 2016, at 12:59 PM, vivek pandya <vivekvpandya at gmail.com> wrote:
>
> I have been working on PGO driven IPRA and I want to measure if this help
> to reduce execution time. So as mentioned earlier the idea is to make cold
> function register usage free i.e saving and restoring all used register by
> such cold function so caller of that function will have more free
> registers. So here I am changing standard callee saved registers set to a
> set which will be decided dynamically based on the actual register usage.
>
>
> Is it different from something like “preserve_all”?
>
I don't think this is different from "preserve_all" because that is what we
really think to do.

>
>
> I am facing few problems to get this working:
> 1 ) While generating CFI for  such function it requires to map Dwarf
> register to LLVM register and even if we force LLVM to use Dwarf register
> number for CFI then also it will be wrong for some register for which
> currently we don't have such mapping for example R8D register on X86 (when
> dealing with actual register usage info we may have such case where R8D is
> being used)
> To fix this I tried to filter the functions which will be optimized by
> putting a constraints that it should have attribute NoUnwind but that does
> not help. Is it possible to disable CFI generation?
>
> 2) R8D is a 48 bit register but pushing and popping such register is not
> allowed and current implementation for CalleeSaved Register also uses
> either 64 bit or 32 bit version of X86 instruction according to target. So
> here I think it may be good to push/pop R8 for R8D (i.e I don't want to
> change current implementation which inserts MI for CSR) for that I need to
> find biggest register for which given register is alias like R8 has R8D as
> alias. How can I find that?
> I tried to use getMatchingSuperReg(unsigned Reg, unsigned SubIdx, const
> TargetRegisterClass *RC) but here I don't know what will be SubIdx for
> given Reg in given RC.
>
>
> If you create a function with a “preserve_all” CC and put some inline
> assembly that clobbers r8d, I expect we’re already generating the correct
> push (outside of IPRA), right?
> I’d start by figuring out how this work.
>
> Yes, correct push will be generated (IPRA is not changing any thing). For
this particular case if inline asm clobbers R15D ( just for simplicity
consider R15)  LLVM will push/pop R15 be cause default CC  as defined in
X86CallingConv.td CSR_64 will return R15 on 64 bit computer and as one of
R15's alias is clobbered LLVM will push/pop R15. But here LLVM does not
push/pop R15D.

So simplest way to achieve desired effect is to change cold function's CC
to "preserve_all" and due to that IPRA will propagate more free registers
to caller of such cold functions.

> —
> Mehdi
>
>
>
> So for example if a function which should be optimized for above
> optimization is having following set of clobbered registers:
> R8D,R8, ECX, EAX, RAX, ESI It should push/pop R8, RCX, RAX, RSI.
>
> Please help!
> - Vivek
>
>
> On Sat, Jul 9, 2016 at 12:26 AM, vivek pandya <vivekvpandya at gmail.com>
> wrote:
>
>>
>>
>> On Sat, Jul 9, 2016 at 12:18 AM, Mehdi Amini <mehdi.amini at apple.com>
>> wrote:
>>
>>>
>>> On Jul 8, 2016, at 11:41 AM, vivek pandya <vivekvpandya at gmail.com>
>>> wrote:
>>>
>>>
>>>
>>> On Fri, Jul 8, 2016 at 11:46 PM, Mehdi Amini <mehdi.amini at apple.com>
>>> wrote:
>>>
>>>>
>>>> On Jul 8, 2016, at 11:12 AM, vivek pandya <vivekvpandya at gmail.com>
>>>> wrote:
>>>>
>>>> Hello LLVM Developers,
>>>>
>>>> I have a thought to improve IPRA and I would like summaries discussion
>>>> on IRC regarding that so we can develop an idea out of that if it really
>>>> helps.
>>>>
>>>> So idea is to have more callee saved registers at infrequently called
>>>> leaf procedures and try provide more registers to procedures which are in
>>>> upper region of the call graph. But as pointed out by Quentin this
>>>> optimization may help in context of "true" IPRA but in our case we may not
>>>> require this. But I think that it can improve performance in current IPRA.
>>>> I explain both arguments ( Quentin's and mine) with following example.
>>>>
>>>> Consider following call sequence A->B->C , here C is very less time
>>>> called leaf procedure while A is called frequently and B may call C based
>>>> on some condition now while propagating actual register usage information
>>>> from C to A we almost clobbered most of the registers so in this case as
>>>> per Quentin's point we does not hurt the performance as we fall back to CC
>>>> but I think we can improve the performance as follows:
>>>> If we mark every register preserved by C (i.e having more spill reloads
>>>> at procedure entry and exit ) and if this can help  at A. Suppose A
>>>> requires more number of distinct registers than CC can provide and if not
>>>> provided it will spill variables to memory. Now if we can provide more
>>>> registers at A by having more spills at C then we can save spill at A which
>>>> can be beneficial because A is frequently called but C is less frequently
>>>> called and thus reducing total number of spill/restore in program execution.
>>>>
>>>> However again effect of this optimization will be limited by the scope
>>>> of current IPRA (i.e one Module only) because we can' really propagate the
>>>> details about more callee saved registers to caller which is defined in
>>>> other module, but still it may helpful.
>>>>
>>>> Any thoughts on this ?
>>>>
>>>>
>>>>
>>>> I think it is interesting, have you considered:
>>>>
>>>> - the code size impact? (C will have a lot of spills)
>>>>
>>> Yes, this needs to be address with some heuristics based on call
>>>  frequency to C and no of clobbers it has. Also can we say that a function
>>> which does not have any kind of call instruction in it's body will have
>>> less clobbers ?
>>>
>>>
>>> I am not sure what you mean.
>>>
>> A function which may do lots of computation but does not required to call
>> any other function may  not have too many simultaneous live ranges thus
>> with very few registers it can be compiled.
>>
>>>
>>> - what if C is cold but all (most) of its call sites are located in
>>>> different modules?
>>>>
>>> Can we user Uses to get no of call site in current module and based on
>>> that we decide to optimize? Again some heuristics .
>>>
>>>
>>> Of course, but what I’m mentioning is exactly what does not work with
>>> that.
>>>
>>
>>>
>>> - an alternative approach where we would break the CGSCC ordering to
>>>> codegen B and A before C, so we would be able to spill minimally when
>>>> performing the codegen for C?
>>>>
>>> Do you here mean marking all preserve for C while code gen for B and
>>> then when we come to C (top-down) we may avoid some spills if C can use
>>> regs which are not really used by B?
>>>
>>>
>>> Yes, but it may be harder to implement for not much gain after all.
>>>
>>> Also this can be applied to a function which is less frequently called
>>> and which may not be a leaf function. It may help.
>>>
>>>
>>> Sure, you can just refer to this as “PGO driven IPRA”.
>>>
>> Ok I will look into this.
>>
>> Vivek
>>
>>>
>>> —
>>> Mehdi
>>>
>>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160729/118b263f/attachment.html>