[llvm-dev] A thought to improve IPRA

Mon Aug 15 21:31:15 PDT 2016

On Tue, Aug 16, 2016 at 9:39 AM, Mehdi Amini <mehdi.amini at apple.com> wrote:

>
> On Aug 15, 2016, at 9:01 PM, vivek pandya <vivekvpandya at gmail.com> wrote:
>
> Hello Mentors,
>
> I did analyze assembly files generated for IPRA + PGO.  (1) I observed
> that I did not considered the scope of the optimization so changing callee
> saved register set for non local function is bad because IPRA can not pass
> this information to other modules.
>
>
> With LTO there shouldn’t be a significant uses from “other modules”.
> Also applying it to local function should not suffer from this.
> A simple heuristic is to look at the callees, and make sure the direct
> callees that are visible are accounting for a large part of the number of
> actually calls to this function.
>
Do you mean to use a simple analyses instead of ProfileSummaryInfo?

>
> (2) applying this change to indirect function also has no effect because
> for such case IPRA is currently not able to propagate regmask.
>
>
> There is no such thing as “indirect function”. There are only “indirect
> calls”.
>
Yes, my mistake.

>
> (3) 'main' function is marked as cold function and this optimization will
> make main's prolog/epilog long thus a cold start is observed
>
>
> The above heuristic addresses this.
>
> (1) and (2) point leads to the constraints which we followed for the
> optimization 'not saving CSR’.
>
>
> I don’t understand this.
>
This is because we can not apply this opt to functions with indirect calls,
also without LTO there is not benefit by applying this to non local
function.

>
> Also with LTO there is no significant improvement ( i.e I tried to address
> (1) with help of LTO)
>
>
> How is LTO not addressing 1?
>
This is just based on some experiments but yes more careful observation is
required.

This also leads to one more question: for the optimization 'not having CSR'
we are not accounting if the caller of such function is hot or not. Because
if due to that optimization we are forcing a hot function to have caller
saved register than it may lead to runtime regression.

-Vivek

>
>
—
> Mehdi
>
>
>
> Conclusion:
> As of now I am not able to find out a good heuristics so that restricting
> no of function for this optimization and overall getting performance
> improvement at acceptable code size change.
> I am currently pausing on this.
>
> Sincerely,
> Vivek
>
> On Sat, Aug 6, 2016 at 1:35 PM, vivek pandya <vivekvpandya at gmail.com>
> wrote:
>
>>
>>
>> On Sat, Aug 6, 2016 at 2:00 AM, Matthias Braun <matze at braunis.de> wrote:
>>
>>> The code in X86TargetLowering::IsEligibleForTailCallOptimization() has
>>> this part:
>>>
>>>   // The callee has to preserve all registers the caller needs to
>>> preserve.
>>>   const X86RegisterInfo *TRI = Subtarget.getRegisterInfo();
>>>   const uint32_t *CallerPreserved = TRI->getCallPreservedMask(MF,
>>> CallerCC);
>>>   if (!CCMatch) {
>>>     const uint32_t *CalleePreserved = TRI->getCallPreservedMask(MF,
>>> CalleeCC);
>>>     if (!TRI->regmaskSubsetEqual(CallerPreserved, CalleePreserved))
>>>       return false;
>>>   }
>>>
>>> Thanks MatzeB for pointing this out.
>>
>>> which usually checks that this is fine. Maybe that code looks at the
>>> regmask of the calling convention rather than the new regmask computed by
>>> IPRA?
>>>
>>
>> Yes I was doing the optimization at RegUsageInfoPropagate and and above
>> code is getting executed at ISel phase that is why effect of new CC is not
>> visible at ISel phase.
>> So I moved this optimization at CodeGenPrepare which happens before ISel.
>> Now I am not getting the above bug. But the result of this optimization is
>> not good ( I mean no positive change in generated code) also due to the
>> condition F.doesNotAccessMemory() && !F.hasLocalLinkage() very few of cold
>> functions are really getting optimized.
>>
>> One more thing how to test test-suite with PGO and also use generated
>> profiles in next run.  I mean any easy way?
>> I just want to be sure if any more kind of case is not covered.
>>
>> -Vivek
>>
>>> - Matthias
>>>
>>> On Aug 4, 2016, at 9:22 PM, vivek pandya via llvm-dev <
>>> llvm-dev at lists.llvm.org> wrote:
>>>
>>>
>>> Hello all,
>>>
>>> Adding MatzeB this may be interesting for him.
>>>
>>> I have tried out following way to save most of the registers for cold
>>> functions, in RegisterUsagePropagation.cpp
>>>
>>> if (PSI->isColdFunction(F) && F->doesNotAccessMemory() &&
>>> !F->hasLocalLinkage()) {
>>>       dbgs() << "Cold Function : " << F->getName() << "\n";
>>>       F->setCallingConv(CallingConv::CXX_FAST_TLS);
>>>   }
>>>
>>> previously I was using CallingConv::PreserveMost but it also saves RAX
>>> and that generated bug for functions which returns address to global
>>> objects or some thing similar. CXX_FAST_TLS is very similar to PreserveMost
>>> but it excludes RAX and
>>> RDI. It also excludes XMM*.  There is also one more interesting bug (
>>> actually it is very similar to what we faced while optimizing function for
>>> not saving registers)
>>>
>>> See in SPASS application there is a very simple function :
>>>
>>> BOOL opts_IdIsNull(OPTID Id)
>>> {
>>>   return opts_IdEqual(opts_IdNull(), Id);
>>> }
>>>
>>> This is cold function for a particular input and that is why it is
>>> getting selected for the optimization. But with CXX_FAST_TLS it is getting
>>> generated as follows :
>>>
>>> _opts_IdIsNull:                         ## @opts_IdIsNull
>>> .cfi_startproc
>>> ## BB#0:                                ## %entry
>>> pushq %rbp
>>> Ltmp9:
>>> .cfi_def_cfa_offset 16
>>> Ltmp10:
>>> .cfi_offset %rbp, -16
>>> movq %rsp, %rbp
>>> Ltmp11:
>>> .cfi_def_cfa_register %rbp
>>> pushq %rsi
>>> pushq %rcx
>>> Ltmp12:
>>> .cfi_offset %rcx, -32
>>> Ltmp13:
>>> .cfi_offset %rsi, -24
>>> movl %edi, %eax
>>> movl $-1, %edi
>>> popq %rcx
>>> popq %rsi
>>> popq %rbp
>>> jmp _opts_IdEqual           ## TAILCALL
>>> .cfi_endproc
>>>
>>> Here before tailcall due to odl values are returned back result is
>>> generated wrong.
>>> I am also confused why rcx and rsi is saved/restored ? Is this normal ?
>>> Any suggestion to handle such case?
>>>
>>> Also if we do not want to use a CC directly then can you help me find
>>> proper place such that we can iterate over  parameter list (also how to
>>> iterate over only register which are used for params only?)  and do not put
>>> this in regMask and also not have save/restore  for them. Some hints will
>>> speed up my work.
>>>
>>> Sincerely,
>>> Vivek
>>>
>>>
>>>
>>> On Fri, Jul 29, 2016 at 11:06 PM, Hal Finkel <hfinkel at anl.gov> wrote:
>>>
>>>>
>>>> ------------------------------
>>>>
>>>> *From: *"vivek pandya" <vivekvpandya at gmail.com>
>>>> *To: *"Hal Finkel" <hfinkel at anl.gov>
>>>> *Cc: *"llvm-dev" <llvm-dev at lists.llvm.org>, "Quentin Colombet" <
>>>> qcolombet at apple.com>, "Mehdi Amini" <mehdi.amini at apple.com>
>>>> *Sent: *Friday, July 29, 2016 5:02:44 AM
>>>> *Subject: *Re: A thought to improve IPRA
>>>>
>>>>
>>>>
>>>> On Fri, Jul 29, 2016 at 9:01 AM, Hal Finkel <hfinkel at anl.gov> wrote:
>>>>
>>>>> ------------------------------
>>>>>
>>>>> > From: "vivek pandya" <vivekvpandya at gmail.com>
>>>>> > To: "Mehdi Amini" <mehdi.amini at apple.com>
>>>>> > Cc: "llvm-dev" <llvm-dev at lists.llvm.org>, "Hal Finkel" <
>>>>> hfinkel at anl.gov>, "Quentin Colombet" <qcolombet at apple.com>
>>>>> > Sent: Thursday, July 28, 2016 2:59:02 PM
>>>>> > Subject: Re: A thought to improve IPRA
>>>>> >
>>>>> >
>>>>> > I have been working on PGO driven IPRA and I want to measure if this
>>>>> > help to reduce execution time. So as mentioned earlier the idea is
>>>>> > to make cold function register usage free i.e saving and restoring
>>>>> > all used register by such cold function so caller of that function
>>>>> > will have more free registers. So here I am changing standard callee
>>>>> > saved registers set to a set which will be decided dynamically based
>>>>> > on the actual register usage.
>>>>> >
>>>>> > I am facing few problems to get this working:
>>>>> > 1 ) While generating CFI for such function it requires to map Dwarf
>>>>> > register to LLVM register and even if we force LLVM to use Dwarf
>>>>> > register number for CFI then also it will be wrong for some register
>>>>> > for which currently we don't have such mapping for example R8D
>>>>> > register on X86 (when dealing with actual register usage info we may
>>>>> > have such case where R8D is being used)
>>>>> > To fix this I tried to filter the functions which will be optimized
>>>>> > by putting a constraints that it should have attribute NoUnwind but
>>>>> > that does not help. Is it possible to disable CFI generation?
>>>>>
>>>>> Disabling CFI generation does not seem like the right solution. If the
>>>>> R8D definition, and similar, need DWARF register numbers, then we should
>>>>> fix that (you can try rearranging things and using DwarfRegAlias, or at
>>>>> least for testing, add the same DwarfRegNum as for R8).
>>>>> Adding DwarfRegNum as for R8 does not work because the mapping is
>>>>> currently generated as a sorted array on the first value of the key
>>>>> for DwarfLLVMRegPair. So in the build directory in file
>>>>> X86GenRegisterInfo.inc this will add 2 different entries for mapping LLVM
>>>>> Reg to Dwarf number i.e R8 -> 8 and R8D -> 8 but in the array of mapping
>>>>> Dwarf to LLVM Reg there is only entry as it will add 8 only once.
>>>>>
>>>> >
>>>>> >
>>>>> > 2) R8D is a 48 bit register
>>>>>
>>>>> Why do you say that? For one thing, it is in a register class GR32 and
>>>>> holds only 32-bit values.
>>>>>
>>>> Sorry this is my bad R8D is 32 bit value. To get this working changing
>>>> CC for cold functions to "preserve_all" seems to be easy and safe way. Let
>>>> me know your thought about this.
>>>>
>>>> Sounds like a reasonable thing to try.
>>>>
>>>>  -Hal
>>>>
>>>>
>>>> -Vivek
>>>>
>>>>>
>>>>>  -Hal
>>>>>
>>>>> > but pushing and popping such register is
>>>>> > not allowed and current implementation for CalleeSaved Register also
>>>>> > uses either 64 bit or 32 bit version of X86 instruction according to
>>>>> > target. So here I think it may be good to push/pop R8 for R8D (i.e I
>>>>> > don't want to change current implementation which inserts MI for
>>>>> > CSR) for that I need to find biggest register for which given
>>>>> > register is alias like R8 has R8D as alias. How can I find that?
>>>>> > I tried to use getMatchingSuperReg(unsigned Reg, unsigned SubIdx,
>>>>> > const TargetRegisterClass *RC) but here I don't know what will be
>>>>> > SubIdx for given Reg in given RC.
>>>>> >
>>>>> >
>>>>> > So for example if a function which should be optimized for above
>>>>> > optimization is having following set of clobbered registers:
>>>>> > R8D,R8, ECX, EAX, RAX, ESI It should push/pop R8, RCX, RAX, RSI.
>>>>> >
>>>>> >
>>>>> > Please help!
>>>>> > - Vivek
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> > On Sat, Jul 9, 2016 at 12:26 AM, vivek pandya <
>>>>> > vivekvpandya at gmail.com > wrote:
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> > On Sat, Jul 9, 2016 at 12:18 AM, Mehdi Amini < mehdi.amini at apple.com
>>>>> > > wrote:
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> > On Jul 8, 2016, at 11:41 AM, vivek pandya < vivekvpandya at gmail.com >
>>>>> > wrote:
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> > On Fri, Jul 8, 2016 at 11:46 PM, Mehdi Amini < mehdi.amini at apple.com
>>>>> > > wrote:
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> > On Jul 8, 2016, at 11:12 AM, vivek pandya < vivekvpandya at gmail.com >
>>>>> > wrote:
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> > Hello LLVM Developers,
>>>>> >
>>>>> >
>>>>> > I have a thought to improve IPRA and I would like summaries
>>>>> > discussion on IRC regarding that so we can develop an idea out of
>>>>> > that if it really helps.
>>>>> >
>>>>> >
>>>>> > So idea is to have more callee saved registers at infrequently called
>>>>> > leaf procedures and try provide more registers to procedures which
>>>>> > are in upper region of the call graph. But as pointed out by Quentin
>>>>> > this optimization may help in context of "true" IPRA but in our case
>>>>> > we may not require this. But I think that it can improve performance
>>>>> > in current IPRA. I explain both arguments ( Quentin's and mine) with
>>>>> > following example.
>>>>> >
>>>>> >
>>>>> > Consider following call sequence A->B->C , here C is very less time
>>>>> > called leaf procedure while A is called frequently and B may call C
>>>>> > based on some condition now while propagating actual register usage
>>>>> > information from C to A we almost clobbered most of the registers so
>>>>> > in this case as per Quentin's point we does not hurt the performance
>>>>> > as we fall back to CC but I think we can improve the performance as
>>>>> > follows:
>>>>> > If we mark every register preserved by C (i.e having more spill
>>>>> > reloads at procedure entry and exit ) and if this can help at A.
>>>>> > Suppose A requires more number of distinct registers than CC can
>>>>> > provide and if not provided it will spill variables to memory. Now
>>>>> > if we can provide more registers at A by having more spills at C
>>>>> > then we can save spill at A which can be beneficial because A is
>>>>> > frequently called but C is less frequently called and thus reducing
>>>>> > total number of spill/restore in program execution.
>>>>> >
>>>>> >
>>>>> > However again effect of this optimization will be limited by the
>>>>> > scope of current IPRA (i.e one Module only) because we can' really
>>>>> > propagate the details about more callee saved registers to caller
>>>>> > which is defined in other module, but still it may helpful.
>>>>> >
>>>>> >
>>>>> > Any thoughts on this ?
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> > I think it is interesting, have you considered:
>>>>> >
>>>>> >
>>>>> > - the code size impact? (C will have a lot of spills)
>>>>> > Yes, this needs to be address with some heuristics based on call
>>>>> > frequency to C and no of clobbers it has. Also can we say that a
>>>>> > function which does not have any kind of call instruction in it's
>>>>> > body will have less clobbers ?
>>>>> >
>>>>> >
>>>>> > I am not sure what you mean.
>>>>> > A function which may do lots of computation but does not required to
>>>>> > call any other function may not have too many simultaneous live
>>>>> > ranges thus with very few registers it can be compiled.
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> > - what if C is cold but all (most) of its call sites are located in
>>>>> > different modules?
>>>>> > Can we user Uses to get no of call site in current module and based
>>>>> > on that we decide to optimize? Again some heuristics .
>>>>> >
>>>>> >
>>>>> > Of course, but what I’m mentioning is exactly what does not work with
>>>>> > that.
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> > - an alternative approach where we would break the CGSCC ordering to
>>>>> > codegen B and A before C, so we would be able to spill minimally
>>>>> > when performing the codegen for C?
>>>>> > Do you here mean marking all preserve for C while code gen for B and
>>>>> > then when we come to C (top-down) we may avoid some spills if C can
>>>>> > use regs which are not really used by B?
>>>>> >
>>>>> >
>>>>> > Yes, but it may be harder to implement for not much gain after all.
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> > Also this can be applied to a function which is less frequently
>>>>> > called and which may not be a leaf function. It may help.
>>>>> >
>>>>> >
>>>>> > Sure, you can just refer to this as “PGO driven IPRA”.
>>>>> > Ok I will look into this.
>>>>> >
>>>>> >
>>>>> > Vivek
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> > —
>>>>> > Mehdi
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>>
>>>>> --
>>>>> Hal Finkel
>>>>> Assistant Computational Scientist
>>>>> Leadership Computing Facility
>>>>> Argonne National Laboratory
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Hal Finkel
>>>> Assistant Computational Scientist
>>>> Leadership Computing Facility
>>>> Argonne National Laboratory
>>>>
>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>>
>>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160816/26eb8519/attachment.html>