[llvm-dev] A thought to improve IPRA

Sat Aug 6 01:05:31 PDT 2016

On Sat, Aug 6, 2016 at 2:00 AM, Matthias Braun <matze at braunis.de> wrote:

> The code in X86TargetLowering::IsEligibleForTailCallOptimization() has
> this part:
>
>   // The callee has to preserve all registers the caller needs to preserve.
>   const X86RegisterInfo *TRI = Subtarget.getRegisterInfo();
>   const uint32_t *CallerPreserved = TRI->getCallPreservedMask(MF,
> CallerCC);
>   if (!CCMatch) {
>     const uint32_t *CalleePreserved = TRI->getCallPreservedMask(MF,
> CalleeCC);
>     if (!TRI->regmaskSubsetEqual(CallerPreserved, CalleePreserved))
>       return false;
>   }
>
> Thanks MatzeB for pointing this out.

> which usually checks that this is fine. Maybe that code looks at the
> regmask of the calling convention rather than the new regmask computed by
> IPRA?
>

Yes I was doing the optimization at RegUsageInfoPropagate and and above
code is getting executed at ISel phase that is why effect of new CC is not
visible at ISel phase.
So I moved this optimization at CodeGenPrepare which happens before ISel.
Now I am not getting the above bug. But the result of this optimization is
not good ( I mean no positive change in generated code) also due to the
condition F.doesNotAccessMemory() && !F.hasLocalLinkage() very few of cold
functions are really getting optimized.

One more thing how to test test-suite with PGO and also use generated
profiles in next run.  I mean any easy way?
I just want to be sure if any more kind of case is not covered.

-Vivek

> - Matthias
>
> On Aug 4, 2016, at 9:22 PM, vivek pandya via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>
> Hello all,
>
> Adding MatzeB this may be interesting for him.
>
> I have tried out following way to save most of the registers for cold
> functions, in RegisterUsagePropagation.cpp
>
> if (PSI->isColdFunction(F) && F->doesNotAccessMemory() &&
> !F->hasLocalLinkage()) {
>       dbgs() << "Cold Function : " << F->getName() << "\n";
>       F->setCallingConv(CallingConv::CXX_FAST_TLS);
>   }
>
> previously I was using CallingConv::PreserveMost but it also saves RAX and
> that generated bug for functions which returns address to global objects or
> some thing similar. CXX_FAST_TLS is very similar to PreserveMost but it
> excludes RAX and
> RDI. It also excludes XMM*.  There is also one more interesting bug (
> actually it is very similar to what we faced while optimizing function for
> not saving registers)
>
> See in SPASS application there is a very simple function :
>
> BOOL opts_IdIsNull(OPTID Id)
> {
>   return opts_IdEqual(opts_IdNull(), Id);
> }
>
> This is cold function for a particular input and that is why it is getting
> selected for the optimization. But with CXX_FAST_TLS it is getting
> generated as follows :
>
> _opts_IdIsNull:                         ## @opts_IdIsNull
> .cfi_startproc
> ## BB#0:                                ## %entry
> pushq %rbp
> Ltmp9:
> .cfi_def_cfa_offset 16
> Ltmp10:
> .cfi_offset %rbp, -16
> movq %rsp, %rbp
> Ltmp11:
> .cfi_def_cfa_register %rbp
> pushq %rsi
> pushq %rcx
> Ltmp12:
> .cfi_offset %rcx, -32
> Ltmp13:
> .cfi_offset %rsi, -24
> movl %edi, %eax
> movl $-1, %edi
> popq %rcx
> popq %rsi
> popq %rbp
> jmp _opts_IdEqual           ## TAILCALL
> .cfi_endproc
>
> Here before tailcall due to odl values are returned back result is
> generated wrong.
> I am also confused why rcx and rsi is saved/restored ? Is this normal ?
> Any suggestion to handle such case?
>
> Also if we do not want to use a CC directly then can you help me find
> proper place such that we can iterate over  parameter list (also how to
> iterate over only register which are used for params only?)  and do not put
> this in regMask and also not have save/restore  for them. Some hints will
> speed up my work.
>
> Sincerely,
> Vivek
>
>
>
> On Fri, Jul 29, 2016 at 11:06 PM, Hal Finkel <hfinkel at anl.gov> wrote:
>
>>
>> ------------------------------
>>
>> *From: *"vivek pandya" <vivekvpandya at gmail.com>
>> *To: *"Hal Finkel" <hfinkel at anl.gov>
>> *Cc: *"llvm-dev" <llvm-dev at lists.llvm.org>, "Quentin Colombet" <
>> qcolombet at apple.com>, "Mehdi Amini" <mehdi.amini at apple.com>
>> *Sent: *Friday, July 29, 2016 5:02:44 AM
>> *Subject: *Re: A thought to improve IPRA
>>
>>
>>
>> On Fri, Jul 29, 2016 at 9:01 AM, Hal Finkel <hfinkel at anl.gov> wrote:
>>
>>> ------------------------------
>>>
>>> > From: "vivek pandya" <vivekvpandya at gmail.com>
>>> > To: "Mehdi Amini" <mehdi.amini at apple.com>
>>> > Cc: "llvm-dev" <llvm-dev at lists.llvm.org>, "Hal Finkel" <
>>> hfinkel at anl.gov>, "Quentin Colombet" <qcolombet at apple.com>
>>> > Sent: Thursday, July 28, 2016 2:59:02 PM
>>> > Subject: Re: A thought to improve IPRA
>>> >
>>> >
>>> > I have been working on PGO driven IPRA and I want to measure if this
>>> > help to reduce execution time. So as mentioned earlier the idea is
>>> > to make cold function register usage free i.e saving and restoring
>>> > all used register by such cold function so caller of that function
>>> > will have more free registers. So here I am changing standard callee
>>> > saved registers set to a set which will be decided dynamically based
>>> > on the actual register usage.
>>> >
>>> > I am facing few problems to get this working:
>>> > 1 ) While generating CFI for such function it requires to map Dwarf
>>> > register to LLVM register and even if we force LLVM to use Dwarf
>>> > register number for CFI then also it will be wrong for some register
>>> > for which currently we don't have such mapping for example R8D
>>> > register on X86 (when dealing with actual register usage info we may
>>> > have such case where R8D is being used)
>>> > To fix this I tried to filter the functions which will be optimized
>>> > by putting a constraints that it should have attribute NoUnwind but
>>> > that does not help. Is it possible to disable CFI generation?
>>>
>>> Disabling CFI generation does not seem like the right solution. If the
>>> R8D definition, and similar, need DWARF register numbers, then we should
>>> fix that (you can try rearranging things and using DwarfRegAlias, or at
>>> least for testing, add the same DwarfRegNum as for R8).
>>> Adding DwarfRegNum as for R8 does not work because the mapping is
>>> currently generated as a sorted array on the first value of the key
>>> for DwarfLLVMRegPair. So in the build directory in file
>>> X86GenRegisterInfo.inc this will add 2 different entries for mapping LLVM
>>> Reg to Dwarf number i.e R8 -> 8 and R8D -> 8 but in the array of mapping
>>> Dwarf to LLVM Reg there is only entry as it will add 8 only once.
>>>
>> >
>>> >
>>> > 2) R8D is a 48 bit register
>>>
>>> Why do you say that? For one thing, it is in a register class GR32 and
>>> holds only 32-bit values.
>>>
>> Sorry this is my bad R8D is 32 bit value. To get this working changing CC
>> for cold functions to "preserve_all" seems to be easy and safe way. Let me
>> know your thought about this.
>>
>> Sounds like a reasonable thing to try.
>>
>>  -Hal
>>
>>
>> -Vivek
>>
>>>
>>>  -Hal
>>>
>>> > but pushing and popping such register is
>>> > not allowed and current implementation for CalleeSaved Register also
>>> > uses either 64 bit or 32 bit version of X86 instruction according to
>>> > target. So here I think it may be good to push/pop R8 for R8D (i.e I
>>> > don't want to change current implementation which inserts MI for
>>> > CSR) for that I need to find biggest register for which given
>>> > register is alias like R8 has R8D as alias. How can I find that?
>>> > I tried to use getMatchingSuperReg(unsigned Reg, unsigned SubIdx,
>>> > const TargetRegisterClass *RC) but here I don't know what will be
>>> > SubIdx for given Reg in given RC.
>>> >
>>> >
>>> > So for example if a function which should be optimized for above
>>> > optimization is having following set of clobbered registers:
>>> > R8D,R8, ECX, EAX, RAX, ESI It should push/pop R8, RCX, RAX, RSI.
>>> >
>>> >
>>> > Please help!
>>> > - Vivek
>>> >
>>> >
>>> >
>>> >
>>> > On Sat, Jul 9, 2016 at 12:26 AM, vivek pandya <
>>> > vivekvpandya at gmail.com > wrote:
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > On Sat, Jul 9, 2016 at 12:18 AM, Mehdi Amini < mehdi.amini at apple.com
>>> > > wrote:
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > On Jul 8, 2016, at 11:41 AM, vivek pandya < vivekvpandya at gmail.com >
>>> > wrote:
>>> >
>>> >
>>> >
>>> >
>>> > On Fri, Jul 8, 2016 at 11:46 PM, Mehdi Amini < mehdi.amini at apple.com
>>> > > wrote:
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > On Jul 8, 2016, at 11:12 AM, vivek pandya < vivekvpandya at gmail.com >
>>> > wrote:
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > Hello LLVM Developers,
>>> >
>>> >
>>> > I have a thought to improve IPRA and I would like summaries
>>> > discussion on IRC regarding that so we can develop an idea out of
>>> > that if it really helps.
>>> >
>>> >
>>> > So idea is to have more callee saved registers at infrequently called
>>> > leaf procedures and try provide more registers to procedures which
>>> > are in upper region of the call graph. But as pointed out by Quentin
>>> > this optimization may help in context of "true" IPRA but in our case
>>> > we may not require this. But I think that it can improve performance
>>> > in current IPRA. I explain both arguments ( Quentin's and mine) with
>>> > following example.
>>> >
>>> >
>>> > Consider following call sequence A->B->C , here C is very less time
>>> > called leaf procedure while A is called frequently and B may call C
>>> > based on some condition now while propagating actual register usage
>>> > information from C to A we almost clobbered most of the registers so
>>> > in this case as per Quentin's point we does not hurt the performance
>>> > as we fall back to CC but I think we can improve the performance as
>>> > follows:
>>> > If we mark every register preserved by C (i.e having more spill
>>> > reloads at procedure entry and exit ) and if this can help at A.
>>> > Suppose A requires more number of distinct registers than CC can
>>> > provide and if not provided it will spill variables to memory. Now
>>> > if we can provide more registers at A by having more spills at C
>>> > then we can save spill at A which can be beneficial because A is
>>> > frequently called but C is less frequently called and thus reducing
>>> > total number of spill/restore in program execution.
>>> >
>>> >
>>> > However again effect of this optimization will be limited by the
>>> > scope of current IPRA (i.e one Module only) because we can' really
>>> > propagate the details about more callee saved registers to caller
>>> > which is defined in other module, but still it may helpful.
>>> >
>>> >
>>> > Any thoughts on this ?
>>> >
>>> >
>>> >
>>> >
>>> > I think it is interesting, have you considered:
>>> >
>>> >
>>> > - the code size impact? (C will have a lot of spills)
>>> > Yes, this needs to be address with some heuristics based on call
>>> > frequency to C and no of clobbers it has. Also can we say that a
>>> > function which does not have any kind of call instruction in it's
>>> > body will have less clobbers ?
>>> >
>>> >
>>> > I am not sure what you mean.
>>> > A function which may do lots of computation but does not required to
>>> > call any other function may not have too many simultaneous live
>>> > ranges thus with very few registers it can be compiled.
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > - what if C is cold but all (most) of its call sites are located in
>>> > different modules?
>>> > Can we user Uses to get no of call site in current module and based
>>> > on that we decide to optimize? Again some heuristics .
>>> >
>>> >
>>> > Of course, but what I’m mentioning is exactly what does not work with
>>> > that.
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > - an alternative approach where we would break the CGSCC ordering to
>>> > codegen B and A before C, so we would be able to spill minimally
>>> > when performing the codegen for C?
>>> > Do you here mean marking all preserve for C while code gen for B and
>>> > then when we come to C (top-down) we may avoid some spills if C can
>>> > use regs which are not really used by B?
>>> >
>>> >
>>> > Yes, but it may be harder to implement for not much gain after all.
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > Also this can be applied to a function which is less frequently
>>> > called and which may not be a leaf function. It may help.
>>> >
>>> >
>>> > Sure, you can just refer to this as “PGO driven IPRA”.
>>> > Ok I will look into this.
>>> >
>>> >
>>> > Vivek
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > —
>>> > Mehdi
>>> >
>>> >
>>> >
>>> >
>>>
>>> --
>>> Hal Finkel
>>> Assistant Computational Scientist
>>> Leadership Computing Facility
>>> Argonne National Laboratory
>>>
>>
>>
>>
>>
>> --
>> Hal Finkel
>> Assistant Computational Scientist
>> Leadership Computing Facility
>> Argonne National Laboratory
>>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160806/12c3fa54/attachment.html>