[llvm-dev] A thought to improve IPRA

Fri Jul 29 10:36:18 PDT 2016

----- Original Message -----

> From: "vivek pandya" <vivekvpandya at gmail.com>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: "llvm-dev" <llvm-dev at lists.llvm.org>, "Quentin Colombet"
> <qcolombet at apple.com>, "Mehdi Amini" <mehdi.amini at apple.com>
> Sent: Friday, July 29, 2016 5:02:44 AM
> Subject: Re: A thought to improve IPRA

> On Fri, Jul 29, 2016 at 9:01 AM, Hal Finkel < hfinkel at anl.gov >
> wrote:

> > ----- Original Message -----
> 

> > > From: "vivek pandya" < vivekvpandya at gmail.com >
> 
> > > To: "Mehdi Amini" < mehdi.amini at apple.com >
> 
> > > Cc: "llvm-dev" < llvm-dev at lists.llvm.org >, "Hal Finkel" <
> > > hfinkel at anl.gov >, "Quentin Colombet" < qcolombet at apple.com >
> 
> > > Sent: Thursday, July 28, 2016 2:59:02 PM
> 
> > > Subject: Re: A thought to improve IPRA
> 
> > >
> 
> > >
> 
> > > I have been working on PGO driven IPRA and I want to measure if
> > > this
> 
> > > help to reduce execution time. So as mentioned earlier the idea
> > > is
> 
> > > to make cold function register usage free i.e saving and
> > > restoring
> 
> > > all used register by such cold function so caller of that
> > > function
> 
> > > will have more free registers. So here I am changing standard
> > > callee
> 
> > > saved registers set to a set which will be decided dynamically
> > > based
> 
> > > on the actual register usage.
> 
> > >
> 
> > > I am facing few problems to get this working:
> 
> > > 1 ) While generating CFI for such function it requires to map
> > > Dwarf
> 
> > > register to LLVM register and even if we force LLVM to use Dwarf
> 
> > > register number for CFI then also it will be wrong for some
> > > register
> 
> > > for which currently we don't have such mapping for example R8D
> 
> > > register on X86 (when dealing with actual register usage info we
> > > may
> 
> > > have such case where R8D is being used)
> 
> > > To fix this I tried to filter the functions which will be
> > > optimized
> 
> > > by putting a constraints that it should have attribute NoUnwind
> > > but
> 
> > > that does not help. Is it possible to disable CFI generation?
> 

> > Disabling CFI generation does not seem like the right solution. If
> > the R8D definition, and similar, need DWARF register numbers, then
> > we should fix that (you can try rearranging things and using
> > DwarfRegAlias, or at least for testing, add the same DwarfRegNum as
> > for R8).
> 
> > Adding DwarfRegNum as for R8 does not work because the mapping is
> > currently generated as a sorted array on the first value of the key
> > for DwarfLLVMRegPair. So in the build directory in file
> > X86GenRegisterInfo.inc this will add 2 different entries for
> > mapping
> > LLVM Reg to Dwarf number i.e R8 -> 8 and R8D -> 8 but in the array
> > of mapping Dwarf to LLVM Reg there is only entry as it will add 8
> > only once.
> 
> > >
> 
> > >
> 
> > > 2) R8D is a 48 bit register
> 

> > Why do you say that? For one thing, it is in a register class GR32
> > and holds only 32-bit values.
> 

> Sorry this is my bad R8D is 32 bit value. To get this working
> changing CC for cold functions to "preserve_all" seems to be easy
> and safe way. Let me know your thought about this.
Sounds like a reasonable thing to try. 

-Hal 

> -Vivek
> > -Hal
> 

> > > but pushing and popping such register is
> 
> > > not allowed and current implementation for CalleeSaved Register
> > > also
> 
> > > uses either 64 bit or 32 bit version of X86 instruction according
> > > to
> 
> > > target. So here I think it may be good to push/pop R8 for R8D
> > > (i.e
> > > I
> 
> > > don't want to change current implementation which inserts MI for
> 
> > > CSR) for that I need to find biggest register for which given
> 
> > > register is alias like R8 has R8D as alias. How can I find that?
> 
> > > I tried to use getMatchingSuperReg(unsigned Reg, unsigned SubIdx,
> 
> > > const TargetRegisterClass *RC) but here I don't know what will be
> 
> > > SubIdx for given Reg in given RC.
> 
> > >
> 
> > >
> 
> > > So for example if a function which should be optimized for above
> 
> > > optimization is having following set of clobbered registers:
> 
> > > R8D,R8, ECX, EAX, RAX, ESI It should push/pop R8, RCX, RAX, RSI.
> 
> > >
> 
> > >
> 
> > > Please help!
> 
> > > - Vivek
> 
> > >
> 
> > >
> 
> > >
> 
> > >
> 
> > > On Sat, Jul 9, 2016 at 12:26 AM, vivek pandya <
> 
> > > vivekvpandya at gmail.com > wrote:
> 
> > >
> 
> > >
> 
> > >
> 
> > >
> 
> > >
> 
> > >
> 
> > >
> 
> > > On Sat, Jul 9, 2016 at 12:18 AM, Mehdi Amini <
> > > mehdi.amini at apple.com
> 
> > > > wrote:
> 
> > >
> 
> > >
> 
> > >
> 
> > >
> 
> > >
> 
> > >
> 
> > >
> 
> > >
> 
> > > On Jul 8, 2016, at 11:41 AM, vivek pandya <
> > > vivekvpandya at gmail.com
> > > >
> 
> > > wrote:
> 
> > >
> 
> > >
> 
> > >
> 
> > >
> 
> > > On Fri, Jul 8, 2016 at 11:46 PM, Mehdi Amini <
> > > mehdi.amini at apple.com
> 
> > > > wrote:
> 
> > >
> 
> > >
> 
> > >
> 
> > >
> 
> > >
> 
> > >
> 
> > >
> 
> > >
> 
> > > On Jul 8, 2016, at 11:12 AM, vivek pandya <
> > > vivekvpandya at gmail.com
> > > >
> 
> > > wrote:
> 
> > >
> 
> > >
> 
> > >
> 
> > >
> 
> > >
> 
> > > Hello LLVM Developers,
> 
> > >
> 
> > >
> 
> > > I have a thought to improve IPRA and I would like summaries
> 
> > > discussion on IRC regarding that so we can develop an idea out of
> 
> > > that if it really helps.
> 
> > >
> 
> > >
> 
> > > So idea is to have more callee saved registers at infrequently
> > > called
> 
> > > leaf procedures and try provide more registers to procedures
> > > which
> 
> > > are in upper region of the call graph. But as pointed out by
> > > Quentin
> 
> > > this optimization may help in context of "true" IPRA but in our
> > > case
> 
> > > we may not require this. But I think that it can improve
> > > performance
> 
> > > in current IPRA. I explain both arguments ( Quentin's and mine)
> > > with
> 
> > > following example.
> 
> > >
> 
> > >
> 
> > > Consider following call sequence A->B->C , here C is very less
> > > time
> 
> > > called leaf procedure while A is called frequently and B may call
> > > C
> 
> > > based on some condition now while propagating actual register
> > > usage
> 
> > > information from C to A we almost clobbered most of the registers
> > > so
> 
> > > in this case as per Quentin's point we does not hurt the
> > > performance
> 
> > > as we fall back to CC but I think we can improve the performance
> > > as
> 
> > > follows:
> 
> > > If we mark every register preserved by C (i.e having more spill
> 
> > > reloads at procedure entry and exit ) and if this can help at A.
> 
> > > Suppose A requires more number of distinct registers than CC can
> 
> > > provide and if not provided it will spill variables to memory.
> > > Now
> 
> > > if we can provide more registers at A by having more spills at C
> 
> > > then we can save spill at A which can be beneficial because A is
> 
> > > frequently called but C is less frequently called and thus
> > > reducing
> 
> > > total number of spill/restore in program execution.
> 
> > >
> 
> > >
> 
> > > However again effect of this optimization will be limited by the
> 
> > > scope of current IPRA (i.e one Module only) because we can'
> > > really
> 
> > > propagate the details about more callee saved registers to caller
> 
> > > which is defined in other module, but still it may helpful.
> 
> > >
> 
> > >
> 
> > > Any thoughts on this ?
> 
> > >
> 
> > >
> 
> > >
> 
> > >
> 
> > > I think it is interesting, have you considered:
> 
> > >
> 
> > >
> 
> > > - the code size impact? (C will have a lot of spills)
> 
> > > Yes, this needs to be address with some heuristics based on call
> 
> > > frequency to C and no of clobbers it has. Also can we say that a
> 
> > > function which does not have any kind of call instruction in it's
> 
> > > body will have less clobbers ?
> 
> > >
> 
> > >
> 
> > > I am not sure what you mean.
> 
> > > A function which may do lots of computation but does not required
> > > to
> 
> > > call any other function may not have too many simultaneous live
> 
> > > ranges thus with very few registers it can be compiled.
> 
> > >
> 
> > >
> 
> > >
> 
> > >
> 
> > >
> 
> > >
> 
> > >
> 
> > >
> 
> > >
> 
> > >
> 
> > >
> 
> > >
> 
> > >
> 
> > > - what if C is cold but all (most) of its call sites are located
> > > in
> 
> > > different modules?
> 
> > > Can we user Uses to get no of call site in current module and
> > > based
> 
> > > on that we decide to optimize? Again some heuristics .
> 
> > >
> 
> > >
> 
> > > Of course, but what I’m mentioning is exactly what does not work
> > > with
> 
> > > that.
> 
> > >
> 
> > >
> 
> > >
> 
> > >
> 
> > >
> 
> > >
> 
> > >
> 
> > >
> 
> > >
> 
> > >
> 
> > >
> 
> > >
> 
> > >
> 
> > >
> 
> > >
> 
> > > - an alternative approach where we would break the CGSCC ordering
> > > to
> 
> > > codegen B and A before C, so we would be able to spill minimally
> 
> > > when performing the codegen for C?
> 
> > > Do you here mean marking all preserve for C while code gen for B
> > > and
> 
> > > then when we come to C (top-down) we may avoid some spills if C
> > > can
> 
> > > use regs which are not really used by B?
> 
> > >
> 
> > >
> 
> > > Yes, but it may be harder to implement for not much gain after
> > > all.
> 
> > >
> 
> > >
> 
> > >
> 
> > >
> 
> > >
> 
> > > Also this can be applied to a function which is less frequently
> 
> > > called and which may not be a leaf function. It may help.
> 
> > >
> 
> > >
> 
> > > Sure, you can just refer to this as “PGO driven IPRA”.
> 
> > > Ok I will look into this.
> 
> > >
> 
> > >
> 
> > > Vivek
> 
> > >
> 
> > >
> 
> > >
> 
> > >
> 
> > >
> 
> > > —
> 
> > > Mehdi
> 
> > >
> 
> > >
> 
> > >
> 
> > >
> 

> > --
> 
> > Hal Finkel
> 
> > Assistant Computational Scientist
> 
> > Leadership Computing Facility
> 
> > Argonne National Laboratory
> 

-- 

Hal Finkel 
Assistant Computational Scientist 
Leadership Computing Facility 
Argonne National Laboratory 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160729/3f0ac05a/attachment.html>