[llvm-dev] [GSoC 2016] Interprocedural Register Allocation - Introduction and Feedback

Wed May 11 10:47:54 PDT 2016

----- Original Message -----

> From: "Matthias Braun" <matze at braunis.de>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: "vivek pandya" <vivekvpandya at gmail.com>, "llvm-dev"
> <llvm-dev at lists.llvm.org>
> Sent: Wednesday, May 11, 2016 12:46:25 PM
> Subject: Re: [llvm-dev] [GSoC 2016] Interprocedural Register
> Allocation - Introduction and Feedback

> > On May 11, 2016, at 3:31 AM, Hal Finkel via llvm-dev <
> > llvm-dev at lists.llvm.org > wrote:
> 

> > ----- Original Message -----
> 

> > > From: "vivek pandya" < vivekvpandya at gmail.com >
> > 
> 
> > > To: "Mehdi Amini" < mehdi.amini at apple.com >
> > 
> 
> > > Cc: "Hal Finkel" < hfinkel at anl.gov >, "Quentin Colombet" <
> > > qcolombet at apple.com >, "llvm-dev" < llvm-dev at lists.llvm.org >,
> > > "Matthias Braun" < matze at braunis.de >
> > 
> 
> > > Sent: Wednesday, May 11, 2016 3:15:03 AM
> > 
> 
> > > Subject: Re: [GSoC 2016] Interprocedural Register Allocation -
> > > Introduction and Feedback
> > 
> 

> > > Vivek Pandya
> > 
> 

> > > On Wed, May 11, 2016 at 10:02 AM, vivek pandya <
> > > vivekvpandya at gmail.com > wrote:
> > 
> 

> > > > Vivek Pandya
> > > 
> > 
> 

> > > > On Wed, May 11, 2016 at 9:43 AM, Mehdi Amini <
> > > > mehdi.amini at apple.com
> > > > > wrote:
> > > 
> > 
> 

> > > > > > On May 10, 2016, at 6:06 PM, Hal Finkel < hfinkel at anl.gov >
> > > > > > wrote:
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > > > From: "vivek pandya" < vivekvpandya at gmail.com >
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > To: "llvm-dev" < llvm-dev at lists.llvm.org >, "Tim Amini
> > > > > > > Golling"
> > > > > > > <
> > > > > > > mehdi.amini at apple.com >, "Hal Finkel" < hfinkel at anl.gov >
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > Cc: "Quentin Colombet" < qcolombet at apple.com >
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > Sent: Tuesday, May 10, 2016 2:59:16 PM
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > Subject: [GSoC 2016] Interprocedural Register Allocation
> > > > > > > -
> > > > > > > Introduction and Feedback
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > > > Hello LLVM Community,
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > > > Sorry for delay as I was busy in final exams.
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > > > I am Vivek from India. Thanks for choosing my proposal
> > > > > > > for
> > > > > > > Interprocedural Register Allocation (IPRA) in LLVM. Mehdi
> > > > > > > Amini
> > > > > > > and
> > > > > > > Hal Finkel will be mentoring me for this project.
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > > > IPRA can reduce code size and runtime of programs by
> > > > > > > allocating
> > > > > > > register across the module and procedure boundaries.
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > > > I have identified some old but effective research work on
> > > > > > > this
> > > > > > > area.
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > I want community's feedback for feasibility of these
> > > > > > > approach
> > > > > > > and
> > > > > > > I
> > > > > > > am targeting to implement two of them during this
> > > > > > > project.
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > > > Here is list of the papers, I have read first two papers
> > > > > > > and
> > > > > > > I
> > > > > > > would
> > > > > > > like to discuss those approach first, I will read other
> > > > > > > two
> > > > > > > paper
> > > > > > > then initiate discussion for them as well. All I want is
> > > > > > > to
> > > > > > > find
> > > > > > > out
> > > > > > > a concrete implementation plan before 23 May, 2016 and
> > > > > > > for
> > > > > > > that
> > > > > > > I
> > > > > > > need community's help.
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > > > 1) Compile time ----- Minimizing register usage penalty
> > > > > > > at
> > > > > > > procedure
> > > > > > > calls - http://dl.acm.org/citation.cfm?id=53999
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > ====================================================================In
> > > > > > > this approach intra-procedural register allocation is
> > > > > > > used
> > > > > > > as
> > > > > > > base
> > > > > > > but machine code generation order is bottom up traversal
> > > > > > > of
> > > > > > > call
> > > > > > > graph and inter-procedural effect is achieved by
> > > > > > > propagating
> > > > > > > register usage information of callee function to caller
> > > > > > > (i.e
> > > > > > > child
> > > > > > > to parent in CallGraph) so that caller can use different
> > > > > > > registers
> > > > > > > than callee and can save load store cost at procedure
> > > > > > > call,
> > > > > > > this
> > > > > > > is
> > > > > > > not trivial as it seems due to recursive calls, library
> > > > > > > function
> > > > > > > usage etc. Also for upper region of the graph in this
> > > > > > > technique
> > > > > > > available number of registers might become zero in that
> > > > > > > case
> > > > > > > it
> > > > > > > should fall back to normal load store at procedure call.
> > > > > > > Apart
> > > > > > > from
> > > > > > > these difficulties other difficulties have been
> > > > > > > identified
> > > > > > > please
> > > > > > > follow this mail-chain
> > > > > > > https://groups.google.com/d/topic/llvm-dev/HOYAXv3m1LY/discussion
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > My mentor has already provided me a patch that alters
> > > > > > > code
> > > > > > > generation
> > > > > > > order as per bottom up call graph traversal, I am working
> > > > > > > from
> > > > > > > that
> > > > > > > point now. Any other help/suggestion is always welcomed.
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > > > 2) Link time ----- Global register allocation at link
> > > > > > > time
> > > > > > > -
> > > > > > > http://dl.acm.org/citation.cfm?id=989415
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > ====================================================================In
> > > > > > > this particular approach (sort of true IPRA) registers
> > > > > > > will
> > > > > > > be
> > > > > > > reallocated (this optimization will be optional if turned
> > > > > > > off
> > > > > > > still
> > > > > > > code will be compiled as per intra-procedural allocation)
> > > > > > > at
> > > > > > > link
> > > > > > > time. Here modules are first complied as per normal
> > > > > > > compilation
> > > > > > > but
> > > > > > > the object code is annotated with details so that linker
> > > > > > > can
> > > > > > > build
> > > > > > > call graph and also calculate usage information at link
> > > > > > > time.
> > > > > > > Compiler also write hints in object code that if
> > > > > > > particular
> > > > > > > variable
> > > > > > > is allocated in some other register ( due to new
> > > > > > > allocation)
> > > > > > > then
> > > > > > > how the code should be changed? Thus linker can use these
> > > > > > > information to decide which variables (global) need to be
> > > > > > > in
> > > > > > > same
> > > > > > > register through out the program execution and also
> > > > > > > according
> > > > > > > to
> > > > > > > register usage information in call graph which procedure
> > > > > > > will
> > > > > > > not
> > > > > > > be
> > > > > > > active simultaneously so that locals for that procedures
> > > > > > > can
> > > > > > > be
> > > > > > > in
> > > > > > > same registers with out load store at procedure calls.
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > For these particular method help me to analyze
> > > > > > > feasibility:
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > 1) Can llvm collects following information at module
> > > > > > > level
> > > > > > > in
> > > > > > > MachineIR? list of procedures in module, list of locals
> > > > > > > in
> > > > > > > procedures, list of procedures that a particular
> > > > > > > procedure
> > > > > > > can
> > > > > > > call,
> > > > > > > and a list of the variables this procedure references.
> > > > > > > Each
> > > > > > > entry
> > > > > > > in
> > > > > > > the last two lists includes an estimate of the number of
> > > > > > > times
> > > > > > > the
> > > > > > > procedure is called or the variable is referenced in each
> > > > > > > execution
> > > > > > > of this procedure
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > 2) Can llvm write informative commands to object files?
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > 3) Can LTO is capable of leveraging those commands?
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > > In terms of scoping the project for the summer, I
> > > > > > definitely
> > > > > > recommend that you focus on (1) first. If you finish that,
> > > > > > we
> > > > > > can
> > > > > > certainly move on to other things.
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > I'll add +1 here, but I already wrote the same thing on IRC
> > > > > when
> > > > > discussing with Vivek. True IPRA without a proper
> > > > > MachineModule
> > > > > infrastructure won't be doable in my opinion (even with such
> > > > > infrastructure, it may not be trivial in LLVM in general).
> > > > 
> > > 
> > 
> 

> > > > > > Regarding link time, note that any such a design would
> > > > > > likely
> > > > > > look
> > > > > > much different than in David Wall's paper however, because
> > > > > > our
> > > > > > LTO
> > > > > > re-codegens everything anyway. The paper says, "Finally, it
> > > > > > keeps
> > > > > > us
> > > > > > honest as designers of the system; once we postpone
> > > > > > anything
> > > > > > until
> > > > > > link time, the temptation is great to postpone everything,
> > > > > > ..."
> > > > > > -
> > > > > > Well, we've long-since succumb to that temptation when we
> > > > > > LTO.
> > > > > > C'est
> > > > > > la vie.
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > +1 as well, our LTO will benefit naturally from the
> > > > > leaf-to-root
> > > > > information propagation. ThinLTO will be more
> > > > > challenging/interesting though!
> > > > 
> > > 
> > 
> 
> > > > > > > For the first part a mechanism similar to
> > > > > > > MachineModulePass
> > > > > > > would
> > > > > > > be
> > > > > > > desirable but that may not be possible during this
> > > > > > > project,
> > > > > > > but
> > > > > > > if
> > > > > > > we can make some sort of smaller version of that to suit
> > > > > > > our
> > > > > > > purpose.
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > I don't think we need to make any kind of MachineModulePass
> > > > > > to
> > > > > > make
> > > > > > this work. Once we alter the visitation order based on the
> > > > > > CGSCC
> > > > > > iteration scheme, we can keep state in-between functions in
> > > > > > the
> > > > > > pre-existing hacky way (using static members of the
> > > > > > relevant
> > > > > > function passes).
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > Sorry my mistake here by first part I mean 1) requirement in
> > > > the
> > > > link
> > > > time approach.
> > > 
> > 
> 
> > > > > I also don't see where/why we need a MachineModule(Pass) for
> > > > > the
> > > > > CGSCC scheme, that said I'd rather avoid using a function
> > > > > pass
> > > > > with
> > > > > static members, if we can have a ModuleAnalysis that is
> > > > > bookkeeping
> > > > > the results for functions in the module and queries by the
> > > > > register
> > > > > allocator somehow.
> > > > 
> > > 
> > 
> 
> > > > > Matthias/Quentin may have other inputs on this aspect.
> > > > 
> > > 
> > 
> 
> > > @Hal do you mean to add a simple MachineFunction pass that will
> > > just
> > > operate on register allocated function and prepare a BitVector to
> > > indicate which register is being used by MachineFunction, and
> > > then
> > > use this pass as analysis pass (i.e just simply return static
> > > BitVector for clobbered register when register allocation for
> > > next
> > > function begins. This part is not much clear to me) this thing
> > > can
> > > be done by scheduling a pass post register allocation in
> > > lib/CodeGen/Passes.cpp
> > 
> 

> > > void TargetPassConfig::addMachinePasses() {
> > 
> 
> > > .
> > 
> 
> > > .
> > 
> 
> > > .
> > 
> 
> > > // Run pre-ra passes.
> > 
> 
> > > addPreRegAlloc();
> > 
> 

> > > // Run register allocation and passes that are tightly coupled
> > > with
> > > it,
> > 
> 
> > > // including phi elimination and scheduling.
> > 
> 
> > > if (getOptimizeRegAlloc())
> > 
> 
> > > addOptimizedRegAlloc(createRegAllocPass(true));
> > 
> 
> > > else
> > 
> 
> > > addFastRegAlloc(createRegAllocPass(false));
> > 
> 

> > > // Run post-ra passes.
> > 
> 
> > > addPostRegAlloc();
> > 
> 
> > > // Adding a new pass here which keeps register mask information
> > > across function calls.
> > 
> 
> > > .
> > 
> 
> > > .
> > 
> 
> > > .
> > 
> 
> > > }
> > 
> 

> > > But this also requires current register allocators to use this
> > > information in someway because RegMaskBits in
> > > LiveIntervalAnalysis.cpp is not static across calls. I mean I am
> > > not
> > > clear for how to propagate static info to Intra-procedural
> > > Register
> > > allocators (if possible without disturbing their code )
> > 
> 
> > First, my hope is that we won't need to change the register
> > allocators, as such, in order to make use of this information.
> > Instead, we'll simply be able to alter the register masks generated
> > for the call instructions. These masks will indicate fewer clobbers
> > than might otherwise be present based on the ABI because of
> > information gathered during the codegen of the callee. These masks
> > are generally constructed by target based on the calling
> > convention.
> > The PowerPC backend, for example, looks like this:
> 

> > // Add a register mask operand representing the call-preserved
> > registers.
> 
> > const TargetRegisterInfo *TRI = Subtarget.getRegisterInfo();
> 
> > const uint32_t *Mask =
> 
> > TRI->getCallPreservedMask(DAG.getMachineFunction(), CallConv);
> 
> > assert(Mask && "Missing call preserved mask for calling
> > convention");
> 
> > Ops.push_back(DAG.getRegisterMask(Mask));
> 

> > but it can be more complicated. If you look for uses of
> > 'getRegisterMask' in Target/*/*ISelLowering.cpp, you'll see what I
> > mean. Regardless, the code ends up calling some method is the
> > targets TargetRegisterInfo subclass. These methods generally look
> > something like this:
> 

> > const uint32_t *
> 
> > PPCRegisterInfo::getCallPreservedMask(const MachineFunction &MF,
> 
> > CallingConv::ID CC) const {
> 
> > const PPCSubtarget &Subtarget = MF.getSubtarget<PPCSubtarget>();
> 
> > ...
> 
> > return TM.isPPC64() ? (Subtarget.hasAltivec() ?
> > CSR_SVR464_Altivec_RegMask
> 
> > : CSR_SVR464_RegMask)
> 
> > : (Subtarget.hasAltivec() ? CSR_SVR432_Altivec_RegMask
> 
> > : CSR_SVR432_RegMask);
> 
> > }
> 

> > In any case, the fundamental idea here is that, when someone calls
> > getCallPreservedMask in order to set the regmask on a call, we
> > might
> > not have to use the CC at all. Instead, if we've already codegened
> > the function, we might use a cache of 'exact' register masks
> > computed during codegen of the potential callees instead.
> 

> > In order to do this, I think we'll need to provide a function
> > callable from the target's getCallPreservedMask implementation,
> > which can return such an 'exact' regmask when available. I think we
> > need to do it this way for two reasons:
> 

> > 1. Not all of the target code calls getCallPreservedMask, but
> > sometimes calls other similar target-specific functions (e.g.
> > getTLSCallPreservedMask).
> 
> > 2. The targets need to opt-in to this behavior because only the
> > target can know that all register uses are really tagged correctly
> > post "pre-emit".
> 

> > Because the target is free to introduce uses of registers at
> > essentially any time, we need to do the scanning for used registers
> > after the "pre-emit" passes run. This can be done by scheduling
> > some
> > simple register-use scanning pass after the call to addPreEmitPass
> > in lib/CodeGen/Passes.cpp.
> 

> MachineRegister maintains linked lists with defs/uses for each
> register so you can determine whether a specific register is used or
> not without scanning.
Does this include regmask-clobbered registers? 

-Hal 

> - Matthias

-- 

Hal Finkel 
Assistant Computational Scientist 
Leadership Computing Facility 
Argonne National Laboratory 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160511/887c15e0/attachment-0001.html>