[llvm-dev] [GSoC 2016] Interprocedural Register Allocation - Introduction and Feedback

Tue May 24 11:00:58 PDT 2016

Hello,

I have written following code to check each register if it is used by
machineFunction or not :

MachineRegisterInfo *MRI = &MF.getRegInfo();
TargetRegisterInfo *TRI = (TargetRegisterInfo
*)MF.getSubtarget().getRegisterInfo();
const TargetMachine &TM = MF.getTarget();
const MCRegisterInfo *MCRI = TM.getMCRegisterInfo();
DEBUG(dbgs() << "Function Name : " << MF.getName() << "\n");

for(TargetRegisterInfo::regclass_iterator i = (*TRI).regclass_begin(), e =
(*TRI).regclass_end(); i != e; i++ ) {
for(TargetRegisterClass::iterator pregi = (*i)->begin(), prege =
(*i)->end(); pregi != prege; pregi++ ) {
DEBUG( dbgs() << "Physical Register : " << MCRI->getName(*pregi) << " is
modified "<< MRI->isPhysRegModified(*pregi) << " \n");
}
}
DEBUG(dbgs() << "\n");

The pass which is executing this code is schedule POST-RA stage but this
gives me true for all registers i.e in each function all registers are
being used except EBP and some other similar, Is this a correct way to get
register usage information ? I think I have made some mistake please help.

Vivek

On Wed, May 18, 2016 at 11:42 PM, Quentin Colombet <qcolombet at apple.com>
wrote:

>
> On May 18, 2016, at 11:00 AM, vivek pandya <vivekvpandya at gmail.com> wrote:
>
>
>
> *Vivek Pandya*
>
>
> On Wed, May 18, 2016 at 11:25 PM, Quentin Colombet <qcolombet at apple.com>
> wrote:
>
>>
>> On May 18, 2016, at 10:46 AM, vivek pandya <vivekvpandya at gmail.com>
>> wrote:
>>
>>
>>
>> *Vivek Pandya*
>>
>>
>> On Wed, May 11, 2016 at 4:01 PM, Hal Finkel <hfinkel at anl.gov> wrote:
>>
>>>
>>> ------------------------------
>>>
>>> *From: *"vivek pandya" <vivekvpandya at gmail.com>
>>> *To: *"Mehdi Amini" <mehdi.amini at apple.com>
>>> *Cc: *"Hal Finkel" <hfinkel at anl.gov>, "Quentin Colombet" <
>>> qcolombet at apple.com>, "llvm-dev" <llvm-dev at lists.llvm.org>, "Matthias
>>> Braun" <matze at braunis.de>
>>> *Sent: *Wednesday, May 11, 2016 3:15:03 AM
>>> *Subject: *Re: [GSoC 2016] Interprocedural Register Allocation -
>>> Introduction and Feedback
>>>
>>>
>>>
>>>
>>> *Vivek Pandya*
>>>
>>>
>>> On Wed, May 11, 2016 at 10:02 AM, vivek pandya <vivekvpandya at gmail.com>
>>> wrote:
>>>
>>>>
>>>>
>>>> *Vivek Pandya*
>>>>
>>>>
>>>> On Wed, May 11, 2016 at 9:43 AM, Mehdi Amini <mehdi.amini at apple.com>
>>>> wrote:
>>>>
>>>>>
>>>>> On May 10, 2016, at 6:06 PM, Hal Finkel <hfinkel at anl.gov> wrote:
>>>>>
>>>>>
>>>>>
>>>>> ------------------------------
>>>>>
>>>>> *From: *"vivek pandya" <vivekvpandya at gmail.com>
>>>>> *To: *"llvm-dev" <llvm-dev at lists.llvm.org>, "Tim Amini Golling" <
>>>>> mehdi.amini at apple.com>, "Hal Finkel" <hfinkel at anl.gov>
>>>>> *Cc: *"Quentin Colombet" <qcolombet at apple.com>
>>>>> *Sent: *Tuesday, May 10, 2016 2:59:16 PM
>>>>> *Subject: *[GSoC 2016] Interprocedural Register Allocation -
>>>>> Introduction and Feedback
>>>>>
>>>>> Hello LLVM Community,
>>>>>
>>>>> Sorry for delay as I was busy in final exams.
>>>>>
>>>>> I am Vivek from India. Thanks for choosing my proposal for
>>>>> Interprocedural Register Allocation (IPRA) in LLVM. Mehdi Amini and Hal
>>>>> Finkel will be mentoring me for this project.
>>>>>
>>>>> IPRA can reduce code size and runtime of programs by allocating
>>>>> register across the module and procedure boundaries.
>>>>>
>>>>> I have identified some old but effective research work on this area.
>>>>> I want community's feedback for feasibility of these approach and I am
>>>>> targeting to implement two of them during this project.
>>>>>
>>>>> Here is list of the papers, I have read first two papers and I would
>>>>> like to discuss those approach first, I will read other two paper then
>>>>> initiate discussion for them as well. All I want is to find out a concrete
>>>>> implementation plan before 23 May, 2016 and for that I need community's
>>>>> help.
>>>>>
>>>>> 1) Compile time ----- Minimizing register usage penalty at procedure
>>>>> calls - http://dl.acm.org/citation.cfm?id=53999
>>>>> ====================================================================In
>>>>> this approach intra-procedural register allocation is used as base but
>>>>> machine code generation order is bottom up traversal of call graph and
>>>>> inter-procedural effect is achieved by propagating register usage
>>>>> information of callee function to caller (i.e child to parent in CallGraph)
>>>>> so that caller can use different registers than callee and can save load
>>>>> store cost at procedure call, this is not trivial as it seems due to
>>>>> recursive calls, library function usage etc. Also for upper region of the
>>>>> graph in this technique available number of registers might become zero in
>>>>> that case it should fall back to normal load store at procedure call. Apart
>>>>> from these difficulties other difficulties have been identified please
>>>>> follow this mail-chain
>>>>> https://groups.google.com/d/topic/llvm-dev/HOYAXv3m1LY/discussion
>>>>> My mentor has already provided me a patch that alters code generation
>>>>> order as per bottom up call graph traversal, I am working from that point
>>>>> now. Any other help/suggestion is always welcomed.
>>>>>
>>>>> 2) Link time ----- Global register allocation at link time -
>>>>> http://dl.acm.org/citation.cfm?id=989415
>>>>> ====================================================================In
>>>>> this particular approach (sort of true IPRA) registers will be reallocated
>>>>> (this optimization will be optional if turned off still code will be
>>>>> compiled as per intra-procedural allocation) at link time. Here modules are
>>>>> first complied as per normal compilation but the object code is annotated
>>>>> with details so that linker can build call graph and also calculate usage
>>>>> information at link time. Compiler also write hints in object code that if
>>>>> particular variable is allocated in some other register ( due to new
>>>>> allocation) then how the code should be changed? Thus linker can use these
>>>>> information to decide which variables (global) need to be in same register
>>>>> through out the program execution and also according to register usage
>>>>> information in call graph which procedure will not be active simultaneously
>>>>> so that locals for that procedures can be in same registers with out load
>>>>> store at procedure calls.
>>>>> For these particular method help me to analyze feasibility:
>>>>> 1) Can llvm collects following information at module level in
>>>>> MachineIR? list of procedures in module, list of locals in procedures, list
>>>>> of procedures that a particular procedure can call, and a list of the
>>>>> variables this procedure references. Each entry in the last two lists
>>>>> includes an estimate of the number of times the procedure is called or the
>>>>> variable is referenced in each execution of this procedure
>>>>> 2) Can llvm write informative commands to object files?
>>>>> 3) Can LTO is capable of leveraging those commands?
>>>>>
>>>>> In terms of scoping the project for the summer, I definitely recommend
>>>>> that you focus on (1) first. If you finish that, we can certainly move on
>>>>> to other things.
>>>>>
>>>>>
>>>>> I'll add +1 here, but I already wrote the same thing on IRC when
>>>>> discussing with Vivek. True IPRA without a proper MachineModule
>>>>> infrastructure won't be doable in my opinion (even with such
>>>>> infrastructure, it may not be trivial in LLVM in general).
>>>>>
>>>>> Regarding link time, note that any such a design would likely look
>>>>> much different than in David Wall's paper however, because our LTO
>>>>> re-codegens everything anyway. The paper says, "Finally, it keeps us honest
>>>>> as designers of the system; once we postpone anything until link time, the
>>>>> temptation is great to postpone everything, ..." - Well, we've long-since
>>>>> succumb to that temptation when we LTO. C'est la vie.
>>>>>
>>>>>
>>>>> +1 as well, our LTO will benefit naturally from the leaf-to-root
>>>>> information propagation. ThinLTO will be more challenging/interesting
>>>>> though!
>>>>>
>>>>> For the first part a mechanism similar to MachineModulePass would be
>>>>> desirable but that may not be possible during this project, but if we can
>>>>> make some sort of smaller version of that to suit our purpose.
>>>>>
>>>>> I don't think we need to make any kind of MachineModulePass to make
>>>>> this work. Once we alter the visitation order based on the CGSCC iteration
>>>>> scheme, we can keep state in-between functions in the pre-existing hacky
>>>>> way (using static members of the relevant function passes).
>>>>>
>>>>>  Sorry my mistake here by first part I mean 1) requirement in the
>>>> link time approach.
>>>>
>>>>>
>>>>
>>>>> I also don't see where/why we need a MachineModule(Pass) for the CGSCC
>>>>> scheme, that said I'd rather avoid using a function pass with static
>>>>> members, if we can have a ModuleAnalysis that is bookkeeping the results
>>>>> for functions in the module and queries by the register allocator somehow.
>>>>> Matthias/Quentin may have other inputs on this aspect.
>>>>>
>>>>
>>> @Hal do you mean to add a simple MachineFunction pass that will just
>>> operate on register allocated function and prepare a BitVector to indicate
>>> which register is being used by MachineFunction, and then use this pass as
>>> analysis pass (i.e just simply return static BitVector for clobbered
>>> register when register allocation for next function begins. This part is
>>> not much clear to me) this thing can be done by scheduling a pass post
>>> register allocation in lib/CodeGen/Passes.cpp
>>>
>>> void TargetPassConfig::addMachinePasses() {
>>> .
>>> .
>>> .
>>>   // Run pre-ra passes.
>>>   addPreRegAlloc();
>>>
>>>   // Run register allocation and passes that are tightly coupled with
>>> it,
>>>   // including phi elimination and scheduling.
>>>   if (getOptimizeRegAlloc())
>>>     addOptimizedRegAlloc(createRegAllocPass(true));
>>>   else
>>>     addFastRegAlloc(createRegAllocPass(false));
>>>
>>>   // Run post-ra passes.
>>>   addPostRegAlloc();
>>> // Adding a new pass here which keeps register mask information across
>>> function calls.
>>> .
>>> .
>>> .
>>> }
>>>
>>> But this also requires current register allocators to use this
>>> information in someway because RegMaskBits in LiveIntervalAnalysis.cpp is
>>> not static across calls. I mean I am not clear for how to propagate static
>>> info to Intra-procedural Register allocators (if possible without
>>> disturbing their code )
>>>
>>> First, my hope is that we won't need to change the register allocators,
>>> as such, in order to make use of this information. Instead, we'll simply be
>>> able to alter the register masks generated for the call instructions. These
>>> masks will indicate fewer clobbers than might otherwise be present based on
>>> the ABI because of information gathered during the codegen of the callee.
>>> These masks are generally constructed by target based on the calling
>>> convention. The PowerPC backend, for example, looks like this:
>>>
>>>   // Add a register mask operand representing the call-preserved
>>> registers.
>>>   const TargetRegisterInfo *TRI = Subtarget.getRegisterInfo();
>>>   const uint32_t *Mask =
>>>       TRI->getCallPreservedMask(DAG.getMachineFunction(), CallConv);
>>>   assert(Mask && "Missing call preserved mask for calling convention");
>>>   Ops.push_back(DAG.getRegisterMask(Mask));
>>>
>>> but it can be more complicated. If you look for uses of
>>> 'getRegisterMask' in Target/*/*ISelLowering.cpp, you'll see what I mean.
>>> Regardless, the code ends up calling some method is the targets
>>> TargetRegisterInfo subclass. These methods generally look something like
>>> this:
>>>
>>> const uint32_t *
>>> PPCRegisterInfo::getCallPreservedMask(const MachineFunction &MF,
>>>                                       CallingConv::ID CC) const {
>>>   const PPCSubtarget &Subtarget = MF.getSubtarget<PPCSubtarget>();
>>>   ...
>>>   return TM.isPPC64() ? (Subtarget.hasAltivec() ?
>>> CSR_SVR464_Altivec_RegMask
>>>                                                 : CSR_SVR464_RegMask)
>>>                       : (Subtarget.hasAltivec() ?
>>> CSR_SVR432_Altivec_RegMask
>>>                                                 : CSR_SVR432_RegMask);
>>> }
>>>
>>> In any case, the fundamental idea here is that, when someone calls
>>> getCallPreservedMask in order to set the regmask on a call, we might not
>>> have to use the CC at all. Instead, if we've already codegened the
>>> function, we might use a cache of 'exact' register masks computed during
>>> codegen of the potential callees instead.
>>>
>> I am thinking to add a simple Immutable pass MachineRegisterUsageInfo
>> similar to MachineBranchProbabilityInfo that can maintain
>> RegisterUsageInformation per function. Can it be simply done by using
>> UsedPhysRegMask from MachineRegisterInfo ??
>>
>>
>> No, like the comment said, UsedPhysRegMask gives only the registers
>> clobbered by calls:
>> // This bit vector represents all the registers clobbered by function
>> calls.
>>
>> You want to build this information yourself on top of
>> MachineRegisterInfo::isPhysRegModified
>>
> Ok but then the time complexity will be O(n) n = number of physical
> register on the target. Am I going correct?
>
>
> Yes, this is correct.
>
>
>> Here getCallPreservedMask will call API provided by
>> MachineRegisterUsageInfo to avail the exact register mask but how it can
>> know that the function is already codegen or it will query each time when
>> getCallPreservedMask is called and of available MachineRegisterUsageInfo
>> will return the details otherwise simply return NULL.
>> So changes will be now in TargetRegisterInfo implementation for each
>> target right ??
>>
>>
>>> In order to do this, I think we'll need to provide a function callable
>>> from the target's getCallPreservedMask implementation, which can return
>>> such an 'exact' regmask when available. I think we need to do it this way
>>> for two reasons:
>>>
>>>  1. Not all of the target code calls getCallPreservedMask, but sometimes
>>> calls other similar target-specific functions (e.g.
>>> getTLSCallPreservedMask).
>>>  2. The targets need to opt-in to this behavior because only the target
>>> can know that all register uses are really tagged correctly post "pre-emit".
>>>
>>> Because the target is free to introduce uses of registers at essentially
>>> any time, we need to do the scanning for used registers after the
>>> "pre-emit" passes run. This can be done by scheduling some simple
>>> register-use scanning pass after the call to addPreEmitPass in
>>> lib/CodeGen/Passes.cpp.
>>>
>>>
>>> I think this also applies in someway to Mehdi Amini's idea to keep a
>>> ModulePass for bookkeeping but then existing register allocators will be
>>> required to change so that the code can query the ModulePass for
>>> RegMaskBits for particular function.
>>>
>>> I think that the simplest way to do this is to create an immutable
>>> analysis pass (e.g. BasicAA) that keeps the cache of the computed register
>>> masks. This is somewhat similar in spirit to how the 'AssumptionCache'
>>> analysis works at the IR level. This analysis can then be created by
>>> lib/CodeGen/Passes.cpp early, and then queried and passed around later by
>>> the CodeGen/Target code. Because it is an immutable analysis, it won't get
>>> destroyed until the very end, which is also important because, I imagine,
>>> it will need to own the memory associated with the generated register masks.
>>>
>>>  -Hal
>>>
>>>
>>> Vivek
>>>
>>>
>>>>> Yes for propagating register usage approach we don't need
>>>> MachineModulePass
>>>>
>>>> Vivek
>>>>
>>>>> --
>>>>> Mehdi
>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Hal Finkel
>>> Assistant Computational Scientist
>>> Leadership Computing Facility
>>> Argonne National Laboratory
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160524/e339b3f5/attachment-0001.html>