[llvm-dev] [GSoC 2016] Interprocedural Register Allocation - Introduction and Feedback

Wed May 25 19:26:49 PDT 2016

Yes , we will first finish this approach, it just me being little bit
excited :)

On Thu, May 26, 2016 at 12:31 AM, Hal Finkel <hfinkel at anl.gov> wrote:

>
> ------------------------------
>
> *From: *"Mehdi Amini" <mehdi.amini at apple.com>
> *To: *"vivek pandya" <vivekvpandya at gmail.com>
> *Cc: *"Hal Finkel" <hfinkel at anl.gov>, "llvm-dev" <llvm-dev at lists.llvm.org>,
> "Matthias Braun" <matze at braunis.de>, "Quentin Colombet" <
> qcolombet at apple.com>
> *Sent: *Wednesday, May 25, 2016 1:26:56 AM
> *Subject: *Re: [GSoC 2016] Interprocedural Register Allocation -
> Introduction and Feedback
>
>
>
> Sent from my iPhone
>
> On May 24, 2016, at 11:04 PM, vivek pandya <vivekvpandya at gmail.com> wrote:
>
>
>
> On Wed, May 25, 2016 at 10:46 AM, Mehdi Amini <mehdi.amini at apple.com>
> wrote:
>
>>
>> On May 24, 2016, at 10:08 PM, vivek pandya <vivekvpandya at gmail.com>
>> wrote:
>>
>>
>>
>> On Wed, May 25, 2016 at 10:08 AM, Mehdi Amini <mehdi.amini at apple.com>
>> wrote:
>>
>>>
>>> On May 24, 2016, at 9:17 PM, vivek pandya <vivekvpandya at gmail.com>
>>> wrote:
>>>
>>> Dear Mentors,
>>>
>>> Please help me to understand our plan to implement Interprocedural
>>> Register allocator by propogating register usage info. While writing this
>>> mail I am considering all previous discussion over llvm-dev and IRC.
>>>
>>> 1) A MachineFunction pass to be executed POST-RA to collect the
>>> information about the used Registers.
>>> 2) An Immutable pass which will store reg usage info collected by
>>> previous pass and return it whenever queried.
>>> 3) A Target specific MachineFucntion pass that will use the register
>>> usage info for available for call instrction to achive IPRA. This pass
>>> should run at PRE-RA.
>>>
>>> Relation among above passes:
>>>
>>> 1) pass will store info to 2) pass as well use info for call instruction
>>> found while processing.
>>>
>>> 3) pass only requires to query information from 2) pass.
>>>
>>>
>>> Questions
>>> =========
>>>
>>>  Which pass is responsible for load/store of callee saved register, at
>>> the begining of each function call? And how does it uses RegMask of call
>>> instruction to generate load/store. I think Intra-procedural register
>>> allocator is not responsible to generate load/store around the call site.
>>>
>>> /- - -> (A) - - -> (D)
>>> /
>>> (K)- - ->(T)- - -> (B) - - -> (E)
>>> \
>>> \- - -> (C) - - -> (F)
>>>
>>> So as per our discussion we would require following passes:
>>>
>>> Suppose in given example call graph , register allocation for D is
>>> completed now we have that information available So 3) pass while
>>> processing A , it would collect reg usage info for all callees and OR them
>>> and then it should update A's regmask by going to parant procedure that
>>> actually calls A ??
>>>
>>>
>>> No, Pass 3) is only looking for every call MI in A and updating the
>>> associated regmask by replacing it with the information stored in the
>>> immutable pass.
>>>
>> What is the use of that ? Please be more specific.
>>
>>
>> I'm not sure I get the question. You describe 3 different passes at the
>> beginning, and it seems fairly obvious what is the responsibility of each.
>>
>> This seems quite straightforward to me, so I'm not sure what to explain,
>> here is the logical sequence on D and A:
>>
>> 1) Codegen function D, MachineFunction post-RA: collect the register
>> usage for D. Store the new regmask in the immutable pass
>> 2) Move to function A, perform ISel,  then run the MachineFunction pre-RA
>> that finds the call to D. Query the immutable pass and get the regmask for
>> D. Update the regmask associated with the call.
>>
>> Done.
>>
> I understand the sequence we discussed till now but as I am also reading
> literature about IPRA so I am confused that how this would remove
> load/store of registers (so that callee can clobber it with out worrying
> about caller's content in registers)
>
>
> From the beginning I told you that it is not true IPRA on this aspect:
> callees have to preserve callee-saved register according to the CC, unless
> internal.
>
>
>
> and also how caller will not use register used by callee.
>
>
> Caller can use them, just like it does now: by saving restoring them
> before/after the call if it does.
>
> So this boils down to question what magic is done by just updating reg
> mask at callsite and we do not eve require to intimate this to
> intra-procedural register allocator,
>
>
> Ask yourself how the RA knows how to deal with what to save around calls
> in general (hint: think about what is the use of regmask).
>
>
> this also means that some where a code is written that is responsible for
> inserting store/load for register so that callee can preserve register but
> in presence of IPRA   how it will not add those store/load?
>
>
> What we're doing here is purely an optimization of the calling convention.
> This does not have to be more intrusive...
>
>
>
>
> It seems that my questions have confused Mehdi Amini :P
> Other masters please help us !
>
>
>>
>>
>>
>>
>>>
>>> How reg mask details of call D would be used by Register allocator while
>>> allocating register for A and also not generating load/store for register
>>> being used by A in body of D as we have callee saved convention.
>>>
>>>
>>> I expect all of that to be handled automatically when updating the
>>> regmask.
>>>
>> Please elaborate your both comments, what ever you have think of it to
>> be.
>>
>>
>>>
>>>
>>> How the pass responsible for generating load/store will optimize for the
>>> child node of call graph where it does not require to load/store because
>>> caller will not use register used by callee ? I mean how our IPRA will take
>>> care of this?
>>>
>>>
>>> I don't understand that.
>>>
>> In above example call graph when pass 3) is executed for D it does not do
>> any work because there is not call to any other function in body of D but
>> actually in presence of IPRA there is no need to preserve (store/load) any
>> register in function D ( and also in other such leaf node of call graph).
>> The reason for the same is A now aware of register usage in D so it will
>> not use any register which intersect with D's register usage thus D need
>> not to worry about any thing.
>> Is this make sense?
>>
>>
>> No. D has to obey the calling convention if it is not internal. So any
>> callee-saved register has to be ... saved.
>>
> Why is that required when caller of D does not use any register used by D?
>
>
> You don't know about the callers when you are doing codegen for D.
> Again: we're not trying to perform pure IPRA here.
>
> +1 -- Please stick to the initial plan. The literature contains many
> interesting possibilities, but I think that the register-mask overriding is
> a useful subset that we can accomplish within the summer. Once we have this
> working, we'll all be in a better place to think about more-sophisticated
> possibilities.
>
>  -Hal
>
>
> --
> Mehdi
>
>
>
>>
>> --
>> Mehdi
>>
>>
>>
>>
>>
>>>
>>>
>>> In short I am not much clear with the method for using information to
>>> get effect of IPRA without modifying Register allocator them self(i.e by
>>> updating regMask of call instructions).
>>>
>>> Also 1) pass and 3) pass are seem to intersecting for their work, for
>>> example consider while scanning register usage info for T function the
>>> final register usage info should be <all regs used by T> OR < reg usage
>>> info A > OR <reg usage info B > OR < reg usage info C > because K should
>>> not use any register which is used by T, A ,D, B, C, E, F with out
>>> load/store the relevant paper also discuss this situation and suggest to
>>> fall back to load/store approach. So as we move to upper region of the call
>>> graph it is very likely that enough regiseters are not there to allocate.
>>>
>>>
>>> The calling convention (for anything else than internal function) will
>>> always have some callee-saved registers.
>>> If we have a deep call-graph of internal-only function, then we may
>>> generate a lot of spill at the top. Somehow we may have to think about
>>> driving some heuristic with PGO.
>>>
>>
>> This kind of problem is addressed by this paper Register allocation
>> across procedure and module boundaries -
>> http://dl.acm.org/citation.cfm?id=93551
>> In this paper authors have tried to eliminate load/store only in call
>> intensive regions ( contrast to simple bottom up ) by analyzing call
>> frequency at compile time ( also has facility to plug in Profile info) and
>> rest of the call graph will follow normal register allocator. But yes we
>> will think about this later.
>>
>> I suspect we can think about that a bit later. Let's focus on the simple
>>> for now.
>>>
>>> --
>>> Mehdi
>>>
>>>
>>>
>>>
>>> Please bear with my silly questions.
>>>
>>> Sincerely,
>>> Vivek
>>>
>>>
>>> On Wed, May 25, 2016 at 8:46 AM, vivek pandya <vivekvpandya at gmail.com>
>>> wrote:
>>>
>>>>
>>>>
>>>> On Wed, May 25, 2016 at 8:44 AM, Hal Finkel <hfinkel at anl.gov> wrote:
>>>>
>>>>>
>>>>> ------------------------------
>>>>>
>>>>> *From: *"vivek pandya" <vivekvpandya at gmail.com>
>>>>> *To: *"Hal Finkel" <hfinkel at anl.gov>
>>>>> *Cc: *"llvm-dev" <llvm-dev at lists.llvm.org>, "Matthias Braun" <
>>>>> matze at braunis.de>, "Mehdi Amini" <mehdi.amini at apple.com>, "Quentin
>>>>> Colombet" <qcolombet at apple.com>
>>>>> *Sent: *Tuesday, May 24, 2016 9:34:29 PM
>>>>>
>>>>> *Subject: *Re: [GSoC 2016] Interprocedural Register Allocation -
>>>>> Introduction and Feedback
>>>>>
>>>>>
>>>>>
>>>>> On Wed, May 25, 2016 at 3:53 AM, Hal Finkel <hfinkel at anl.gov> wrote:
>>>>>
>>>>>>
>>>>>> ------------------------------
>>>>>>
>>>>>> *From: *"vivek pandya" <vivekvpandya at gmail.com>
>>>>>> *To: *"Quentin Colombet" <qcolombet at apple.com>
>>>>>> *Cc: *"Hal Finkel" <hfinkel at anl.gov>, "llvm-dev" <
>>>>>> llvm-dev at lists.llvm.org>, "Matthias Braun" <matze at braunis.de>,
>>>>>> "Mehdi Amini" <mehdi.amini at apple.com>
>>>>>> *Sent: *Tuesday, May 24, 2016 1:00:58 PM
>>>>>> *Subject: *Re: [GSoC 2016] Interprocedural Register Allocation -
>>>>>> Introduction and Feedback
>>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> I have written following code to check each register if it is used by
>>>>>> machineFunction or not :
>>>>>>
>>>>>> MachineRegisterInfo *MRI = &MF.getRegInfo();
>>>>>> TargetRegisterInfo *TRI = (TargetRegisterInfo
>>>>>> *)MF.getSubtarget().getRegisterInfo();
>>>>>>
>>>>>> Some reason you can't use a const pointer here?
>>>>>>
>>>>> MCRegisterInfo is just used to get conventional name of register for
>>>>> given target like AX, BX on X86.
>>>>>
>>>>>>
>>>>>> const TargetMachine &TM = MF.getTarget();
>>>>>> const MCRegisterInfo *MCRI = TM.getMCRegisterInfo();
>>>>>> DEBUG(dbgs() << "Function Name : " << MF.getName() << "\n");
>>>>>>
>>>>>> for(TargetRegisterInfo::regclass_iterator i =
>>>>>> (*TRI).regclass_begin(), e = (*TRI).regclass_end(); i != e; i++ ) {
>>>>>> for(TargetRegisterClass::iterator pregi = (*i)->begin(), prege =
>>>>>> (*i)->end(); pregi != prege; pregi++ ) {
>>>>>> DEBUG( dbgs() << "Physical Register : " << MCRI->getName(*pregi) << "
>>>>>> is modified "<< MRI->isPhysRegModified(*pregi) << " \n");
>>>>>>
>>>>>> Try isPhysRegUsed.
>>>>>>
>>>>> ok
>>>>>
>>>>>>
>>>>>> }
>>>>>> }
>>>>>> DEBUG(dbgs() << "\n");
>>>>>>
>>>>>> The pass which is executing this code is schedule POST-RA stage but
>>>>>> this gives me true for all registers i.e in each function all registers are
>>>>>> being used except EBP and some other similar, Is this a correct way to get
>>>>>> register usage information ? I think I have made some mistake please help.
>>>>>>
>>>>>>
>>>>>> You might look at the implementation of these functions in
>>>>>> lib/CodeGen/MachineRegisterInfo.cpp and figure out if they're returning
>>>>>> true because UsedPhysRegMask.test(PhysReg) is true or because
>>>>>> reg_nodbg_empty(*AliasReg) is true.
>>>>>>
>>>>> Yes that helped now I am getting actual register which have been used
>>>>> by given function, but a little problem
>>>>> The updated code is as shown below :
>>>>> for(TargetRegisterInfo::regclass_iterator i = (*TRI).regclass_begin(),
>>>>> e = (*TRI).regclass_end(); i != e; i++ ) {
>>>>> for(TargetRegisterClass::iterator pregi = (*i)->begin(), prege =
>>>>> (*i)->end(); pregi != prege; pregi++ ) {
>>>>> for (MCRegAliasIterator AliasReg(*pregi, TRI, true);
>>>>> AliasReg.isValid(); ++AliasReg) {
>>>>>    if (!MRI->reg_nodbg_empty(*AliasReg)) {
>>>>>     DEBUG( dbgs() << "Physical Register : " << MCRI->getName(*pregi)
>>>>> << " is used "<< MRI->isPhysRegUsed(*pregi) << " \n");
>>>>>     break; // no need to process more alias
>>>>>    }
>>>>>   }
>>>>> }
>>>>> }
>>>>> But here some registers are getting processed with in different
>>>>> classes (unnecessary processing) Is this only way to iterate through all
>>>>> used register (using RegClass iterator) ? Is there any way to avoid
>>>>> duplicate regs?
>>>>> Of course currently I am just printing but next I am thinking to use a
>>>>> map to track usage info , in that only distinct register info will be
>>>>> stored but still due to loop structure I need to iterate through a single
>>>>> register 3 - 4 times making it time consuming.
>>>>>
>>>>> Yes, I believe you can just do:
>>>>>
>>>>>   for (unsigned Reg = 0; Reg < TRI->getNumRegs(); ++Reg) {
>>>>>
>>>> Oh yes thanks I just forgot that PhyReg starts at 0.
>>>>
>>>>>
>>>>>
>>>>>  -Hal
>>>>>
>>>>>
>>>>> -Vivek
>>>>>
>>>>>>
>>>>>>
>>>>>>  -Hal
>>>>>>
>>>>>>
>>>>>>
>>>>>> Vivek
>>>>>>
>>>>>> On Wed, May 18, 2016 at 11:42 PM, Quentin Colombet <
>>>>>> qcolombet at apple.com> wrote:
>>>>>>
>>>>>>>
>>>>>>> On May 18, 2016, at 11:00 AM, vivek pandya <vivekvpandya at gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *Vivek Pandya*
>>>>>>>
>>>>>>>
>>>>>>> On Wed, May 18, 2016 at 11:25 PM, Quentin Colombet <
>>>>>>> qcolombet at apple.com> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> On May 18, 2016, at 10:46 AM, vivek pandya <vivekvpandya at gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *Vivek Pandya*
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, May 11, 2016 at 4:01 PM, Hal Finkel <hfinkel at anl.gov>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>> ------------------------------
>>>>>>>>>
>>>>>>>>> *From: *"vivek pandya" <vivekvpandya at gmail.com>
>>>>>>>>> *To: *"Mehdi Amini" <mehdi.amini at apple.com>
>>>>>>>>> *Cc: *"Hal Finkel" <hfinkel at anl.gov>, "Quentin Colombet" <
>>>>>>>>> qcolombet at apple.com>, "llvm-dev" <llvm-dev at lists.llvm.org>,
>>>>>>>>> "Matthias Braun" <matze at braunis.de>
>>>>>>>>> *Sent: *Wednesday, May 11, 2016 3:15:03 AM
>>>>>>>>> *Subject: *Re: [GSoC 2016] Interprocedural Register Allocation -
>>>>>>>>> Introduction and Feedback
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *Vivek Pandya*
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, May 11, 2016 at 10:02 AM, vivek pandya <
>>>>>>>>> vivekvpandya at gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> *Vivek Pandya*
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, May 11, 2016 at 9:43 AM, Mehdi Amini <
>>>>>>>>>> mehdi.amini at apple.com> wrote:
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On May 10, 2016, at 6:06 PM, Hal Finkel <hfinkel at anl.gov> wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ------------------------------
>>>>>>>>>>>
>>>>>>>>>>> *From: *"vivek pandya" <vivekvpandya at gmail.com>
>>>>>>>>>>> *To: *"llvm-dev" <llvm-dev at lists.llvm.org>, "Tim Amini Golling"
>>>>>>>>>>> <mehdi.amini at apple.com>, "Hal Finkel" <hfinkel at anl.gov>
>>>>>>>>>>> *Cc: *"Quentin Colombet" <qcolombet at apple.com>
>>>>>>>>>>> *Sent: *Tuesday, May 10, 2016 2:59:16 PM
>>>>>>>>>>> *Subject: *[GSoC 2016] Interprocedural Register Allocation -
>>>>>>>>>>> Introduction and Feedback
>>>>>>>>>>>
>>>>>>>>>>> Hello LLVM Community,
>>>>>>>>>>>
>>>>>>>>>>> Sorry for delay as I was busy in final exams.
>>>>>>>>>>>
>>>>>>>>>>> I am Vivek from India. Thanks for choosing my proposal for
>>>>>>>>>>> Interprocedural Register Allocation (IPRA) in LLVM. Mehdi Amini and Hal
>>>>>>>>>>> Finkel will be mentoring me for this project.
>>>>>>>>>>>
>>>>>>>>>>> IPRA can reduce code size and runtime of programs by allocating
>>>>>>>>>>> register across the module and procedure boundaries.
>>>>>>>>>>>
>>>>>>>>>>> I have identified some old but effective research work on this
>>>>>>>>>>> area.
>>>>>>>>>>> I want community's feedback for feasibility of these approach
>>>>>>>>>>> and I am targeting to implement two of them during this project.
>>>>>>>>>>>
>>>>>>>>>>> Here is list of the papers, I have read first two papers and I
>>>>>>>>>>> would like to discuss those approach first, I will read other two paper
>>>>>>>>>>> then initiate discussion for them as well. All I want is to find out a
>>>>>>>>>>> concrete implementation plan before 23 May, 2016 and for that I need
>>>>>>>>>>> community's help.
>>>>>>>>>>>
>>>>>>>>>>> 1) Compile time ----- Minimizing register usage penalty at
>>>>>>>>>>> procedure calls - http://dl.acm.org/citation.cfm?id=53999
>>>>>>>>>>> ====================================================================In
>>>>>>>>>>> this approach intra-procedural register allocation is used as base but
>>>>>>>>>>> machine code generation order is bottom up traversal of call graph and
>>>>>>>>>>> inter-procedural effect is achieved by propagating register usage
>>>>>>>>>>> information of callee function to caller (i.e child to parent in CallGraph)
>>>>>>>>>>> so that caller can use different registers than callee and can save load
>>>>>>>>>>> store cost at procedure call, this is not trivial as it seems due to
>>>>>>>>>>> recursive calls, library function usage etc. Also for upper region of the
>>>>>>>>>>> graph in this technique available number of registers might become zero in
>>>>>>>>>>> that case it should fall back to normal load store at procedure call. Apart
>>>>>>>>>>> from these difficulties other difficulties have been identified please
>>>>>>>>>>> follow this mail-chain
>>>>>>>>>>> https://groups.google.com/d/topic/llvm-dev/HOYAXv3m1LY/discussion
>>>>>>>>>>> My mentor has already provided me a patch that alters code
>>>>>>>>>>> generation order as per bottom up call graph traversal, I am working from
>>>>>>>>>>> that point now. Any other help/suggestion is always welcomed.
>>>>>>>>>>>
>>>>>>>>>>> 2) Link time ----- Global register allocation at link time -
>>>>>>>>>>> http://dl.acm.org/citation.cfm?id=989415
>>>>>>>>>>> ====================================================================In
>>>>>>>>>>> this particular approach (sort of true IPRA) registers will be reallocated
>>>>>>>>>>> (this optimization will be optional if turned off still code will be
>>>>>>>>>>> compiled as per intra-procedural allocation) at link time. Here modules are
>>>>>>>>>>> first complied as per normal compilation but the object code is annotated
>>>>>>>>>>> with details so that linker can build call graph and also calculate usage
>>>>>>>>>>> information at link time. Compiler also write hints in object code that if
>>>>>>>>>>> particular variable is allocated in some other register ( due to new
>>>>>>>>>>> allocation) then how the code should be changed? Thus linker can use these
>>>>>>>>>>> information to decide which variables (global) need to be in same register
>>>>>>>>>>> through out the program execution and also according to register usage
>>>>>>>>>>> information in call graph which procedure will not be active simultaneously
>>>>>>>>>>> so that locals for that procedures can be in same registers with out load
>>>>>>>>>>> store at procedure calls.
>>>>>>>>>>> For these particular method help me to analyze feasibility:
>>>>>>>>>>> 1) Can llvm collects following information at module level in
>>>>>>>>>>> MachineIR? list of procedures in module, list of locals in procedures, list
>>>>>>>>>>> of procedures that a particular procedure can call, and a list of the
>>>>>>>>>>> variables this procedure references. Each entry in the last two lists
>>>>>>>>>>> includes an estimate of the number of times the procedure is called or the
>>>>>>>>>>> variable is referenced in each execution of this procedure
>>>>>>>>>>> 2) Can llvm write informative commands to object files?
>>>>>>>>>>> 3) Can LTO is capable of leveraging those commands?
>>>>>>>>>>>
>>>>>>>>>>> In terms of scoping the project for the summer, I definitely
>>>>>>>>>>> recommend that you focus on (1) first. If you finish that, we can certainly
>>>>>>>>>>> move on to other things.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I'll add +1 here, but I already wrote the same thing on IRC when
>>>>>>>>>>> discussing with Vivek. True IPRA without a proper MachineModule
>>>>>>>>>>> infrastructure won't be doable in my opinion (even with such
>>>>>>>>>>> infrastructure, it may not be trivial in LLVM in general).
>>>>>>>>>>>
>>>>>>>>>>> Regarding link time, note that any such a design would likely
>>>>>>>>>>> look much different than in David Wall's paper however, because our LTO
>>>>>>>>>>> re-codegens everything anyway. The paper says, "Finally, it keeps us honest
>>>>>>>>>>> as designers of the system; once we postpone anything until link time, the
>>>>>>>>>>> temptation is great to postpone everything, ..." - Well, we've long-since
>>>>>>>>>>> succumb to that temptation when we LTO. C'est la vie.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> +1 as well, our LTO will benefit naturally from the leaf-to-root
>>>>>>>>>>> information propagation. ThinLTO will be more challenging/interesting
>>>>>>>>>>> though!
>>>>>>>>>>>
>>>>>>>>>>> For the first part a mechanism similar to MachineModulePass
>>>>>>>>>>> would be desirable but that may not be possible during this project, but if
>>>>>>>>>>> we can make some sort of smaller version of that to suit our purpose.
>>>>>>>>>>>
>>>>>>>>>>> I don't think we need to make any kind of MachineModulePass to
>>>>>>>>>>> make this work. Once we alter the visitation order based on the CGSCC
>>>>>>>>>>> iteration scheme, we can keep state in-between functions in the
>>>>>>>>>>> pre-existing hacky way (using static members of the relevant function
>>>>>>>>>>> passes).
>>>>>>>>>>>
>>>>>>>>>>>  Sorry my mistake here by first part I mean 1) requirement in
>>>>>>>>>> the link time approach.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> I also don't see where/why we need a MachineModule(Pass) for the
>>>>>>>>>>> CGSCC scheme, that said I'd rather avoid using a function pass with static
>>>>>>>>>>> members, if we can have a ModuleAnalysis that is bookkeeping the results
>>>>>>>>>>> for functions in the module and queries by the register allocator somehow.
>>>>>>>>>>> Matthias/Quentin may have other inputs on this aspect.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> @Hal do you mean to add a simple MachineFunction pass that will
>>>>>>>>> just operate on register allocated function and prepare a BitVector to
>>>>>>>>> indicate which register is being used by MachineFunction, and then use this
>>>>>>>>> pass as analysis pass (i.e just simply return static BitVector for
>>>>>>>>> clobbered register when register allocation for next function begins. This
>>>>>>>>> part is not much clear to me) this thing can be done by scheduling a pass
>>>>>>>>> post register allocation in lib/CodeGen/Passes.cpp
>>>>>>>>>
>>>>>>>>> void TargetPassConfig::addMachinePasses() {
>>>>>>>>> .
>>>>>>>>> .
>>>>>>>>> .
>>>>>>>>>   // Run pre-ra passes.
>>>>>>>>>   addPreRegAlloc();
>>>>>>>>>
>>>>>>>>>   // Run register allocation and passes that are tightly coupled
>>>>>>>>> with it,
>>>>>>>>>   // including phi elimination and scheduling.
>>>>>>>>>   if (getOptimizeRegAlloc())
>>>>>>>>>     addOptimizedRegAlloc(createRegAllocPass(true));
>>>>>>>>>   else
>>>>>>>>>     addFastRegAlloc(createRegAllocPass(false));
>>>>>>>>>
>>>>>>>>>   // Run post-ra passes.
>>>>>>>>>   addPostRegAlloc();
>>>>>>>>> // Adding a new pass here which keeps register mask information
>>>>>>>>> across function calls.
>>>>>>>>> .
>>>>>>>>> .
>>>>>>>>> .
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> But this also requires current register allocators to use this
>>>>>>>>> information in someway because RegMaskBits in LiveIntervalAnalysis.cpp is
>>>>>>>>> not static across calls. I mean I am not clear for how to propagate static
>>>>>>>>> info to Intra-procedural Register allocators (if possible without
>>>>>>>>> disturbing their code )
>>>>>>>>>
>>>>>>>>> First, my hope is that we won't need to change the register
>>>>>>>>> allocators, as such, in order to make use of this information. Instead,
>>>>>>>>> we'll simply be able to alter the register masks generated for the call
>>>>>>>>> instructions. These masks will indicate fewer clobbers than might otherwise
>>>>>>>>> be present based on the ABI because of information gathered during the
>>>>>>>>> codegen of the callee. These masks are generally constructed by target
>>>>>>>>> based on the calling convention. The PowerPC backend, for example, looks
>>>>>>>>> like this:
>>>>>>>>>
>>>>>>>>>   // Add a register mask operand representing the call-preserved
>>>>>>>>> registers.
>>>>>>>>>   const TargetRegisterInfo *TRI = Subtarget.getRegisterInfo();
>>>>>>>>>   const uint32_t *Mask =
>>>>>>>>>       TRI->getCallPreservedMask(DAG.getMachineFunction(),
>>>>>>>>> CallConv);
>>>>>>>>>   assert(Mask && "Missing call preserved mask for calling
>>>>>>>>> convention");
>>>>>>>>>   Ops.push_back(DAG.getRegisterMask(Mask));
>>>>>>>>>
>>>>>>>>> but it can be more complicated. If you look for uses of
>>>>>>>>> 'getRegisterMask' in Target/*/*ISelLowering.cpp, you'll see what I mean.
>>>>>>>>> Regardless, the code ends up calling some method is the targets
>>>>>>>>> TargetRegisterInfo subclass. These methods generally look something like
>>>>>>>>> this:
>>>>>>>>>
>>>>>>>>> const uint32_t *
>>>>>>>>> PPCRegisterInfo::getCallPreservedMask(const MachineFunction &MF,
>>>>>>>>>                                       CallingConv::ID CC) const {
>>>>>>>>>   const PPCSubtarget &Subtarget = MF.getSubtarget<PPCSubtarget>();
>>>>>>>>>   ...
>>>>>>>>>   return TM.isPPC64() ? (Subtarget.hasAltivec() ?
>>>>>>>>> CSR_SVR464_Altivec_RegMask
>>>>>>>>>                                                 :
>>>>>>>>> CSR_SVR464_RegMask)
>>>>>>>>>                       : (Subtarget.hasAltivec() ?
>>>>>>>>> CSR_SVR432_Altivec_RegMask
>>>>>>>>>                                                 :
>>>>>>>>> CSR_SVR432_RegMask);
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> In any case, the fundamental idea here is that, when someone calls
>>>>>>>>> getCallPreservedMask in order to set the regmask on a call, we might not
>>>>>>>>> have to use the CC at all. Instead, if we've already codegened the
>>>>>>>>> function, we might use a cache of 'exact' register masks computed during
>>>>>>>>> codegen of the potential callees instead.
>>>>>>>>>
>>>>>>>> I am thinking to add a simple Immutable pass
>>>>>>>> MachineRegisterUsageInfo similar to MachineBranchProbabilityInfo that can
>>>>>>>> maintain RegisterUsageInformation per function. Can it be simply done by
>>>>>>>> using UsedPhysRegMask from MachineRegisterInfo ??
>>>>>>>>
>>>>>>>>
>>>>>>>> No, like the comment said, UsedPhysRegMask gives only the registers
>>>>>>>> clobbered by calls:
>>>>>>>> // This bit vector represents all the registers clobbered by
>>>>>>>> function calls.
>>>>>>>>
>>>>>>>> You want to build this information yourself on top of
>>>>>>>> MachineRegisterInfo::isPhysRegModified
>>>>>>>>
>>>>>>> Ok but then the time complexity will be O(n) n = number of physical
>>>>>>> register on the target. Am I going correct?
>>>>>>>
>>>>>>>
>>>>>>> Yes, this is correct.
>>>>>>>
>>>>>>>
>>>>>>>> Here getCallPreservedMask will call API provided by
>>>>>>>> MachineRegisterUsageInfo to avail the exact register mask but how it can
>>>>>>>> know that the function is already codegen or it will query each time when
>>>>>>>> getCallPreservedMask is called and of available MachineRegisterUsageInfo
>>>>>>>> will return the details otherwise simply return NULL.
>>>>>>>> So changes will be now in TargetRegisterInfo implementation for
>>>>>>>> each target right ??
>>>>>>>>
>>>>>>>>
>>>>>>>>> In order to do this, I think we'll need to provide a function
>>>>>>>>> callable from the target's getCallPreservedMask implementation, which can
>>>>>>>>> return such an 'exact' regmask when available. I think we need to do it
>>>>>>>>> this way for two reasons:
>>>>>>>>>
>>>>>>>>>  1. Not all of the target code calls getCallPreservedMask, but
>>>>>>>>> sometimes calls other similar target-specific functions (e.g.
>>>>>>>>> getTLSCallPreservedMask).
>>>>>>>>>  2. The targets need to opt-in to this behavior because only the
>>>>>>>>> target can know that all register uses are really tagged correctly post
>>>>>>>>> "pre-emit".
>>>>>>>>>
>>>>>>>>> Because the target is free to introduce uses of registers at
>>>>>>>>> essentially any time, we need to do the scanning for used registers after
>>>>>>>>> the "pre-emit" passes run. This can be done by scheduling some simple
>>>>>>>>> register-use scanning pass after the call to addPreEmitPass in
>>>>>>>>> lib/CodeGen/Passes.cpp.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I think this also applies in someway to Mehdi Amini's idea to keep
>>>>>>>>> a ModulePass for bookkeeping but then existing register allocators will be
>>>>>>>>> required to change so that the code can query the ModulePass for
>>>>>>>>> RegMaskBits for particular function.
>>>>>>>>>
>>>>>>>>> I think that the simplest way to do this is to create an immutable
>>>>>>>>> analysis pass (e.g. BasicAA) that keeps the cache of the computed register
>>>>>>>>> masks. This is somewhat similar in spirit to how the 'AssumptionCache'
>>>>>>>>> analysis works at the IR level. This analysis can then be created by
>>>>>>>>> lib/CodeGen/Passes.cpp early, and then queried and passed around later by
>>>>>>>>> the CodeGen/Target code. Because it is an immutable analysis, it won't get
>>>>>>>>> destroyed until the very end, which is also important because, I imagine,
>>>>>>>>> it will need to own the memory associated with the generated register masks.
>>>>>>>>>
>>>>>>>>>  -Hal
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Vivek
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>> Yes for propagating register usage approach we don't need
>>>>>>>>>> MachineModulePass
>>>>>>>>>>
>>>>>>>>>> Vivek
>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Mehdi
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Hal Finkel
>>>>>>>>> Assistant Computational Scientist
>>>>>>>>> Leadership Computing Facility
>>>>>>>>> Argonne National Laboratory
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Hal Finkel
>>>>>> Assistant Computational Scientist
>>>>>> Leadership Computing Facility
>>>>>> Argonne National Laboratory
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Hal Finkel
>>>>> Assistant Computational Scientist
>>>>> Leadership Computing Facility
>>>>> Argonne National Laboratory
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>
>
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160526/27ec29ba/attachment.html>