<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Sep 26, 2019 at 5:31 PM Sriraman Tallam <<a href="mailto:tmsriram@google.com">tmsriram@google.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On Thu, Sep 26, 2019 at 5:13 PM Eli Friedman <<a href="mailto:efriedma@quicinc.com" target="_blank">efriedma@quicinc.com</a>> wrote:<br>

><br>

> > -----Original Message-----<br>

> > From: Sriraman Tallam <<a href="mailto:tmsriram@google.com" target="_blank">tmsriram@google.com</a>><br>

> > Sent: Thursday, September 26, 2019 3:24 PM<br>

> > To: Eli Friedman <<a href="mailto:efriedma@quicinc.com" target="_blank">efriedma@quicinc.com</a>><br>

> > Cc: Xinliang David Li <<a href="mailto:xinliangli@gmail.com" target="_blank">xinliangli@gmail.com</a>>; llvm-dev <<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a>><br>

> > Subject: [EXT] Re: [llvm-dev] [RFC] Propeller: A frame work for Post Link<br>

> > Optimizations<br>

> ><br>

> > On Thu, Sep 26, 2019 at 12:39 PM Eli Friedman <<a href="mailto:efriedma@quicinc.com" target="_blank">efriedma@quicinc.com</a>> wrote:<br>

> > ><br>

> > ><br>

> > ><br>

> > > From: Xinliang David Li <<a href="mailto:xinliangli@gmail.com" target="_blank">xinliangli@gmail.com</a>><br>

> > > Sent: Wednesday, September 25, 2019 5:58 PM<br>

> > > To: Eli Friedman <<a href="mailto:efriedma@quicinc.com" target="_blank">efriedma@quicinc.com</a>><br>

> > > Cc: Sriraman Tallam <<a href="mailto:tmsriram@google.com" target="_blank">tmsriram@google.com</a>>; llvm-dev <llvm-<br>

> > <a href="mailto:dev@lists.llvm.org" target="_blank">dev@lists.llvm.org</a>><br>

> > > Subject: [EXT] Re: [llvm-dev] [RFC] Propeller: A frame work for Post Link<br>

> > Optimizations<br>

> > ><br>

> > ><br>

> > ><br>

> > ><br>

> > ><br>

> > ><br>

> > ><br>

> > > On Wed, Sep 25, 2019 at 5:02 PM Eli Friedman via llvm-dev <llvm-<br>

> > <a href="mailto:dev@lists.llvm.org" target="_blank">dev@lists.llvm.org</a>> wrote:<br>

> > ><br>

> > > My biggest question about this architecture is about when propeller runs basic<br>

> > block reordering within a function.  It seems like a lot of the complexity comes<br>

> > from using the proposed -fbasicblock-sections to generated mangled ELF, and<br>

> > then re-parsing the mangled ELF as a separate step.  I'm not sure that's the right<br>

> > approach, long-term.<br>

> > ><br>

> > > Splitting every basic block into its own section introduces overhead, like you<br>

> > described.  And it's likely more complex on non-x86 targets, which have a<br>

> > greater variety of conditional branches.  And the reordering itself involves a<br>

> > bunch of x86 and ELF-specific code.<br>

> > ><br>

> > > I'd like to suggest an alternative: instead of perform basic block reordering and<br>

> > function splitting by manipulating the ELF files, you could perform reordering<br>

> > and splitting as part of link-time code generation, as an MIR pass that runs just<br>

> > before the assembly printer.  MIR is almost exactly the form you want for this<br>

> > sort of manipulation: it has basic blocks which correspond closely to the final<br>

> > binary, and a high-level representation of branch instructions.<br>

> > ><br>

> > ><br>

> > ><br>

> > > This was considered for Propeller.  This  is currently being explored in a similar<br>

> > way as an alternative of CSFDO which uses PMU samples.<br>

> > ><br>

> > ><br>

> > ><br>

> > > Makes sense.<br>

> > ><br>

> > ><br>

> > ><br>

> > > And it's before the DWARF/CFI emission, so you don't need to worry about<br>

> > fixing them afterwards.  This should take less code overall, and much less target-<br>

> > specific code. And infrastructure for function splitting would be useful for non-<br>

> > Propeller workflows.<br>

> > ><br>

> > > There are some minor downsides to this approach I can think of.  You lose a<br>

> > little flexibility, in that you can't mix blocks from different functions together,<br>

> > but you aren't doing that anyway, from your description?<br>

> > ><br>

> > ><br>

> > ><br>

> > > One of the main design objectives of Propeller is to have the capability to do<br>

> > interprocedural code transformations (reordering, cloning, dedupping etc), so<br>

> > this won't be a minor downside. Function/block alignment (for branch<br>

> > misprediction reduction etc) will also need to be done as a global optimization in<br>

> > the future.<br>

> > ><br>

> > ><br>

> > ><br>

> > > Okay, so my suggestion doesn’t work. I’m still concerned the proposed design<br>

> > is going to push us in a direction we don’t want to go.  Particularly, if you’re<br>

> > going to attempt more complicated transforms, the problems caused by the<br>

> > limited information available in an ELF file will become more prominent.  I mean,<br>

> > yes, you can come up with a more complicated implicit contract between the<br>

> > compiler and Propeller about the exact format of Propeller basic blocks, and add<br>

> > additional symbol annotations, and eventually come up with an “IR” that allows<br>

> > Propeller to perform arbitrary code transforms.  But that’s inevitably going to be<br>

> > more complicated, and harder to understand, than a compiler IR designed for<br>

> > optimizations.<br>

> ><br>

> > Thanks for the feedback, I am not sure I fully understand your<br>

> > concerns but let me try to make some of the things clearer:<br>

> ><br>

> > *  Propeller relinks. Specifically, it regenerates ELF object files<br>

> > from MIR.  Even if MIR were serializable, we would still be starting<br>

> > before CFI instruction inserter pass and then regenerate the native<br>

> > ELF objects.<br>

><br>

> Now I'm confused.  Why are you regenerating the ELF files?<br>

<br>

TLDR;  We are regenerating ELF files from optimized IR to keep the<br>

cost of generating basic block sections low.<br>

<br>

If what you describe above is done, where we generate basic block<br>

sections even before we profile, we must conservatively generate<br>

sections for all functions.  This is unnecessarily expensive and we<br>

have shown that generating it on demand based on profiles is much<br>

cheaper.  Since the backend generation can be distributed or<br>

parallelized, this is not expensive to do.    If we could make the<br>

cost of basic block sections in terms of size bloats cheaper we could<br>

do what you just suggested here, that is, always build with basic<br>

block sections for all basic blocks.<br>

<br>

><br>

> I thought the workflow was something like the following:<br>

><br>

> 1. The compiler (or LTO codegen) generates ELF files with basic block sections.<br>

<br>

The correction here is that we generate first with basic block labels<br>

and not sections.<br>

<br>

> 2. You link once without propeller optimizations.<br>

<br>

This is correct.<br>

<br>

> 3. You collect the propeller profile.<br>

<br>

This is correct.<br>

<br>

> 4. You use the same ELF files to relink with propeller optimizations.<br>

<br>

We use the same optimized IR files to regenerate the ELF files.  We<br>

could use optimized MIR here but serializability of MIR is not well<br>

supported yet and is part of the larger plan.  We only need to re-run<br>

CFI instruction inserter in MIR.  For future optimizations, we can<br>

plug in here and make small code transformations for optimizations<br>

like prefetch insertion.  If we attempt to do basic block re-ordering<br>

here, we would be limited to intra-procedural basic block reordering<br>

which is sub-optimal.  Hence, we generate basic block sections here<br>

only for the sampled functions which is a small %age.<br>

<br>

><br>

> That's sufficient to implement all the optimizations you described, as far as I can tell.<br>

><br>

> But it sounds like that's wrong?  It sounds like what you're describing is:<br>

><br>

> 1. LTO codegen generates MIR files.<br>

> 2. You convert the MIR files to ELF object files, and link without propeller optimizations.  (Does this step have any propeller-specific code generation differences?)<br>

<br>

Yes, like I said above, basic block labels are used here to identify<br>

the basic blocks which are later used in the last step to annotate<br>

profiles.<br>

<br>

> 4. You collect the propeller profile.<br>

> 5. You apply some propeller optimization to the MIR files.<br>

> 6. You convert the optimized MIR files to ELF object files, with basic block sections enabled.<br>

<br>

Right, technically this is high level IR now as MIR is not<br>

serializable but this is ok for discussion.<br>

<br>

> 7. You link the ELF files, applying propeller reordering optimizations.<br>

><br>

> (And since you can't serialize MIR, you actually serialize LLVM IR, and run the code generator when you deserialize, to convert that to MIR.)<br>

><br>

> Is that correct?<br>

<br>

That is correct.<br>

<br>

><br>

> If you have the MIR at the time you're making all the propeller optimization decisions, why is the linker rewriting raw x86 assembly, as opposed to performing the relevant transforms on MIR?<br></blockquote><div><br></div><div>MIR level transformations are not excluded and can be done in Propeller framework, just limited to module level optimizations. Anything Global/Interprocedural needs to happen at (real) link time with assistance of module level annotations.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

MIR is still one module at a time.  We cannot do inter-procedural<br>

basic block layout here.  We can do much more advanced stuff at the<br>

whole-program level in the linker.  The relaxation code is the down<br>

side.<br>

<br>

For more details, We strongly considered this.  We could run something<br>

like a thin link in thin lto figure out the global layout and hand out<br>

the relevant  subsets of the global decision  to each module.  This<br>

looked more complicated and the individual pieces from each module<br>

should still be globally laid out again by the linker.  </blockquote><div><br></div><div>This won't work well -- as cross module inlining has not yet happened. Not only will there be problems with profile annotation, all the profile context sensitives will be lost.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">This<br>

constraints us on what we can do for layout and also does not work<br>

well with future optimizations like global alignment like David<br>

pointed out.<br>

<br>

><br>

> Why are you proposing to add a bunch of options to clang to manipulate LLVM code generation, given none of those options are actually used for Propeller workflows?<br>

<br></blockquote><div><br></div><div>Propeller workflows include precise profile annotations, so the options are used there.</div><div><br></div><div>David</div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

Where do you suggest labelling and section options should exist?  We<br>

looked at  basic block sections to be similar to function sections in<br>

terms of option handling?<br>

<br>

Thanks<br>

Sri<br>

<br>

><br>

> -Eli<br>

</blockquote></div></div>