[llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations

Thu Sep 26 19:02:50 PDT 2019

On Thu, Sep 26, 2019 at 5:31 PM Sriraman Tallam <tmsriram at google.com> wrote:

> On Thu, Sep 26, 2019 at 5:13 PM Eli Friedman <efriedma at quicinc.com> wrote:
> >
> > > -----Original Message-----
> > > From: Sriraman Tallam <tmsriram at google.com>
> > > Sent: Thursday, September 26, 2019 3:24 PM
> > > To: Eli Friedman <efriedma at quicinc.com>
> > > Cc: Xinliang David Li <xinliangli at gmail.com>; llvm-dev <
> llvm-dev at lists.llvm.org>
> > > Subject: [EXT] Re: [llvm-dev] [RFC] Propeller: A frame work for Post
> Link
> > > Optimizations
> > >
> > > On Thu, Sep 26, 2019 at 12:39 PM Eli Friedman <efriedma at quicinc.com>
> wrote:
> > > >
> > > >
> > > >
> > > > From: Xinliang David Li <xinliangli at gmail.com>
> > > > Sent: Wednesday, September 25, 2019 5:58 PM
> > > > To: Eli Friedman <efriedma at quicinc.com>
> > > > Cc: Sriraman Tallam <tmsriram at google.com>; llvm-dev <llvm-
> > > dev at lists.llvm.org>
> > > > Subject: [EXT] Re: [llvm-dev] [RFC] Propeller: A frame work for Post
> Link
> > > Optimizations
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Wed, Sep 25, 2019 at 5:02 PM Eli Friedman via llvm-dev <llvm-
> > > dev at lists.llvm.org> wrote:
> > > >
> > > > My biggest question about this architecture is about when propeller
> runs basic
> > > block reordering within a function.  It seems like a lot of the
> complexity comes
> > > from using the proposed -fbasicblock-sections to generated mangled
> ELF, and
> > > then re-parsing the mangled ELF as a separate step.  I'm not sure
> that's the right
> > > approach, long-term.
> > > >
> > > > Splitting every basic block into its own section introduces
> overhead, like you
> > > described.  And it's likely more complex on non-x86 targets, which
> have a
> > > greater variety of conditional branches.  And the reordering itself
> involves a
> > > bunch of x86 and ELF-specific code.
> > > >
> > > > I'd like to suggest an alternative: instead of perform basic block
> reordering and
> > > function splitting by manipulating the ELF files, you could perform
> reordering
> > > and splitting as part of link-time code generation, as an MIR pass
> that runs just
> > > before the assembly printer.  MIR is almost exactly the form you want
> for this
> > > sort of manipulation: it has basic blocks which correspond closely to
> the final
> > > binary, and a high-level representation of branch instructions.
> > > >
> > > >
> > > >
> > > > This was considered for Propeller.  This  is currently being
> explored in a similar
> > > way as an alternative of CSFDO which uses PMU samples.
> > > >
> > > >
> > > >
> > > > Makes sense.
> > > >
> > > >
> > > >
> > > > And it's before the DWARF/CFI emission, so you don't need to worry
> about
> > > fixing them afterwards.  This should take less code overall, and much
> less target-
> > > specific code. And infrastructure for function splitting would be
> useful for non-
> > > Propeller workflows.
> > > >
> > > > There are some minor downsides to this approach I can think of.  You
> lose a
> > > little flexibility, in that you can't mix blocks from different
> functions together,
> > > but you aren't doing that anyway, from your description?
> > > >
> > > >
> > > >
> > > > One of the main design objectives of Propeller is to have the
> capability to do
> > > interprocedural code transformations (reordering, cloning, dedupping
> etc), so
> > > this won't be a minor downside. Function/block alignment (for branch
> > > misprediction reduction etc) will also need to be done as a global
> optimization in
> > > the future.
> > > >
> > > >
> > > >
> > > > Okay, so my suggestion doesn’t work. I’m still concerned the
> proposed design
> > > is going to push us in a direction we don’t want to go.  Particularly,
> if you’re
> > > going to attempt more complicated transforms, the problems caused by
> the
> > > limited information available in an ELF file will become more
> prominent.  I mean,
> > > yes, you can come up with a more complicated implicit contract between
> the
> > > compiler and Propeller about the exact format of Propeller basic
> blocks, and add
> > > additional symbol annotations, and eventually come up with an “IR”
> that allows
> > > Propeller to perform arbitrary code transforms.  But that’s inevitably
> going to be
> > > more complicated, and harder to understand, than a compiler IR
> designed for
> > > optimizations.
> > >
> > > Thanks for the feedback, I am not sure I fully understand your
> > > concerns but let me try to make some of the things clearer:
> > >
> > > *  Propeller relinks. Specifically, it regenerates ELF object files
> > > from MIR.  Even if MIR were serializable, we would still be starting
> > > before CFI instruction inserter pass and then regenerate the native
> > > ELF objects.
> >
> > Now I'm confused.  Why are you regenerating the ELF files?
>
> TLDR;  We are regenerating ELF files from optimized IR to keep the
> cost of generating basic block sections low.
>
> If what you describe above is done, where we generate basic block
> sections even before we profile, we must conservatively generate
> sections for all functions.  This is unnecessarily expensive and we
> have shown that generating it on demand based on profiles is much
> cheaper.  Since the backend generation can be distributed or
> parallelized, this is not expensive to do.    If we could make the
> cost of basic block sections in terms of size bloats cheaper we could
> do what you just suggested here, that is, always build with basic
> block sections for all basic blocks.
>
> >
> > I thought the workflow was something like the following:
> >
> > 1. The compiler (or LTO codegen) generates ELF files with basic block
> sections.
>
> The correction here is that we generate first with basic block labels
> and not sections.
>
> > 2. You link once without propeller optimizations.
>
> This is correct.
>
> > 3. You collect the propeller profile.
>
> This is correct.
>
> > 4. You use the same ELF files to relink with propeller optimizations.
>
> We use the same optimized IR files to regenerate the ELF files.  We
> could use optimized MIR here but serializability of MIR is not well
> supported yet and is part of the larger plan.  We only need to re-run
> CFI instruction inserter in MIR.  For future optimizations, we can
> plug in here and make small code transformations for optimizations
> like prefetch insertion.  If we attempt to do basic block re-ordering
> here, we would be limited to intra-procedural basic block reordering
> which is sub-optimal.  Hence, we generate basic block sections here
> only for the sampled functions which is a small %age.
>
> >
> > That's sufficient to implement all the optimizations you described, as
> far as I can tell.
> >
> > But it sounds like that's wrong?  It sounds like what you're describing
> is:
> >
> > 1. LTO codegen generates MIR files.
> > 2. You convert the MIR files to ELF object files, and link without
> propeller optimizations.  (Does this step have any propeller-specific code
> generation differences?)
>
> Yes, like I said above, basic block labels are used here to identify
> the basic blocks which are later used in the last step to annotate
> profiles.
>
> > 4. You collect the propeller profile.
> > 5. You apply some propeller optimization to the MIR files.
> > 6. You convert the optimized MIR files to ELF object files, with basic
> block sections enabled.
>
> Right, technically this is high level IR now as MIR is not
> serializable but this is ok for discussion.
>
> > 7. You link the ELF files, applying propeller reordering optimizations.
> >
> > (And since you can't serialize MIR, you actually serialize LLVM IR, and
> run the code generator when you deserialize, to convert that to MIR.)
> >
> > Is that correct?
>
> That is correct.
>
> >
> > If you have the MIR at the time you're making all the propeller
> optimization decisions, why is the linker rewriting raw x86 assembly, as
> opposed to performing the relevant transforms on MIR?
>

MIR level transformations are not excluded and can be done in Propeller
framework, just limited to module level optimizations. Anything
Global/Interprocedural needs to happen at (real) link time with assistance
of module level annotations.

>
> MIR is still one module at a time.  We cannot do inter-procedural
> basic block layout here.  We can do much more advanced stuff at the
> whole-program level in the linker.  The relaxation code is the down
> side.
>
> For more details, We strongly considered this.  We could run something
> like a thin link in thin lto figure out the global layout and hand out
> the relevant  subsets of the global decision  to each module.  This
> looked more complicated and the individual pieces from each module
> should still be globally laid out again by the linker.

This won't work well -- as cross module inlining has not yet happened. Not
only will there be problems with profile annotation, all the profile
context sensitives will be lost.

> This
> constraints us on what we can do for layout and also does not work
> well with future optimizations like global alignment like David
> pointed out.
>
> >
> > Why are you proposing to add a bunch of options to clang to manipulate
> LLVM code generation, given none of those options are actually used for
> Propeller workflows?
>
>
Propeller workflows include precise profile annotations, so the options are
used there.

David

> Where do you suggest labelling and section options should exist?  We
> looked at  basic block sections to be similar to function sections in
> terms of option handling?
>
> Thanks
> Sri
>
> >
> > -Eli
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190926/511eb860/attachment.html>