[LLVMdev] [RFC] Iterrative compilation framework for Clang/LLVM

Hal Finkel hfinkel at anl.gov
Sat Feb 15 16:24:12 PST 2014


----- Original Message -----
> From: "Radovan Obradovic" <Radovan.Obradovic at imgtec.com>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: llvmdev at cs.uiuc.edu, "chandlerc" <chandlerc at google.com>, "Andrew Trick" <atrick at apple.com>, "Zoran Jovanovic"
> <Zoran.Jovanovic at imgtec.com>, "Petar Jovanovic" <Petar.Jovanovic at imgtec.com>, "Rich Fuhler"
> <Rich.Fuhler at imgtec.com>, clattner at apple.com
> Sent: Thursday, December 19, 2013 6:21:16 AM
> Subject: RE: [LLVMdev] [RFC] Iterrative compilation framework for Clang/LLVM
> 
> Hal,
> 
> Here is a slightly modified iterative compilation patch that reduces
> amount
> of boilerplate code at the points where the decisions are made.

Thanks! That looks more manageable. There is going to be *a lot* of work involved in getting something like this integrated; but if you're up for that, as a next step, please rebase the patch and upload it to phabricator (see http://llvm.org/docs/Phabricator.html). That will make it easier to review in detail.

 -Hal

> 
> Radovan
> 
> > ________________________________________
> > From: Radovan Obradovic
> > Sent: Tuesday, December 17, 2013 6:33 PM
> > To: Hal Finkel
> > Cc: llvmdev at cs.uiuc.edu; chandlerc; Andrew Trick; Zoran Jovanovic;
> > Petar Jovanovic
> > Subject: RE: [LLVMdev] [RFC] Iterrative compilation framework for
> > Clang/LLVM
> > 
> > Hal,
> > 
> > Thank you for finding interest in our work!
> > Please find my answers inlined below.
> > 
> > >
> > > ________________________________________
> > > From: Hal Finkel [hfinkel at anl.gov]
> > > Sent: Monday, December 16, 2013 4:26 PM
> > > To: Radovan Obradovic
> > > Cc: llvmdev at cs.uiuc.edu; chandlerc; Andrew Trick
> > > Subject: Re: [LLVMdev] [RFC] Iterrative compilation framework for
> > > Clang/LLVM
> > >
> > > Radovan,
> > >
> > > Thanks for posting this! I would really like to have this kind of
> > > functionality available. There are a lot of things to think
> > > through
> > > here; comments inline...
> > >
> > > ----- Original Message -----
> > > > From: "Radovan Obradovic" <Radovan.Obradovic at imgtec.com>
> > > > To: llvmdev at cs.uiuc.edu
> > > > Sent: Monday, December 16, 2013 11:31:21 AM
> > > > Subject: [LLVMdev] [RFC] Iterrative compilation framework for
> > > > Clang/LLVM
> > > >
> > > >
> > > >
> > > >
> > > > This is a first step towards to full iterative compilation
> > > > framework
> > > > for
> > > > Clang/LLVM. The attached patch is work in progress and the idea
> > > > of
> > > > sending
> > > > it to the list earlier rather than later is to open up a
> > > > discussion
> > > > on
> > > > iterative compilation in Clang/LLVM. All comments are welcome,
> > > > especially
> > > > those related to integration of the framework into Clang/LLVM.
> > >
> > > I'm glad you're doing this, and I hope that you'll be willing to
> > > stick though what might be a long design discussion :)
> > 
> > Our belief is that this approach can be beneficial, especially
> > for non-orthogonal architectures (e.g. microMIPS), thus we plan
> > to continue our work on this.
> > 
> > >
> > > >
> > > > Current compilers have pipeline structure in which data from
> > > > one
> > > > phase is
> > > > passed to the following one. But these compilation phases can
> > > > interfere in
> > > > unpredictable way so compilers use conservative heuristics on
> > > > which
> > > > they
> > > > make decisions during the compilation process. Because of
> > > > unpredictability
> > > > of these interferences, compilers sometimes make decisions
> > > > which can
> > > > degrade
> > > > code quality. Iterative compilation could partially resolve
> > > > these
> > > > problems
> > > > by allowing compiler to explore some alternative decisions,
> > > > generate
> > > > code
> > > > for these new decision paths and take the generated code that
> > > > is the
> > > > best
> > > > for a particular criteria eventually.
> > > >
> > > > This approach is not new in the literature, we can mention
> > > > Milepost/CTuning
> > > > [http://ctuning.org] among other projects. In this approach, in
> > > > comparison
> > > > to Milepost/CTuning, the emphasis is on much finer granularity
> > > > by
> > > > changing
> > > > arbitrary compiler decisions, and not only those based on
> > > > values of
> > > > command
> > > > line options, at any point in translation process.
> > > >
> > > > This framework implements iterative approach in which in every
> > > > iteration
> > > > slightly different code is generated. At the end of compilation
> > > > process the
> > > > best emitted code is chosen.
> > > >
> > > > All points at which relevant decisions are made should be
> > > > identified.
> > > > Access to framework data structure (DecisionTree) should be
> > > > provided
> > > > at
> > > > these points. Based on the data from DecisionTree, obtained by
> > > > calling
> > > > getDecision member function, at these points decisions are made
> > > > whether to
> > > > execute default actions (as compiler would do without this
> > > > framework)
> > > > or to
> > > > follow alternative paths.
> > > >
> > > > Let us consider Inliner example:
> > > >
> > > > Original code:
> > > >
> > > > if (!IC) {
> > > > DEBUG(dbgs() << " NOT Inlining: cost=" << IC.getCost()
> > > > << ", thres=" << (IC.getCostDelta() + IC.getCost())
> > > > << ", Call: " << *CS.getInstruction() << "\n");
> > > > return false;
> > > > }
> > > >
> > > > Code after instrumentalization:
> > > >
> > > > bool doInlinging = (bool)IC;
> > > > DecisionTreeProxy *decisionTreeProxy;
> > > > decisionTreeProxy =
> > > > moduleDecisionTreeProxies->getCurrFnDecisionTreeProxy();
> > > > int t = MaxDtPriority - abs(IC.getCostDelta());
> > > > bool decision = false;
> > > > if (t >= 0)
> > > > decision = decisionTreeProxy->getDecision(call);
> > > > if (decision)
> > > > doInlinging = !doInlinging;
> > > >
> > > > if (!doInlinging) {
> > > > DEBUG(dbgs() << " NOT Inlining: cost=" << IC.getCost()
> > > > << ", thres=" << (IC.getCostDelta() + IC.getCost())
> > > > << ", Call: " << *CS.getInstruction() << "\n");
> > > > return false;
> > > > }
> > >
> > > Talking a quick look over the patch, this seems to introduce a
> > > lot
> > > of boilerplate code. Can you think of a design that simplifies
> > > this?
> > 
> > It is necessary to access decision tree at every place where you
> > want to call getDecision function and priority parameter must be
> > calculated which is specific to every such place. It is only a
> > few lines of code but we will consider how to improve this.
> > All suggestions are welcomed.
> > 
> > >
> > > >
> > > > Function getDecision returns true or false. If false is
> > > > returned than
> > > > this
> > > > means to take default path and true means to take alternative
> > > > one.
> > > >
> > > > Compilation process can be represented by sequence of binary
> > > > values
> > > > where
> > > > every value determines decision made at some point. Value FALSE
> > > > means
> > > > to
> > > > follow normal compilation process as original compiler would do
> > > > without
> > > > presence of iterative compilation framework. TRUE value means
> > > > to take
> > > > alternative path or to make opposite decision at that point.
> > > > This
> > > > naturally
> > > > forms tree like structure which we named Decision Tree.
> > > > Traversing the tree from the root to the leaf of the tree
> > > > provides
> > > > all the
> > > > decisions made during one iteration in compilation process.
> > > >
> > > > At the end of each iteration, quality of generated code is
> > > > estimated
> > > > by
> > > > executing newly introduced target dependent pass. Based on
> > > > results
> > > > path for
> > > > the following iteration is calculated. At the moment, this has
> > > > been
> > > > proved
> > > > for MIPS only and it is based on code size.
> > >
> > > I think that, in general, we ought to be able to get an overall
> > > latency estimate from the instruction scheduler.
> > 
> > Criteria to optimize for speed should be added as well. Do you
> > think that latency info form scheduler can be used to estimate
> > execution speed of generated code? Can we guess somehow the
> > execution
> > frequency of basic blocks?
> > 
> > >
> > > >
> > > > Initially iterative compilation framework was implemented in
> > > > CodeGen
> > > > phase
> > > > but large number of code transformations was already performed
> > > > in
> > > > earlier
> > > > phases. This reduces possibilities to improve quality of
> > > > generated
> > > > code.
> > > > Thus iterative compilation framework is applied to Clang driver
> > > > where
> > > > it can
> > > > influence every single compiler phase. This relocation resulted
> > > > in
> > > > some
> > > > implementation issues mostly related to the fact that driver
> > > > and the
> > > > compiler itself (-cc1) are located in different processes. This
> > > > is
> > > > resolved
> > > > by introducing DecisionTreeProxy class that implements
> > > > communication
> > > > between
> > > > these two processes through temporary files.
> > >
> > > I think that this can be solved by introducing some API for the
> > > necessary data flow. There is obviously already a binding here.
> > 
> > This probably needs to be fixed. All suggestions are welcomed.
> > 
> > >
> > > >
> > > > At this point mechanism is applied to two optimizations:
> > > > Inliner and
> > > > Loop
> > > > Invariant Code Motion. Supported platform is MIPS32r2.
> > > >
> > > > Despite small number of instrumented points we got improvements
> > > > on
> > > > some test
> > > > files from CSiBE benchmark. Here are the promising results:
> > > >
> > > > jpeg-6b/jidctflt.c -10.2%
> > > > OpenTCP-1.0.4/system.c -6.7%
> > > > libmspack/test/chmd_md5.c -5.8%
> > > > flex-2.5.31/regex.c -4.8%
> > >
> > > Very interesting. Out of curiosity, any slowdowns?
> > 
> > jidctflt.c is optimized due to the fact that LICM moves some
> > constant definitions out of the loop and increases the register
> > pressure so some spills occur. So optimized version, in this
> > case, should be even faster. But slowdowns are possible and we
> > use option -Oz so we are telling the compiler that code size
> > is the priority.
> > 
> > >
> > > >
> > > > We plan to apply this framework to more compiler phases and to
> > > > some
> > > > other
> > > > target architectures. It is expected to give better results on
> > > > some
> > > > non
> > > > orthogonal architectures like microMIPS. Another direction is
> > > > to
> > > > apply
> > > > machine learning based on some input program features (number
> > > > of
> > > > basic
> > > > blocks, register pressure at some point etc.) and it can be
> > > > used for
> > > > selecting compilation path for the next iteration or estimating
> > > > the
> > > > best
> > > > path for compiling program without using iterative approach
> > > > (with
> > > > only one
> > > > iteration).
> > > >
> > > >
> > > > o
> > > > / \
> > > > / \
> > > > o \
> > > > / \
> > > > / \
> > > > o o
> > > > / \ /
> > > > / \ /
> > > > L1 o o
> > > > / /
> > > > / /
> > > > L2 L3
> > > >
> > > > Figure 1: Decision Tree Example - Symbol 'o' represent decision
> > > > points. If
> > > > getDecision function returns FALSE the left branch is taken
> > > > (default
> > > > path),
> > > > otherwise the right branch is taken.
> > > > 'L1' compilation iteration can be represented as a string
> > > > (FALSE,FALSE,FALSE), 'L2' as (FALSE,FALSE,TRUE,FALSE) and 'L3'
> > > > as (TRUE, FALSE, FALSE).
> > > >
> > > >
> > > > Clearly, we see huge potential in this direction. At the same
> > > > time,
> > > > we are
> > > > aware this would cause compiler to be slower which everyone is
> > > > against, but
> > > > this penalty can be taken when we are in need for most
> > > > optimized
> > > > code.
> > >
> > > What is the overhead of the current implementation? I suspect
> > > that,
> > > in order to control the large number of decision points from
> > > large
> > > regions, we will want some way to limit fine-grained decisions in
> > > the tree in large regions (and only including coarse ones, like
> > > the
> > > inlining of relatively large functions). In small blocks, on the
> > > other hand, we might even include instruction scheduling
> > > decisions
> > > (for in-order cores).
> > 
> > In the current implementation one full compilation per
> > decision is needed. Plan is that, for CodeGen pass, do once all the
> > passes of CodeGen per decision (to reuse bitcode). Machine learning
> > should
> > considerably reduce the need for the large number of iterations.
> > 
> > >
> > > I'd been thinking of experimenting with some iterative scheduling
> > > for my in-order PPC cores (for small blocks in loops), and
> > > integrating it into something like this might be even better.
> > >
> > > >
> > > > What does everyone think?
> > > > Could this approach be considered as a valid mode for
> > > > Clang/LLVM?
> > > >
> > > > Please use this post and the attached patch (it can be applied
> > > > to 3.3
> > > > release - clang should be inside llvm/tools folder) as a
> > > > reference
> > > > and
> > > > not the patch for submission.
> > >
> > > I see in the patch some (seemingly small) changes to the pass
> > > manager
> > > setup. Can you describe exactly what you've done. Chandler is
> > > rewriting the pass infrastructure, and I'd like to understand how
> > > what you've done will or won't fit with the new setup.
> > 
> > Pointer to decision tree proxy has to be passed to every place
> > where getDecision is called and to the pass that calculates
> > function
> > fitness. So pointer to it is added in Pass and PassManagerBase
> > classes.
> > We will provide more detail explanation later.
> > 
> > >
> > > Thanks again,
> > > Hal
> > >
> > > >
> > > > Example command line:
> > > > > clang -target mipsel-unknown-linux -Oz -S -fiterative-comp=3
> > > > > example.c
> > > >
> > > > Thank you. Looking forward to hearing your comments on this.
> > > >
> > > >
> > > >
> > > > _______________________________________________
> > > > LLVM Developers mailing list
> > > > LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> > > > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> > > >
> > >
> > > --
> > > Hal Finkel
> > > Assistant Computational Scientist
> > > Leadership Computing Facility
> > > Argonne National Laboratory
> 

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory



More information about the llvm-dev mailing list