<html dir="ltr">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<style id="owaParaStyle" type="text/css">P {margin-top:0;margin-bottom:0;}</style>
</head>
<body ocsi="0" fpstyle="1">
<div style="direction: ltr;font-family: Tahoma;color: #000000;font-size: 10pt;">Hello Hongbin,<br>
<br>
Thank you for your suggestions. It sounds like something we were looking for<br>
in order to better integrate iterative compilation framework into llvm.<br>
We plan to port this implementation to llvm 3.4 release and then we will try to<br>
follow your recommendations.<br>
<br>
Radovan<br>
<br>
> From: Hongbin Zheng [etherzhhb@gmail.com]<br>
> Sent: Wednesday, December 18, 2013 7:45 PM<br>
> To: Radovan Obradovic<br>
> Cc: Hal Finkel; Zoran Jovanovic; Petar Jovanovic; llvmdev@cs.uiuc.edu<br>
> Subject: Re: [LLVMdev] [RFC] Iterrative compilation framework for Clang/LLVM<br>
> <br>
> Hi Radovan,<br>
> I am also interested in the iterative compilation in LLVM, and had implemented<br>
> a simple iterative compilation driver. I guess you do not need to embedded the<br>
> pointer to ModuleDecisionTreeProxies into the core classes of llvm, i.e. the<br>
> PassManager class and the Function class. Instead, you could:<br>
> 1. Implement a special pass manager that runs a set of passes iteratively.<br>
> Implementing the iterative compilation framework based on pass manage<br>
> allows the framework to live outside of clang, and we can use it in opt,<br>
> llc, etc.<br>
> 2. Implement an immutable pass to store the global information<br>
> (i.e. ModuleDecisionTreeProxies if I do not misunderstand your patch)<br>
> across iterations. In your example, once you implemented<br>
> ModuleDecisionTreeProxies as an immutable pass, you can get<br>
> the ModuleDecisionTreeProxies in every pass you need it by simply<br>
> call "getAnalysis<ModuleDecisionTreeProxies>()"<br>
> Thanks<br>
> Hongbin<br>
> PS: I attached our iterative driver, you could have a look if you are interested.<br>
> <br>
> <br>
> On Wed, Dec 18, 2013 at 10:33 AM, Radovan Obradovic <Radovan.Obradovic@imgtec.com> wrote:<br>
> <br>
> Hal,<br>
> <br>
> Thank you for finding interest in our work!<br>
> Please find my answers inlined below.<br>
> <br>
> ><br>
> > ________________________________________<br>
> > From: Hal Finkel [hfinkel@anl.gov]<br>
> > Sent: Monday, December 16, 2013 4:26 PM<br>
> > To: Radovan Obradovic<br>
> > Cc: llvmdev@cs.uiuc.edu; chandlerc; Andrew Trick<br>
> > Subject: Re: [LLVMdev] [RFC] Iterrative compilation framework for Clang/LLVM<br>
> ><br>
> > Radovan,<br>
> ><br>
> > Thanks for posting this! I would really like to have this kind of<br>
> > functionality available. There are a lot of things to think through<br>
> > here; comments inline...<br>
> ><br>
> > ----- Original Message -----<br>
> > > From: "Radovan Obradovic" <Radovan.Obradovic@imgtec.com><br>
> > > To: llvmdev@cs.uiuc.edu<br>
> > > Sent: Monday, December 16, 2013 11:31:21 AM<br>
> > > Subject: [LLVMdev] [RFC] Iterrative compilation framework for Clang/LLVM<br>
> > ><br>
> > ><br>
> > ><br>
> > ><br>
> > > This is a first step towards to full iterative compilation framework<br>
> > > for<br>
> > > Clang/LLVM. The attached patch is work in progress and the idea of<br>
> > > sending<br>
> > > it to the list earlier rather than later is to open up a discussion<br>
> > > on<br>
> > > iterative compilation in Clang/LLVM. All comments are welcome,<br>
> > > especially<br>
> > > those related to integration of the framework into Clang/LLVM.<br>
> ><br>
> > I'm glad you're doing this, and I hope that you'll be willing to<br>
> > stick though what might be a long design discussion :)<br>
> <br>
> Our belief is that this approach can be beneficial, especially<br>
> for non-orthogonal architectures (e.g. microMIPS), thus we plan<br>
> to continue our work on this.<br>
> <br>
> ><br>
> > ><br>
> > > Current compilers have pipeline structure in which data from one<br>
> > > phase is<br>
> > > passed to the following one. But these compilation phases can<br>
> > > interfere in<br>
> > > unpredictable way so compilers use conservative heuristics on which<br>
> > > they<br>
> > > make decisions during the compilation process. Because of<br>
> > > unpredictability<br>
> > > of these interferences, compilers sometimes make decisions which can<br>
> > > degrade<br>
> > > code quality. Iterative compilation could partially resolve these<br>
> > > problems<br>
> > > by allowing compiler to explore some alternative decisions, generate<br>
> > > code<br>
> > > for these new decision paths and take the generated code that is the<br>
> > > best<br>
> > > for a particular criteria eventually.<br>
> > ><br>
> > > This approach is not new in the literature, we can mention<br>
> > > Milepost/CTuning<br>
> > > [http://ctuning.org] among other projects. In this approach, in<br>
> > > comparison<br>
> > > to Milepost/CTuning, the emphasis is on much finer granularity by<br>
> > > changing<br>
> > > arbitrary compiler decisions, and not only those based on values of<br>
> > > command<br>
> > > line options, at any point in translation process.<br>
> > ><br>
> > > This framework implements iterative approach in which in every<br>
> > > iteration<br>
> > > slightly different code is generated. At the end of compilation<br>
> > > process the<br>
> > > best emitted code is chosen.<br>
> > ><br>
> > > All points at which relevant decisions are made should be identified.<br>
> > > Access to framework data structure (DecisionTree) should be provided<br>
> > > at<br>
> > > these points. Based on the data from DecisionTree, obtained by<br>
> > > calling<br>
> > > getDecision member function, at these points decisions are made<br>
> > > whether to<br>
> > > execute default actions (as compiler would do without this framework)<br>
> > > or to<br>
> > > follow alternative paths.<br>
> > ><br>
> > > Let us consider Inliner example:<br>
> > ><br>
> > > Original code:<br>
> > ><br>
> > > if (!IC) {<br>
> > > DEBUG(dbgs() << " NOT Inlining: cost=" << IC.getCost()<br>
> > > << ", thres=" << (IC.getCostDelta() + IC.getCost())<br>
> > > << ", Call: " << *CS.getInstruction() << "\n");<br>
> > > return false;<br>
> > > }<br>
> > ><br>
> > > Code after instrumentalization:<br>
> > ><br>
> > > bool doInlinging = (bool)IC;<br>
> > > DecisionTreeProxy *decisionTreeProxy;<br>
> > > decisionTreeProxy =<br>
> > > moduleDecisionTreeProxies->getCurrFnDecisionTreeProxy();<br>
> > > int t = MaxDtPriority - abs(IC.getCostDelta());<br>
> > > bool decision = false;<br>
> > > if (t >= 0)<br>
> > > decision = decisionTreeProxy->getDecision(call);<br>
> > > if (decision)<br>
> > > doInlinging = !doInlinging;<br>
> > ><br>
> > > if (!doInlinging) {<br>
> > > DEBUG(dbgs() << " NOT Inlining: cost=" << IC.getCost()<br>
> > > << ", thres=" << (IC.getCostDelta() + IC.getCost())<br>
> > > << ", Call: " << *CS.getInstruction() << "\n");<br>
> > > return false;<br>
> > > }<br>
> ><br>
> > Talking a quick look over the patch, this seems to introduce a lot<br>
> > of boilerplate code. Can you think of a design that simplifies this?<br>
> <br>
> It is necessary to access decision tree at every place where you<br>
> want to call getDecision function and priority parameter must be<br>
> calculated which is specific to every such place. It is only a<br>
> few lines of code but we will consider how to improve this.<br>
> All suggestions are welcomed.<br>
> <br>
> ><br>
> > ><br>
> > > Function getDecision returns true or false. If false is returned than<br>
> > > this<br>
> > > means to take default path and true means to take alternative one.<br>
> > ><br>
> > > Compilation process can be represented by sequence of binary values<br>
> > > where<br>
> > > every value determines decision made at some point. Value FALSE means<br>
> > > to<br>
> > > follow normal compilation process as original compiler would do<br>
> > > without<br>
> > > presence of iterative compilation framework. TRUE value means to take<br>
> > > alternative path or to make opposite decision at that point. This<br>
> > > naturally<br>
> > > forms tree like structure which we named Decision Tree.<br>
> > > Traversing the tree from the root to the leaf of the tree provides<br>
> > > all the<br>
> > > decisions made during one iteration in compilation process.<br>
> > ><br>
> > > At the end of each iteration, quality of generated code is estimated<br>
> > > by<br>
> > > executing newly introduced target dependent pass. Based on results<br>
> > > path for<br>
> > > the following iteration is calculated. At the moment, this has been<br>
> > > proved<br>
> > > for MIPS only and it is based on code size.<br>
> ><br>
> > I think that, in general, we ought to be able to get an overall<br>
> > latency estimate from the instruction scheduler.<br>
> <br>
> Criteria to optimize for speed should be added as well. Do you<br>
> think that latency info form scheduler can be used to estimate<br>
> execution speed of generated code? Can we guess somehow the execution<br>
> frequency of basic blocks?<br>
> <br>
> ><br>
> > ><br>
> > > Initially iterative compilation framework was implemented in CodeGen<br>
> > > phase<br>
> > > but large number of code transformations was already performed in<br>
> > > earlier<br>
> > > phases. This reduces possibilities to improve quality of generated<br>
> > > code.<br>
> > > Thus iterative compilation framework is applied to Clang driver where<br>
> > > it can<br>
> > > influence every single compiler phase. This relocation resulted in<br>
> > > some<br>
> > > implementation issues mostly related to the fact that driver and the<br>
> > > compiler itself (-cc1) are located in different processes. This is<br>
> > > resolved<br>
> > > by introducing DecisionTreeProxy class that implements communication<br>
> > > between<br>
> > > these two processes through temporary files.<br>
> ><br>
> > I think that this can be solved by introducing some API for the<br>
> > necessary data flow. There is obviously already a binding here.<br>
> <br>
> This probably needs to be fixed. All suggestions are welcomed.<br>
> <br>
> ><br>
> > ><br>
> > > At this point mechanism is applied to two optimizations: Inliner and<br>
> > > Loop<br>
> > > Invariant Code Motion. Supported platform is MIPS32r2.<br>
> > ><br>
> > > Despite small number of instrumented points we got improvements on<br>
> > > some test<br>
> > > files from CSiBE benchmark. Here are the promising results:<br>
> > ><br>
> > > jpeg-6b/jidctflt.c -10.2%<br>
> > > OpenTCP-1.0.4/system.c -6.7%<br>
> > > libmspack/test/chmd_md5.c -5.8%<br>
> > > flex-2.5.31/regex.c -4.8%<br>
> ><br>
> > Very interesting. Out of curiosity, any slowdowns?<br>
> <br>
> jidctflt.c is optimized due to the fact that LICM moves some<br>
> constant definitions out of the loop and increases the register<br>
> pressure so some spills occur. So optimized version, in this<br>
> case, should be even faster. But slowdowns are possible and we<br>
> use option -Oz so we are telling the compiler that code size<br>
> is the priority.<br>
> <br>
> ><br>
> > ><br>
> > > We plan to apply this framework to more compiler phases and to some<br>
> > > other<br>
> > > target architectures. It is expected to give better results on some<br>
> > > non<br>
> > > orthogonal architectures like microMIPS. Another direction is to<br>
> > > apply<br>
> > > machine learning based on some input program features (number of<br>
> > > basic<br>
> > > blocks, register pressure at some point etc.) and it can be used for<br>
> > > selecting compilation path for the next iteration or estimating the<br>
> > > best<br>
> > > path for compiling program without using iterative approach (with<br>
> > > only one<br>
> > > iteration).<br>
> > ><br>
> > ><br>
> > > o<br>
> > > / \<br>
> > > / \<br>
> > > o \<br>
> > > / \<br>
> > > / \<br>
> > > o o<br>
> > > / \ /<br>
> > > / \ /<br>
> > > L1 o o<br>
> > > / /<br>
> > > / /<br>
> > > L2 L3<br>
> > ><br>
> > > Figure 1: Decision Tree Example - Symbol 'o' represent decision<br>
> > > points. If<br>
> > > getDecision function returns FALSE the left branch is taken (default<br>
> > > path),<br>
> > > otherwise the right branch is taken.<br>
> > > 'L1' compilation iteration can be represented as a string<br>
> > > (FALSE,FALSE,FALSE), 'L2' as (FALSE,FALSE,TRUE,FALSE) and 'L3'<br>
> > > as (TRUE, FALSE, FALSE).<br>
> > ><br>
> > ><br>
> > > Clearly, we see huge potential in this direction. At the same time,<br>
> > > we are<br>
> > > aware this would cause compiler to be slower which everyone is<br>
> > > against, but<br>
> > > this penalty can be taken when we are in need for most optimized<br>
> > > code.<br>
> ><br>
> > What is the overhead of the current implementation? I suspect that,<br>
> > in order to control the large number of decision points from large<br>
> > regions, we will want some way to limit fine-grained decisions in<br>
> > the tree in large regions (and only including coarse ones, like the<br>
> > inlining of relatively large functions). In small blocks, on the<br>
> > other hand, we might even include instruction scheduling decisions<br>
> > (for in-order cores).<br>
> <br>
> In the current implementation one full compilation per<br>
> decision is needed. Plan is that, for CodeGen pass, do once all the<br>
> passes of CodeGen per decision (to reuse bitcode). Machine learning should<br>
> considerably reduce the need for the large number of iterations.<br>
> <br>
> ><br>
> > I'd been thinking of experimenting with some iterative scheduling<br>
> > for my in-order PPC cores (for small blocks in loops), and<br>
> > integrating it into something like this might be even better.<br>
> ><br>
> > ><br>
> > > What does everyone think?<br>
> > > Could this approach be considered as a valid mode for Clang/LLVM?<br>
> > ><br>
> > > Please use this post and the attached patch (it can be applied to 3.3<br>
> > > release - clang should be inside llvm/tools folder) as a reference<br>
> > > and<br>
> > > not the patch for submission.<br>
> ><br>
> > I see in the patch some (seemingly small) changes to the pass manager<br>
> > setup. Can you describe exactly what you've done. Chandler is<br>
> > rewriting the pass infrastructure, and I'd like to understand how<br>
> > what you've done will or won't fit with the new setup.<br>
> <br>
> Pointer to decision tree proxy has to be passed to every place<br>
> where getDecision is called and to the pass that calculates function<br>
> fitness. So pointer to it is added in Pass and PassManagerBase classes.<br>
> We will provide more detail explanation later.<br>
> <br>
> ><br>
> > Thanks again,<br>
> > Hal<br>
> ><br>
> > ><br>
> > > Example command line:<br>
> > > > clang -target mipsel-unknown-linux -Oz -S -fiterative-comp=3<br>
> > > > example.c<br>
> > ><br>
> > > Thank you. Looking forward to hearing your comments on this.<br>
> > ><br>
> > ><br>
> > ><br>
> > > _______________________________________________<br>
> > > LLVM Developers mailing list<br>
> > > LLVMdev@cs.uiuc.edu http://llvm.cs.uiuc.edu<br>
> > > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev<br>
> > ><br>
> ><br>
> > --<br>
> > Hal Finkel<br>
> > Assistant Computational Scientist<br>
> > Leadership Computing Facility<br>
> > Argonne National Laboratory<br>
> <br>
> <br>
> _______________________________________________<br>
> LLVM Developers mailing list<br>
> LLVMdev@cs.uiuc.edu http://llvm.cs.uiuc.edu<br>
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev<br>
> <br>
<br>
</div>
</body>
</html>