<html dir="ltr">

<head>

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

<style id="owaParaStyle" type="text/css">P {margin-top:0;margin-bottom:0;}</style>

</head>

<body ocsi="0" fpstyle="1">

<div style="direction: ltr;font-family: Tahoma;color: #000000;font-size: 10pt;">Hello Hongbin,<br>

<br>

Thank you for your suggestions. It sounds like something we were looking for<br>

in order to better integrate iterative compilation framework into llvm.<br>

We plan to port this implementation to llvm 3.4 release and then we will try to<br>

follow your recommendations.<br>

<br>

Radovan<br>

 <br>

> From: Hongbin Zheng [etherzhhb@gmail.com]<br>

> Sent: Wednesday, December 18, 2013 7:45 PM<br>

> To: Radovan Obradovic<br>

> Cc: Hal Finkel; Zoran Jovanovic; Petar Jovanovic; llvmdev@cs.uiuc.edu<br>

> Subject: Re: [LLVMdev] [RFC] Iterrative compilation framework for Clang/LLVM<br>

> <br>

> Hi Radovan,<br>

> I am also interested in the iterative compilation in LLVM, and had implemented<br>

> a simple iterative compilation driver. I guess you do not need to embedded the<br>

> pointer to ModuleDecisionTreeProxies into the core classes of llvm, i.e. the<br>

> PassManager class and the Function class. Instead, you could:<br>

> 1. Implement a special pass manager that runs a set of passes iteratively.<br>

>    Implementing the iterative compilation framework based on pass manage<br>

>    allows the framework to live outside of clang, and we can use it in opt,<br>

>    llc, etc.<br>

> 2. Implement an immutable pass to store the global information<br>

>    (i.e. ModuleDecisionTreeProxies if I do not misunderstand your patch)<br>

>    across iterations. In your example, once you implemented<br>

>    ModuleDecisionTreeProxies as an immutable pass, you can get<br>

>    the ModuleDecisionTreeProxies in every pass you need it by simply<br>

>    call "getAnalysis<ModuleDecisionTreeProxies>()"<br>

> Thanks<br>

> Hongbin<br>

> PS: I attached our iterative driver, you could have a look if you are interested.<br>

> <br>

> <br>

> On Wed, Dec 18, 2013 at 10:33 AM, Radovan Obradovic <Radovan.Obradovic@imgtec.com> wrote:<br>

> <br>

>     Hal,<br>

> <br>

>     Thank you for finding interest in our work!<br>

>     Please find my answers inlined below.<br>

> <br>

>     ><br>

>     > ________________________________________<br>

>     > From: Hal Finkel [hfinkel@anl.gov]<br>

>     > Sent: Monday, December 16, 2013 4:26 PM<br>

>     > To: Radovan Obradovic<br>

>     > Cc: llvmdev@cs.uiuc.edu; chandlerc; Andrew Trick<br>

>     > Subject: Re: [LLVMdev] [RFC] Iterrative compilation framework for Clang/LLVM<br>

>     ><br>

>     > Radovan,<br>

>     ><br>

>     > Thanks for posting this! I would really like to have this kind of<br>

>     > functionality available. There are a lot of things to think through<br>

>     > here; comments inline...<br>

>     ><br>

>     > ----- Original Message -----<br>

>     > > From: "Radovan Obradovic" <Radovan.Obradovic@imgtec.com><br>

>     > > To: llvmdev@cs.uiuc.edu<br>

>     > > Sent: Monday, December 16, 2013 11:31:21 AM<br>

>     > > Subject: [LLVMdev] [RFC] Iterrative compilation framework for Clang/LLVM<br>

>     > ><br>

>     > ><br>

>     > ><br>

>     > ><br>

>     > > This is a first step towards to full iterative compilation framework<br>

>     > > for<br>

>     > > Clang/LLVM. The attached patch is work in progress and the idea of<br>

>     > > sending<br>

>     > > it to the list earlier rather than later is to open up a discussion<br>

>     > > on<br>

>     > > iterative compilation in Clang/LLVM. All comments are welcome,<br>

>     > > especially<br>

>     > > those related to integration of the framework into Clang/LLVM.<br>

>     ><br>

>     > I'm glad you're doing this, and I hope that you'll be willing to<br>

>     > stick though what might be a long design discussion :)<br>

> <br>

>     Our belief is that this approach can be beneficial, especially<br>

>     for non-orthogonal architectures (e.g. microMIPS), thus we plan<br>

>     to continue our work on this.<br>

> <br>

>     ><br>

>     > ><br>

>     > > Current compilers have pipeline structure in which data from one<br>

>     > > phase is<br>

>     > > passed to the following one. But these compilation phases can<br>

>     > > interfere in<br>

>     > > unpredictable way so compilers use conservative heuristics on which<br>

>     > > they<br>

>     > > make decisions during the compilation process. Because of<br>

>     > > unpredictability<br>

>     > > of these interferences, compilers sometimes make decisions which can<br>

>     > > degrade<br>

>     > > code quality. Iterative compilation could partially resolve these<br>

>     > > problems<br>

>     > > by allowing compiler to explore some alternative decisions, generate<br>

>     > > code<br>

>     > > for these new decision paths and take the generated code that is the<br>

>     > > best<br>

>     > > for a particular criteria eventually.<br>

>     > ><br>

>     > > This approach is not new in the literature, we can mention<br>

>     > > Milepost/CTuning<br>

>     > > [http://ctuning.org] among other projects. In this approach, in<br>

>     > > comparison<br>

>     > > to Milepost/CTuning, the emphasis is on much finer granularity by<br>

>     > > changing<br>

>     > > arbitrary compiler decisions, and not only those based on values of<br>

>     > > command<br>

>     > > line options, at any point in translation process.<br>

>     > ><br>

>     > > This framework implements iterative approach in which in every<br>

>     > > iteration<br>

>     > > slightly different code is generated. At the end of compilation<br>

>     > > process the<br>

>     > > best emitted code is chosen.<br>

>     > ><br>

>     > > All points at which relevant decisions are made should be identified.<br>

>     > > Access to framework data structure (DecisionTree) should be provided<br>

>     > > at<br>

>     > > these points. Based on the data from DecisionTree, obtained by<br>

>     > > calling<br>

>     > > getDecision member function, at these points decisions are made<br>

>     > > whether to<br>

>     > > execute default actions (as compiler would do without this framework)<br>

>     > > or to<br>

>     > > follow alternative paths.<br>

>     > ><br>

>     > > Let us consider Inliner example:<br>

>     > ><br>

>     > > Original code:<br>

>     > ><br>

>     > > if (!IC) {<br>

>     > > DEBUG(dbgs() << " NOT Inlining: cost=" << IC.getCost()<br>

>     > > << ", thres=" << (IC.getCostDelta() + IC.getCost())<br>

>     > > << ", Call: " << *CS.getInstruction() << "\n");<br>

>     > > return false;<br>

>     > > }<br>

>     > ><br>

>     > > Code after instrumentalization:<br>

>     > ><br>

>     > > bool doInlinging = (bool)IC;<br>

>     > > DecisionTreeProxy *decisionTreeProxy;<br>

>     > > decisionTreeProxy =<br>

>     > > moduleDecisionTreeProxies->getCurrFnDecisionTreeProxy();<br>

>     > > int t = MaxDtPriority - abs(IC.getCostDelta());<br>

>     > > bool decision = false;<br>

>     > > if (t >= 0)<br>

>     > > decision = decisionTreeProxy->getDecision(call);<br>

>     > > if (decision)<br>

>     > > doInlinging = !doInlinging;<br>

>     > ><br>

>     > > if (!doInlinging) {<br>

>     > > DEBUG(dbgs() << " NOT Inlining: cost=" << IC.getCost()<br>

>     > > << ", thres=" << (IC.getCostDelta() + IC.getCost())<br>

>     > > << ", Call: " << *CS.getInstruction() << "\n");<br>

>     > > return false;<br>

>     > > }<br>

>     ><br>

>     > Talking a quick look over the patch, this seems to introduce a lot<br>

>     > of boilerplate code. Can you think of a design that simplifies this?<br>

> <br>

>     It is necessary to access decision tree at every place where you<br>

>     want to call getDecision function and priority parameter must be<br>

>     calculated which is specific to every such place. It is only a<br>

>     few lines of code but we will consider how to improve this.<br>

>     All suggestions are welcomed.<br>

> <br>

>     ><br>

>     > ><br>

>     > > Function getDecision returns true or false. If false is returned than<br>

>     > > this<br>

>     > > means to take default path and true means to take alternative one.<br>

>     > ><br>

>     > > Compilation process can be represented by sequence of binary values<br>

>     > > where<br>

>     > > every value determines decision made at some point. Value FALSE means<br>

>     > > to<br>

>     > > follow normal compilation process as original compiler would do<br>

>     > > without<br>

>     > > presence of iterative compilation framework. TRUE value means to take<br>

>     > > alternative path or to make opposite decision at that point. This<br>

>     > > naturally<br>

>     > > forms tree like structure which we named Decision Tree.<br>

>     > > Traversing the tree from the root to the leaf of the tree provides<br>

>     > > all the<br>

>     > > decisions made during one iteration in compilation process.<br>

>     > ><br>

>     > > At the end of each iteration, quality of generated code is estimated<br>

>     > > by<br>

>     > > executing newly introduced target dependent pass. Based on results<br>

>     > > path for<br>

>     > > the following iteration is calculated. At the moment, this has been<br>

>     > > proved<br>

>     > > for MIPS only and it is based on code size.<br>

>     ><br>

>     > I think that, in general, we ought to be able to get an overall<br>

>     > latency estimate from the instruction scheduler.<br>

> <br>

>     Criteria to optimize for speed should be added as well. Do you<br>

>     think that latency info form scheduler can be used to estimate<br>

>     execution speed of generated code? Can we guess somehow the execution<br>

>     frequency of basic blocks?<br>

> <br>

>     ><br>

>     > ><br>

>     > > Initially iterative compilation framework was implemented in CodeGen<br>

>     > > phase<br>

>     > > but large number of code transformations was already performed in<br>

>     > > earlier<br>

>     > > phases. This reduces possibilities to improve quality of generated<br>

>     > > code.<br>

>     > > Thus iterative compilation framework is applied to Clang driver where<br>

>     > > it can<br>

>     > > influence every single compiler phase. This relocation resulted in<br>

>     > > some<br>

>     > > implementation issues mostly related to the fact that driver and the<br>

>     > > compiler itself (-cc1) are located in different processes. This is<br>

>     > > resolved<br>

>     > > by introducing DecisionTreeProxy class that implements communication<br>

>     > > between<br>

>     > > these two processes through temporary files.<br>

>     ><br>

>     > I think that this can be solved by introducing some API for the<br>

>     > necessary data flow. There is obviously already a binding here.<br>

> <br>

>     This probably needs to be fixed. All suggestions are welcomed.<br>

> <br>

>     ><br>

>     > ><br>

>     > > At this point mechanism is applied to two optimizations: Inliner and<br>

>     > > Loop<br>

>     > > Invariant Code Motion. Supported platform is MIPS32r2.<br>

>     > ><br>

>     > > Despite small number of instrumented points we got improvements on<br>

>     > > some test<br>

>     > > files from CSiBE benchmark. Here are the promising results:<br>

>     > ><br>

>     > > jpeg-6b/jidctflt.c -10.2%<br>

>     > > OpenTCP-1.0.4/system.c -6.7%<br>

>     > > libmspack/test/chmd_md5.c -5.8%<br>

>     > > flex-2.5.31/regex.c -4.8%<br>

>     ><br>

>     > Very interesting. Out of curiosity, any slowdowns?<br>

> <br>

>     jidctflt.c is optimized due to the fact that LICM moves some<br>

>     constant definitions out of the loop and increases the register<br>

>     pressure so some spills occur. So optimized version, in this<br>

>     case, should be even faster. But slowdowns are possible and we<br>

>     use option -Oz so we are telling the compiler that code size<br>

>     is the priority.<br>

> <br>

>     ><br>

>     > ><br>

>     > > We plan to apply this framework to more compiler phases and to some<br>

>     > > other<br>

>     > > target architectures. It is expected to give better results on some<br>

>     > > non<br>

>     > > orthogonal architectures like microMIPS. Another direction is to<br>

>     > > apply<br>

>     > > machine learning based on some input program features (number of<br>

>     > > basic<br>

>     > > blocks, register pressure at some point etc.) and it can be used for<br>

>     > > selecting compilation path for the next iteration or estimating the<br>

>     > > best<br>

>     > > path for compiling program without using iterative approach (with<br>

>     > > only one<br>

>     > > iteration).<br>

>     > ><br>

>     > ><br>

>     > > o<br>

>     > > / \<br>

>     > > / \<br>

>     > > o \<br>

>     > > / \<br>

>     > > / \<br>

>     > > o o<br>

>     > > / \ /<br>

>     > > / \ /<br>

>     > > L1 o o<br>

>     > > / /<br>

>     > > / /<br>

>     > > L2 L3<br>

>     > ><br>

>     > > Figure 1: Decision Tree Example - Symbol 'o' represent decision<br>

>     > > points. If<br>

>     > > getDecision function returns FALSE the left branch is taken (default<br>

>     > > path),<br>

>     > > otherwise the right branch is taken.<br>

>     > > 'L1' compilation iteration can be represented as a string<br>

>     > > (FALSE,FALSE,FALSE), 'L2' as (FALSE,FALSE,TRUE,FALSE) and 'L3'<br>

>     > > as (TRUE, FALSE, FALSE).<br>

>     > ><br>

>     > ><br>

>     > > Clearly, we see huge potential in this direction. At the same time,<br>

>     > > we are<br>

>     > > aware this would cause compiler to be slower which everyone is<br>

>     > > against, but<br>

>     > > this penalty can be taken when we are in need for most optimized<br>

>     > > code.<br>

>     ><br>

>     > What is the overhead of the current implementation? I suspect that,<br>

>     > in order to control the large number of decision points from large<br>

>     > regions, we will want some way to limit fine-grained decisions in<br>

>     > the tree in large regions (and only including coarse ones, like the<br>

>     > inlining of relatively large functions). In small blocks, on the<br>

>     > other hand, we might even include instruction scheduling decisions<br>

>     > (for in-order cores).<br>

> <br>

>     In the current implementation one full compilation per<br>

>     decision is needed. Plan is that, for CodeGen pass, do once all the<br>

>     passes of CodeGen per decision (to reuse bitcode). Machine learning should<br>

>     considerably reduce the need for the large number of iterations.<br>

> <br>

>     ><br>

>     > I'd been thinking of experimenting with some iterative scheduling<br>

>     > for my in-order PPC cores (for small blocks in loops), and<br>

>     > integrating it into something like this might be even better.<br>

>     ><br>

>     > ><br>

>     > > What does everyone think?<br>

>     > > Could this approach be considered as a valid mode for Clang/LLVM?<br>

>     > ><br>

>     > > Please use this post and the attached patch (it can be applied to 3.3<br>

>     > > release - clang should be inside llvm/tools folder) as a reference<br>

>     > > and<br>

>     > > not the patch for submission.<br>

>     ><br>

>     > I see in the patch some (seemingly small) changes to the pass manager<br>

>     > setup. Can you describe exactly what you've done. Chandler is<br>

>     > rewriting the pass infrastructure, and I'd like to understand how<br>

>     > what you've done will or won't fit with the new setup.<br>

> <br>

>     Pointer to decision tree proxy has to be passed to every place<br>

>     where getDecision is called and to the pass that calculates function<br>

>     fitness. So pointer to it is added in Pass and PassManagerBase classes.<br>

>     We will provide more detail explanation later.<br>

> <br>

>     ><br>

>     > Thanks again,<br>

>     > Hal<br>

>     ><br>

>     > ><br>

>     > > Example command line:<br>

>     > > > clang -target mipsel-unknown-linux -Oz -S -fiterative-comp=3<br>

>     > > > example.c<br>

>     > ><br>

>     > > Thank you. Looking forward to hearing your comments on this.<br>

>     > ><br>

>     > ><br>

>     > ><br>

>     > > _______________________________________________<br>

>     > > LLVM Developers mailing list<br>

>     > > LLVMdev@cs.uiuc.edu         http://llvm.cs.uiuc.edu<br>

>     > > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev<br>

>     > ><br>

>     ><br>

>     > --<br>

>     > Hal Finkel<br>

>     > Assistant Computational Scientist<br>

>     > Leadership Computing Facility<br>

>     > Argonne National Laboratory<br>

> <br>

> <br>

>     _______________________________________________<br>

>     LLVM Developers mailing list<br>

>     LLVMdev@cs.uiuc.edu         http://llvm.cs.uiuc.edu<br>

>     http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev<br>

> <br>

<br>

</div>

</body>

</html>