<html>

  <head>

    <meta content="text/html; charset=ISO-8859-1"

      http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    <br>

    <div class="moz-cite-prefix">On 7/16/13 3:32 PM, Nick Lewycky wrote:<br>

    </div>

    <blockquote

cite="mid:CADbEz-hfnqvWBcVSO+xLx_7ujGGD1eHrNnbinakhgn_4Wqpq5Q@mail.gmail.com"

      type="cite">

      <div dir="ltr">On 12 July 2013 15:49, Shuxin Yang <span dir="ltr"><<a

            moz-do-not-send="true" href="mailto:shuxin.llvm@gmail.com"

            target="_blank">shuxin.llvm@gmail.com</a>></span> wrote:<br>

        <div class="gmail_extra">

          <div class="gmail_quote">

            <blockquote class="gmail_quote" style="margin:0px 0px 0px

0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">Hi,

              There:<br>

              <br>

                This is the proposal for parallelizing post-ipo stage.

              See the following for details.<br>

              <br>

                I also attach a toy-grade rudimentary implementation.

              This implementation can be<br>

              used to illustrate some concepts here. This patch is not

              going to be committed.<br>

              <br>

                Unfortunately, this weekend I will be too busy to read

              emails. Please do not construe<br>

              delayed response as being rude :-).<br>

              <br>

              Thanks a lot in advance for your time insightful comments!<br>

              <br>

              Shuxin<br>

              <br>

              <br>

              The proposal<br>

              ------------<br>

                It is organized as following:<br>

                 1) background info, if you heard "/usr/bin/ls", please

              skip it<br>

                 2) the motivation of parallelize post-IPO stage<br>

                 3) how to parallelize post-IPO<br>

                 4) the linker problems.<br>

                 5) the toy-grade rudimentary implementation<br>

                 6) misc<br>

              <br>

              1.Some background<br>

              ------------------<br>

              <br>

                The Interprocedural-optimization compilation, aka IPO or

              IPA, typically<br>

              consists of three stages:<br>

              <br>

                S1) pre-IPO<br>

                  Each function goes through some analysis and

              not-very-aggressive optimizations.<br>

                  Some information is collected during this stage, this

              info will be to IPO stages.<br>

                  This info is usually called summary info.<br>

              <br>

                  The result of this stage is "fake-objects" which is

              binary files using<br>

                  some known object format to encapsulate IR as well as

              summary info along with<br>

                  other stuff.<br>

              <br>

                S2) IPO:<br>

                  Compiler works with linker to resolve and merge

              symbols in the "fake-objects"<br>

              <br>

                  Then Interprocedural analyses (IPA) are invoked to

              perform interprocedural<br>

                  analysis either based on the summary-info, or directly

              on the IR.<br>

              <br>

                  Interprocedural optimizations (IPO) are called based

              on the IPA result.<br>

              <br>

                  In some compilers, IPA and IPO are separated. One

              reason is that many IPAs can<br>

                  be directly conduct on the concise summary info, while

              many IPOs need to load<br>

                  IRs and bulky annotation/metadata into memory.<br>

              <br>

                S3) post-IPO:<br>

                  Typically consist of Loop-Nest-Opt, Scalar Opt,

              Code-Gen etc etc. While they<br>

                  are intra-procedural analyses/optimizers, they may

              directly benefit from<br>

                  the info collected in the IPO stages and pass down the

              road.<br>

              <br>

                LLVM collectively call S2 and S3 as "LTO CodeGen", which

              is very confusing.<br>

              <br>

              2. Why parallelize post-IPO stage<br>

              ==============================<br>

              <br>

                R1) To improve the scalarbility<br>

                  It is quite obvious that we are not able to put

              everything about a monster<br>

                  program in memory at once.<br>

              <br>

                  Even if you can afford a expensive computer, the

              address space of a<br>

                  single compiler process cannot accommodate a monster

              program.<br>

              <br>

                R2) to take advantage of ample HW resource to shorten

              compile time.<br>

                R3) make debugging lot easier.<br>

                   One can triage problems in a much smaller partition

              rather than<br>

                 the huge monster program.<br>

              <br>

                This proposal is not able to shoot the goal R1 at this

              moment, because during<br>

              the IPO stage, currently the compiler brings everything

              into memory at once.<br>

              <br>

              3. How to parallelize post-IPO stage<br>

              ====================================<br>

              <br>

                From 5k' high, the concept is very simple, just to<br>

                 step 1).divide the merged IR into small pieces,<br>

                 step 2).and compile each of this pieces independendly.<br>

                 step 3) the objects of each piece are fed back to

              linker to are linked<br>

                         into an executable, or a dynamic lib.<br>

              <br>

                Section 3.1 through 3.3 describe these three steps

              respectively.<br>

            </blockquote>

            <div><br>

            </div>

            <div>Yes, this is one approach. I think others at Google

              have looked into this sort of partitioning with GCC and

              found that the one thing which really helps you pick the

              partitions, is to profile the program and make the

              partitions based on the actual paths of functions seen by

              the profile. I don't think they saw much improvement

              without that. See <a moz-do-not-send="true"

                href="http://research.google.com/pubs/pub36355.html">http://research.google.com/pubs/pub36355.html</a>

              .</div>

            <div><br>

            </div>

            <div>I do want to mention some other things we can do to

              parallelize. You may already know of these and have

              considered them and rejected them before deciding on the

              design you emailed out here, but I want to list them since

              there's already another thread with a different approach

              on the mailing list.</div>

            <div><br>

            </div>

            <div>* Use what we have. LTO partitions at a time,

              directories for instance, on the premise that LTO'ing them

              will produce something smaller than the sum of its inputs.

              Then when you do the final whole-program step, it will be

              receiving smaller inputs than it otherwise would. The

              change to LLVM here is to fix the inliner (or other

              optimizations?) to reduce the chance that LTO produces an

              output larger than its input.<br>

            </div>

            <div><br>

            </div>

            <div>* Parallelize the per-function code generator within a

              single LLVMContext. CodeGen presently operates on a

              per-function basis, and is structured as an analysis over

              llvm IR. There shouldn't be any global state, and there

              shouldn't be any need for locking accesses to the IR since

              nothing will be mutating it. This will even speed up clang

              -O0, and is a solid first step that gets a thread-creation

              API into LLVM.</div>

            <div>  - What happened to "one LLVMContext per thread"?

              Okay, that rule is a little white lie. Always was.

              LLVMContext allows two libraries to use llvm under the

              hood without interfering with each other (even to the

              point of separate maps of types to avoid one library from

              causing slow type lookups in the other). LLVM also doesn't

              have locking around accesses to the IR, and few guarantees

              how many things a single mutating operation will need to

              look at or change, but with those caveats in mind it is

              possible to share a context across threads. Because

              CodeGen is structured as an analysis over the IR without

              mutating the IR, it should work. There's probably still

              global state in the code generator somewhere, but it's not

              a structural problem.</div>

            <div><br>

            </div>

            <div>* Parallelize the function passes, and SCCs that are

              siblings in the call tree (again within a single

              LLVMContext). The gnarly part of this is that globals have

              shared use-lists which are updated as we modify each

              function individually. Those globals either need to have

              locks on their use-lists, replaced with a lockless list,

              or removed entirely. Probably both, as GlobalVariable's

              have use-lists we actually use in the optimizers, but we

              don't actually need the use-list for "i32 0".</div>

            <div><br>

            </div>

            <div>* Parallelize ModulePasses by splitting them into an

              analysis phase and an optimization phase. Make each per-TU

              build emit the .bc as usual plus an analysis-file (for

              instance, call graph, or "which functions reads/modify

              which globals"). Merge all the analysis-files and farm

              them back out to be used as input to the programs

              optimizing each .bc individually -- but now they have

              total knowledge of the whole-program call graph and other

              AA information, etc.</div>

            <div>  - You can combine this with an RPC layer to give each

              worker the ability to ask for the definition of a function

              from another worker. LLVM already supports

              "unmaterialized" functions where the definition is loaded

              lazily. The analysis part should arrange to give us enough

              information to determine whether we want to do the

              inlining, then if we decide to materialize the function we

              get its body from another worker.</div>

            <div><br>

            </div>

            <div>* Parallelize by splitting into different LLVMContexts.

              This spares us the difficulty of adding locks or otherwise

              changing the internals of LLVM, gets us the ability to

              spread the load across multiple machines, and if combined

              with the RPC idea above you can also get good inlining

              without necessarily loading the whole program into memory

              on a single machine.<br>

            </div>

            <div><br>

            </div>

            <div>I'm not planning to do the work any time soon so count

              my vote with that in mind, but if you ask me I think the

              first step should be to parallelize the backend within a

              single LLVMContext first, then to parallelize the function

              passes and CGSCC passes (across siblings only of course)

              second. Removing the use-list from simple constants is a

              very interesting thing to do to decrease lock contention,

              but we may want to do something smarter than just remove

              it -- consider emitting a large constant that is only used

              by an inline function. It is possible to emit a table of

              constants in the same COMDAT group as the function, then

              if the inline function is discarded by the linker the

              constants are discarded with it. I don't have a concrete

              proposal for that.</div>

            <div><br>

            </div>

            <div>Nick</div>

            <div><br>

            </div>

            <br>

          </div>

        </div>

      </div>

    </blockquote>

    <br>

       Thank you for sharing your enlightening thoughts.  I heard some

    ideas before, some are quite new to me. <br>

    I will take this into account as the project move on.  <br>

    <br>

       The motivation of this project is not merely to make compilation

    faster. It is also to:<br>

       - significantly ease trouble shooting -- I was asked to fix LTO

    bugs for several times, <br>

          it almost drive me mad to pin-point the bug in a huge merged

    module. It is definitely an<br>

         unglamorous and painstaking undertaking:-)<br>

    <br>

      -  this is one step toward better scalability. <br>

    <br>

      For now, I don't want to parallelize CodeGen only, as post-IPO

    scalar opt is compile-time hogging as well. <br>

    On the other,  HPC folks may invoke Loop-Nest-Opt/autopar/etc/ in

    the post-IPO stage as well,  those intrinsically <br>

    very slow breeds.  So parallelize the entire post-IPO stage will

    make them happy.  Finer-grained parallelism <br>

    could be promising, however, it is too error-prone:-).<br>

    <br>

    <br>

    <br>

     <br>

        <br>

    <br>

  </body>

</html>