[LLVMdev] [Proposal] Parallelize post-IPO stage.

Tue Jul 16 16:07:05 PDT 2013

On 7/16/13 3:32 PM, Nick Lewycky wrote:
> On 12 July 2013 15:49, Shuxin Yang <shuxin.llvm at gmail.com 
> <mailto:shuxin.llvm at gmail.com>> wrote:
>
>     Hi, There:
>
>       This is the proposal for parallelizing post-ipo stage. See the
>     following for details.
>
>       I also attach a toy-grade rudimentary implementation. This
>     implementation can be
>     used to illustrate some concepts here. This patch is not going to
>     be committed.
>
>       Unfortunately, this weekend I will be too busy to read emails.
>     Please do not construe
>     delayed response as being rude :-).
>
>     Thanks a lot in advance for your time insightful comments!
>
>     Shuxin
>
>
>     The proposal
>     ------------
>       It is organized as following:
>        1) background info, if you heard "/usr/bin/ls", please skip it
>        2) the motivation of parallelize post-IPO stage
>        3) how to parallelize post-IPO
>        4) the linker problems.
>        5) the toy-grade rudimentary implementation
>        6) misc
>
>     1.Some background
>     ------------------
>
>       The Interprocedural-optimization compilation, aka IPO or IPA,
>     typically
>     consists of three stages:
>
>       S1) pre-IPO
>         Each function goes through some analysis and
>     not-very-aggressive optimizations.
>         Some information is collected during this stage, this info
>     will be to IPO stages.
>         This info is usually called summary info.
>
>         The result of this stage is "fake-objects" which is binary
>     files using
>         some known object format to encapsulate IR as well as summary
>     info along with
>         other stuff.
>
>       S2) IPO:
>         Compiler works with linker to resolve and merge symbols in the
>     "fake-objects"
>
>         Then Interprocedural analyses (IPA) are invoked to perform
>     interprocedural
>         analysis either based on the summary-info, or directly on the IR.
>
>         Interprocedural optimizations (IPO) are called based on the
>     IPA result.
>
>         In some compilers, IPA and IPO are separated. One reason is
>     that many IPAs can
>         be directly conduct on the concise summary info, while many
>     IPOs need to load
>         IRs and bulky annotation/metadata into memory.
>
>       S3) post-IPO:
>         Typically consist of Loop-Nest-Opt, Scalar Opt, Code-Gen etc
>     etc. While they
>         are intra-procedural analyses/optimizers, they may directly
>     benefit from
>         the info collected in the IPO stages and pass down the road.
>
>       LLVM collectively call S2 and S3 as "LTO CodeGen", which is very
>     confusing.
>
>     2. Why parallelize post-IPO stage
>     ==============================
>
>       R1) To improve the scalarbility
>         It is quite obvious that we are not able to put everything
>     about a monster
>         program in memory at once.
>
>         Even if you can afford a expensive computer, the address space
>     of a
>         single compiler process cannot accommodate a monster program.
>
>       R2) to take advantage of ample HW resource to shorten compile time.
>       R3) make debugging lot easier.
>          One can triage problems in a much smaller partition rather than
>        the huge monster program.
>
>       This proposal is not able to shoot the goal R1 at this moment,
>     because during
>     the IPO stage, currently the compiler brings everything into
>     memory at once.
>
>     3. How to parallelize post-IPO stage
>     ====================================
>
>       From 5k' high, the concept is very simple, just to
>        step 1).divide the merged IR into small pieces,
>        step 2).and compile each of this pieces independendly.
>        step 3) the objects of each piece are fed back to linker to are
>     linked
>                into an executable, or a dynamic lib.
>
>       Section 3.1 through 3.3 describe these three steps respectively.
>
>
> Yes, this is one approach. I think others at Google have looked into 
> this sort of partitioning with GCC and found that the one thing which 
> really helps you pick the partitions, is to profile the program and 
> make the partitions based on the actual paths of functions seen by the 
> profile. I don't think they saw much improvement without that. See 
> http://research.google.com/pubs/pub36355.html .
>
> I do want to mention some other things we can do to parallelize. You 
> may already know of these and have considered them and rejected them 
> before deciding on the design you emailed out here, but I want to list 
> them since there's already another thread with a different approach on 
> the mailing list.
>
> * Use what we have. LTO partitions at a time, directories for 
> instance, on the premise that LTO'ing them will produce something 
> smaller than the sum of its inputs. Then when you do the final 
> whole-program step, it will be receiving smaller inputs than it 
> otherwise would. The change to LLVM here is to fix the inliner (or 
> other optimizations?) to reduce the chance that LTO produces an output 
> larger than its input.
>
> * Parallelize the per-function code generator within a single 
> LLVMContext. CodeGen presently operates on a per-function basis, and 
> is structured as an analysis over llvm IR. There shouldn't be any 
> global state, and there shouldn't be any need for locking accesses to 
> the IR since nothing will be mutating it. This will even speed up 
> clang -O0, and is a solid first step that gets a thread-creation API 
> into LLVM.
>   - What happened to "one LLVMContext per thread"? Okay, that rule is 
> a little white lie. Always was. LLVMContext allows two libraries to 
> use llvm under the hood without interfering with each other (even to 
> the point of separate maps of types to avoid one library from causing 
> slow type lookups in the other). LLVM also doesn't have locking around 
> accesses to the IR, and few guarantees how many things a single 
> mutating operation will need to look at or change, but with those 
> caveats in mind it is possible to share a context across threads. 
> Because CodeGen is structured as an analysis over the IR without 
> mutating the IR, it should work. There's probably still global state 
> in the code generator somewhere, but it's not a structural problem.
>
> * Parallelize the function passes, and SCCs that are siblings in the 
> call tree (again within a single LLVMContext). The gnarly part of this 
> is that globals have shared use-lists which are updated as we modify 
> each function individually. Those globals either need to have locks on 
> their use-lists, replaced with a lockless list, or removed entirely. 
> Probably both, as GlobalVariable's have use-lists we actually use in 
> the optimizers, but we don't actually need the use-list for "i32 0".
>
> * Parallelize ModulePasses by splitting them into an analysis phase 
> and an optimization phase. Make each per-TU build emit the .bc as 
> usual plus an analysis-file (for instance, call graph, or "which 
> functions reads/modify which globals"). Merge all the analysis-files 
> and farm them back out to be used as input to the programs optimizing 
> each .bc individually -- but now they have total knowledge of the 
> whole-program call graph and other AA information, etc.
>   - You can combine this with an RPC layer to give each worker the 
> ability to ask for the definition of a function from another worker. 
> LLVM already supports "unmaterialized" functions where the definition 
> is loaded lazily. The analysis part should arrange to give us enough 
> information to determine whether we want to do the inlining, then if 
> we decide to materialize the function we get its body from another worker.
>
> * Parallelize by splitting into different LLVMContexts. This spares us 
> the difficulty of adding locks or otherwise changing the internals of 
> LLVM, gets us the ability to spread the load across multiple machines, 
> and if combined with the RPC idea above you can also get good inlining 
> without necessarily loading the whole program into memory on a single 
> machine.
>
> I'm not planning to do the work any time soon so count my vote with 
> that in mind, but if you ask me I think the first step should be to 
> parallelize the backend within a single LLVMContext first, then to 
> parallelize the function passes and CGSCC passes (across siblings only 
> of course) second. Removing the use-list from simple constants is a 
> very interesting thing to do to decrease lock contention, but we may 
> want to do something smarter than just remove it -- consider emitting 
> a large constant that is only used by an inline function. It is 
> possible to emit a table of constants in the same COMDAT group as the 
> function, then if the inline function is discarded by the linker the 
> constants are discarded with it. I don't have a concrete proposal for 
> that.
>
> Nick
>
>

    Thank you for sharing your enlightening thoughts.  I heard some 
ideas before, some are quite new to me.
I will take this into account as the project move on.

    The motivation of this project is not merely to make compilation 
faster. It is also to:
    - significantly ease trouble shooting -- I was asked to fix LTO bugs 
for several times,
       it almost drive me mad to pin-point the bug in a huge merged 
module. It is definitely an
      unglamorous and painstaking undertaking:-)

   -  this is one step toward better scalability.

   For now, I don't want to parallelize CodeGen only, as post-IPO scalar 
opt is compile-time hogging as well.
On the other,  HPC folks may invoke Loop-Nest-Opt/autopar/etc/ in the 
post-IPO stage as well,  those intrinsically
very slow breeds.  So parallelize the entire post-IPO stage will make 
them happy.  Finer-grained parallelism
could be promising, however, it is too error-prone:-).

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130716/3740ed56/attachment.html>