[LLVMdev] [Proposal] Parallelize post-IPO stage.
Shuxin Yang
shuxin.llvm at gmail.com
Tue Jul 16 16:07:05 PDT 2013
On 7/16/13 3:32 PM, Nick Lewycky wrote:
> On 12 July 2013 15:49, Shuxin Yang <shuxin.llvm at gmail.com
> <mailto:shuxin.llvm at gmail.com>> wrote:
>
> Hi, There:
>
> This is the proposal for parallelizing post-ipo stage. See the
> following for details.
>
> I also attach a toy-grade rudimentary implementation. This
> implementation can be
> used to illustrate some concepts here. This patch is not going to
> be committed.
>
> Unfortunately, this weekend I will be too busy to read emails.
> Please do not construe
> delayed response as being rude :-).
>
> Thanks a lot in advance for your time insightful comments!
>
> Shuxin
>
>
> The proposal
> ------------
> It is organized as following:
> 1) background info, if you heard "/usr/bin/ls", please skip it
> 2) the motivation of parallelize post-IPO stage
> 3) how to parallelize post-IPO
> 4) the linker problems.
> 5) the toy-grade rudimentary implementation
> 6) misc
>
> 1.Some background
> ------------------
>
> The Interprocedural-optimization compilation, aka IPO or IPA,
> typically
> consists of three stages:
>
> S1) pre-IPO
> Each function goes through some analysis and
> not-very-aggressive optimizations.
> Some information is collected during this stage, this info
> will be to IPO stages.
> This info is usually called summary info.
>
> The result of this stage is "fake-objects" which is binary
> files using
> some known object format to encapsulate IR as well as summary
> info along with
> other stuff.
>
> S2) IPO:
> Compiler works with linker to resolve and merge symbols in the
> "fake-objects"
>
> Then Interprocedural analyses (IPA) are invoked to perform
> interprocedural
> analysis either based on the summary-info, or directly on the IR.
>
> Interprocedural optimizations (IPO) are called based on the
> IPA result.
>
> In some compilers, IPA and IPO are separated. One reason is
> that many IPAs can
> be directly conduct on the concise summary info, while many
> IPOs need to load
> IRs and bulky annotation/metadata into memory.
>
> S3) post-IPO:
> Typically consist of Loop-Nest-Opt, Scalar Opt, Code-Gen etc
> etc. While they
> are intra-procedural analyses/optimizers, they may directly
> benefit from
> the info collected in the IPO stages and pass down the road.
>
> LLVM collectively call S2 and S3 as "LTO CodeGen", which is very
> confusing.
>
> 2. Why parallelize post-IPO stage
> ==============================
>
> R1) To improve the scalarbility
> It is quite obvious that we are not able to put everything
> about a monster
> program in memory at once.
>
> Even if you can afford a expensive computer, the address space
> of a
> single compiler process cannot accommodate a monster program.
>
> R2) to take advantage of ample HW resource to shorten compile time.
> R3) make debugging lot easier.
> One can triage problems in a much smaller partition rather than
> the huge monster program.
>
> This proposal is not able to shoot the goal R1 at this moment,
> because during
> the IPO stage, currently the compiler brings everything into
> memory at once.
>
> 3. How to parallelize post-IPO stage
> ====================================
>
> From 5k' high, the concept is very simple, just to
> step 1).divide the merged IR into small pieces,
> step 2).and compile each of this pieces independendly.
> step 3) the objects of each piece are fed back to linker to are
> linked
> into an executable, or a dynamic lib.
>
> Section 3.1 through 3.3 describe these three steps respectively.
>
>
> Yes, this is one approach. I think others at Google have looked into
> this sort of partitioning with GCC and found that the one thing which
> really helps you pick the partitions, is to profile the program and
> make the partitions based on the actual paths of functions seen by the
> profile. I don't think they saw much improvement without that. See
> http://research.google.com/pubs/pub36355.html .
>
> I do want to mention some other things we can do to parallelize. You
> may already know of these and have considered them and rejected them
> before deciding on the design you emailed out here, but I want to list
> them since there's already another thread with a different approach on
> the mailing list.
>
> * Use what we have. LTO partitions at a time, directories for
> instance, on the premise that LTO'ing them will produce something
> smaller than the sum of its inputs. Then when you do the final
> whole-program step, it will be receiving smaller inputs than it
> otherwise would. The change to LLVM here is to fix the inliner (or
> other optimizations?) to reduce the chance that LTO produces an output
> larger than its input.
>
> * Parallelize the per-function code generator within a single
> LLVMContext. CodeGen presently operates on a per-function basis, and
> is structured as an analysis over llvm IR. There shouldn't be any
> global state, and there shouldn't be any need for locking accesses to
> the IR since nothing will be mutating it. This will even speed up
> clang -O0, and is a solid first step that gets a thread-creation API
> into LLVM.
> - What happened to "one LLVMContext per thread"? Okay, that rule is
> a little white lie. Always was. LLVMContext allows two libraries to
> use llvm under the hood without interfering with each other (even to
> the point of separate maps of types to avoid one library from causing
> slow type lookups in the other). LLVM also doesn't have locking around
> accesses to the IR, and few guarantees how many things a single
> mutating operation will need to look at or change, but with those
> caveats in mind it is possible to share a context across threads.
> Because CodeGen is structured as an analysis over the IR without
> mutating the IR, it should work. There's probably still global state
> in the code generator somewhere, but it's not a structural problem.
>
> * Parallelize the function passes, and SCCs that are siblings in the
> call tree (again within a single LLVMContext). The gnarly part of this
> is that globals have shared use-lists which are updated as we modify
> each function individually. Those globals either need to have locks on
> their use-lists, replaced with a lockless list, or removed entirely.
> Probably both, as GlobalVariable's have use-lists we actually use in
> the optimizers, but we don't actually need the use-list for "i32 0".
>
> * Parallelize ModulePasses by splitting them into an analysis phase
> and an optimization phase. Make each per-TU build emit the .bc as
> usual plus an analysis-file (for instance, call graph, or "which
> functions reads/modify which globals"). Merge all the analysis-files
> and farm them back out to be used as input to the programs optimizing
> each .bc individually -- but now they have total knowledge of the
> whole-program call graph and other AA information, etc.
> - You can combine this with an RPC layer to give each worker the
> ability to ask for the definition of a function from another worker.
> LLVM already supports "unmaterialized" functions where the definition
> is loaded lazily. The analysis part should arrange to give us enough
> information to determine whether we want to do the inlining, then if
> we decide to materialize the function we get its body from another worker.
>
> * Parallelize by splitting into different LLVMContexts. This spares us
> the difficulty of adding locks or otherwise changing the internals of
> LLVM, gets us the ability to spread the load across multiple machines,
> and if combined with the RPC idea above you can also get good inlining
> without necessarily loading the whole program into memory on a single
> machine.
>
> I'm not planning to do the work any time soon so count my vote with
> that in mind, but if you ask me I think the first step should be to
> parallelize the backend within a single LLVMContext first, then to
> parallelize the function passes and CGSCC passes (across siblings only
> of course) second. Removing the use-list from simple constants is a
> very interesting thing to do to decrease lock contention, but we may
> want to do something smarter than just remove it -- consider emitting
> a large constant that is only used by an inline function. It is
> possible to emit a table of constants in the same COMDAT group as the
> function, then if the inline function is discarded by the linker the
> constants are discarded with it. I don't have a concrete proposal for
> that.
>
> Nick
>
>
Thank you for sharing your enlightening thoughts. I heard some
ideas before, some are quite new to me.
I will take this into account as the project move on.
The motivation of this project is not merely to make compilation
faster. It is also to:
- significantly ease trouble shooting -- I was asked to fix LTO bugs
for several times,
it almost drive me mad to pin-point the bug in a huge merged
module. It is definitely an
unglamorous and painstaking undertaking:-)
- this is one step toward better scalability.
For now, I don't want to parallelize CodeGen only, as post-IPO scalar
opt is compile-time hogging as well.
On the other, HPC folks may invoke Loop-Nest-Opt/autopar/etc/ in the
post-IPO stage as well, those intrinsically
very slow breeds. So parallelize the entire post-IPO stage will make
them happy. Finer-grained parallelism
could be promising, however, it is too error-prone:-).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130716/3740ed56/attachment.html>
More information about the llvm-dev
mailing list