<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<br>
<div class="moz-cite-prefix">On 7/16/13 3:32 PM, Nick Lewycky wrote:<br>
</div>
<blockquote
cite="mid:CADbEz-hfnqvWBcVSO+xLx_7ujGGD1eHrNnbinakhgn_4Wqpq5Q@mail.gmail.com"
type="cite">
<div dir="ltr">On 12 July 2013 15:49, Shuxin Yang <span dir="ltr"><<a
moz-do-not-send="true" href="mailto:shuxin.llvm@gmail.com"
target="_blank">shuxin.llvm@gmail.com</a>></span> wrote:<br>
<div class="gmail_extra">
<div class="gmail_quote">
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">Hi,
There:<br>
<br>
This is the proposal for parallelizing post-ipo stage.
See the following for details.<br>
<br>
I also attach a toy-grade rudimentary implementation.
This implementation can be<br>
used to illustrate some concepts here. This patch is not
going to be committed.<br>
<br>
Unfortunately, this weekend I will be too busy to read
emails. Please do not construe<br>
delayed response as being rude :-).<br>
<br>
Thanks a lot in advance for your time insightful comments!<br>
<br>
Shuxin<br>
<br>
<br>
The proposal<br>
------------<br>
It is organized as following:<br>
1) background info, if you heard "/usr/bin/ls", please
skip it<br>
2) the motivation of parallelize post-IPO stage<br>
3) how to parallelize post-IPO<br>
4) the linker problems.<br>
5) the toy-grade rudimentary implementation<br>
6) misc<br>
<br>
1.Some background<br>
------------------<br>
<br>
The Interprocedural-optimization compilation, aka IPO or
IPA, typically<br>
consists of three stages:<br>
<br>
S1) pre-IPO<br>
Each function goes through some analysis and
not-very-aggressive optimizations.<br>
Some information is collected during this stage, this
info will be to IPO stages.<br>
This info is usually called summary info.<br>
<br>
The result of this stage is "fake-objects" which is
binary files using<br>
some known object format to encapsulate IR as well as
summary info along with<br>
other stuff.<br>
<br>
S2) IPO:<br>
Compiler works with linker to resolve and merge
symbols in the "fake-objects"<br>
<br>
Then Interprocedural analyses (IPA) are invoked to
perform interprocedural<br>
analysis either based on the summary-info, or directly
on the IR.<br>
<br>
Interprocedural optimizations (IPO) are called based
on the IPA result.<br>
<br>
In some compilers, IPA and IPO are separated. One
reason is that many IPAs can<br>
be directly conduct on the concise summary info, while
many IPOs need to load<br>
IRs and bulky annotation/metadata into memory.<br>
<br>
S3) post-IPO:<br>
Typically consist of Loop-Nest-Opt, Scalar Opt,
Code-Gen etc etc. While they<br>
are intra-procedural analyses/optimizers, they may
directly benefit from<br>
the info collected in the IPO stages and pass down the
road.<br>
<br>
LLVM collectively call S2 and S3 as "LTO CodeGen", which
is very confusing.<br>
<br>
2. Why parallelize post-IPO stage<br>
==============================<br>
<br>
R1) To improve the scalarbility<br>
It is quite obvious that we are not able to put
everything about a monster<br>
program in memory at once.<br>
<br>
Even if you can afford a expensive computer, the
address space of a<br>
single compiler process cannot accommodate a monster
program.<br>
<br>
R2) to take advantage of ample HW resource to shorten
compile time.<br>
R3) make debugging lot easier.<br>
One can triage problems in a much smaller partition
rather than<br>
the huge monster program.<br>
<br>
This proposal is not able to shoot the goal R1 at this
moment, because during<br>
the IPO stage, currently the compiler brings everything
into memory at once.<br>
<br>
3. How to parallelize post-IPO stage<br>
====================================<br>
<br>
From 5k' high, the concept is very simple, just to<br>
step 1).divide the merged IR into small pieces,<br>
step 2).and compile each of this pieces independendly.<br>
step 3) the objects of each piece are fed back to
linker to are linked<br>
into an executable, or a dynamic lib.<br>
<br>
Section 3.1 through 3.3 describe these three steps
respectively.<br>
</blockquote>
<div><br>
</div>
<div>Yes, this is one approach. I think others at Google
have looked into this sort of partitioning with GCC and
found that the one thing which really helps you pick the
partitions, is to profile the program and make the
partitions based on the actual paths of functions seen by
the profile. I don't think they saw much improvement
without that. See <a moz-do-not-send="true"
href="http://research.google.com/pubs/pub36355.html">http://research.google.com/pubs/pub36355.html</a>
.</div>
<div><br>
</div>
<div>I do want to mention some other things we can do to
parallelize. You may already know of these and have
considered them and rejected them before deciding on the
design you emailed out here, but I want to list them since
there's already another thread with a different approach
on the mailing list.</div>
<div><br>
</div>
<div>* Use what we have. LTO partitions at a time,
directories for instance, on the premise that LTO'ing them
will produce something smaller than the sum of its inputs.
Then when you do the final whole-program step, it will be
receiving smaller inputs than it otherwise would. The
change to LLVM here is to fix the inliner (or other
optimizations?) to reduce the chance that LTO produces an
output larger than its input.<br>
</div>
<div><br>
</div>
<div>* Parallelize the per-function code generator within a
single LLVMContext. CodeGen presently operates on a
per-function basis, and is structured as an analysis over
llvm IR. There shouldn't be any global state, and there
shouldn't be any need for locking accesses to the IR since
nothing will be mutating it. This will even speed up clang
-O0, and is a solid first step that gets a thread-creation
API into LLVM.</div>
<div> - What happened to "one LLVMContext per thread"?
Okay, that rule is a little white lie. Always was.
LLVMContext allows two libraries to use llvm under the
hood without interfering with each other (even to the
point of separate maps of types to avoid one library from
causing slow type lookups in the other). LLVM also doesn't
have locking around accesses to the IR, and few guarantees
how many things a single mutating operation will need to
look at or change, but with those caveats in mind it is
possible to share a context across threads. Because
CodeGen is structured as an analysis over the IR without
mutating the IR, it should work. There's probably still
global state in the code generator somewhere, but it's not
a structural problem.</div>
<div><br>
</div>
<div>* Parallelize the function passes, and SCCs that are
siblings in the call tree (again within a single
LLVMContext). The gnarly part of this is that globals have
shared use-lists which are updated as we modify each
function individually. Those globals either need to have
locks on their use-lists, replaced with a lockless list,
or removed entirely. Probably both, as GlobalVariable's
have use-lists we actually use in the optimizers, but we
don't actually need the use-list for "i32 0".</div>
<div><br>
</div>
<div>* Parallelize ModulePasses by splitting them into an
analysis phase and an optimization phase. Make each per-TU
build emit the .bc as usual plus an analysis-file (for
instance, call graph, or "which functions reads/modify
which globals"). Merge all the analysis-files and farm
them back out to be used as input to the programs
optimizing each .bc individually -- but now they have
total knowledge of the whole-program call graph and other
AA information, etc.</div>
<div> - You can combine this with an RPC layer to give each
worker the ability to ask for the definition of a function
from another worker. LLVM already supports
"unmaterialized" functions where the definition is loaded
lazily. The analysis part should arrange to give us enough
information to determine whether we want to do the
inlining, then if we decide to materialize the function we
get its body from another worker.</div>
<div><br>
</div>
<div>* Parallelize by splitting into different LLVMContexts.
This spares us the difficulty of adding locks or otherwise
changing the internals of LLVM, gets us the ability to
spread the load across multiple machines, and if combined
with the RPC idea above you can also get good inlining
without necessarily loading the whole program into memory
on a single machine.<br>
</div>
<div><br>
</div>
<div>I'm not planning to do the work any time soon so count
my vote with that in mind, but if you ask me I think the
first step should be to parallelize the backend within a
single LLVMContext first, then to parallelize the function
passes and CGSCC passes (across siblings only of course)
second. Removing the use-list from simple constants is a
very interesting thing to do to decrease lock contention,
but we may want to do something smarter than just remove
it -- consider emitting a large constant that is only used
by an inline function. It is possible to emit a table of
constants in the same COMDAT group as the function, then
if the inline function is discarded by the linker the
constants are discarded with it. I don't have a concrete
proposal for that.</div>
<div><br>
</div>
<div>Nick</div>
<div><br>
</div>
<br>
</div>
</div>
</div>
</blockquote>
<br>
Thank you for sharing your enlightening thoughts. I heard some
ideas before, some are quite new to me. <br>
I will take this into account as the project move on. <br>
<br>
The motivation of this project is not merely to make compilation
faster. It is also to:<br>
- significantly ease trouble shooting -- I was asked to fix LTO
bugs for several times, <br>
it almost drive me mad to pin-point the bug in a huge merged
module. It is definitely an<br>
unglamorous and painstaking undertaking:-)<br>
<br>
- this is one step toward better scalability. <br>
<br>
For now, I don't want to parallelize CodeGen only, as post-IPO
scalar opt is compile-time hogging as well. <br>
On the other, HPC folks may invoke Loop-Nest-Opt/autopar/etc/ in
the post-IPO stage as well, those intrinsically <br>
very slow breeds. So parallelize the entire post-IPO stage will
make them happy. Finer-grained parallelism <br>
could be promising, however, it is too error-prone:-).<br>
<br>
<br>
<br>
<br>
<br>
<br>
</body>
</html>