[LLVMdev] [Proposal] Parallelize post-IPO stage.

Wed Jul 17 13:06:49 PDT 2013

On 7/17/13 12:35 PM, Diego Novillo wrote:
> On Fri, Jul 12, 2013 at 3:49 PM, Shuxin Yang <shuxin.llvm at gmail.com> wrote:
>
>> 3. How to parallelize post-IPO stage
>> ====================================
>>
>>    From 5k' high, the concept is very simple, just to
>>     step 1).divide the merged IR into small pieces,
>>     step 2).and compile each of this pieces independendly.
>>     step 3) the objects of each piece are fed back to linker to are linked
>>             into an executable, or a dynamic lib.
> You seem to be describing GCC's strategy for whole program
> optimization (http://gcc.gnu.org/wiki/LinkTimeOptimization).

Hi,  Diego:

   Thank you for the comment. Quite honestly, I'm not very familiar with
gcc's LTO implementation, and now I'm not allowed to read any GPL V3 code.

What I'm describing here is somewhat similar to Open64's IPA 
implementation.
I did port gcc before and of course read its document . Based on my
little knowledge of its internal, I think gcc's "whole-program"
mode is almost identical to Open64's IPA in spirit, I guess gcc 
community likely
borrowed  some idea from Open64.

On the other hand, as far as I can understand of the oldish gcc's 
implementation
before I join Apple, I guess the "whole-program mode" in gcc is misnomer.
I don't want call this work as "whole-program mode", I'd like to call 
"big program mode"
or something like that, as I don't care the binary being built see the 
whole-program or
not while the analyses/opt
do care the difference between them.

>
> What is a bit confusing in your description is that you seem to be
> wanting to do more work after IPO is *done*.

IPO is difficult to be parallelized.  What I try to do is to parallelize 
the post-lto compilation
stage, including, optionally LNO, and scalar opt and compile-time 
hogging CodeGen.

> If the optimizations are
> done, then there is no need to parallelize anything.
>
> In GCC, we have two parallel stages: the generation of bytecode (which
> is naturally parallelized by your build system) and the final
> optimization phase, which is parallelized by partitioning the
> callgraph into disjoint subsets.  These subsets are then parallelized
> by spawning the compiler on each of the partitions.
I call the 1st "parallel stage" pre-ipo, and the 2nd "parallel stage" 
post-IPO:-),
and the IPO (which you call WPA) is sandwiched in the middle.

>
> The only sequential part is the actual analysis (what we call WPA or
> whole program analysis).
While not include in the proposal, I was proposing another level 
(earlier) of partition in order
to build outrageously huge programs in order to parallelize the "WPA".  
The benefit
is to parallize the IPO/"WPA", however, the cost is not seeing the 
"whole program".

One of my coworker told me we care about the benefit reaped from seeing 
the "whole program"
than the improved compile-time at the cost of compiling bunch of 
"incomplete program"s independently.

> That is a single threaded phase that makes
> all the optimization decisions, annotates the summaries on each
> callgraph node, partitions the callgraph and then spawns all the
> optimizers to work on each section of the callgraph.
>
This is what I called "post-IPO" :-)