[LLVMdev] [Proposal] Parallelize post-IPO stage.

Wed Jul 17 13:16:17 PDT 2013

On Wed, Jul 17, 2013 at 1:06 PM, Shuxin Yang <shuxin.llvm at gmail.com> wrote:
>
> On 7/17/13 12:35 PM, Diego Novillo wrote:
>>
>> On Fri, Jul 12, 2013 at 3:49 PM, Shuxin Yang <shuxin.llvm at gmail.com>
>> wrote:
>>
>>> 3. How to parallelize post-IPO stage
>>> ====================================
>>>
>>>    From 5k' high, the concept is very simple, just to
>>>     step 1).divide the merged IR into small pieces,
>>>     step 2).and compile each of this pieces independendly.
>>>     step 3) the objects of each piece are fed back to linker to are
>>> linked
>>>             into an executable, or a dynamic lib.
>>
>> You seem to be describing GCC's strategy for whole program
>> optimization (http://gcc.gnu.org/wiki/LinkTimeOptimization).
>
>
> Hi,  Diego:
>
>   Thank you for the comment. Quite honestly, I'm not very familiar with
> gcc's LTO implementation, and now I'm not allowed to read any GPL V3 code.
>
> What I'm describing here is somewhat similar to Open64's IPA implementation.
> I did port gcc before and of course read its document . Based on my
> little knowledge of its internal, I think gcc's "whole-program"
> mode is almost identical to Open64's IPA in spirit, I guess gcc community
> likely
> borrowed  some idea from Open64.
>
> On the other hand, as far as I can understand of the oldish gcc's
> implementation
> before I join Apple, I guess the "whole-program mode" in gcc is misnomer.
> I don't want call this work as "whole-program mode", I'd like to call "big
> program mode"
> or something like that, as I don't care the binary being built see the
> whole-program or
> not while the analyses/opt
> do care the difference between them.
>
>
>>
>> What is a bit confusing in your description is that you seem to be
>> wanting to do more work after IPO is *done*.
>
>
> IPO is difficult to be parallelized.  What I try to do is to parallelize the
> post-lto compilation
> stage, including, optionally LNO, and scalar opt and compile-time hogging
> CodeGen.
>
>
>> If the optimizations are
>> done, then there is no need to parallelize anything.
>>
>> In GCC, we have two parallel stages: the generation of bytecode (which
>> is naturally parallelized by your build system) and the final
>> optimization phase, which is parallelized by partitioning the
>> callgraph into disjoint subsets.  These subsets are then parallelized
>> by spawning the compiler on each of the partitions.
>
> I call the 1st "parallel stage" pre-ipo, and the 2nd "parallel stage"
> post-IPO:-),
> and the IPO (which you call WPA) is sandwiched in the middle.
>
>
>>
>> The only sequential part is the actual analysis (what we call WPA or
>> whole program analysis).
>
> While not include in the proposal, I was proposing another level (earlier)
> of partition in order
> to build outrageously huge programs in order to parallelize the "WPA".  The
> benefit
> is to parallize the IPO/"WPA", however, the cost is not seeing the "whole
> program".
>
> One of my coworker told me we care about the benefit reaped from seeing the
> "whole program"
> than the improved compile-time at the cost of compiling bunch of "incomplete
> program"s independently.
>
>
>
>> That is a single threaded phase that makes
>> all the optimization decisions, annotates the summaries on each
>> callgraph node, partitions the callgraph and then spawns all the
>> optimizers to work on each section of the callgraph.
>>
> This is what I called "post-IPO" :-)
>

OK, yeah, this was horribly confusing. IPA would probably have been a
better choice of term (or WPA) with the 'A' for analysis and the 'O'
for optimization where we're doing the work.

It makes your entire proposal make some sense now ;)

I'll need to go back and reread it with that in mind.

-eric