[llvm] r189297 - Add new API lto_codegen_compile_parallel().

Sun Sep 15 00:30:20 PDT 2013

Interesting. What's the difference (or your opinion) here between, say, parallelizing codegen/post-ipo passes and splitting the module?
Why go for the second rather than the first?

[Xiaofei] The first one is just what I have proposed, almost at the same time as Shuxin proposed his idea; we have merged it into AOSP/llvm-toolchain project; it could improve back-end code-gen by 3.5X for 4 threads on our device.
We did what Shuxin proposed and found the module partition is not a good solution since "module partition, binary merge" will take pretty time; we abandoned module partition and turn to function-based parallelism (parallelize passed)

LLVM back-end compilation time is important to our business(we only care the compilation time without LTO); I am looking forward that community could come to agreement on the final solution to parallelize the back-end codegen passes, any solution is OK; meanwhile I will keep my proposal open here, we do hope community could come to a good solution.

Here I attach the discussion before and the code we have merged into AOSP/toolchain.
http://lists.cs.uiuc.edu/pipermail/llvmdev/2013-July/063796.html
http://llvm-reviews.chandlerc.com/D1152
https://android-review.googlesource.com/#/c/62308/ 

Thanks
Wan Xiaofei

-----Original Message-----
From: llvm-commits-bounces at cs.uiuc.edu [mailto:llvm-commits-bounces at cs.uiuc.edu] On Behalf Of Eric Christopher
Sent: Wednesday, September 04, 2013 12:08 AM
To: Shuxin Yang
Cc: llvm-commits at cs.uiuc.edu
Subject: Re: [llvm] r189297 - Add new API lto_codegen_compile_parallel().

On Tue, Aug 27, 2013 at 10:42 AM, Shuxin Yang <shuxin.llvm at gmail.com> wrote:
> Revert in 189386.  Once again, I apologize  I don't follow the 
> canonical procedure.
> I personally think Nick's proposal is clean enough for our system, and 
> take for granted the community will like it.
>

It's not necessarily bad, but the lto library is a bit funky and perhaps a new lto library is what we need :)

> I will not initiate a discussion for now. I'd like to cool things down 
> for a while. (maybe postpone indefinitely).
>
> As with most infrastructure related project, partition is an 
> unglamorous and pain-taking work.
> I step forward to take it just because we are almost have no way debug 
> or investigate LTO.
>

Absolutely.

> For those who is curious about how much we can speedup by partition.
> Unfortunately, I can't tell
> as the project is not yet completely done. My rudimentary (quite 
> stupid
> actually)
> implementation using make-utility speedup the command "clang++ 
> Xalancbmk/*.o -flto"
> by 39%. (35s vs 21s, Xalancbmk has 700+ input).  It is bit shame for 
> partition. But at very least, each partition is under human control.  
> On the other hand,  post-IPO scalar-optimization is not yet 
> parallelizied in my rudimentary implementation. (i.e. so far only 
> parallelize the codegen part). Surprisingly, the result is very 
> consistent with what Xiaofei achieve via multh-threading code-gen.  As 
> far as I can recall, he speedup some 2.9x. In my case, it take about 
> 13s before code-gen starts.
> Meaning the speedup to the code-gen is about (35-13)/(21-13) = 2.75x.
> (Code-gen plus linker's post-processing take 35-13s).
>

Interesting. What's the difference (or your opinion) here between, say, parallelizing codegen/post-ipo passes and splitting the module?
Why go for the second rather than the first?

-eric

>
> On 8/27/13 12:27 AM, Shuxin Yang wrote:
>
> On 8/26/13 11:19 PM, Chandler Carruth wrote:
>
> On Mon, Aug 26, 2013 at 5:53 PM, Shuxin Yang <shuxin.llvm at gmail.com> wrote:
>>
>> We certainly need a way to feed multiple resulting objects back to linker.
>> There are couple of ways
>> for this end:
>>
>>    1) 'ld -r all-resulting-obj-on-disk -o result.o"  and feed the 
>> only object file (i.e. the result.o)
>>        back to linker
>>
>>     2) keep the resulting objects in memory buffer, and feedback to 
>> buffers back to linker
>>         (as proposed by Nick)
>>
>>     3) As with GNU gold,  save the resulting objects on disk, and 
>> feed the these disk files back to linker one by one.
>>
>>     I'm big linker nut. I don't know which way work better.  I try to 
>> use
>> 1) as a workaround for the time being before 2) is available. People 
>> at Apple disagree my engineering approach.
>>
>>     From compiler's perspective,
>>     o. 1) is not just workaround, 3) is certainly better than 1).
>>     o. 2) will win if the program being compiled is small- or 
>> medium-sized.
>>         With huge programs,  it will be difficult for compiler to 
>> decide when and how to "spill" some stuff
>>         from memory to disk.  Folks in Apple iterate and reiterate we 
>> only consider the case that the entire
>>        program can be loaded in memory. So, the added difficulty for 
>> compiler dose not seems to be a
>>        problem for the workload we care about.
>
>
> Shuxin, I'm not sure what you're trying to accomplish here, but I 
> don't think this is the right approach.
>
> First, you seem to be pursuing a partitioning scheme for parallelizing 
> LTO work despite *no* consensus that this is the correct approach
>
> I sent a proposal long time ago, as far as I can understand from the 
> mailing list. There is no objection at all.
> Actually, but my approach is not new at all. It is almost a "std" way 
> to perform partition. It looks similar to all LTOs I worked/played before.
> It just need some LLVM flavor.  But this change has nothing to do the 
> partition implementation, it just add a interface.
>
> in any of the community discussions I can find. Please don't commit 
> code toward a design that the community has expressed serious 
> reservations about without review.
>
> Second, you are committing a new API to the set of the stable C APIs 
> that libLTO exposes without a thorough discussion on the mailing list.
>
> Sorry, I thought this is pretty Apple thing, as no other system use 
> this API.
> I will revert tomorrow, and initiate a discussion.
>
> The APIs are almost divided into two classes. One for Unix+gold, the 
> other one for OSX + Apple LD.
> I don't like the way it is, and I don't like the such APIs at all (I 
> mean all of them).
>  I used to argue we are better off having a symbol-related interface 
> instead of LTO-related API.
>  But the community dose not buy my point.  As I have little knowledge 
> about LLVM, I have to keep open mind, and adapter to LLVM-thinking, 
> but it certainly take some time.
>
> It is possible I have missed this discussion, but I did look and 
> failed to find anything that seems to resemble a review, much less an 
> LGTM. If I have missed it, I apologize and please direct me at the 
> thread. I bring this up because the specific interface seems surprising to me.
>
> Third, you are justifying the particular approach with a deflection to 
> some discussion within Apple or with those developers you work with at Apple.
> While this may in fact be the motivation for this patch, the open 
> source community is often not party to these discussions. ;]
>
> That is true:-)
>
> It would help us if you would just give the specific basis rather than 
> referencing a discussion that we weren't involved with. As it happens, 
> I suspect I agree with these "Folks in Apple" that it is useful to 
> specifically optimize for the case that an entire program fits into 
> memory, bypassing the filesystem.
>
> You bet!.
>
> I debate with them. No chance to win. Why don't you suspect in the 
> first place:-).
> But "folks in Apple" argue that is plan in the future.  It dose not 
> seems to be pretty lame argument, as current implement of LTO bring 
> everything in memory.
>
> | However, there are many paths to that end result. From the little
> information in the commit log there isn't really enough to tell why 
> *this* is the necessary path forward (in fact, I'm somewhat confident it isn't).
>
> In concept, there is only one alternative : compile the the merged 
> module into multiple objects, and feed the object back to linker.
>
>
>
>
> So, to get back to Eric's original question: what is the motivation 
> for this API, it's expected actual usage, and the reason why it is 
> important to stub out in this way now?
>
> The motivation is: the existing LTO compile the merged module into 
> *single* object,
>   with this new API, it enable the way to compile merged module into
> *multiple* objects.
>   I'm wondering if this is clear now.
>
>    for instance, suppose the command line is "clang -flto a.o b.bc c.o d.bc"
> (*.o is real object, and *.bc are bitcode),
>   existing LTO will merge b.bc and d.dc into t.bc (merged module), LTO 
> will compile the merged t.bc into t.o, and feed the t.o back the 
> linker which combine a.o c.o t.o into a.out.
>
>    The new API will trigger the compiler convert t.o into p1.o and 
> p2.o ...., and feed these p*.o back to linker, which
>   combine a.o and c.o into a.out.
>
>
>
>
>
>
>
> Better yet, could we have that discussion before growing the set of 
> stable APIs that we claim to never regress?
>
>
> Sure. Sorry about that. I actually don't what to touch the lto_xxx() 
> API for now.  I just want to do some workaround on the limitation on 
> the linker, and wait for new ld. But Bob didn't buy my argument:-).
>
>
>
_______________________________________________
llvm-commits mailing list
llvm-commits at cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits