[llvm] r189297 - Add new API lto_codegen_compile_parallel().

Fri Sep 20 19:30:34 PDT 2013

Steve:

One more thing I need clarify, this patch is not only for Android, but for LLVM itself; just because this patch happen to improve Android use case. This patch passed lots of test suites and Android/toolchain is just a small part of them. It is not a proper place to discuss Android project here, for sure technical inputs are always welcome anytime & anywhere.

After rounds of discussions, community will come out a best solution finally to improve code generation speed (eg. Shuxin's proposal also sounds very good theoretically).

Thanks
Wan Xiaofei

-----Original Message-----
From: Eric Christopher [mailto:echristo at gmail.com] 
Sent: Saturday, September 21, 2013 12:52 AM
To: Wan, Xiaofei
Cc: Stephen Hines; Shuxin Yang; Chandler Carruth; llvm-commits at cs.uiuc.edu
Subject: Re: [llvm] r189297 - Add new API lto_codegen_compile_parallel().

FWIW I am (and have mentioned a few times) in favor of the parallel passes approach and it's great to have some data from people that have tried that approach.

Thanks Xiaofei!

-eric

On Thu, Sep 19, 2013 at 6:38 PM, Wan, Xiaofei <xiaofei.wan at intel.com> wrote:
> Steve:
>
>
>
> Sorry for introducing your misunderstanding here; I am not convincing 
> community to accept this patch. As what you said, this is just an 
> experimental project; it is not enough to be only verified in small 
> test coverage.
>
>
>
> This is just for discussion; to prove the possibility of passes 
> parallelism since Shuxin propose another solution. I think the guys in 
> the community will work out a most proper solution to improve the code 
> generation, so I don't care which patch will be upstream.
>
>
>
> Thanks
> Wan Xiaofei
>
>
>
> From: Stephen Hines [mailto:srhines at google.com]
> Sent: Friday, September 20, 2013 8:47 AM
> To: Wan, Xiaofei
> Cc: Eric Christopher; Shuxin Yang; Chandler Carruth; 
> llvm-commits at cs.uiuc.edu
>
>
> Subject: Re: [llvm] r189297 - Add new API lto_codegen_compile_parallel().
>
>
>
> Although this was merged into an AOSP project, I want to make it clear 
> that this is *NOT* the official LLVM toolchain for Android (and thus 
> does not constitute endorsement of this patch). That repository is an 
> experimental branch for a 20% project at Google that wanted to try out 
> the patch. Please do not use unofficial sources to try to convince the 
> LLVM that your patch has been accepted/verified by Android.
>
>
>
> We will continue to only accept upstream patches for rebasing our 
> Android LLVM sources. When this patch or something different gets 
> accepted as the proper way to improve code generation performance, we 
> will be using the same patch as upstream.
>
>
>
> Thanks,
>
> Steve
>
>
>
> On Sun, Sep 15, 2013 at 12:30 AM, Wan, Xiaofei <xiaofei.wan at intel.com>
> wrote:
>
> Interesting. What's the difference (or your opinion) here between, 
> say, parallelizing codegen/post-ipo passes and splitting the module?
> Why go for the second rather than the first?
>
> [Xiaofei] The first one is just what I have proposed, almost at the 
> same time as Shuxin proposed his idea; we have merged it into 
> AOSP/llvm-toolchain project; it could improve back-end code-gen by 
> 3.5X for 4 threads on our device.
> We did what Shuxin proposed and found the module partition is not a 
> good solution since "module partition, binary merge" will take pretty 
> time; we abandoned module partition and turn to function-based 
> parallelism (parallelize passed)
>
> LLVM back-end compilation time is important to our business(we only 
> care the compilation time without LTO); I am looking forward that 
> community could come to agreement on the final solution to parallelize 
> the back-end codegen passes, any solution is OK; meanwhile I will keep 
> my proposal open here, we do hope community could come to a good solution.
>
> Here I attach the discussion before and the code we have merged into 
> AOSP/toolchain.
> http://lists.cs.uiuc.edu/pipermail/llvmdev/2013-July/063796.html
> http://llvm-reviews.chandlerc.com/D1152
> https://android-review.googlesource.com/#/c/62308/
>
> Thanks
> Wan Xiaofei
>
>
>
> -----Original Message-----
> From: llvm-commits-bounces at cs.uiuc.edu 
> [mailto:llvm-commits-bounces at cs.uiuc.edu] On Behalf Of Eric 
> Christopher
> Sent: Wednesday, September 04, 2013 12:08 AM
> To: Shuxin Yang
> Cc: llvm-commits at cs.uiuc.edu
> Subject: Re: [llvm] r189297 - Add new API lto_codegen_compile_parallel().
>
> On Tue, Aug 27, 2013 at 10:42 AM, Shuxin Yang <shuxin.llvm at gmail.com> wrote:
>> Revert in 189386.  Once again, I apologize  I don't follow the 
>> canonical procedure.
>> I personally think Nick's proposal is clean enough for our system, 
>> and take for granted the community will like it.
>>
>
> It's not necessarily bad, but the lto library is a bit funky and 
> perhaps a new lto library is what we need :)
>
>> I will not initiate a discussion for now. I'd like to cool things 
>> down for a while. (maybe postpone indefinitely).
>>
>> As with most infrastructure related project, partition is an 
>> unglamorous and pain-taking work.
>> I step forward to take it just because we are almost have no way 
>> debug or investigate LTO.
>>
>
> Absolutely.
>
>> For those who is curious about how much we can speedup by partition.
>> Unfortunately, I can't tell
>> as the project is not yet completely done. My rudimentary (quite 
>> stupid
>> actually)
>> implementation using make-utility speedup the command "clang++ 
>> Xalancbmk/*.o -flto"
>> by 39%. (35s vs 21s, Xalancbmk has 700+ input).  It is bit shame for 
>> partition. But at very least, each partition is under human control.
>> On the other hand,  post-IPO scalar-optimization is not yet 
>> parallelizied in my rudimentary implementation. (i.e. so far only 
>> parallelize the codegen part). Surprisingly, the result is very 
>> consistent with what Xiaofei achieve via multh-threading code-gen.  
>> As far as I can recall, he speedup some 2.9x. In my case, it take 
>> about 13s before code-gen starts.
>> Meaning the speedup to the code-gen is about (35-13)/(21-13) = 2.75x.
>> (Code-gen plus linker's post-processing take 35-13s).
>>
>
> Interesting. What's the difference (or your opinion) here between, 
> say, parallelizing codegen/post-ipo passes and splitting the module?
> Why go for the second rather than the first?
>
> -eric
>
>>
>> On 8/27/13 12:27 AM, Shuxin Yang wrote:
>>
>> On 8/26/13 11:19 PM, Chandler Carruth wrote:
>>
>> On Mon, Aug 26, 2013 at 5:53 PM, Shuxin Yang <shuxin.llvm at gmail.com>
>> wrote:
>>>
>>> We certainly need a way to feed multiple resulting objects back to 
>>> linker.
>>> There are couple of ways
>>> for this end:
>>>
>>>    1) 'ld -r all-resulting-obj-on-disk -o result.o"  and feed the 
>>> only object file (i.e. the result.o)
>>>        back to linker
>>>
>>>     2) keep the resulting objects in memory buffer, and feedback to 
>>> buffers back to linker
>>>         (as proposed by Nick)
>>>
>>>     3) As with GNU gold,  save the resulting objects on disk, and 
>>> feed the these disk files back to linker one by one.
>>>
>>>     I'm big linker nut. I don't know which way work better.  I try 
>>> to use
>>> 1) as a workaround for the time being before 2) is available. People 
>>> at Apple disagree my engineering approach.
>>>
>>>     From compiler's perspective,
>>>     o. 1) is not just workaround, 3) is certainly better than 1).
>>>     o. 2) will win if the program being compiled is small- or 
>>> medium-sized.
>>>         With huge programs,  it will be difficult for compiler to 
>>> decide when and how to "spill" some stuff
>>>         from memory to disk.  Folks in Apple iterate and reiterate 
>>> we only consider the case that the entire
>>>        program can be loaded in memory. So, the added difficulty for 
>>> compiler dose not seems to be a
>>>        problem for the workload we care about.
>>
>>
>> Shuxin, I'm not sure what you're trying to accomplish here, but I 
>> don't think this is the right approach.
>>
>> First, you seem to be pursuing a partitioning scheme for 
>> parallelizing LTO work despite *no* consensus that this is the 
>> correct approach
>>
>> I sent a proposal long time ago, as far as I can understand from the 
>> mailing list. There is no objection at all.
>> Actually, but my approach is not new at all. It is almost a "std" way 
>> to perform partition. It looks similar to all LTOs I worked/played before.
>> It just need some LLVM flavor.  But this change has nothing to do the 
>> partition implementation, it just add a interface.
>>
>> in any of the community discussions I can find. Please don't commit 
>> code toward a design that the community has expressed serious 
>> reservations about without review.
>>
>> Second, you are committing a new API to the set of the stable C APIs 
>> that libLTO exposes without a thorough discussion on the mailing list.
>>
>> Sorry, I thought this is pretty Apple thing, as no other system use 
>> this API.
>> I will revert tomorrow, and initiate a discussion.
>>
>> The APIs are almost divided into two classes. One for Unix+gold, the 
>> other one for OSX + Apple LD.
>> I don't like the way it is, and I don't like the such APIs at all (I 
>> mean all of them).
>>  I used to argue we are better off having a symbol-related interface 
>> instead of LTO-related API.
>>  But the community dose not buy my point.  As I have little knowledge 
>> about LLVM, I have to keep open mind, and adapter to LLVM-thinking, 
>> but it certainly take some time.
>>
>> It is possible I have missed this discussion, but I did look and 
>> failed to find anything that seems to resemble a review, much less an 
>> LGTM. If I have missed it, I apologize and please direct me at the 
>> thread. I bring this up because the specific interface seems 
>> surprising to me.
>>
>> Third, you are justifying the particular approach with a deflection 
>> to some discussion within Apple or with those developers you work 
>> with at Apple.
>> While this may in fact be the motivation for this patch, the open 
>> source community is often not party to these discussions. ;]
>>
>> That is true:-)
>>
>> It would help us if you would just give the specific basis rather 
>> than referencing a discussion that we weren't involved with. As it 
>> happens, I suspect I agree with these "Folks in Apple" that it is 
>> useful to specifically optimize for the case that an entire program 
>> fits into memory, bypassing the filesystem.
>>
>> You bet!.
>>
>> I debate with them. No chance to win. Why don't you suspect in the 
>> first place:-).
>> But "folks in Apple" argue that is plan in the future.  It dose not 
>> seems to be pretty lame argument, as current implement of LTO bring 
>> everything in memory.
>>
>> | However, there are many paths to that end result. From the little
>> information in the commit log there isn't really enough to tell why
>> *this* is the necessary path forward (in fact, I'm somewhat confident 
>> it isn't).
>>
>> In concept, there is only one alternative : compile the the merged 
>> module into multiple objects, and feed the object back to linker.
>>
>>
>>
>>
>> So, to get back to Eric's original question: what is the motivation 
>> for this API, it's expected actual usage, and the reason why it is 
>> important to stub out in this way now?
>>
>> The motivation is: the existing LTO compile the merged module into
>> *single* object,
>>   with this new API, it enable the way to compile merged module into
>> *multiple* objects.
>>   I'm wondering if this is clear now.
>>
>>    for instance, suppose the command line is "clang -flto a.o b.bc 
>> c.o d.bc"
>> (*.o is real object, and *.bc are bitcode),
>>   existing LTO will merge b.bc and d.dc into t.bc (merged module), 
>> LTO will compile the merged t.bc into t.o, and feed the t.o back the 
>> linker which combine a.o c.o t.o into a.out.
>>
>>    The new API will trigger the compiler convert t.o into p1.o and 
>> p2.o ...., and feed these p*.o back to linker, which
>>   combine a.o and c.o into a.out.
>>
>>
>>
>>
>>
>>
>>
>> Better yet, could we have that discussion before growing the set of 
>> stable APIs that we claim to never regress?
>>
>>
>> Sure. Sorry about that. I actually don't what to touch the lto_xxx() 
>> API for now.  I just want to do some workaround on the limitation on 
>> the linker, and wait for new ld. But Bob didn't buy my argument:-).
>>
>>
>>
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>
>