[llvm] r189297 - Add new API lto_codegen_compile_parallel().

Mon Oct 7 10:52:34 PDT 2013

Sorry for the delay in coming back to this, but do you think it'd be
possible to get this patch cleaned up for inclusion into LLVM itself?
It'd be nice even if we're just going to look at the various different
options to have this as a possibility.

Thanks!

-eric

On Fri, Sep 20, 2013 at 7:30 PM, Wan, Xiaofei <xiaofei.wan at intel.com> wrote:
> Steve:
>
> One more thing I need clarify, this patch is not only for Android, but for LLVM itself; just because this patch happen to improve Android use case. This patch passed lots of test suites and Android/toolchain is just a small part of them. It is not a proper place to discuss Android project here, for sure technical inputs are always welcome anytime & anywhere.
>
> After rounds of discussions, community will come out a best solution finally to improve code generation speed (eg. Shuxin's proposal also sounds very good theoretically).
>
> Thanks
> Wan Xiaofei
>
> -----Original Message-----
> From: Eric Christopher [mailto:echristo at gmail.com]
> Sent: Saturday, September 21, 2013 12:52 AM
> To: Wan, Xiaofei
> Cc: Stephen Hines; Shuxin Yang; Chandler Carruth; llvm-commits at cs.uiuc.edu
> Subject: Re: [llvm] r189297 - Add new API lto_codegen_compile_parallel().
>
> FWIW I am (and have mentioned a few times) in favor of the parallel passes approach and it's great to have some data from people that have tried that approach.
>
> Thanks Xiaofei!
>
> -eric
>
> On Thu, Sep 19, 2013 at 6:38 PM, Wan, Xiaofei <xiaofei.wan at intel.com> wrote:
>> Steve:
>>
>>
>>
>> Sorry for introducing your misunderstanding here; I am not convincing
>> community to accept this patch. As what you said, this is just an
>> experimental project; it is not enough to be only verified in small
>> test coverage.
>>
>>
>>
>> This is just for discussion; to prove the possibility of passes
>> parallelism since Shuxin propose another solution. I think the guys in
>> the community will work out a most proper solution to improve the code
>> generation, so I don't care which patch will be upstream.
>>
>>
>>
>> Thanks
>> Wan Xiaofei
>>
>>
>>
>> From: Stephen Hines [mailto:srhines at google.com]
>> Sent: Friday, September 20, 2013 8:47 AM
>> To: Wan, Xiaofei
>> Cc: Eric Christopher; Shuxin Yang; Chandler Carruth;
>> llvm-commits at cs.uiuc.edu
>>
>>
>> Subject: Re: [llvm] r189297 - Add new API lto_codegen_compile_parallel().
>>
>>
>>
>> Although this was merged into an AOSP project, I want to make it clear
>> that this is *NOT* the official LLVM toolchain for Android (and thus
>> does not constitute endorsement of this patch). That repository is an
>> experimental branch for a 20% project at Google that wanted to try out
>> the patch. Please do not use unofficial sources to try to convince the
>> LLVM that your patch has been accepted/verified by Android.
>>
>>
>>
>> We will continue to only accept upstream patches for rebasing our
>> Android LLVM sources. When this patch or something different gets
>> accepted as the proper way to improve code generation performance, we
>> will be using the same patch as upstream.
>>
>>
>>
>> Thanks,
>>
>> Steve
>>
>>
>>
>> On Sun, Sep 15, 2013 at 12:30 AM, Wan, Xiaofei <xiaofei.wan at intel.com>
>> wrote:
>>
>> Interesting. What's the difference (or your opinion) here between,
>> say, parallelizing codegen/post-ipo passes and splitting the module?
>> Why go for the second rather than the first?
>>
>> [Xiaofei] The first one is just what I have proposed, almost at the
>> same time as Shuxin proposed his idea; we have merged it into
>> AOSP/llvm-toolchain project; it could improve back-end code-gen by
>> 3.5X for 4 threads on our device.
>> We did what Shuxin proposed and found the module partition is not a
>> good solution since "module partition, binary merge" will take pretty
>> time; we abandoned module partition and turn to function-based
>> parallelism (parallelize passed)
>>
>> LLVM back-end compilation time is important to our business(we only
>> care the compilation time without LTO); I am looking forward that
>> community could come to agreement on the final solution to parallelize
>> the back-end codegen passes, any solution is OK; meanwhile I will keep
>> my proposal open here, we do hope community could come to a good solution.
>>
>> Here I attach the discussion before and the code we have merged into
>> AOSP/toolchain.
>> http://lists.cs.uiuc.edu/pipermail/llvmdev/2013-July/063796.html
>> http://llvm-reviews.chandlerc.com/D1152
>> https://android-review.googlesource.com/#/c/62308/
>>
>> Thanks
>> Wan Xiaofei
>>
>>
>>
>> -----Original Message-----
>> From: llvm-commits-bounces at cs.uiuc.edu
>> [mailto:llvm-commits-bounces at cs.uiuc.edu] On Behalf Of Eric
>> Christopher
>> Sent: Wednesday, September 04, 2013 12:08 AM
>> To: Shuxin Yang
>> Cc: llvm-commits at cs.uiuc.edu
>> Subject: Re: [llvm] r189297 - Add new API lto_codegen_compile_parallel().
>>
>> On Tue, Aug 27, 2013 at 10:42 AM, Shuxin Yang <shuxin.llvm at gmail.com> wrote:
>>> Revert in 189386.  Once again, I apologize  I don't follow the
>>> canonical procedure.
>>> I personally think Nick's proposal is clean enough for our system,
>>> and take for granted the community will like it.
>>>
>>
>> It's not necessarily bad, but the lto library is a bit funky and
>> perhaps a new lto library is what we need :)
>>
>>> I will not initiate a discussion for now. I'd like to cool things
>>> down for a while. (maybe postpone indefinitely).
>>>
>>> As with most infrastructure related project, partition is an
>>> unglamorous and pain-taking work.
>>> I step forward to take it just because we are almost have no way
>>> debug or investigate LTO.
>>>
>>
>> Absolutely.
>>
>>> For those who is curious about how much we can speedup by partition.
>>> Unfortunately, I can't tell
>>> as the project is not yet completely done. My rudimentary (quite
>>> stupid
>>> actually)
>>> implementation using make-utility speedup the command "clang++
>>> Xalancbmk/*.o -flto"
>>> by 39%. (35s vs 21s, Xalancbmk has 700+ input).  It is bit shame for
>>> partition. But at very least, each partition is under human control.
>>> On the other hand,  post-IPO scalar-optimization is not yet
>>> parallelizied in my rudimentary implementation. (i.e. so far only
>>> parallelize the codegen part). Surprisingly, the result is very
>>> consistent with what Xiaofei achieve via multh-threading code-gen.
>>> As far as I can recall, he speedup some 2.9x. In my case, it take
>>> about 13s before code-gen starts.
>>> Meaning the speedup to the code-gen is about (35-13)/(21-13) = 2.75x.
>>> (Code-gen plus linker's post-processing take 35-13s).
>>>
>>
>> Interesting. What's the difference (or your opinion) here between,
>> say, parallelizing codegen/post-ipo passes and splitting the module?
>> Why go for the second rather than the first?
>>
>> -eric
>>
>>>
>>> On 8/27/13 12:27 AM, Shuxin Yang wrote:
>>>
>>> On 8/26/13 11:19 PM, Chandler Carruth wrote:
>>>
>>> On Mon, Aug 26, 2013 at 5:53 PM, Shuxin Yang <shuxin.llvm at gmail.com>
>>> wrote:
>>>>
>>>> We certainly need a way to feed multiple resulting objects back to
>>>> linker.
>>>> There are couple of ways
>>>> for this end:
>>>>
>>>>    1) 'ld -r all-resulting-obj-on-disk -o result.o"  and feed the
>>>> only object file (i.e. the result.o)
>>>>        back to linker
>>>>
>>>>     2) keep the resulting objects in memory buffer, and feedback to
>>>> buffers back to linker
>>>>         (as proposed by Nick)
>>>>
>>>>     3) As with GNU gold,  save the resulting objects on disk, and
>>>> feed the these disk files back to linker one by one.
>>>>
>>>>     I'm big linker nut. I don't know which way work better.  I try
>>>> to use
>>>> 1) as a workaround for the time being before 2) is available. People
>>>> at Apple disagree my engineering approach.
>>>>
>>>>     From compiler's perspective,
>>>>     o. 1) is not just workaround, 3) is certainly better than 1).
>>>>     o. 2) will win if the program being compiled is small- or
>>>> medium-sized.
>>>>         With huge programs,  it will be difficult for compiler to
>>>> decide when and how to "spill" some stuff
>>>>         from memory to disk.  Folks in Apple iterate and reiterate
>>>> we only consider the case that the entire
>>>>        program can be loaded in memory. So, the added difficulty for
>>>> compiler dose not seems to be a
>>>>        problem for the workload we care about.
>>>
>>>
>>> Shuxin, I'm not sure what you're trying to accomplish here, but I
>>> don't think this is the right approach.
>>>
>>> First, you seem to be pursuing a partitioning scheme for
>>> parallelizing LTO work despite *no* consensus that this is the
>>> correct approach
>>>
>>> I sent a proposal long time ago, as far as I can understand from the
>>> mailing list. There is no objection at all.
>>> Actually, but my approach is not new at all. It is almost a "std" way
>>> to perform partition. It looks similar to all LTOs I worked/played before.
>>> It just need some LLVM flavor.  But this change has nothing to do the
>>> partition implementation, it just add a interface.
>>>
>>> in any of the community discussions I can find. Please don't commit
>>> code toward a design that the community has expressed serious
>>> reservations about without review.
>>>
>>> Second, you are committing a new API to the set of the stable C APIs
>>> that libLTO exposes without a thorough discussion on the mailing list.
>>>
>>> Sorry, I thought this is pretty Apple thing, as no other system use
>>> this API.
>>> I will revert tomorrow, and initiate a discussion.
>>>
>>> The APIs are almost divided into two classes. One for Unix+gold, the
>>> other one for OSX + Apple LD.
>>> I don't like the way it is, and I don't like the such APIs at all (I
>>> mean all of them).
>>>  I used to argue we are better off having a symbol-related interface
>>> instead of LTO-related API.
>>>  But the community dose not buy my point.  As I have little knowledge
>>> about LLVM, I have to keep open mind, and adapter to LLVM-thinking,
>>> but it certainly take some time.
>>>
>>> It is possible I have missed this discussion, but I did look and
>>> failed to find anything that seems to resemble a review, much less an
>>> LGTM. If I have missed it, I apologize and please direct me at the
>>> thread. I bring this up because the specific interface seems
>>> surprising to me.
>>>
>>> Third, you are justifying the particular approach with a deflection
>>> to some discussion within Apple or with those developers you work
>>> with at Apple.
>>> While this may in fact be the motivation for this patch, the open
>>> source community is often not party to these discussions. ;]
>>>
>>> That is true:-)
>>>
>>> It would help us if you would just give the specific basis rather
>>> than referencing a discussion that we weren't involved with. As it
>>> happens, I suspect I agree with these "Folks in Apple" that it is
>>> useful to specifically optimize for the case that an entire program
>>> fits into memory, bypassing the filesystem.
>>>
>>> You bet!.
>>>
>>> I debate with them. No chance to win. Why don't you suspect in the
>>> first place:-).
>>> But "folks in Apple" argue that is plan in the future.  It dose not
>>> seems to be pretty lame argument, as current implement of LTO bring
>>> everything in memory.
>>>
>>> | However, there are many paths to that end result. From the little
>>> information in the commit log there isn't really enough to tell why
>>> *this* is the necessary path forward (in fact, I'm somewhat confident
>>> it isn't).
>>>
>>> In concept, there is only one alternative : compile the the merged
>>> module into multiple objects, and feed the object back to linker.
>>>
>>>
>>>
>>>
>>> So, to get back to Eric's original question: what is the motivation
>>> for this API, it's expected actual usage, and the reason why it is
>>> important to stub out in this way now?
>>>
>>> The motivation is: the existing LTO compile the merged module into
>>> *single* object,
>>>   with this new API, it enable the way to compile merged module into
>>> *multiple* objects.
>>>   I'm wondering if this is clear now.
>>>
>>>    for instance, suppose the command line is "clang -flto a.o b.bc
>>> c.o d.bc"
>>> (*.o is real object, and *.bc are bitcode),
>>>   existing LTO will merge b.bc and d.dc into t.bc (merged module),
>>> LTO will compile the merged t.bc into t.o, and feed the t.o back the
>>> linker which combine a.o c.o t.o into a.out.
>>>
>>>    The new API will trigger the compiler convert t.o into p1.o and
>>> p2.o ...., and feed these p*.o back to linker, which
>>>   combine a.o and c.o into a.out.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Better yet, could we have that discussion before growing the set of
>>> stable APIs that we claim to never regress?
>>>
>>>
>>> Sure. Sorry about that. I actually don't what to touch the lto_xxx()
>>> API for now.  I just want to do some workaround on the limitation on
>>> the linker, and wait for new ld. But Bob didn't buy my argument:-).
>>>
>>>
>>>
>> _______________________________________________
>> llvm-commits mailing list
>> llvm-commits at cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>
>> _______________________________________________
>> llvm-commits mailing list
>> llvm-commits at cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>
>>