[llvm] r189297 - Add new API lto_codegen_compile_parallel().

Thu Sep 19 18:38:38 PDT 2013

Steve:

Sorry for introducing your misunderstanding here; I am not convincing community to accept this patch. As what you said, this is just an experimental project; it is not enough to be only verified in small test coverage.

This is just for discussion; to prove the possibility of passes parallelism since Shuxin propose another solution. I think the guys in the community will work out a most proper solution to improve the code generation, so I don't care which patch will be upstream.

Thanks
Wan Xiaofei

From: Stephen Hines [mailto:srhines at google.com]
Sent: Friday, September 20, 2013 8:47 AM
To: Wan, Xiaofei
Cc: Eric Christopher; Shuxin Yang; Chandler Carruth; llvm-commits at cs.uiuc.edu
Subject: Re: [llvm] r189297 - Add new API lto_codegen_compile_parallel().

Although this was merged into an AOSP project, I want to make it clear that this is *NOT* the official LLVM toolchain for Android (and thus does not constitute endorsement of this patch). That repository is an experimental branch for a 20% project at Google that wanted to try out the patch. Please do not use unofficial sources to try to convince the LLVM that your patch has been accepted/verified by Android.

We will continue to only accept upstream patches for rebasing our Android LLVM sources. When this patch or something different gets accepted as the proper way to improve code generation performance, we will be using the same patch as upstream.

Thanks,
Steve

On Sun, Sep 15, 2013 at 12:30 AM, Wan, Xiaofei <xiaofei.wan at intel.com<mailto:xiaofei.wan at intel.com>> wrote:
Interesting. What's the difference (or your opinion) here between, say, parallelizing codegen/post-ipo passes and splitting the module?
Why go for the second rather than the first?
[Xiaofei] The first one is just what I have proposed, almost at the same time as Shuxin proposed his idea; we have merged it into AOSP/llvm-toolchain project; it could improve back-end code-gen by 3.5X for 4 threads on our device.
We did what Shuxin proposed and found the module partition is not a good solution since "module partition, binary merge" will take pretty time; we abandoned module partition and turn to function-based parallelism (parallelize passed)

LLVM back-end compilation time is important to our business(we only care the compilation time without LTO); I am looking forward that community could come to agreement on the final solution to parallelize the back-end codegen passes, any solution is OK; meanwhile I will keep my proposal open here, we do hope community could come to a good solution.

Here I attach the discussion before and the code we have merged into AOSP/toolchain.
http://lists.cs.uiuc.edu/pipermail/llvmdev/2013-July/063796.html
http://llvm-reviews.chandlerc.com/D1152
https://android-review.googlesource.com/#/c/62308/

Thanks
Wan Xiaofei

-----Original Message-----
From: llvm-commits-bounces at cs.uiuc.edu<mailto:llvm-commits-bounces at cs.uiuc.edu> [mailto:llvm-commits-bounces at cs.uiuc.edu<mailto:llvm-commits-bounces at cs.uiuc.edu>] On Behalf Of Eric Christopher
Sent: Wednesday, September 04, 2013 12:08 AM
To: Shuxin Yang
Cc: llvm-commits at cs.uiuc.edu<mailto:llvm-commits at cs.uiuc.edu>
Subject: Re: [llvm] r189297 - Add new API lto_codegen_compile_parallel().

On Tue, Aug 27, 2013 at 10:42 AM, Shuxin Yang <shuxin.llvm at gmail.com<mailto:shuxin.llvm at gmail.com>> wrote:
> Revert in 189386.  Once again, I apologize  I don't follow the
> canonical procedure.
> I personally think Nick's proposal is clean enough for our system, and
> take for granted the community will like it.
>

It's not necessarily bad, but the lto library is a bit funky and perhaps a new lto library is what we need :)

> I will not initiate a discussion for now. I'd like to cool things down
> for a while. (maybe postpone indefinitely).
>
> As with most infrastructure related project, partition is an
> unglamorous and pain-taking work.
> I step forward to take it just because we are almost have no way debug
> or investigate LTO.
>

Absolutely.

> For those who is curious about how much we can speedup by partition.
> Unfortunately, I can't tell
> as the project is not yet completely done. My rudimentary (quite
> stupid
> actually)
> implementation using make-utility speedup the command "clang++
> Xalancbmk/*.o -flto"
> by 39%. (35s vs 21s, Xalancbmk has 700+ input).  It is bit shame for
> partition. But at very least, each partition is under human control.
> On the other hand,  post-IPO scalar-optimization is not yet
> parallelizied in my rudimentary implementation. (i.e. so far only
> parallelize the codegen part). Surprisingly, the result is very
> consistent with what Xiaofei achieve via multh-threading code-gen.  As
> far as I can recall, he speedup some 2.9x. In my case, it take about
> 13s before code-gen starts.
> Meaning the speedup to the code-gen is about (35-13)/(21-13) = 2.75x.
> (Code-gen plus linker's post-processing take 35-13s).
>

Interesting. What's the difference (or your opinion) here between, say, parallelizing codegen/post-ipo passes and splitting the module?
Why go for the second rather than the first?

-eric

>
> On 8/27/13 12:27 AM, Shuxin Yang wrote:
>
> On 8/26/13 11:19 PM, Chandler Carruth wrote:
>
> On Mon, Aug 26, 2013 at 5:53 PM, Shuxin Yang <shuxin.llvm at gmail.com<mailto:shuxin.llvm at gmail.com>> wrote:
>>
>> We certainly need a way to feed multiple resulting objects back to linker.
>> There are couple of ways
>> for this end:
>>
>>    1) 'ld -r all-resulting-obj-on-disk -o result.o"  and feed the
>> only object file (i.e. the result.o)
>>        back to linker
>>
>>     2) keep the resulting objects in memory buffer, and feedback to
>> buffers back to linker
>>         (as proposed by Nick)
>>
>>     3) As with GNU gold,  save the resulting objects on disk, and
>> feed the these disk files back to linker one by one.
>>
>>     I'm big linker nut. I don't know which way work better.  I try to
>> use
>> 1) as a workaround for the time being before 2) is available. People
>> at Apple disagree my engineering approach.
>>
>>     From compiler's perspective,
>>     o. 1) is not just workaround, 3) is certainly better than 1).
>>     o. 2) will win if the program being compiled is small- or
>> medium-sized.
>>         With huge programs,  it will be difficult for compiler to
>> decide when and how to "spill" some stuff
>>         from memory to disk.  Folks in Apple iterate and reiterate we
>> only consider the case that the entire
>>        program can be loaded in memory. So, the added difficulty for
>> compiler dose not seems to be a
>>        problem for the workload we care about.
>
>
> Shuxin, I'm not sure what you're trying to accomplish here, but I
> don't think this is the right approach.
>
> First, you seem to be pursuing a partitioning scheme for parallelizing
> LTO work despite *no* consensus that this is the correct approach
>
> I sent a proposal long time ago, as far as I can understand from the
> mailing list. There is no objection at all.
> Actually, but my approach is not new at all. It is almost a "std" way
> to perform partition. It looks similar to all LTOs I worked/played before.
> It just need some LLVM flavor.  But this change has nothing to do the
> partition implementation, it just add a interface.
>
> in any of the community discussions I can find. Please don't commit
> code toward a design that the community has expressed serious
> reservations about without review.
>
> Second, you are committing a new API to the set of the stable C APIs
> that libLTO exposes without a thorough discussion on the mailing list.
>
> Sorry, I thought this is pretty Apple thing, as no other system use
> this API.
> I will revert tomorrow, and initiate a discussion.
>
> The APIs are almost divided into two classes. One for Unix+gold, the
> other one for OSX + Apple LD.
> I don't like the way it is, and I don't like the such APIs at all (I
> mean all of them).
>  I used to argue we are better off having a symbol-related interface
> instead of LTO-related API.
>  But the community dose not buy my point.  As I have little knowledge
> about LLVM, I have to keep open mind, and adapter to LLVM-thinking,
> but it certainly take some time.
>
> It is possible I have missed this discussion, but I did look and
> failed to find anything that seems to resemble a review, much less an
> LGTM. If I have missed it, I apologize and please direct me at the
> thread. I bring this up because the specific interface seems surprising to me.
>
> Third, you are justifying the particular approach with a deflection to
> some discussion within Apple or with those developers you work with at Apple.
> While this may in fact be the motivation for this patch, the open
> source community is often not party to these discussions. ;]
>
> That is true:-)
>
> It would help us if you would just give the specific basis rather than
> referencing a discussion that we weren't involved with. As it happens,
> I suspect I agree with these "Folks in Apple" that it is useful to
> specifically optimize for the case that an entire program fits into
> memory, bypassing the filesystem.
>
> You bet!.
>
> I debate with them. No chance to win. Why don't you suspect in the
> first place:-).
> But "folks in Apple" argue that is plan in the future.  It dose not
> seems to be pretty lame argument, as current implement of LTO bring
> everything in memory.
>
> | However, there are many paths to that end result. From the little
> information in the commit log there isn't really enough to tell why
> *this* is the necessary path forward (in fact, I'm somewhat confident it isn't).
>
> In concept, there is only one alternative : compile the the merged
> module into multiple objects, and feed the object back to linker.
>
>
>
>
> So, to get back to Eric's original question: what is the motivation
> for this API, it's expected actual usage, and the reason why it is
> important to stub out in this way now?
>
> The motivation is: the existing LTO compile the merged module into
> *single* object,
>   with this new API, it enable the way to compile merged module into
> *multiple* objects.
>   I'm wondering if this is clear now.
>
>    for instance, suppose the command line is "clang -flto a.o b.bc c.o d.bc"
> (*.o is real object, and *.bc are bitcode),
>   existing LTO will merge b.bc and d.dc into t.bc (merged module), LTO
> will compile the merged t.bc into t.o, and feed the t.o back the
> linker which combine a.o c.o t.o into a.out.
>
>    The new API will trigger the compiler convert t.o into p1.o and
> p2.o ...., and feed these p*.o back to linker, which
>   combine a.o and c.o into a.out.
>
>
>
>
>
>
>
> Better yet, could we have that discussion before growing the set of
> stable APIs that we claim to never regress?
>
>
> Sure. Sorry about that. I actually don't what to touch the lto_xxx()
> API for now.  I just want to do some workaround on the limitation on
> the linker, and wait for new ld. But Bob didn't buy my argument:-).
>
>
>
_______________________________________________
llvm-commits mailing list
llvm-commits at cs.uiuc.edu<mailto:llvm-commits at cs.uiuc.edu>
http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits

_______________________________________________
llvm-commits mailing list
llvm-commits at cs.uiuc.edu<mailto:llvm-commits at cs.uiuc.edu>
http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20130920/3912cdf3/attachment.html>