[LLVMdev] [Proposal] Parallelize post-IPO stage.
Evan Cheng
evan.cheng at apple.com
Tue Jul 16 05:23:36 PDT 2013
Thanks for the proposal. This is important work which is one step towards making LTO more applicable for large applications. Some comments inline.
On Jul 12, 2013, at 3:49 PM, Shuxin Yang <shuxin.llvm at gmail.com> wrote:
>
>
> 3.1.1 Figure out Partition scheme
> ----------------------------------
> we randomly pick up some function and put them in a partition.
> It would be nice to perform some optimization at this moment. One opt
> in my mind is to reorder functions in order to reduce working-set and
> improve locality.
>
> Unfortunately, this opt seems to be bit blind at this time, because
> - CallGraph is not annotated with estimated or profiled frequency.
> - some linkers don't respect the order. It seems they just
> remembers the function order of the pristine input obj/fake-obj,
> and enforce this order at final link (link-exec/shared-lib) stage.
>
> Anyway, I try to ignore all these problems, and try to perform partition
> via following steps. Maybe we have some luck on some platforms:
>
> o. DFS the call-graph, ignoring the self-resursive edges, if freq is
> available, prioritizing the edges (i.e. corresponding to call-sites)
> such that frequent edges are visited first.
>
> o. Cut the DFS spanning tree obtained from the previous step bottom-up,
> Each cut/partition contains reasonable # of functions, and the aggregate
> size of the functions of the partition should not exceeds predefined
> threshold.
I'd like to see more details about this step. How do you determine the "reasonable # of functions"? How do you define the threshold? It has to be the same for a given target / platform regardless of the configuration of the build machine, right?
>
> o. repeat the previous step until the Call-graph's DFS spanning tree
> is empty.
>
> 3.1.2 Partition transformation
> ------------------------------
>
> This is bit involved. There are bunch of problems we have to tackle.
> 1) When the use/def of a symbol are separated in different modules,
> its attribute, like linkage, visibility, need to be changed
> as well.
>
> [Example 1], if a symbol is flagged as "internal" to the module where
> the it is defined, the linkage need to be changed into "internal"
> to the executable/lib being compiled.
>
> [Example 2], For compile-time constants, their initialized value
> needs to to cloned to the partitions where it is referenced,
> The rationale is to make the post-ipo passes to take advantage
> of the initlized value to squeeeze some performance.
>
> In order to not bloat the code size, the cloned constant should
> mark "don't emit". [end of eg2]
>
> Being able to precisely update symbols' attribute is not only
> vital to correctness, it has significant impact to the the
> performance as well.
>
> I have not yet taken a thorough investigation of this issue. My
> rudimentary implementation is simply flag symbol "external" when its
> use/def are separated in different module. I believe this is one
> of the most difficult part of this work. I guess it is going to
> take long time to become stable.
>
> 2) In order to compile each partition in each separate thread (see
> Section 3.2), we have to put partitions in distinct LLVMContext.
>
> I could be wrong, but I don't find the code which is able to
> perform function cloning across LLVMContext.
>
> My workaround in the patch is to perform function cloning in
> one LLVMContext (but in different Module, of course), then
> save the module to disk file, and load it to memory using a
> new LLVMContext.
>
> It is bit circuitous and expensive.
Do you plan to fix this? What are the issues that prevented function cloning across multiple LLVMContexts?
Evan
>
> One random observation:
> Currently, function-scoped static variables are considered
> as "global variables". When cloning a function with static variable,
> compiler has no idea if the static variables are used only by
> the function being cloned, and hence separate the function
> and the variables.
>
> I guess it would be nice if we organized symbols by its scope
> instead of its live-time. it would be convenient for this situation.
>
> 3.2 Compile partitions independently
> --------------------------------------
>
> There are two camps: one camp advocate compiling partitions via multi-process,
> the other one favor multi-thread.
>
> Inside Apple compiler teams, I'm the only one belong to the 1st comp. I think
> while multi-proc sounds bit red-neck, it has its advantage for this purpose, and
> while multi-thread is certainly more eye-popping, it has its advantage
> as well.
>
> The advantage of multi-proc are:
> 1) easier to implement, the process run in its own address space.
> We don't need to worry about they can interfere with each other.
>
> 2)huge, or not unlimited, address space.
>
> The disadvantage is that it's expensive. But I guess the cost is
> almost negligible compared to the overall IPO compilation.
>
> The advantage of multi-threads I can imagine are:
> 1) sound fancy
> 2) it is light-weight
> 3) inter-thread communication is easier than IPC.
>
> Its disadvantage are:
> 1). Oftentime we will come across race-condition, and it took
> awful long time to figure it out. While the code is supposed
> to be mult-thread safe, we might miss some tricky case.
> Trouble-shooting race condition is a nightmare.
>
> 2) Small address space. This is big problem if we the compiler
> is built 32-bit . In that case, the compiler is not able to bring
> lots of stuff in memory even if the HW dose
> provide ample mem.
>
> 3) The thread-safe run-time lib is more expensive.
> I once linked a compiler using -lpthread (I dose not have to) on a
> UNIX platform, and saw the compiler slow down by about 1/3.
>
> I'm not able to convince the folks in other camp, neither are they
> able to convince me. I decide to implement both. Fortunately, this
> part is not difficult, it seems to be rather easy to crank out one within short
> period of time. It would be interesting to compare them side-by-side,
> and see which camp lose:-). On the other hand, if we run into race-condition
> problem, we choose multi-proc version as a fall-back.
>
> Regardless which tech is going to use to compile partition
> independently, in order to judiciously and adaptively choose appropriate
> parallel-factor, the compiler certainly need a lib which is able to
> figure out the load the entire system is in. I don't know if there are
> such magic lib or not.
>
> 4. the tale of two kinds of linker
> ----------------------------------
>
> As far as I can tell, llvm suports two kind linker for its IPO compilation,
> and the supports are embodied by two set of APIs/interfaces.
>
> o. Interface 1, those stuff named lto_xxxx().
> o. GNU gold interface,
> The compiler interact with GNU gold via the adapter implemented
> in tools/gold/gold-plugin.cpp.
>
> This adpater calls the interface-1 to control the IPO process.
> It dose not have to call the interface APIs, I think it is definitely
> ok it call internal functions.
>
> The compiler used to generate a single object file from the merged
> IR, now it will generate multiple of them, one for each partition.
>
> So, the interface 1 is *NOT* sufficient any more.
>
> For gold linker users, it is easy to make them happy just by
> hacking the adapter, informing the linker new input object
> files. This is done transparently, the users don't need to install new ld.
>
> For those system which invoke ld interacting with the libLTO.{so,dylib},
> it has to accept the new APIs I added to the interface-1 in order to
> enable the new functionality. Or maybe we can invoke '/the/path/to/ld -r *.o -o merged.o'
> and feed the merged.o the linker (this will keep the interface
> interact)? Unfortunately, it dose not work at all, how can I know the path
> the ld? the libLTO.{so,dydlib} is invoked as plugin; it cannot see the argv.
> How about hack them by adding a nasty flag pointing to the right ld?
> Well, it works. However, I don't believe many people like to do it this way,
> that means I loose huge number of "QA" who are working hard for this compiler.
>
> What's wrong with the interface-1? The ld side is more active than
> the compiler side, however, in concept the IPO is driven by the compiler side.
> This mean this interface is changing over time.
>
> In contrast, the gold interface (as I rever-engineer from the adpator
> code) is more symbol-centric, taking little IPO-thing into account.
> That interface is simple and stable.
>
> 5. the rudimentary implementation
> ---------------------------------
>
> I make it works for bzip2 at cpu2kint yesterday. bzip2 is "tiny"
> program, I intentionally lower the partition size to get 3 partitions.
> There is no comment in the code, and it definitely need rewrite. I
> just check the correctness (with ref input), and I don't measure how much
> it degrade the performance. (due to the problem I have not got chance
> to tackle, see section 3.1.2, the symbol attribute stuff).
>
> The control flow basically is:
> 1. add a module pass to the IPO pass-manager, and figure
> out the partition scheme.
>
> 2) physically partition the merged partition.
> the IR and the obj of partition are placed in a new dir. "llvmipo" by default
>
> --
> ls llvmipo/
> Makefile merged part1.bc part1.o part2.bc part2.o part3.bc part3.o
> --
>
> 3) For demo purpose, I drive the post-IPO stage via a makefile, which encapsulate
> hack and other nasty stuff.
>
> NOTE that the post-ipo pass in my hack contains only CodeGen pass, we need to
> reorganize the PassManagerBuilder::populateLTOPassManager(), which intermingle
> IPO pass along with intra-proc scalar pass, we need to separate them and the intra-proc
> scalar pass to post-IPO stage.
>
>
> 1 .PHONY = all
> 2
> 3
> 4 BC = part1.bc part2.bc part3.bc
> 5 OBJ = ${BC:.bc=.o}
> 6
> 7 all : merged
> 8 %.o : %.bc
> 9 $(HOME)/tmp/lto.llc -filetype=obj $+ -o $@
> 10
> 11 merged : $(OBJ)
> 12 /usr/bin/ld $+ -r -o $@
> 13
>
> 4. as the Makefile sugguest, the *.o of the partions are linked into a single obj "merged"
> and feed back to link.
>
>
> 6) Miscellaneous
> ===========
> Will partitioning degrade performance in theory. I think it depends on the definition of
> performance. If performance means execution-time, I guess it dose not.
> However, if performance includes code-size, I think it may have some negative impact.
> Following is few scenario:
>
> - constants generated by the post-IPO passes are not shared across partitions
> - dead func may be detected during the post-IPO stage, and they may not be deleted.
>
>
>
>
> <post-ipo.patch>_______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
More information about the llvm-dev
mailing list