[LLVMdev] [Proposal] Parallelize post-IPO stage.

Tue Jul 16 05:23:36 PDT 2013

Thanks for the proposal. This is important work which is one step towards making LTO more applicable for large applications. Some comments inline.

On Jul 12, 2013, at 3:49 PM, Shuxin Yang <shuxin.llvm at gmail.com> wrote:

> 
> 
> 3.1.1 Figure out Partition scheme
> ----------------------------------
>  we randomly pick up some function and put them in a partition.
> It would be nice to perform some optimization at this moment. One opt
> in my mind is to reorder functions in order to reduce working-set and
> improve locality.
> 
>  Unfortunately, this opt seems to be bit blind at this time, because
>    - CallGraph is not annotated with estimated or profiled frequency.
>    - some linkers don't respect the order. It seems they just
>      remembers the function order of the pristine input obj/fake-obj,
>      and enforce this order at final link (link-exec/shared-lib) stage.
> 
>  Anyway, I try to ignore all these problems, and try to perform partition
> via following steps. Maybe we have some luck on some platforms:
> 
>  o. DFS the call-graph, ignoring the self-resursive edges, if freq is
>     available, prioritizing the edges (i.e. corresponding to call-sites)
>     such that frequent edges are visited first.
> 
>  o. Cut the DFS spanning tree obtained from the previous step bottom-up,
>     Each cut/partition contains reasonable # of functions, and the aggregate
>     size of the functions of the partition should not exceeds predefined
>     threshold.

I'd like to see more details about this step. How do you determine the "reasonable # of functions"? How do you define the threshold? It has to be the same for a given target / platform regardless of the configuration of the build machine, right?

> 
> o. repeat the previous step until the Call-graph's DFS spanning tree
>     is empty.
> 
> 3.1.2 Partition transformation
> ------------------------------
> 
>  This is bit involved. There are bunch of problems we have to tackle.
>  1) When the use/def of a symbol are separated in different modules,
>     its attribute, like linkage, visibility, need  to be changed
>     as well.
> 
>      [Example 1], if a symbol is flagged as "internal" to the module where
>     the it is defined, the linkage need to be changed into "internal"
>     to the executable/lib being compiled.
> 
>      [Example 2], For compile-time constants, their initialized value
>     needs to to cloned to the partitions where it is referenced,
>     The rationale is to make the post-ipo passes to take advantage
>     of the initlized value to squeeeze some performance.
> 
>      In order to not bloat the code size, the cloned constant should
>     mark "don't emit". [end of eg2]
> 
>       Being able to precisely update symbols' attribute is not only
>     vital to correctness, it has significant impact to the the
>     performance as well.
> 
>       I have not yet taken a thorough investigation of this issue. My
>     rudimentary implementation is simply flag symbol "external" when its
>     use/def are separated in different module. I believe this is one
>     of the most difficult part of this work. I guess it is going to
>     take long time to become stable.
> 
>  2) In order to compile each partition in each separate thread (see
>     Section 3.2), we have to put partitions in distinct LLVMContext.
> 
>     I could be wrong, but I don't find the code which is able to
>     perform function cloning across LLVMContext.
> 
>     My workaround in the patch is to perform function cloning in
>    one LLVMContext (but in different Module, of course), then
>    save the module to disk file, and load it to memory using a
>    new LLVMContext.
> 
>     It is bit circuitous and expensive.

Do you plan to fix this? What are the issues that prevented function cloning across multiple LLVMContexts?

Evan

> 
>     One random observation:
>       Currently, function-scoped static variables are considered
>     as "global variables". When cloning a function with static variable,
>     compiler has no idea if the static variables are used only by
>     the function being cloned, and hence separate the function
>     and the variables.
> 
>        I guess it would be nice if we organized symbols by its scope
>     instead of its live-time. it would be convenient for this situation.
> 
> 3.2 Compile partitions independently
> --------------------------------------
> 
>   There are two camps: one camp advocate compiling partitions via multi-process,
> the other one favor multi-thread.
> 
>  Inside Apple compiler teams, I'm the only one belong to the 1st comp. I think
> while multi-proc sounds bit red-neck, it has its advantage for this purpose, and
> while multi-thread is certainly more eye-popping, it has its advantage
> as well.
> 
>  The advantage of multi-proc are:
>  1) easier to implement, the process run in its own address space.
>    We don't need to worry about they can interfere with each other.
> 
>  2)huge, or not unlimited, address space.
> 
>   The disadvantage is that it's expensive. But I guess the cost is
>  almost negligible compared to the overall IPO compilation.
> 
>  The advantage of multi-threads I can imagine are:
>   1) sound fancy
>   2) it is light-weight
>   3) inter-thread communication is easier than IPC.
> 
>  Its disadvantage are:
>   1). Oftentime we will come across race-condition, and it took
>      awful long time to figure it out. While the code is supposed
>      to be mult-thread safe, we might miss some tricky case.
>      Trouble-shooting race condition is a nightmare.
> 
>   2) Small address space. This is big problem if we the compiler
>      is built 32-bit . In that case, the compiler is not able to bring
>      lots of stuff in memory even if the HW dose
>      provide ample mem.
> 
>   3) The thread-safe run-time lib is more expensive.
>      I once linked a compiler using -lpthread (I dose not have to) on a
>      UNIX platform,  and saw the compiler slow down by about 1/3.
> 
>    I'm not able to convince the folks in other camp, neither are they
> able to convince me. I decide to implement both. Fortunately, this
> part is not difficult, it seems to be rather easy to crank out one within short
> period of time. It would be interesting to compare them side-by-side,
> and see which camp lose:-). On the other hand, if we run into race-condition
> problem, we choose multi-proc version as a fall-back.
> 
>   Regardless which tech is going to use to compile partition
> independently, in order to judiciously and adaptively choose appropriate
> parallel-factor, the compiler certainly need a lib which is able to
> figure out the load the entire system is in. I don't know if there are
> such magic lib or not.
> 
> 4. the tale of two kinds of linker
> ----------------------------------
> 
>  As far as I can tell, llvm suports two kind linker for its IPO compilation,
> and the supports are embodied by two set of APIs/interfaces.
> 
> o. Interface 1, those stuff named lto_xxxx().
> o. GNU gold interface,
>    The compiler interact with GNU gold via the adapter implemented
>    in tools/gold/gold-plugin.cpp.
> 
>    This adpater calls the interface-1 to control the IPO process.
>    It dose not have to call the interface APIs, I think it is definitely
>    ok it call internal functions.
> 
>  The compiler used to generate a single object file from the merged
> IR, now it will generate multiple of them, one for each partition.
> 
>  So, the interface 1 is *NOT* sufficient any more.
> 
>  For gold linker users, it is easy to make them happy just by
> hacking the adapter, informing the linker new input object
> files. This is done transparently, the users don't need to install new ld.
> 
>  For those system which invoke ld interacting with the libLTO.{so,dylib},
> it has to accept the new APIs I added to the interface-1 in order to
> enable the new functionality. Or maybe we can invoke '/the/path/to/ld -r *.o -o merged.o'
> and feed the merged.o the linker (this will  keep the interface
> interact)?  Unfortunately, it dose not work at all, how can I know the path
> the ld? the libLTO.{so,dydlib} is invoked as plugin; it cannot see the argv.
> How about hack them by adding a nasty flag pointing to the right ld?
> Well, it works. However, I don't believe many people like to do it this way,
> that means I loose huge number of "QA" who are working hard for this compiler.
> 
>  What's wrong with the interface-1? The ld side is more active than
> the compiler side, however, in concept the IPO is driven by the compiler side.
> This mean this interface is changing over time.
> 
>  In contrast, the gold interface (as I rever-engineer from the adpator
> code) is more symbol-centric, taking little IPO-thing into account.
> That interface is simple and stable.
> 
> 5. the rudimentary implementation
> ---------------------------------
> 
>  I make it works for bzip2 at cpu2kint yesterday. bzip2 is "tiny"
> program, I intentionally lower the partition size to get 3 partitions.
> There is no comment in the code, and it definitely need rewrite.  I
> just check the correctness (with ref input), and I don't measure how much
> it degrade the performance. (due to the problem I have not got chance
> to tackle, see section 3.1.2, the symbol attribute stuff).
> 
>  The control flow basically is:
>   1. add a module pass to the IPO pass-manager, and figure
>     out the partition scheme.
> 
>   2) physically partition the merged partition.
>      the IR and the obj of partition are placed in a new dir. "llvmipo" by default
> 
>      --
>      ls llvmipo/
>      Makefile  merged      part1.bc    part1.o     part2.bc part2.o     part3.bc    part3.o
>      --
> 
>   3) For demo purpose, I drive the post-IPO stage via a makefile, which encapsulate
>      hack and other nasty stuff.
> 
>       NOTE that the post-ipo pass in my hack contains only CodeGen pass, we need to
>     reorganize the PassManagerBuilder::populateLTOPassManager(), which intermingle
>     IPO pass along with intra-proc scalar pass, we need to separate them and the intra-proc
>     scalar pass to post-IPO stage.
> 
> 
>     1  .PHONY = all
>     2
>     3
>     4  BC = part1.bc part2.bc part3.bc
>     5  OBJ = ${BC:.bc=.o}
>     6
>     7  all : merged
>     8  %.o : %.bc
>     9      $(HOME)/tmp/lto.llc -filetype=obj $+ -o $@
>    10
>    11  merged : $(OBJ)
>    12      /usr/bin/ld $+ -r -o $@
>    13
> 
>   4. as the Makefile sugguest, the *.o of the partions are linked into a single obj "merged"
>     and feed back to link.
> 
> 
> 6) Miscellaneous
> ===========
>   Will partitioning degrade performance in theory.  I think it depends on the definition of
> performance.  If performance means execution-time, I guess it dose not.
> However, if performance includes code-size, I think it may have some negative impact.
> Following is few scenario:
> 
>   - constants generated by the post-IPO passes are not shared across partitions
>   - dead func may be detected during the post-IPO stage, and they may not be deleted.
> 
> 
> 
> 
> <post-ipo.patch>_______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev