[LLVMdev] [Proposal] Parallelize post-IPO stage.

Evan Cheng evan.cheng at apple.com
Tue Jul 16 10:40:49 PDT 2013

On Jul 16, 2013, at 10:37 AM, Shuxin Yang <shuxin.llvm at gmail.com> wrote:

> On 7/16/13 5:23 AM, Evan Cheng wrote:
>> Thanks for the proposal. This is important work which is one step towards making LTO more applicable for large applications. Some comments inline.
>> On Jul 12, 2013, at 3:49 PM, Shuxin Yang <shuxin.llvm at gmail.com> wrote:
>>> 3.1.1 Figure out Partition scheme
>>> ----------------------------------
>>>  we randomly pick up some function and put them in a partition.
>>> It would be nice to perform some optimization at this moment. One opt
>>> in my mind is to reorder functions in order to reduce working-set and
>>> improve locality.
>>>  Unfortunately, this opt seems to be bit blind at this time, because
>>>    - CallGraph is not annotated with estimated or profiled frequency.
>>>    - some linkers don't respect the order. It seems they just
>>>      remembers the function order of the pristine input obj/fake-obj,
>>>      and enforce this order at final link (link-exec/shared-lib) stage.
>>>  Anyway, I try to ignore all these problems, and try to perform partition
>>> via following steps. Maybe we have some luck on some platforms:
>>>  o. DFS the call-graph, ignoring the self-resursive edges, if freq is
>>>     available, prioritizing the edges (i.e. corresponding to call-sites)
>>>     such that frequent edges are visited first.
>>>  o. Cut the DFS spanning tree obtained from the previous step bottom-up,
>>>     Each cut/partition contains reasonable # of functions, and the aggregate
>>>     size of the functions of the partition should not exceeds predefined
>>>     threshold.
>> I'd like to see more details about this step. How do you determine the "reasonable # of functions"? How do you define the threshold? It has to be the same for a given target / platform regardless of the configuration of the build machine, right?
> Say, each module should not contains :
>    - no more than 100 functions,  and
>    - the total size of the functions in a partition should not exceed the pre-defined threshold,
> These threshold can be override by command line.

But how do you come about the thresholds? And are they fixed thresholds or determined based on analysis of function size, etc.?


>>> o. repeat the previous step until the Call-graph's DFS spanning tree
>>>     is empty.
>>> 3.1.2 Partition transformation
>>> ------------------------------
>>>  This is bit involved. There are bunch of problems we have to tackle.
>>>  1) When the use/def of a symbol are separated in different modules,
>>>     its attribute, like linkage, visibility, need  to be changed
>>>     as well.
>>>      [Example 1], if a symbol is flagged as "internal" to the module where
>>>     the it is defined, the linkage need to be changed into "internal"
>>>     to the executable/lib being compiled.
>>>      [Example 2], For compile-time constants, their initialized value
>>>     needs to to cloned to the partitions where it is referenced,
>>>     The rationale is to make the post-ipo passes to take advantage
>>>     of the initlized value to squeeeze some performance.
>>>      In order to not bloat the code size, the cloned constant should
>>>     mark "don't emit". [end of eg2]
>>>       Being able to precisely update symbols' attribute is not only
>>>     vital to correctness, it has significant impact to the the
>>>     performance as well.
>>>       I have not yet taken a thorough investigation of this issue. My
>>>     rudimentary implementation is simply flag symbol "external" when its
>>>     use/def are separated in different module. I believe this is one
>>>     of the most difficult part of this work. I guess it is going to
>>>     take long time to become stable.
>>>  2) In order to compile each partition in each separate thread (see
>>>     Section 3.2), we have to put partitions in distinct LLVMContext.
>>>     I could be wrong, but I don't find the code which is able to
>>>     perform function cloning across LLVMContext.
>>>     My workaround in the patch is to perform function cloning in
>>>    one LLVMContext (but in different Module, of course), then
>>>    save the module to disk file, and load it to memory using a
>>>    new LLVMContext.
>>>     It is bit circuitous and expensive.
>> Do you plan to fix this? What are the issues that prevented function cloning across multiple LLVMContexts?
>> Evan
> We may fix it, I don't know for sure if it is a big gain at this moment.
> If issues is that, as far as I can tell, current code base dose not have functions support copying
> IR across different LLVMContext.
> For example, when it copy an instruction from src to dest,
> it check the "src", take a look of of its Type, and derive LLVMContext from the Type, and use
> the same context for the dest. So, we need to change the existing code.

More information about the llvm-dev mailing list