[LLVMdev] [Proposal] Parallelize post-IPO stage.
evan.cheng at apple.com
Tue Jul 16 10:40:49 PDT 2013
On Jul 16, 2013, at 10:37 AM, Shuxin Yang <shuxin.llvm at gmail.com> wrote:
> On 7/16/13 5:23 AM, Evan Cheng wrote:
>> Thanks for the proposal. This is important work which is one step towards making LTO more applicable for large applications. Some comments inline.
>> On Jul 12, 2013, at 3:49 PM, Shuxin Yang <shuxin.llvm at gmail.com> wrote:
>>> 3.1.1 Figure out Partition scheme
>>> we randomly pick up some function and put them in a partition.
>>> It would be nice to perform some optimization at this moment. One opt
>>> in my mind is to reorder functions in order to reduce working-set and
>>> improve locality.
>>> Unfortunately, this opt seems to be bit blind at this time, because
>>> - CallGraph is not annotated with estimated or profiled frequency.
>>> - some linkers don't respect the order. It seems they just
>>> remembers the function order of the pristine input obj/fake-obj,
>>> and enforce this order at final link (link-exec/shared-lib) stage.
>>> Anyway, I try to ignore all these problems, and try to perform partition
>>> via following steps. Maybe we have some luck on some platforms:
>>> o. DFS the call-graph, ignoring the self-resursive edges, if freq is
>>> available, prioritizing the edges (i.e. corresponding to call-sites)
>>> such that frequent edges are visited first.
>>> o. Cut the DFS spanning tree obtained from the previous step bottom-up,
>>> Each cut/partition contains reasonable # of functions, and the aggregate
>>> size of the functions of the partition should not exceeds predefined
>> I'd like to see more details about this step. How do you determine the "reasonable # of functions"? How do you define the threshold? It has to be the same for a given target / platform regardless of the configuration of the build machine, right?
> Say, each module should not contains :
> - no more than 100 functions, and
> - the total size of the functions in a partition should not exceed the pre-defined threshold,
> These threshold can be override by command line.
But how do you come about the thresholds? And are they fixed thresholds or determined based on analysis of function size, etc.?
>>> o. repeat the previous step until the Call-graph's DFS spanning tree
>>> is empty.
>>> 3.1.2 Partition transformation
>>> This is bit involved. There are bunch of problems we have to tackle.
>>> 1) When the use/def of a symbol are separated in different modules,
>>> its attribute, like linkage, visibility, need to be changed
>>> as well.
>>> [Example 1], if a symbol is flagged as "internal" to the module where
>>> the it is defined, the linkage need to be changed into "internal"
>>> to the executable/lib being compiled.
>>> [Example 2], For compile-time constants, their initialized value
>>> needs to to cloned to the partitions where it is referenced,
>>> The rationale is to make the post-ipo passes to take advantage
>>> of the initlized value to squeeeze some performance.
>>> In order to not bloat the code size, the cloned constant should
>>> mark "don't emit". [end of eg2]
>>> Being able to precisely update symbols' attribute is not only
>>> vital to correctness, it has significant impact to the the
>>> performance as well.
>>> I have not yet taken a thorough investigation of this issue. My
>>> rudimentary implementation is simply flag symbol "external" when its
>>> use/def are separated in different module. I believe this is one
>>> of the most difficult part of this work. I guess it is going to
>>> take long time to become stable.
>>> 2) In order to compile each partition in each separate thread (see
>>> Section 3.2), we have to put partitions in distinct LLVMContext.
>>> I could be wrong, but I don't find the code which is able to
>>> perform function cloning across LLVMContext.
>>> My workaround in the patch is to perform function cloning in
>>> one LLVMContext (but in different Module, of course), then
>>> save the module to disk file, and load it to memory using a
>>> new LLVMContext.
>>> It is bit circuitous and expensive.
>> Do you plan to fix this? What are the issues that prevented function cloning across multiple LLVMContexts?
> We may fix it, I don't know for sure if it is a big gain at this moment.
> If issues is that, as far as I can tell, current code base dose not have functions support copying
> IR across different LLVMContext.
> For example, when it copy an instruction from src to dest,
> it check the "src", take a look of of its Type, and derive LLVMContext from the Type, and use
> the same context for the dest. So, we need to change the existing code.
More information about the llvm-dev