[cfe-dev] [llvm-dev] Your help needed: List of LLVM Open Projects 2017

Xinliang David Li via cfe-dev cfe-dev at lists.llvm.org
Mon Jan 16 20:25:01 PST 2017


Xray data are collected post-inlining, but the problem is that inlining
decisions are different between profile-gen and profile-use compilations.
This makes it hard to match call trace data from prof-gen run with
post-inline callgraph of the profile-use build.

David

On Mon, Jan 16, 2017 at 7:41 PM, Sean Silva <chisophugis at gmail.com> wrote:

> Would it make sense for xray instrumentation be part of
> -fprofile-generate? PGO will affect inlining decisions etc for the
> optimized binary, but the collected traces during the instrumented build
> would still have quite a bit of useful information.
>
> -- Sean Silva
>
>
> On Mon, Jan 16, 2017 at 4:33 PM, Xinliang David Li <xinliangli at gmail.com>
> wrote:
>
>> Google GCC records profile data (dynamic callgraph) in a special named
>> section in ELF object file to be consumed by the plugin. Those sections
>> will be discarded later by the linker.
>>
>> There are pros and cons of using xray for layout purpose.  The call trace
>> from xray is certainly more powerful for layout purpose, but it adds
>> addtional complexity to the optimized build process.  You would need to
>> collect xray trace profile on the optimized binary (presumably built with
>> PGO already) and rebuild without xray nop insertion and function layout.
>>
>> David
>>
>> On Mon, Jan 16, 2017 at 3:40 PM, Sean Silva via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>>>
>>>
>>> On Mon, Jan 16, 2017 at 3:34 PM, Sean Silva <chisophugis at gmail.com>
>>> wrote:
>>>
>>>>
>>>>
>>>> On Mon, Jan 16, 2017 at 3:32 PM, Sean Silva <chisophugis at gmail.com>
>>>> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Mon, Jan 16, 2017 at 2:31 PM, Davide Italiano <davide at freebsd.org>
>>>>> wrote:
>>>>>
>>>>>> On Mon, Jan 16, 2017 at 2:07 PM, Mehdi Amini <mehdi.amini at apple.com>
>>>>>> wrote:
>>>>>> >
>>>>>> > On Jan 16, 2017, at 1:47 PM, Sean Silva <chisophugis at gmail.com>
>>>>>> wrote:
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > On Mon, Jan 16, 2017 at 1:25 PM, Davide Italiano <
>>>>>> davide at freebsd.org> wrote:
>>>>>> >>
>>>>>> >> On Mon, Jan 16, 2017 at 12:31 PM, Sean Silva via llvm-dev
>>>>>> >> <llvm-dev at lists.llvm.org> wrote:
>>>>>> >> > Do we have any open projects on LLD?
>>>>>> >> >
>>>>>> >> > I know we usually try to avoid any big "projects" and mainly
>>>>>> add/fix
>>>>>> >> > things
>>>>>> >> > in response to user needs, but just wondering if somebody has
>>>>>> any ideas.
>>>>>> >> >
>>>>>> >>
>>>>>> >> I'm not particularly active in lld anymore, but the last big item
>>>>>> I'd
>>>>>> >> like to see implemented is Pettis-Hansen layout.
>>>>>> >> http://perso.ensta-paristech.fr/~bmonsuez/Cours/B6-4/Article
>>>>>> s/papers15.pdf
>>>>>> >> (mainly because it improves performances of the final executable).
>>>>>> >> GCC/gold have an implementation of the algorithm that can be used
>>>>>> as
>>>>>> >> base. I'll expand if anybody is interested.
>>>>>> >> Side note: I'd like to propose a couple of llvm projects as well,
>>>>>> I'll
>>>>>> >> sit down later today and write them.
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > I’m not sure, can you confirm that such layout optimization on ELF
>>>>>> requires
>>>>>> > -ffunction-sections?
>>>>>> >
>>>>>>
>>>>>> For the non-LTO case, I think so.
>>>>>>
>>>>>> > Also, for clang on OSX the best layout we could get is to order
>>>>>> functions in
>>>>>> > the order in which they get executed at runtime.
>>>>>> >
>>>>>>
>>>>>> That's what we already do for lld. We collect and order file (run a
>>>>>> profiler) and pass that to the linker that lays out functions
>>>>>> accordingly.
>>>>>> This is to improve startup time for a class of startup-time-sensitive
>>>>>> operations. The algorithm proposed by Pettis (allegedly) aims to
>>>>>> reduce the TLB misses as it tries to lay out hot functions (or
>>>>>> functions that are likely to  be called together near in the final
>>>>>> binary).
>>>>>>
>>>>>
>>>>> IIRC from when I looked at the paper a while ago, it is mostly just a
>>>>> "huffman tree construction" type algorithm (agglomerating based on highest
>>>>> probability) and assumes that if two functions are hot then they are likely
>>>>> to be needed together. This is not always the case.
>>>>>
>>>>> E.g. consider a server that accepts RPC requests and based on those
>>>>> requests either does Foo or Bar which are largely disjoint. It's entirely
>>>>> possible for the top two functions of the profile to be one in Foo and one
>>>>> in Bar, but laying them out near each other doesn't make sense since there
>>>>> is never locality (for a given RPC, either Foo or Bar gets run). A static
>>>>> call graph analysis can provide the needed signals to handle this case
>>>>> better.
>>>>>
>>>>>
>>>> Hence you said "allegedly" :) I know we've talked about this before.
>>>> Just wanted to put the backstory of the "allegedly" on the list.
>>>>
>>>
>>> Looks like I remembered this wrong. The algorithm in section 3.2 of the
>>> paper is call-graph aware. It does do greedy coalescing like a Huffman tree
>>> construction algorithms, but constrains the available coalescing operations
>>> at each step by call graph adjacency (in fact, what it is "greedy" about is
>>> the hotness of the edges between call graph nodes and not the nodes
>>> themselves).
>>>
>>> -- Sean Silva
>>>
>>>
>>>>
>>>> -- Sean Silva
>>>>
>>>>
>>>>> -- Sean Silva
>>>>>
>>>>>
>>>>>>
>>>>>> >
>>>>>> > For FullLTO it is conceptually pretty easy to get profile data we
>>>>>> need for
>>>>>> > this, but I'm not sure about the ThinLTO case.
>>>>>> >
>>>>>> > Teresa, Mehdi,
>>>>>> >
>>>>>> > Are there any plans (or things already working!) for getting
>>>>>> profile data
>>>>>> > from ThinLTO in a format that the linker can use for code layout? I
>>>>>> assume
>>>>>> > that profile data is being used already to guide importing, so it
>>>>>> may just
>>>>>> > be a matter of siphoning that off.
>>>>>> >
>>>>>> >
>>>>>> > I’m not sure what kind of “profile information” is needed, and what
>>>>>> makes it
>>>>>> > easier for MonolithicLTO compared to ThinLTO?
>>>>>> >
>>>>>> > Or maybe that layout code should be inside LLVM; maybe part of the
>>>>>> general
>>>>>> > LTO interface? It looks like the current gcc plugin calls back into
>>>>>> gcc for
>>>>>> > the actual layout algorithm itself (function call
>>>>>> > find_pettis_hansen_function_layout) rather than the reordering
>>>>>> logic living
>>>>>> > in the linker:
>>>>>> > https://android.googlesource.com/toolchain/gcc/+/3f73d6ef904
>>>>>> 58b45bbbb33ef4c2b174d4662a22d/gcc-4.6/function_reordering_pl
>>>>>> ugin/function_reordering_plugin.c
>>>>>> >
>>>>>> >
>>>>>> > I was thinking about this: could this be done by reorganizing the
>>>>>> module
>>>>>> > itself for LTO?
>>>>>> >
>>>>>> > That wouldn’t help non-LTO and ThinLTO though.
>>>>>>
>>>>>> This is a dimension that I think can be explored. The fact that it
>>>>>> wouldn't help with other modes of operation is completely orthogonal,
>>>>>> in particular until it's proven that this kind of optimization makes
>>>>>> sense with ThinLTO (and if it doesn't, it can be an optimization ran
>>>>>> only during full LTO).
>>>>>>
>>>>>> --
>>>>>> Davide
>>>>>>
>>>>>> "There are no solved problems; there are only problems that are more
>>>>>> or less solved" -- Henri Poincare
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20170116/efeb5388/attachment.html>


More information about the cfe-dev mailing list