[llvm-dev] [RFC] Order File Instrumentation

Thu Jan 17 21:24:44 PST 2019

On Thu, Jan 17, 2019 at 8:19 PM Manman Ren <manman.ren at gmail.com> wrote:

>
>
> On Thu, Jan 17, 2019 at 3:58 PM Xinliang David Li <davidxl at google.com>
> wrote:
>
>>
>>
>> On Thu, Jan 17, 2019 at 3:25 PM Manman Ren <manman.ren at gmail.com> wrote:
>>
>>>
>>>
>>> On Thu, Jan 17, 2019 at 2:47 PM Xinliang David Li <davidxl at google.com>
>>> wrote:
>>>
>>>>
>>>>
>>>> On Thu, Jan 17, 2019 at 2:32 PM Manman Ren <manman.ren at gmail.com>
>>>> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Thu, Jan 17, 2019 at 10:53 AM Xinliang David Li <davidxl at google.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Manman,
>>>>>>
>>>>>> Ordering profiling is certainly something very useful to have to
>>>>>> startup time performance. GCC has something similar.
>>>>>>
>>>>>> In terms of implementation, it is possible to simply extend the edge
>>>>>> profiling counters by 1 for each function, and instrument the function to
>>>>>> record the time stamp the first time the function is executed. The overhead
>>>>>> will be minimized and you can leverage all the other existing support in
>>>>>> profiling runtime.
>>>>>>
>>>>>
>>>>> Hi David,
>>>>>
>>>>> Just to clarify, are you suggesting to add an edge profiling counter
>>>>> per function to record the time stamp? Where are the edge profiling
>>>>> counters defined?
>>>>>
>>>>
>>>> There is no needed to define the counter explicitly. What is needed is
>>>> to introduce a new intrinsic to update the order counter, and the InstProf
>>>> lowerer will create the counter for you. See
>>>> Transforms/Instrumentation/InstrProfiling.cpp.   +Vedant Kumar
>>>> <vsk at apple.com>
>>>>
>>>
>>> We need the instrumentation to be run late as an IR pass, so it has all
>>> the function symbols. Are you suggesting a front-end instrumentation by
>>> mentioning InstrProfiling.cpp?
>>>
>>>
>>
>> The lowering is actually shared by both FE instrumentation and IR PGO.
>>  For ordering profiling, it needs to run the pass after the inliner pass in
>> theory (which is later even than the current IR PGO instrumentation pass).
>> One way to achieve the same effect is lower the time stamp counter update
>> intrinsic in a separate pass (after inlining).  In the lowerer, the counter
>> update intrinsic can be eliminated if it is in an inline instance, but this
>> is not ideal either as the inlining decisions may be affected due to
>> instrumentation.
>>
>> To combine the ordering profiling requirements (post inlining) and the
>> benefit of sharing (runtime, tooling etc),  the proposed implementation can
>> be modified with the following changes:
>> 0) keep the phase ordering in the patch
>> 1) the instrumentation just does insert intrinsics
>> 2) enhance lowering to handle new intrinsics.
>>
>
>> I assume ordering profiling is not intended to work together with PGO
>> instrumentation (to make sure inlining decisions are the same), so the you
>> end up with a counter section with one counter per out of line function.
>> The rest will just work. The changes to runtime will be minimal.
>>
>
> Got it, this should work! The overhead will be one counter for each
> function, and we also need to save the function names in the instrumented
> binary, right?
>

There is no need to save function names -- it is handled by existing
lowerer and runtime. That is why the size overhead is larger than one
counter per function.  The static function names are also qualified with
the enclosing module name.

> Other than binary size, the other concern is the size of the profile data,
> as we need to download it from device. The profile data will be one counter
> for each function plus all the function names?
>

There is per-function meta data recorded in llvm_prf_data section. Function
names are compressed and stored in the names section.  We don't have it
now, but it is possible to add an option to not load the name section into
memory at all, but the llvm_profdata tool needs to be taught to load the
name section from the executable (as the profile data file won't have the
name section).

If PGO can be used on the device, then the size overhead should not be a
concern here -- otherwise we have an usability issue with PGO here.

It is also reasonable to assume that instrumented builds are not intended
to be run on low-end devices  with small memory (which is unlike release
builds).

>
> In the proposed implementation, the profile data will be the size of the
> circular buffer, which should be the maximal number of functions that can
> be executed during startup.
>
>
>> If  PGO (IR or FE) + Ordering profiling is a desired use case, it can be
>> made working too -- the lowerer needs to update the per function counter
>> array size (appending one).
>>
>>
> There may be a use case for IR Instr + ordering profiling. One important
> usage of IR Instr is to guide hot/cold splitting for startup time.
> It is always simpler if we only need to maintain one instrumentation
> build! But that depends on how much overhead IR Instr adds in terms of
> binary size, profile data size, and runtime perf.
>

This use case makes it more attractive to unify the new feature with
existing framework.

David

>
> Manman
>
>
>> David
>>
>>
>>
>>>
>>>>
>>>>> So the difference will be where we store the profile information and
>>>>> in what format?
>>>>>
>>>>>
>>>> Time stamp can be simply a uint64 value just like any edge/block
>>>> profile count.   llvm_profdata tool can be taught to dump the ordering file
>>>> to be exported to the linker.
>>>>
>>>>
>>>>> With the suggested approach, we need to allocate one time stamp for
>>>>> each function, what is implemented is a pair of numbers for each executed
>>>>> function. The runtime performance can be different as well, the suggested
>>>>> approach gets the time stamp, and saves it to memory, what is implemented
>>>>> is saving the pair of numbers and incrementing a counter.
>>>>>
>>>>>
>>>> The runtime cost is probably not much for either approach. The
>>>> suggested approach eliminates the trouble to maintain/implement a different
>>>> way to identify functions.
>>>>
>>>
>>> Agree that using existing infra is better!
>>>
>>>
>>
>>
>>
>>
>>> Manman
>>>
>>>>
>>>>
>>>>
>>>>>
>>>>>> Another possibility is to use xray to implement the functionality --
>>>>>> xray is useful for trace like profiling by design.
>>>>>>
>>>>>
>>>>> We have not looked into XRay. We need something with low binary size
>>>>> penalty and low runtime perf degradation, not sure if XRay is a good fit!
>>>>>
>>>>>
>>>> xray can achieve very low runtime overhead. +Dean Michael Berris
>>>> <dberris at google.com> for additional comments.
>>>>
>>>> David
>>>>
>>>>
>>>>
>>>>
>>>>> Thanks,
>>>>> Manman
>>>>>
>>>>>
>>>>>> David
>>>>>>
>>>>>> On Thu, Jan 17, 2019 at 10:24 AM Manman Ren <manman.ren at gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Order file is used to teach ld64 how to order the functions in a
>>>>>>> binary. If we put all functions executed during startup together in the
>>>>>>> right order, we will greatly reduce the page faults during startup.
>>>>>>>
>>>>>>> To generate order file for iOS apps, we usually use dtrace, but some
>>>>>>> apps have various startup scenarios that we want to capture in the order
>>>>>>> file. dtrace approach is not easy to automate, it is hard to capture the
>>>>>>> different ways of starting an app without automation. Instrumented builds
>>>>>>> however can be deployed to phones and profile data can be automatically
>>>>>>> collected.
>>>>>>>
>>>>>>> For the Facebook app, by looking at the startup distribution, we are
>>>>>>> expecting a big win out of the order file instrumentation, from 100ms to
>>>>>>> 500ms+, in startup time.
>>>>>>>
>>>>>>> The basic idea of the pass is to use a circular buffer to log the
>>>>>>> execution ordering of the functions. We only log the function when it is
>>>>>>> first executed. Instead of logging the symbol name of the function, we log
>>>>>>> a pair of integers, with one integer specifying the module id, and the
>>>>>>> other specifying the function id within the module.
>>>>>>>
>>>>>>> In this pass, we add three global variables:
>>>>>>> (1) an order file buffer
>>>>>>> The order file buffer is a circular buffer at its own llvm section.
>>>>>>> Each entry is a pair of integers, with one integer specifying the module
>>>>>>> id, and the other specifying the function id within the module.
>>>>>>> (2) a bitmap for each module: one bit for each function to say if
>>>>>>> the function is already executed;
>>>>>>> (3) a global index to the buffer
>>>>>>>
>>>>>>> At the function prologue, if the function has not been executed (by
>>>>>>> checking the bitmap), log the module id and the function id, then
>>>>>>> atomically increase the index.
>>>>>>>
>>>>>>> This pass is intended to be used as a ThinLTO pass or a LTO pass. It
>>>>>>> maps each module to a distinct integer, it also generate a mapping file so
>>>>>>> we can decode the function symbol name from the pair of ids.
>>>>>>>
>>>>>>> clang has '-finstrument-function-entry-bare' which inserts a
>>>>>>> function call and is not as efficient.
>>>>>>>
>>>>>>> Three patches are attached, for llvm, clang, and compiler-rt
>>>>>>> respectively.
>>>>>>>
>>>>>>> TODO:
>>>>>>> (1) Migrate to the new pass manager with a shim for the legacy pass
>>>>>>> manager.
>>>>>>> (2) For the order file buffer, consider always emitting definitions,
>>>>>>> making them LinkOnceODR with a COMDAT group.
>>>>>>> (3) Add testing case for clang/compiler-rt patches.
>>>>>>> (4) Add utilities to deobfuscate the profile dump.
>>>>>>> (5) The size of the buffer is currently hard-coded (
>>>>>>> INSTR_ORDER_FILE_BUFFER_SIZE).
>>>>>>>
>>>>>>> Thanks Kamal for contributing to the patches! Thanks to Aditya and
>>>>>>> Saleem for doing an initial review pass over the patches!
>>>>>>>
>>>>>>> Manman
>>>>>>>
>>>>>>>
>>>>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190117/15cae2ca/attachment-0001.html>