[llvm-dev] [RFC] Order File Instrumentation

Fri Jan 18 21:37:18 PST 2019

On Fri, Jan 18, 2019 at 9:19 PM Manman Ren <manman.ren at gmail.com> wrote:

>
>
> On Fri, Jan 18, 2019 at 9:10 PM Manman Ren <manman.ren at gmail.com> wrote:
>
>>
>>
>> On Fri, Jan 18, 2019 at 4:11 PM Xinliang David Li <davidxl at google.com>
>> wrote:
>>
>>>
>>>
>>> On Fri, Jan 18, 2019 at 3:56 PM Manman Ren <manman.ren at gmail.com> wrote:
>>>
>>>> Some background information first, then a quick summary of what we have
>>>> discussed so far!
>>>>
>>>> Background: Facebook app is one of the biggest iOS apps. Because of
>>>> this, we want the instrumentation to be as lightweight as possible in terms
>>>> of binary size, profile data size, and runtime performance. The plan to
>>>> improve Facebook app start up time is to (1) implement order file
>>>> instrumentation to be as light as possible, (2) push the order file
>>>> instrumentation to internal users first, and then to external beta users if
>>>> the overhead is low, (3) enable PGO instrumentation to collect information
>>>> to guide hot/cold splitting, and (4) push PGO instrumentation to internal
>>>> users.
>>>>
>>>> There are a few alternatives we have discussed:
>>>> (A) What is proposed in the initial email: Log (module id, function id)
>>>> into a circular buffer in its own profile section when a function is first
>>>> executed.
>>>>
>>>> (B) Re-use existing infra of a per function counter to record the
>>>> timestamp when a function is first executed
>>>> Compared to option (A), the runtime overhead for option (B) should be
>>>> higher since we will be calling timestamp for each function that is
>>>> executed at startup time,
>>>>
>>>
>>> The 'timestamp' can be the just an global index. Since there is one
>>> counter per func, the counter can be initialized to be '-1' so that you
>>> don't need to use bitmap to track if the function has been invoked or not.
>>> In other words, the runtime overhead of B) could be lower :)
>>>
>>
>> That actually works! We only care about the ordering of the functions.
>> But the concern on profile data size and binary size still exist :]
>>
>
> The runtime should be similar as we still need to check if the counter is
> "-1" before saving the global index. We don't need the separate bitmap
> though. Also the counter can be initialized to 0 and the global index can
> start from 1.
>

If we don't need bitmap, then the two approaches are converging !

David

>
>
>>> David
>>>
>>>
>>>
>>>> and the binary and the profile data will be larger since it needs one
>>>> number for each function plus additional overhead in the per-function
>>>> metadata recorded in llvm_prf_data. The buffer size for option (A) is
>>>> controllable, it needs to be the number of functions executed at startup.
>>>>
>>>
>> Do you have a rough estimation on how much overhead the per-function
>> metadata is?
>>
>> Manman
>>
>>>
>>>> For the Facebook app, we expect that the number of functions executed
>>>> during startup is 1/3 to 1/2 of all functions in the binary. Profile data
>>>> size is important since we need to upload the profile data from device to
>>>> server.
>>>>
>>>> The plus side is to reuse the existing infra!
>>>>
>>>> In terms of integration with PGO instrumentation, both (A) and (B)
>>>> should work. For (B), we need to increase the number of per function
>>>> counters by one. For (A), they will be in different sections.
>>>>
>>>> (C) XRay
>>>> We have not looked into this, but would like to hear more about it!
>>>>
>>>> (D) -finstrument-functions-after-inlining or
>>>> -finstrument-function-entry-bare
>>>> We are worried about the runtime overhead of calling a separate
>>>> function when starting up the App.
>>>>
>>>> Thanks,
>>>> Manman
>>>>
>>>> On Fri, Jan 18, 2019 at 2:01 PM Chris Bieneman <chris.bieneman at me.com>
>>>> wrote:
>>>>
>>>>> I would love to see this kind of order profiling support. Using dtrace
>>>>> to generate function orders is actually really problematic because dtrace
>>>>> made tradeoffs in implementation allowing it to ignore probe execution if
>>>>> the performance impact is too great on the system. This can result in
>>>>> dtrace being non-deterministic which is not ideal for generating
>>>>> optimization data.
>>>>>
>>>>> Additionally if order generation could be enabled at the same time as
>>>>> PGO generation that would be a great solution for generating profile data
>>>>> for optimizing clang itself. Clang has some scripts and build-system goop
>>>>> under utils/perf-training that can generate order files using dtrace and
>>>>> PGO data, it would be great to apply this technique to those tools too.
>>>>>
>>>>> -Chris
>>>>>
>>>>> > On Jan 18, 2019, at 2:43 AM, Hans Wennborg via llvm-dev <
>>>>> llvm-dev at lists.llvm.org> wrote:
>>>>> >
>>>>> > On Thu, Jan 17, 2019 at 7:24 PM Manman Ren via llvm-dev
>>>>> > <llvm-dev at lists.llvm.org> wrote:
>>>>> >>
>>>>> >> Order file is used to teach ld64 how to order the functions in a
>>>>> binary. If we put all functions executed during startup together in the
>>>>> right order, we will greatly reduce the page faults during startup.
>>>>> >>
>>>>> >> To generate order file for iOS apps, we usually use dtrace, but
>>>>> some apps have various startup scenarios that we want to capture in the
>>>>> order file. dtrace approach is not easy to automate, it is hard to capture
>>>>> the different ways of starting an app without automation. Instrumented
>>>>> builds however can be deployed to phones and profile data can be
>>>>> automatically collected.
>>>>> >>
>>>>> >> For the Facebook app, by looking at the startup distribution, we
>>>>> are expecting a big win out of the order file instrumentation, from 100ms
>>>>> to 500ms+, in startup time.
>>>>> >>
>>>>> >> The basic idea of the pass is to use a circular buffer to log the
>>>>> execution ordering of the functions. We only log the function when it is
>>>>> first executed. Instead of logging the symbol name of the function, we log
>>>>> a pair of integers, with one integer specifying the module id, and the
>>>>> other specifying the function id within the module.
>>>>> >
>>>>> > [...]
>>>>> >
>>>>> >> clang has '-finstrument-function-entry-bare' which inserts a
>>>>> function call and is not as efficient.
>>>>> >
>>>>> > Can you elaborate on why this existing functionality is not efficient
>>>>> > enough for you?
>>>>> >
>>>>> > For Chrome on Windows, we use -finstrument-functions-after-inlining
>>>>> to
>>>>> > insert calls at function entry (after inlining) that calls a function
>>>>> > which captures the addresses in a buffer, and later symbolizes and
>>>>> > dumps them to an order file that we feed the linker. We use a similar
>>>>> > approach on for Chrome on Android, but I'm not as familiar with the
>>>>> > details there.
>>>>> >
>>>>> > Thanks,
>>>>> > Hans
>>>>> > _______________________________________________
>>>>> > LLVM Developers mailing list
>>>>> > llvm-dev at lists.llvm.org
>>>>> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>
>>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190118/558426b9/attachment.html>