[llvm-dev] [RFC] Order File Instrumentation

Wed Jan 23 09:35:40 PST 2019

The plan (decouple from thinLTO and uses circular buffer + bitmap to save
memory usage) sounds good to me.

David

On Wed, Jan 23, 2019 at 9:29 AM Manman Ren <manman.ren at gmail.com> wrote:

>
> I chatted with David offline during the weekend. Thanks for the great
> discussions, David!
>
> The trimmed-down version of the current infra will require 2 x 8 bytes for
> each function, while the circular buffer implementation requires 4 byte (2
> byte for module id, 2 byte for function id) for each startup function. For
> Facebook app, that means the profile data will be 8 times more. Since we
> want to push the instrumented build to external test users, we are trying
> to minimize the uploading from device to servers.
>
> The circular buffer implementation currently uses (module id, function
> id), which only works in ThinLTO mode. David suggested to decouple from
> ThinLTO by using the 8-byte MD5 of function names.
>
> I plan to revise the existing patches to decouple from ThinLTO by taking
> David's suggestions. Let me know if you have questions on the general
> approach!
>
> Thanks,
> Manman
>
> On Fri, Jan 18, 2019 at 9:37 PM Xinliang David Li <davidxl at google.com>
> wrote:
>
>>
>>
>> On Fri, Jan 18, 2019 at 9:19 PM Manman Ren <manman.ren at gmail.com> wrote:
>>
>>>
>>>
>>> On Fri, Jan 18, 2019 at 9:10 PM Manman Ren <manman.ren at gmail.com> wrote:
>>>
>>>>
>>>>
>>>> On Fri, Jan 18, 2019 at 4:11 PM Xinliang David Li <davidxl at google.com>
>>>> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Fri, Jan 18, 2019 at 3:56 PM Manman Ren <manman.ren at gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Some background information first, then a quick summary of what we
>>>>>> have discussed so far!
>>>>>>
>>>>>> Background: Facebook app is one of the biggest iOS apps. Because of
>>>>>> this, we want the instrumentation to be as lightweight as possible in terms
>>>>>> of binary size, profile data size, and runtime performance. The plan to
>>>>>> improve Facebook app start up time is to (1) implement order file
>>>>>> instrumentation to be as light as possible, (2) push the order file
>>>>>> instrumentation to internal users first, and then to external beta users if
>>>>>> the overhead is low, (3) enable PGO instrumentation to collect information
>>>>>> to guide hot/cold splitting, and (4) push PGO instrumentation to internal
>>>>>> users.
>>>>>>
>>>>>> There are a few alternatives we have discussed:
>>>>>> (A) What is proposed in the initial email: Log (module id, function
>>>>>> id) into a circular buffer in its own profile section when a function is
>>>>>> first executed.
>>>>>>
>>>>>> (B) Re-use existing infra of a per function counter to record the
>>>>>> timestamp when a function is first executed
>>>>>> Compared to option (A), the runtime overhead for option (B) should be
>>>>>> higher since we will be calling timestamp for each function that is
>>>>>> executed at startup time,
>>>>>>
>>>>>
>>>>> The 'timestamp' can be the just an global index. Since there is one
>>>>> counter per func, the counter can be initialized to be '-1' so that you
>>>>> don't need to use bitmap to track if the function has been invoked or not.
>>>>> In other words, the runtime overhead of B) could be lower :)
>>>>>
>>>>
>>>> That actually works! We only care about the ordering of the functions.
>>>> But the concern on profile data size and binary size still exist :]
>>>>
>>>
>>> The runtime should be similar as we still need to check if the counter
>>> is "-1" before saving the global index. We don't need the separate bitmap
>>> though. Also the counter can be initialized to 0 and the global index can
>>> start from 1.
>>>
>>
>> If we don't need bitmap, then the two approaches are converging !
>>
>> David
>>
>>>
>>>
>>>>> David
>>>>>
>>>>>
>>>>>
>>>>>> and the binary and the profile data will be larger since it needs one
>>>>>> number for each function plus additional overhead in the per-function
>>>>>> metadata recorded in llvm_prf_data. The buffer size for option (A) is
>>>>>> controllable, it needs to be the number of functions executed at startup.
>>>>>>
>>>>>
>>>> Do you have a rough estimation on how much overhead the per-function
>>>> metadata is?
>>>>
>>>> Manman
>>>>
>>>>>
>>>>>> For the Facebook app, we expect that the number of functions executed
>>>>>> during startup is 1/3 to 1/2 of all functions in the binary. Profile data
>>>>>> size is important since we need to upload the profile data from device to
>>>>>> server.
>>>>>>
>>>>>> The plus side is to reuse the existing infra!
>>>>>>
>>>>>> In terms of integration with PGO instrumentation, both (A) and (B)
>>>>>> should work. For (B), we need to increase the number of per function
>>>>>> counters by one. For (A), they will be in different sections.
>>>>>>
>>>>>> (C) XRay
>>>>>> We have not looked into this, but would like to hear more about it!
>>>>>>
>>>>>> (D) -finstrument-functions-after-inlining or
>>>>>> -finstrument-function-entry-bare
>>>>>> We are worried about the runtime overhead of calling a separate
>>>>>> function when starting up the App.
>>>>>>
>>>>>> Thanks,
>>>>>> Manman
>>>>>>
>>>>>> On Fri, Jan 18, 2019 at 2:01 PM Chris Bieneman <chris.bieneman at me.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I would love to see this kind of order profiling support. Using
>>>>>>> dtrace to generate function orders is actually really problematic because
>>>>>>> dtrace made tradeoffs in implementation allowing it to ignore probe
>>>>>>> execution if the performance impact is too great on the system. This can
>>>>>>> result in dtrace being non-deterministic which is not ideal for generating
>>>>>>> optimization data.
>>>>>>>
>>>>>>> Additionally if order generation could be enabled at the same time
>>>>>>> as PGO generation that would be a great solution for generating profile
>>>>>>> data for optimizing clang itself. Clang has some scripts and build-system
>>>>>>> goop under utils/perf-training that can generate order files using dtrace
>>>>>>> and PGO data, it would be great to apply this technique to those tools too.
>>>>>>>
>>>>>>> -Chris
>>>>>>>
>>>>>>> > On Jan 18, 2019, at 2:43 AM, Hans Wennborg via llvm-dev <
>>>>>>> llvm-dev at lists.llvm.org> wrote:
>>>>>>> >
>>>>>>> > On Thu, Jan 17, 2019 at 7:24 PM Manman Ren via llvm-dev
>>>>>>> > <llvm-dev at lists.llvm.org> wrote:
>>>>>>> >>
>>>>>>> >> Order file is used to teach ld64 how to order the functions in a
>>>>>>> binary. If we put all functions executed during startup together in the
>>>>>>> right order, we will greatly reduce the page faults during startup.
>>>>>>> >>
>>>>>>> >> To generate order file for iOS apps, we usually use dtrace, but
>>>>>>> some apps have various startup scenarios that we want to capture in the
>>>>>>> order file. dtrace approach is not easy to automate, it is hard to capture
>>>>>>> the different ways of starting an app without automation. Instrumented
>>>>>>> builds however can be deployed to phones and profile data can be
>>>>>>> automatically collected.
>>>>>>> >>
>>>>>>> >> For the Facebook app, by looking at the startup distribution, we
>>>>>>> are expecting a big win out of the order file instrumentation, from 100ms
>>>>>>> to 500ms+, in startup time.
>>>>>>> >>
>>>>>>> >> The basic idea of the pass is to use a circular buffer to log the
>>>>>>> execution ordering of the functions. We only log the function when it is
>>>>>>> first executed. Instead of logging the symbol name of the function, we log
>>>>>>> a pair of integers, with one integer specifying the module id, and the
>>>>>>> other specifying the function id within the module.
>>>>>>> >
>>>>>>> > [...]
>>>>>>> >
>>>>>>> >> clang has '-finstrument-function-entry-bare' which inserts a
>>>>>>> function call and is not as efficient.
>>>>>>> >
>>>>>>> > Can you elaborate on why this existing functionality is not
>>>>>>> efficient
>>>>>>> > enough for you?
>>>>>>> >
>>>>>>> > For Chrome on Windows, we use
>>>>>>> -finstrument-functions-after-inlining to
>>>>>>> > insert calls at function entry (after inlining) that calls a
>>>>>>> function
>>>>>>> > which captures the addresses in a buffer, and later symbolizes and
>>>>>>> > dumps them to an order file that we feed the linker. We use a
>>>>>>> similar
>>>>>>> > approach on for Chrome on Android, but I'm not as familiar with the
>>>>>>> > details there.
>>>>>>> >
>>>>>>> > Thanks,
>>>>>>> > Hans
>>>>>>> > _______________________________________________
>>>>>>> > LLVM Developers mailing list
>>>>>>> > llvm-dev at lists.llvm.org
>>>>>>> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>>>
>>>>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190123/3ab23ca3/attachment.html>