[llvm-dev] [RFC] Order File Instrumentation

Manman Ren via llvm-dev llvm-dev at lists.llvm.org
Fri Jan 18 21:19:12 PST 2019

On Fri, Jan 18, 2019 at 9:10 PM Manman Ren <manman.ren at gmail.com> wrote:

> On Fri, Jan 18, 2019 at 4:11 PM Xinliang David Li <davidxl at google.com>
> wrote:
>> On Fri, Jan 18, 2019 at 3:56 PM Manman Ren <manman.ren at gmail.com> wrote:
>>> Some background information first, then a quick summary of what we have
>>> discussed so far!
>>> Background: Facebook app is one of the biggest iOS apps. Because of
>>> this, we want the instrumentation to be as lightweight as possible in terms
>>> of binary size, profile data size, and runtime performance. The plan to
>>> improve Facebook app start up time is to (1) implement order file
>>> instrumentation to be as light as possible, (2) push the order file
>>> instrumentation to internal users first, and then to external beta users if
>>> the overhead is low, (3) enable PGO instrumentation to collect information
>>> to guide hot/cold splitting, and (4) push PGO instrumentation to internal
>>> users.
>>> There are a few alternatives we have discussed:
>>> (A) What is proposed in the initial email: Log (module id, function id)
>>> into a circular buffer in its own profile section when a function is first
>>> executed.
>>> (B) Re-use existing infra of a per function counter to record the
>>> timestamp when a function is first executed
>>> Compared to option (A), the runtime overhead for option (B) should be
>>> higher since we will be calling timestamp for each function that is
>>> executed at startup time,
>> The 'timestamp' can be the just an global index. Since there is one
>> counter per func, the counter can be initialized to be '-1' so that you
>> don't need to use bitmap to track if the function has been invoked or not.
>> In other words, the runtime overhead of B) could be lower :)
> That actually works! We only care about the ordering of the functions. But
> the concern on profile data size and binary size still exist :]

The runtime should be similar as we still need to check if the counter is
"-1" before saving the global index. We don't need the separate bitmap
though. Also the counter can be initialized to 0 and the global index can
start from 1.

>> David
>>> and the binary and the profile data will be larger since it needs one
>>> number for each function plus additional overhead in the per-function
>>> metadata recorded in llvm_prf_data. The buffer size for option (A) is
>>> controllable, it needs to be the number of functions executed at startup.
> Do you have a rough estimation on how much overhead the per-function
> metadata is?
> Manman
>>> For the Facebook app, we expect that the number of functions executed
>>> during startup is 1/3 to 1/2 of all functions in the binary. Profile data
>>> size is important since we need to upload the profile data from device to
>>> server.
>>> The plus side is to reuse the existing infra!
>>> In terms of integration with PGO instrumentation, both (A) and (B)
>>> should work. For (B), we need to increase the number of per function
>>> counters by one. For (A), they will be in different sections.
>>> (C) XRay
>>> We have not looked into this, but would like to hear more about it!
>>> (D) -finstrument-functions-after-inlining or
>>> -finstrument-function-entry-bare
>>> We are worried about the runtime overhead of calling a separate function
>>> when starting up the App.
>>> Thanks,
>>> Manman
>>> On Fri, Jan 18, 2019 at 2:01 PM Chris Bieneman <chris.bieneman at me.com>
>>> wrote:
>>>> I would love to see this kind of order profiling support. Using dtrace
>>>> to generate function orders is actually really problematic because dtrace
>>>> made tradeoffs in implementation allowing it to ignore probe execution if
>>>> the performance impact is too great on the system. This can result in
>>>> dtrace being non-deterministic which is not ideal for generating
>>>> optimization data.
>>>> Additionally if order generation could be enabled at the same time as
>>>> PGO generation that would be a great solution for generating profile data
>>>> for optimizing clang itself. Clang has some scripts and build-system goop
>>>> under utils/perf-training that can generate order files using dtrace and
>>>> PGO data, it would be great to apply this technique to those tools too.
>>>> -Chris
>>>> > On Jan 18, 2019, at 2:43 AM, Hans Wennborg via llvm-dev <
>>>> llvm-dev at lists.llvm.org> wrote:
>>>> >
>>>> > On Thu, Jan 17, 2019 at 7:24 PM Manman Ren via llvm-dev
>>>> > <llvm-dev at lists.llvm.org> wrote:
>>>> >>
>>>> >> Order file is used to teach ld64 how to order the functions in a
>>>> binary. If we put all functions executed during startup together in the
>>>> right order, we will greatly reduce the page faults during startup.
>>>> >>
>>>> >> To generate order file for iOS apps, we usually use dtrace, but some
>>>> apps have various startup scenarios that we want to capture in the order
>>>> file. dtrace approach is not easy to automate, it is hard to capture the
>>>> different ways of starting an app without automation. Instrumented builds
>>>> however can be deployed to phones and profile data can be automatically
>>>> collected.
>>>> >>
>>>> >> For the Facebook app, by looking at the startup distribution, we are
>>>> expecting a big win out of the order file instrumentation, from 100ms to
>>>> 500ms+, in startup time.
>>>> >>
>>>> >> The basic idea of the pass is to use a circular buffer to log the
>>>> execution ordering of the functions. We only log the function when it is
>>>> first executed. Instead of logging the symbol name of the function, we log
>>>> a pair of integers, with one integer specifying the module id, and the
>>>> other specifying the function id within the module.
>>>> >
>>>> > [...]
>>>> >
>>>> >> clang has '-finstrument-function-entry-bare' which inserts a
>>>> function call and is not as efficient.
>>>> >
>>>> > Can you elaborate on why this existing functionality is not efficient
>>>> > enough for you?
>>>> >
>>>> > For Chrome on Windows, we use -finstrument-functions-after-inlining to
>>>> > insert calls at function entry (after inlining) that calls a function
>>>> > which captures the addresses in a buffer, and later symbolizes and
>>>> > dumps them to an order file that we feed the linker. We use a similar
>>>> > approach on for Chrome on Android, but I'm not as familiar with the
>>>> > details there.
>>>> >
>>>> > Thanks,
>>>> > Hans
>>>> > _______________________________________________
>>>> > LLVM Developers mailing list
>>>> > llvm-dev at lists.llvm.org
>>>> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190118/bd8b9744/attachment.html>

More information about the llvm-dev mailing list