[llvm-dev] Add support for in-process profile merging in profile-runtime

Mon Feb 29 12:02:51 PST 2016

On Sun, Feb 28, 2016 at 12:13 AM, Xinliang David Li <davidxl at google.com>
wrote:

>
> On Sat, Feb 27, 2016 at 6:50 PM, Sean Silva <chisophugis at gmail.com> wrote:
>
>> I have thought about this issue too, in the context of games. We may want
>> to turn profiling only for certain frames (essentially, this is many small
>> profile runs).
>>
>> However, I have not seen it demonstrated that this kind of refined data
>> collection will actually improve PGO results in practice.
>> The evidence I do have though is that IIRC Apple have found that almost
>> all of the benefits of PGO for the Clang binary can be gotten with a
>> handful of training runs of Clang. Are your findings different?
>>
>
> We have a very wide customer base so we can not claim one use model is
> sufficient for all users. For instance, we have users using fine grained
> profile dumping control (programatically) as you described above. There are
> also other possible use cases such as dump profiles for different
> periodical phases into files associated with phases. Later different
> phase's profile data can be merged with different weights.
>
>
>>
>> Also, in general, I am very wary of file locking. This can cause huge
>> amounts of slowdown for a build and has potential portability problems.
>>
>
> I don't see much slow down with a clang build using instrumented clang as
> the build compiler. With file locking and profile merging enabled, the
> build time on my local machine looks like:
>
> real    18m22.737s
> user    293m18.924s
> sys     9m55.532s
>
> If profile merging/locking is disabled (i.e, let the profile dumper to
> clobber/write over each other),  the real time is about 14m.
>
>
>> I don't see it as a substantially better solution than wrapping clang in
>> a script that runs clang and then just calls llvm-profdata to do the
>> merging. Running llvm-profdata is cheap compared to doing locking in a
>> highly parallel situation like a build.
>>
>
> That would require synchronization for merging too.
>
> From Justin's email, it looks like there is a key point I have not made
> clear: the on-line profile merge is a very simple raw profile to raw
> profile merging which is super fast. The end result of the profile run is
> still in raw format. The raw to indexed merging is still needed -- but
> instead of merging thousands of raw profiles which can be very slow, with
> this model, only one raw profile input is needed.
>

I think that __llvm_profile_merge_buffers in the runtime would be a useful
primitive if it can be implemented simply (or
__llvm_profile_load_counters_from_buffer, perhaps). If you could post a
patch for that part as a first incremental step that would be a good
starting point for concrete discussion.

In combination with the buffer API and reset_counters this is all that is
needed for very fine-grained counter capture.

-- Sean Silva

>
> thanks,
>
> David
>
>
>>
>>
>> -- Sean Silva
>>
>> On Sat, Feb 27, 2016 at 6:02 PM, Xinliang David Li via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>>> One of the main missing features in Clang/LLVM profile runtime is the
>>> lack of support for online/in-process profile merging support. Profile data
>>> collected for different workloads for the same executable binary need to be
>>> collected and merged later by the offline post-processing tool.  This
>>> limitation makes it hard to handle cases where the instrumented binary
>>> needs to be run with large number of small workloads, possibly in
>>> parallel.  For instance, to do PGO for clang, we may choose to  build  a
>>> large project with the instrumented Clang binary. This is because
>>>  1) to avoid profile from different runs from overriding others, %p
>>> substitution needs to be specified in either the command line or an
>>> environment variable so that different process can dump profile data into
>>> its own file named using pid. This will create huge requirement on the disk
>>> storage. For instance, clang's raw profile size is typically 80M -- if the
>>> instrumented clang is used to build a medium to large size project (such as
>>> clang itself), profile data can easily use up hundreds of Gig bytes of
>>> local storage.
>>> 2) pid can also be recycled. This means that some of the profile data
>>> may be overridden without being noticed.
>>>
>>> The way to solve this problem is to allow profile data to be merged in
>>> process.  I have a prototype implementation and plan to send it out for
>>> review soon after some clean ups. By default, the profiling merging is off
>>> and it can be turned on with an user option or via an environment variable.
>>> The following summarizes the issues involved in adding this feature:
>>>  1. the target platform needs to have file locking support
>>>  2. there needs an efficient way to identify the profile data and
>>> associate it with the binary using binary/profdata signature;
>>>  3. Currently without merging, profile data from shared libraries
>>> (including dlopen/dlcose ones) are concatenated into the primary profile
>>> file. This can complicate matters, as the merger also needs to find the
>>> matching shared libs, and the merger also needs to avoid unnecessary data
>>> movement/copy;
>>>  4. value profile data is variable in length even for the same binary.
>>>
>>> All the above issues are resolved and clang self build with instrumented
>>> binary passes (with both j1 and high parallelism).
>>>
>>> If you have any concerns, please let me know.
>>>
>>> thanks,
>>>
>>> David
>>>
>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160229/31b7e906/attachment.html>