[llvm-dev] Add support for in-process profile merging in profile-runtime

Tue Mar 1 15:34:06 PST 2016

Hi David,

This is wonderful data and demonstrates the viability of this feature. I
think this has alleviated the concerns regarding file locking.

As far as the implementation of the feature, I think we will probably want
the following incremental steps:
a) implement the core merging logic and add to buffer API a primitive for
merging two buffers
b) implement the file system glue to extend this to the filesystem API's
(write_file etc.)
c) implement a profile filename format string which generates a random
number mod a specified amount (strawman:
`LLVM_PROFILE_FILE=default.profraw.%7u` which generates a _u_nique number
mod 7. Of course, in general it is `%<N>u`)

 b) depends on a), but c) can be done in parallel with both.

Does this seem feasible?

-- Sean Silva

On Tue, Mar 1, 2016 at 2:55 PM, Xinliang David Li <davidxl at google.com>
wrote:

> I have implemented the profile pool idea from Mehdi, and collected
> performance data related to profile merging and file locking.  The
> following is the experiment setup:
>
> 1) the machine has 32 logical cores (Intel sandybridge machine/64G memory)
> 2) the workload is clang self build (~3.3K files to be built), and the
> instrumented binary is Clang.
> 3) ninja parallelism j32
>
> File systems tested (on linux)
> 1) a local file system on a SSD drive
> 2) tmpfs
> 3) a local file system on a hard disk
> 4) an internal distributed file system
>
> Configurations tested:
> 1) all processes dump to the same profile data file without locking (this
> configuration of course produces useless profile data in the end, but it
> serves as the performance baseline)
> 2) profile-merging enabled with pool sizes : 1, 2, 3, 4, 5, 10, and 32
> 3) using LLVM_PROFILE_FILE=..._%p to enable each process to dump its own
> copy of profile data (resulting in ~3.2K profile data files in the end).
> This configuration is only tested on some FS due to size/quota constraints.
>
> Here is a very high level summary of the experiment result. The longer
> writing latency it is, the more file locking contention is (which is not
> surprising). In some cases, file lock has close to zero overhead, while for
> FS with high write latencies, file locking can affect performance
> negatively. In such cases, using a small pool of profile files can
> completely recover the performance. The size of the required pool size is
> capped at a small value (which depends on many different factors: write
> latency, the rate at which the instrumented binary retires, io
> throughput/network bandwidth etc).
>
> 1) SSD
>
> The performance is almost identical across *ALL* the test configurations.
> The real time needed to complete the full self build is ~13m10s.  There is
> no visible file contention with file locking enabled even with pool size ==
> 1.
>
> 2) tmpfs
>
> only tested with the following configs
> a) shared profile with no merge
> b) with merge (pool == 1), with merge (pool == 2)
>
> Not surprisingly, the result is similar to SSD case -- consistently
> finished building in a little more than 13m.
>
> 3) HDD
>
> With this configuration, file locking start to show some impact -- the
> write is slow enough to introduce contention.
>
> a) Shared profile without merging: ~13m10s
> b) with merging
>    b.1) pool size == 1:  ~18m20s
>    b.2) pool size == 2:  ~16m30s
>    b.3) pool size == 3:  ~15m55s
>    b.4) pool size == 4:  ~16m20s
>    b.5) pool size == 5:  ~16m42s
> c) >3000 profile file without merging (%p) : ~16m50s
>
> Increasing the size of merge pool increases dumping parallelism -- the
> performance improves initially but when it is above 4, it starts to degrade
> gradually. When the HDD IO throughput is saturated at that point and
> increasing parallelism does not help any more.
>
> In short, with profile merging, we just need to dump 3 profile files to
> achieve the same build performance that dumps >3000 files (the current
> default behavior).
>
> 4) An internal file system using network attached storage
>
> In such a file system, the file write has relatively long latency compared
> with local file systems. The backend storage server does dynamic load
> balancing so that it can achieve very high IO throughput with high
> parallelism (at both FE/client side and backend).
>
> a) Single profile without profile merging : ~60m
> b) Profile merging enabled:
>     b.1) pool size == 1:  ~80m
>     b.2) pool size == 2:  ~47m
>     b.3) pool size == 3:  ~43m
>     b.4) pool size == 4:  ~40m40s
>     b.5) pool size == 5:  ~38m50s
>     b.6) pool size == 10: ~36m48s
>     b.7) pool size == 32: ~36m24s
> c) >3000 profile file without profile merging (%p): ~35m24s
>
> b.6), b.7) and c) have the best performance among all.
>
> Unlike in HDD case, a) has poor performance here -- due to low parallelism
> in the storage backend.
>
> With file dumping parallelism, the performance flats out when the pool
> size >= 10. This is because the client (ninja+clang) system has reached its
> peak and becomes the new performance bottleneck.
>
> Again, with profile merging, we only need 10 profile data file to achieve
> the same performance as the default behavior that requires >3000 files to
> be dumped.
>
> thanks,
>
> David
>
>
>
>
> On Sun, Feb 28, 2016 at 12:13 AM, Xinliang David Li <davidxl at google.com>
> wrote:
>
>>
>> On Sat, Feb 27, 2016 at 6:50 PM, Sean Silva <chisophugis at gmail.com>
>> wrote:
>>
>>> I have thought about this issue too, in the context of games. We may
>>> want to turn profiling only for certain frames (essentially, this is many
>>> small profile runs).
>>>
>>> However, I have not seen it demonstrated that this kind of refined data
>>> collection will actually improve PGO results in practice.
>>> The evidence I do have though is that IIRC Apple have found that almost
>>> all of the benefits of PGO for the Clang binary can be gotten with a
>>> handful of training runs of Clang. Are your findings different?
>>>
>>
>> We have a very wide customer base so we can not claim one use model is
>> sufficient for all users. For instance, we have users using fine grained
>> profile dumping control (programatically) as you described above. There are
>> also other possible use cases such as dump profiles for different
>> periodical phases into files associated with phases. Later different
>> phase's profile data can be merged with different weights.
>>
>>
>>>
>>> Also, in general, I am very wary of file locking. This can cause huge
>>> amounts of slowdown for a build and has potential portability problems.
>>>
>>
>> I don't see much slow down with a clang build using instrumented clang as
>> the build compiler. With file locking and profile merging enabled, the
>> build time on my local machine looks like:
>>
>> real    18m22.737s
>> user    293m18.924s
>> sys     9m55.532s
>>
>> If profile merging/locking is disabled (i.e, let the profile dumper to
>> clobber/write over each other),  the real time is about 14m.
>>
>>
>>> I don't see it as a substantially better solution than wrapping clang in
>>> a script that runs clang and then just calls llvm-profdata to do the
>>> merging. Running llvm-profdata is cheap compared to doing locking in a
>>> highly parallel situation like a build.
>>>
>>
>> That would require synchronization for merging too.
>>
>> From Justin's email, it looks like there is a key point I have not made
>> clear: the on-line profile merge is a very simple raw profile to raw
>> profile merging which is super fast. The end result of the profile run is
>> still in raw format. The raw to indexed merging is still needed -- but
>> instead of merging thousands of raw profiles which can be very slow, with
>> this model, only one raw profile input is needed.
>>
>> thanks,
>>
>> David
>>
>>
>>>
>>>
>>> -- Sean Silva
>>>
>>> On Sat, Feb 27, 2016 at 6:02 PM, Xinliang David Li via llvm-dev <
>>> llvm-dev at lists.llvm.org> wrote:
>>>
>>>> One of the main missing features in Clang/LLVM profile runtime is the
>>>> lack of support for online/in-process profile merging support. Profile data
>>>> collected for different workloads for the same executable binary need to be
>>>> collected and merged later by the offline post-processing tool.  This
>>>> limitation makes it hard to handle cases where the instrumented binary
>>>> needs to be run with large number of small workloads, possibly in
>>>> parallel.  For instance, to do PGO for clang, we may choose to  build  a
>>>> large project with the instrumented Clang binary. This is because
>>>>  1) to avoid profile from different runs from overriding others, %p
>>>> substitution needs to be specified in either the command line or an
>>>> environment variable so that different process can dump profile data into
>>>> its own file named using pid. This will create huge requirement on the disk
>>>> storage. For instance, clang's raw profile size is typically 80M -- if the
>>>> instrumented clang is used to build a medium to large size project (such as
>>>> clang itself), profile data can easily use up hundreds of Gig bytes of
>>>> local storage.
>>>> 2) pid can also be recycled. This means that some of the profile data
>>>> may be overridden without being noticed.
>>>>
>>>> The way to solve this problem is to allow profile data to be merged in
>>>> process.  I have a prototype implementation and plan to send it out for
>>>> review soon after some clean ups. By default, the profiling merging is off
>>>> and it can be turned on with an user option or via an environment variable.
>>>> The following summarizes the issues involved in adding this feature:
>>>>  1. the target platform needs to have file locking support
>>>>  2. there needs an efficient way to identify the profile data and
>>>> associate it with the binary using binary/profdata signature;
>>>>  3. Currently without merging, profile data from shared libraries
>>>> (including dlopen/dlcose ones) are concatenated into the primary profile
>>>> file. This can complicate matters, as the merger also needs to find the
>>>> matching shared libs, and the merger also needs to avoid unnecessary data
>>>> movement/copy;
>>>>  4. value profile data is variable in length even for the same binary.
>>>>
>>>> All the above issues are resolved and clang self build with
>>>> instrumented binary passes (with both j1 and high parallelism).
>>>>
>>>> If you have any concerns, please let me know.
>>>>
>>>> thanks,
>>>>
>>>> David
>>>>
>>>>
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> llvm-dev at lists.llvm.org
>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160301/e083f520/attachment.html>