[llvm-dev] Add support for in-process profile merging in profile-runtime

Tue Mar 1 15:41:14 PST 2016

sounds reasonable. My design of c) is different in many ways (e.g, using
getpid()%PoolSize), but we can delay discussion of that in code review.

thanks,

David

On Tue, Mar 1, 2016 at 3:34 PM, Sean Silva via llvm-dev <
llvm-dev at lists.llvm.org> wrote:

> Hi David,
>
> This is wonderful data and demonstrates the viability of this feature. I
> think this has alleviated the concerns regarding file locking.
>
> As far as the implementation of the feature, I think we will probably want
> the following incremental steps:
> a) implement the core merging logic and add to buffer API a primitive for
> merging two buffers
> b) implement the file system glue to extend this to the filesystem API's
> (write_file etc.)
> c) implement a profile filename format string which generates a random
> number mod a specified amount (strawman:
> `LLVM_PROFILE_FILE=default.profraw.%7u` which generates a _u_nique number
> mod 7. Of course, in general it is `%<N>u`)
>
>  b) depends on a), but c) can be done in parallel with both.
>
> Does this seem feasible?
>
> -- Sean Silva
>
> On Tue, Mar 1, 2016 at 2:55 PM, Xinliang David Li <davidxl at google.com>
> wrote:
>
>> I have implemented the profile pool idea from Mehdi, and collected
>> performance data related to profile merging and file locking.  The
>> following is the experiment setup:
>>
>> 1) the machine has 32 logical cores (Intel sandybridge machine/64G memory)
>> 2) the workload is clang self build (~3.3K files to be built), and the
>> instrumented binary is Clang.
>> 3) ninja parallelism j32
>>
>> File systems tested (on linux)
>> 1) a local file system on a SSD drive
>> 2) tmpfs
>> 3) a local file system on a hard disk
>> 4) an internal distributed file system
>>
>> Configurations tested:
>> 1) all processes dump to the same profile data file without locking (this
>> configuration of course produces useless profile data in the end, but it
>> serves as the performance baseline)
>> 2) profile-merging enabled with pool sizes : 1, 2, 3, 4, 5, 10, and 32
>> 3) using LLVM_PROFILE_FILE=..._%p to enable each process to dump its own
>> copy of profile data (resulting in ~3.2K profile data files in the end).
>> This configuration is only tested on some FS due to size/quota constraints.
>>
>> Here is a very high level summary of the experiment result. The longer
>> writing latency it is, the more file locking contention is (which is not
>> surprising). In some cases, file lock has close to zero overhead, while for
>> FS with high write latencies, file locking can affect performance
>> negatively. In such cases, using a small pool of profile files can
>> completely recover the performance. The size of the required pool size is
>> capped at a small value (which depends on many different factors: write
>> latency, the rate at which the instrumented binary retires, io
>> throughput/network bandwidth etc).
>>
>> 1) SSD
>>
>> The performance is almost identical across *ALL* the test configurations.
>> The real time needed to complete the full self build is ~13m10s.  There is
>> no visible file contention with file locking enabled even with pool size ==
>> 1.
>>
>> 2) tmpfs
>>
>> only tested with the following configs
>> a) shared profile with no merge
>> b) with merge (pool == 1), with merge (pool == 2)
>>
>> Not surprisingly, the result is similar to SSD case -- consistently
>> finished building in a little more than 13m.
>>
>> 3) HDD
>>
>> With this configuration, file locking start to show some impact -- the
>> write is slow enough to introduce contention.
>>
>> a) Shared profile without merging: ~13m10s
>> b) with merging
>>    b.1) pool size == 1:  ~18m20s
>>    b.2) pool size == 2:  ~16m30s
>>    b.3) pool size == 3:  ~15m55s
>>    b.4) pool size == 4:  ~16m20s
>>    b.5) pool size == 5:  ~16m42s
>> c) >3000 profile file without merging (%p) : ~16m50s
>>
>> Increasing the size of merge pool increases dumping parallelism -- the
>> performance improves initially but when it is above 4, it starts to degrade
>> gradually. When the HDD IO throughput is saturated at that point and
>> increasing parallelism does not help any more.
>>
>> In short, with profile merging, we just need to dump 3 profile files to
>> achieve the same build performance that dumps >3000 files (the current
>> default behavior).
>>
>> 4) An internal file system using network attached storage
>>
>> In such a file system, the file write has relatively long latency
>> compared with local file systems. The backend storage server does dynamic
>> load balancing so that it can achieve very high IO throughput with high
>> parallelism (at both FE/client side and backend).
>>
>> a) Single profile without profile merging : ~60m
>> b) Profile merging enabled:
>>     b.1) pool size == 1:  ~80m
>>     b.2) pool size == 2:  ~47m
>>     b.3) pool size == 3:  ~43m
>>     b.4) pool size == 4:  ~40m40s
>>     b.5) pool size == 5:  ~38m50s
>>     b.6) pool size == 10: ~36m48s
>>     b.7) pool size == 32: ~36m24s
>> c) >3000 profile file without profile merging (%p): ~35m24s
>>
>> b.6), b.7) and c) have the best performance among all.
>>
>> Unlike in HDD case, a) has poor performance here -- due to low
>> parallelism in the storage backend.
>>
>> With file dumping parallelism, the performance flats out when the pool
>> size >= 10. This is because the client (ninja+clang) system has reached its
>> peak and becomes the new performance bottleneck.
>>
>> Again, with profile merging, we only need 10 profile data file to achieve
>> the same performance as the default behavior that requires >3000 files to
>> be dumped.
>>
>> thanks,
>>
>> David
>>
>>
>>
>>
>> On Sun, Feb 28, 2016 at 12:13 AM, Xinliang David Li <davidxl at google.com>
>> wrote:
>>
>>>
>>> On Sat, Feb 27, 2016 at 6:50 PM, Sean Silva <chisophugis at gmail.com>
>>> wrote:
>>>
>>>> I have thought about this issue too, in the context of games. We may
>>>> want to turn profiling only for certain frames (essentially, this is many
>>>> small profile runs).
>>>>
>>>> However, I have not seen it demonstrated that this kind of refined data
>>>> collection will actually improve PGO results in practice.
>>>> The evidence I do have though is that IIRC Apple have found that almost
>>>> all of the benefits of PGO for the Clang binary can be gotten with a
>>>> handful of training runs of Clang. Are your findings different?
>>>>
>>>
>>> We have a very wide customer base so we can not claim one use model is
>>> sufficient for all users. For instance, we have users using fine grained
>>> profile dumping control (programatically) as you described above. There are
>>> also other possible use cases such as dump profiles for different
>>> periodical phases into files associated with phases. Later different
>>> phase's profile data can be merged with different weights.
>>>
>>>
>>>>
>>>> Also, in general, I am very wary of file locking. This can cause huge
>>>> amounts of slowdown for a build and has potential portability problems.
>>>>
>>>
>>> I don't see much slow down with a clang build using instrumented clang
>>> as the build compiler. With file locking and profile merging enabled, the
>>> build time on my local machine looks like:
>>>
>>> real    18m22.737s
>>> user    293m18.924s
>>> sys     9m55.532s
>>>
>>> If profile merging/locking is disabled (i.e, let the profile dumper to
>>> clobber/write over each other),  the real time is about 14m.
>>>
>>>
>>>> I don't see it as a substantially better solution than wrapping clang
>>>> in a script that runs clang and then just calls llvm-profdata to do the
>>>> merging. Running llvm-profdata is cheap compared to doing locking in a
>>>> highly parallel situation like a build.
>>>>
>>>
>>> That would require synchronization for merging too.
>>>
>>> From Justin's email, it looks like there is a key point I have not made
>>> clear: the on-line profile merge is a very simple raw profile to raw
>>> profile merging which is super fast. The end result of the profile run is
>>> still in raw format. The raw to indexed merging is still needed -- but
>>> instead of merging thousands of raw profiles which can be very slow, with
>>> this model, only one raw profile input is needed.
>>>
>>> thanks,
>>>
>>> David
>>>
>>>
>>>>
>>>>
>>>> -- Sean Silva
>>>>
>>>> On Sat, Feb 27, 2016 at 6:02 PM, Xinliang David Li via llvm-dev <
>>>> llvm-dev at lists.llvm.org> wrote:
>>>>
>>>>> One of the main missing features in Clang/LLVM profile runtime is the
>>>>> lack of support for online/in-process profile merging support. Profile data
>>>>> collected for different workloads for the same executable binary need to be
>>>>> collected and merged later by the offline post-processing tool.  This
>>>>> limitation makes it hard to handle cases where the instrumented binary
>>>>> needs to be run with large number of small workloads, possibly in
>>>>> parallel.  For instance, to do PGO for clang, we may choose to  build  a
>>>>> large project with the instrumented Clang binary. This is because
>>>>>  1) to avoid profile from different runs from overriding others, %p
>>>>> substitution needs to be specified in either the command line or an
>>>>> environment variable so that different process can dump profile data into
>>>>> its own file named using pid. This will create huge requirement on the disk
>>>>> storage. For instance, clang's raw profile size is typically 80M -- if the
>>>>> instrumented clang is used to build a medium to large size project (such as
>>>>> clang itself), profile data can easily use up hundreds of Gig bytes of
>>>>> local storage.
>>>>> 2) pid can also be recycled. This means that some of the profile data
>>>>> may be overridden without being noticed.
>>>>>
>>>>> The way to solve this problem is to allow profile data to be merged in
>>>>> process.  I have a prototype implementation and plan to send it out for
>>>>> review soon after some clean ups. By default, the profiling merging is off
>>>>> and it can be turned on with an user option or via an environment variable.
>>>>> The following summarizes the issues involved in adding this feature:
>>>>>  1. the target platform needs to have file locking support
>>>>>  2. there needs an efficient way to identify the profile data and
>>>>> associate it with the binary using binary/profdata signature;
>>>>>  3. Currently without merging, profile data from shared libraries
>>>>> (including dlopen/dlcose ones) are concatenated into the primary profile
>>>>> file. This can complicate matters, as the merger also needs to find the
>>>>> matching shared libs, and the merger also needs to avoid unnecessary data
>>>>> movement/copy;
>>>>>  4. value profile data is variable in length even for the same binary.
>>>>>
>>>>> All the above issues are resolved and clang self build with
>>>>> instrumented binary passes (with both j1 and high parallelism).
>>>>>
>>>>> If you have any concerns, please let me know.
>>>>>
>>>>> thanks,
>>>>>
>>>>> David
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> LLVM Developers mailing list
>>>>> llvm-dev at lists.llvm.org
>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>
>>>>>
>>>>
>>>
>>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160301/275bbbda/attachment.html>