[llvm-dev] Add support for in-process profile merging in profile-runtime

Sean Silva via llvm-dev llvm-dev at lists.llvm.org
Tue Mar 1 15:54:41 PST 2016


On Tue, Mar 1, 2016 at 3:41 PM, Xinliang David Li <xinliangli at gmail.com>
wrote:

> sounds reasonable. My design of c) is different in many ways (e.g, using
> getpid()%PoolSize), but we can delay discussion of that in code review.
>

I like that (e.g. support %7p in addition to %p).

-- Sean Silva


>
> thanks,
>
> David
>
> On Tue, Mar 1, 2016 at 3:34 PM, Sean Silva via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> Hi David,
>>
>> This is wonderful data and demonstrates the viability of this feature. I
>> think this has alleviated the concerns regarding file locking.
>>
>> As far as the implementation of the feature, I think we will probably
>> want the following incremental steps:
>> a) implement the core merging logic and add to buffer API a primitive for
>> merging two buffers
>> b) implement the file system glue to extend this to the filesystem API's
>> (write_file etc.)
>> c) implement a profile filename format string which generates a random
>> number mod a specified amount (strawman:
>> `LLVM_PROFILE_FILE=default.profraw.%7u` which generates a _u_nique number
>> mod 7. Of course, in general it is `%<N>u`)
>>
>>  b) depends on a), but c) can be done in parallel with both.
>>
>> Does this seem feasible?
>>
>> -- Sean Silva
>>
>> On Tue, Mar 1, 2016 at 2:55 PM, Xinliang David Li <davidxl at google.com>
>> wrote:
>>
>>> I have implemented the profile pool idea from Mehdi, and collected
>>> performance data related to profile merging and file locking.  The
>>> following is the experiment setup:
>>>
>>> 1) the machine has 32 logical cores (Intel sandybridge machine/64G
>>> memory)
>>> 2) the workload is clang self build (~3.3K files to be built), and the
>>> instrumented binary is Clang.
>>> 3) ninja parallelism j32
>>>
>>> File systems tested (on linux)
>>> 1) a local file system on a SSD drive
>>> 2) tmpfs
>>> 3) a local file system on a hard disk
>>> 4) an internal distributed file system
>>>
>>> Configurations tested:
>>> 1) all processes dump to the same profile data file without locking
>>> (this configuration of course produces useless profile data in the end, but
>>> it serves as the performance baseline)
>>> 2) profile-merging enabled with pool sizes : 1, 2, 3, 4, 5, 10, and 32
>>> 3) using LLVM_PROFILE_FILE=..._%p to enable each process to dump its own
>>> copy of profile data (resulting in ~3.2K profile data files in the end).
>>> This configuration is only tested on some FS due to size/quota constraints.
>>>
>>> Here is a very high level summary of the experiment result. The longer
>>> writing latency it is, the more file locking contention is (which is not
>>> surprising). In some cases, file lock has close to zero overhead, while for
>>> FS with high write latencies, file locking can affect performance
>>> negatively. In such cases, using a small pool of profile files can
>>> completely recover the performance. The size of the required pool size is
>>> capped at a small value (which depends on many different factors: write
>>> latency, the rate at which the instrumented binary retires, io
>>> throughput/network bandwidth etc).
>>>
>>> 1) SSD
>>>
>>> The performance is almost identical across *ALL* the test
>>> configurations. The real time needed to complete the full self build is
>>> ~13m10s.  There is no visible file contention with file locking enabled
>>> even with pool size == 1.
>>>
>>> 2) tmpfs
>>>
>>> only tested with the following configs
>>> a) shared profile with no merge
>>> b) with merge (pool == 1), with merge (pool == 2)
>>>
>>> Not surprisingly, the result is similar to SSD case -- consistently
>>> finished building in a little more than 13m.
>>>
>>> 3) HDD
>>>
>>> With this configuration, file locking start to show some impact -- the
>>> write is slow enough to introduce contention.
>>>
>>> a) Shared profile without merging: ~13m10s
>>> b) with merging
>>>    b.1) pool size == 1:  ~18m20s
>>>    b.2) pool size == 2:  ~16m30s
>>>    b.3) pool size == 3:  ~15m55s
>>>    b.4) pool size == 4:  ~16m20s
>>>    b.5) pool size == 5:  ~16m42s
>>> c) >3000 profile file without merging (%p) : ~16m50s
>>>
>>> Increasing the size of merge pool increases dumping parallelism -- the
>>> performance improves initially but when it is above 4, it starts to degrade
>>> gradually. When the HDD IO throughput is saturated at that point and
>>> increasing parallelism does not help any more.
>>>
>>> In short, with profile merging, we just need to dump 3 profile files to
>>> achieve the same build performance that dumps >3000 files (the current
>>> default behavior).
>>>
>>> 4) An internal file system using network attached storage
>>>
>>> In such a file system, the file write has relatively long latency
>>> compared with local file systems. The backend storage server does dynamic
>>> load balancing so that it can achieve very high IO throughput with high
>>> parallelism (at both FE/client side and backend).
>>>
>>> a) Single profile without profile merging : ~60m
>>> b) Profile merging enabled:
>>>     b.1) pool size == 1:  ~80m
>>>     b.2) pool size == 2:  ~47m
>>>     b.3) pool size == 3:  ~43m
>>>     b.4) pool size == 4:  ~40m40s
>>>     b.5) pool size == 5:  ~38m50s
>>>     b.6) pool size == 10: ~36m48s
>>>     b.7) pool size == 32: ~36m24s
>>> c) >3000 profile file without profile merging (%p): ~35m24s
>>>
>>> b.6), b.7) and c) have the best performance among all.
>>>
>>> Unlike in HDD case, a) has poor performance here -- due to low
>>> parallelism in the storage backend.
>>>
>>> With file dumping parallelism, the performance flats out when the pool
>>> size >= 10. This is because the client (ninja+clang) system has reached its
>>> peak and becomes the new performance bottleneck.
>>>
>>> Again, with profile merging, we only need 10 profile data file to
>>> achieve the same performance as the default behavior that requires >3000
>>> files to be dumped.
>>>
>>> thanks,
>>>
>>> David
>>>
>>>
>>>
>>>
>>> On Sun, Feb 28, 2016 at 12:13 AM, Xinliang David Li <davidxl at google.com>
>>> wrote:
>>>
>>>>
>>>> On Sat, Feb 27, 2016 at 6:50 PM, Sean Silva <chisophugis at gmail.com>
>>>> wrote:
>>>>
>>>>> I have thought about this issue too, in the context of games. We may
>>>>> want to turn profiling only for certain frames (essentially, this is many
>>>>> small profile runs).
>>>>>
>>>>> However, I have not seen it demonstrated that this kind of refined
>>>>> data collection will actually improve PGO results in practice.
>>>>> The evidence I do have though is that IIRC Apple have found that
>>>>> almost all of the benefits of PGO for the Clang binary can be gotten with a
>>>>> handful of training runs of Clang. Are your findings different?
>>>>>
>>>>
>>>> We have a very wide customer base so we can not claim one use model is
>>>> sufficient for all users. For instance, we have users using fine grained
>>>> profile dumping control (programatically) as you described above. There are
>>>> also other possible use cases such as dump profiles for different
>>>> periodical phases into files associated with phases. Later different
>>>> phase's profile data can be merged with different weights.
>>>>
>>>>
>>>>>
>>>>> Also, in general, I am very wary of file locking. This can cause huge
>>>>> amounts of slowdown for a build and has potential portability problems.
>>>>>
>>>>
>>>> I don't see much slow down with a clang build using instrumented clang
>>>> as the build compiler. With file locking and profile merging enabled, the
>>>> build time on my local machine looks like:
>>>>
>>>> real    18m22.737s
>>>> user    293m18.924s
>>>> sys     9m55.532s
>>>>
>>>> If profile merging/locking is disabled (i.e, let the profile dumper to
>>>> clobber/write over each other),  the real time is about 14m.
>>>>
>>>>
>>>>> I don't see it as a substantially better solution than wrapping clang
>>>>> in a script that runs clang and then just calls llvm-profdata to do the
>>>>> merging. Running llvm-profdata is cheap compared to doing locking in a
>>>>> highly parallel situation like a build.
>>>>>
>>>>
>>>> That would require synchronization for merging too.
>>>>
>>>> From Justin's email, it looks like there is a key point I have not made
>>>> clear: the on-line profile merge is a very simple raw profile to raw
>>>> profile merging which is super fast. The end result of the profile run is
>>>> still in raw format. The raw to indexed merging is still needed -- but
>>>> instead of merging thousands of raw profiles which can be very slow, with
>>>> this model, only one raw profile input is needed.
>>>>
>>>> thanks,
>>>>
>>>> David
>>>>
>>>>
>>>>>
>>>>>
>>>>> -- Sean Silva
>>>>>
>>>>> On Sat, Feb 27, 2016 at 6:02 PM, Xinliang David Li via llvm-dev <
>>>>> llvm-dev at lists.llvm.org> wrote:
>>>>>
>>>>>> One of the main missing features in Clang/LLVM profile runtime is the
>>>>>> lack of support for online/in-process profile merging support. Profile data
>>>>>> collected for different workloads for the same executable binary need to be
>>>>>> collected and merged later by the offline post-processing tool.  This
>>>>>> limitation makes it hard to handle cases where the instrumented binary
>>>>>> needs to be run with large number of small workloads, possibly in
>>>>>> parallel.  For instance, to do PGO for clang, we may choose to  build  a
>>>>>> large project with the instrumented Clang binary. This is because
>>>>>>  1) to avoid profile from different runs from overriding others, %p
>>>>>> substitution needs to be specified in either the command line or an
>>>>>> environment variable so that different process can dump profile data into
>>>>>> its own file named using pid. This will create huge requirement on the disk
>>>>>> storage. For instance, clang's raw profile size is typically 80M -- if the
>>>>>> instrumented clang is used to build a medium to large size project (such as
>>>>>> clang itself), profile data can easily use up hundreds of Gig bytes of
>>>>>> local storage.
>>>>>> 2) pid can also be recycled. This means that some of the profile data
>>>>>> may be overridden without being noticed.
>>>>>>
>>>>>> The way to solve this problem is to allow profile data to be merged
>>>>>> in process.  I have a prototype implementation and plan to send it out for
>>>>>> review soon after some clean ups. By default, the profiling merging is off
>>>>>> and it can be turned on with an user option or via an environment variable.
>>>>>> The following summarizes the issues involved in adding this feature:
>>>>>>  1. the target platform needs to have file locking support
>>>>>>  2. there needs an efficient way to identify the profile data and
>>>>>> associate it with the binary using binary/profdata signature;
>>>>>>  3. Currently without merging, profile data from shared libraries
>>>>>> (including dlopen/dlcose ones) are concatenated into the primary profile
>>>>>> file. This can complicate matters, as the merger also needs to find the
>>>>>> matching shared libs, and the merger also needs to avoid unnecessary data
>>>>>> movement/copy;
>>>>>>  4. value profile data is variable in length even for the same binary.
>>>>>>
>>>>>> All the above issues are resolved and clang self build with
>>>>>> instrumented binary passes (with both j1 and high parallelism).
>>>>>>
>>>>>> If you have any concerns, please let me know.
>>>>>>
>>>>>> thanks,
>>>>>>
>>>>>> David
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> LLVM Developers mailing list
>>>>>> llvm-dev at lists.llvm.org
>>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160301/425acafd/attachment.html>


More information about the llvm-dev mailing list