[llvm-dev] Add support for in-process profile merging in profile-runtime
Xinliang David Li via llvm-dev
llvm-dev at lists.llvm.org
Sun Feb 28 10:59:37 PST 2016
On Sun, Feb 28, 2016 at 10:45 AM, Mehdi Amini <mehdi.amini at apple.com> wrote:
> On Feb 28, 2016, at 12:46 AM, Xinliang David Li via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
> Justin, looks like there is some misunderstanding in my email. I want to
> clarify it here first:
> 1) I am not proposing changing the default profile dumping model as used
> today. The online merging is totally optional;
> 2) the on-line profile merging is not doing conversion from raw to index
> format. It does very simple raw-to-raw merging using existing runtime APIs.
> 3) the change to existing profile runtime code is just a few lines. All
> the new functionality is isolated in one new file. It will become clear
> when the patch is sent out later.
> My inline replies below:
> On Sat, Feb 27, 2016 at 11:44 PM, Justin Bogner <mail at justinbogner.com>
>> Xinliang David Li via llvm-dev <llvm-dev at lists.llvm.org> writes:
>> > One of the main missing features in Clang/LLVM profile runtime is the
>> lack of
>> > support for online/in-process profile merging support. Profile data
>> > for different workloads for the same executable binary need to be
>> > and merged later by the offline post-processing tool. This limitation
>> > it hard to handle cases where the instrumented binary needs to be run
>> > large number of small workloads, possibly in parallel. For instance,
>> to do
>> > PGO for clang, we may choose to build a large project with the
>> > Clang binary. This is because
>> > 1) to avoid profile from different runs from overriding others, %p
>> > substitution needs to be specified in either the command line or an
>> > environment variable so that different process can dump profile data
>> > into its own file named using pid.
>> ... or you can specify a more specific name that describes what's under
>> test, instead of %p.
> yes -- but the problem still exists -- each training process will need its
> own copy of raw profile.
>> > This will create huge requirement on the disk storage. For
>> > instance, clang's raw profile size is typically 80M -- if the
>> > instrumented clang is used to build a medium to large size project
>> > (such as clang itself), profile data can easily use up hundreds of
>> > Gig bytes of local storage.
>> This argument is kind of confusing. It says that one profile is
>> typicially 80M, then claims that this uses 100s of GB of data. From
>> these statements that only makes sense I suppose that's true if you run
>> 1000 profiling runs without merging the data in between. Is that what
>> you're talking about, or did I miss something?
> Yes. For instance, first build a clang with
> -fprofile-instr-generate=prof.data.%p, and use this instrumented clang to
> build another large project such as clang itself. The second build will
> produce tons of profile data.
>> > 2) pid can also be recycled. This means that some of the profile data
>> may be
>> > overridden without being noticed.
>> > The way to solve this problem is to allow profile data to be merged in
>> > process.
>> I'm not convinced. Can you provide some more concrete examples of where
>> the out of process merging model fails? This was a *very deliberate*
>> design decision in how clang's profiling works, and most of the
>> subsequent decisions have been based on this initial one. Changing it
>> has far reaching effects.
> I am not proposing changing the out of process merging -- it is still
> needed. What I meant is that, in a scenario where the instrumented binaries
> are running multiple times (using their existing running harness), there is
> no good/automatic way of making sure different process's profile data won't
> have name conflict.
> Could the profile file be named from a hash of the profile data themselves?
That will solve the name conflict problem -- but the problem with larger
resource requirement is still there :)
> Even with that I still like the direction you're heading to. To avoid
> contention on the "file locking", could there be a pool of output files? I
> don't know how to map a new process to one output file without a lock
> somewhere though.
We can certainly introduce something like that when lock contention becomes
an issue (which is a great idea by the way, and quite straightforward to
implement). From my initial experiment, it does seem to be a problem.
> Using clang's self build (using instrumented clang as build compiler for
> profile bootstrapping) as an example. Ideally this should all be done
> transparently -- i.e, set the instrumented compiler as the build compiler,
> run ninja or make and things will just work, but with the current default
> profile dumping mode, it can fail in many different ways:
> 1) Just run ninja/make -- all clang processes will dump profile into the
> same file concurrently -- the result is a corrupted profile -- FAIL
> 2) run ninja with LLVM_PROFILE_FILE=....%p
> 2.1) failure mode #1 --> really slow build due to large IO; or running
> out of diskspace
> 2.2) failure mode #2 --> pid recyling leading to profile file name
> conflict -- profile overwriting happens and we loss data
> Suppose 2) above finally succeeds, the user will have to merge thousands
> of raw profiles to indexed profile.
> With the proposed profile on-line merging, you just need to use the
> instrumented clang, and one merged raw profile data automagically produced
> in the end. The raw to indexed merge is also much faster.
> The online merge feature has a huge advantage when considering integrating
> the instrumented binary with existing make systems or loadtesting harness
> -- it is almost seamless.
>> > I have a prototype implementation and plan to send it out for review
>> > soon after some clean ups. By default, the profiling merging is off and
>> it can
>> > be turned on with an user option or via an environment variable. The
>> > summarizes the issues involved in adding this feature:
>> > 1. the target platform needs to have file locking support
>> > 2. there needs an efficient way to identify the profile data and
>> associate it
>> > with the binary using binary/profdata signature;
>> > 3. Currently without merging, profile data from shared libraries
>> > (including dlopen/dlcose ones) are concatenated into the primary
>> > profile file. This can complicate matters, as the merger also needs
>> > find the matching shared libs, and the merger also needs to avoid
>> > unnecessary data movement/copy;
>> > 4. value profile data is variable in length even for the same binary.
>> If we actually want this, we should reconsider the design of having a
>> raw vs processed profiling format. The raw profile format is
>> specifically designed to be fast to write out and not to consider
>> merging profiles at all. This feature would make it nearly as
>> complicated as the processed format and lose all of the advantages of
>> making them different.
> See above -- all the nice raw profile dumping mechanism is still kept --
> there won't be a change of that.
>> > All the above issues are resolved and clang self build with instrumented
>> > binary passes (with both j1 and high parallelism).
>> > If you have any concerns, please let me know.
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the llvm-dev