[llvm-dev] Add support for in-process profile merging in profile-runtime

Tue Mar 1 15:33:31 PST 2016

On Mon, Feb 29, 2016 at 8:41 AM, Vedant Kumar <vsk at apple.com> wrote:

> + 1 to Sean's suggestion of using a wrapper script to call profdata merge.
>
> David, does that work for your use case?
>

 Every user who needs this use model will be forced to implement this
solution which does not scale. The proposed feature is relatively simple to
do.

>
> Some inline comments ---
>
> > On Feb 28, 2016, at 10:45 AM, Mehdi Amini via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
> >
> >>
> >> On Feb 28, 2016, at 12:46 AM, Xinliang David Li via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
> >>
> >> Justin, looks like there is some misunderstanding in my email. I want
> to clarify it here first:
> >>
> >> 1) I am not proposing changing the default profile dumping model as
> used today. The online merging is totally optional;
> >> 2) the on-line profile merging is not doing conversion from raw to
> index format. It does very simple raw-to-raw merging using existing runtime
> APIs.
> >> 3) the change to existing profile runtime code is just a few lines. All
> the new functionality is isolated in one new file. It will become clear
> when the patch is sent out later.
> >>
> >> My inline replies below:
> >>
> >>
> >> On Sat, Feb 27, 2016 at 11:44 PM, Justin Bogner <mail at justinbogner.com>
> wrote:
> >> Xinliang David Li via llvm-dev <llvm-dev at lists.llvm.org> writes:
> >> > One of the main missing features in Clang/LLVM profile runtime is the
> lack of
> >> > support for online/in-process profile merging support. Profile data
> collected
> >> > for different workloads for the same executable binary need to be
> collected
> >> > and merged later by the offline post-processing tool.  This
> limitation makes
> >> > it hard to handle cases where the instrumented binary needs to be run
> with
> >> > large number of small workloads, possibly in parallel.  For instance,
> to do
> >> > PGO for clang, we may choose to  build  a large project with the
> instrumented
> >> > Clang binary. This is because
> >> >
> >> > 1) to avoid profile from different runs from overriding others, %p
> >> >    substitution needs to be specified in either the command line or an
> >> >    environment variable so that different process can dump profile
> data
> >> >    into its own file named using pid.
> >>
> >> ... or you can specify a more specific name that describes what's under
> >> test, instead of %p.
> >>
> >> yes -- but the problem still exists -- each training process will need
> its own copy of raw profile.
> >>
> >>
> >> >    This will create huge requirement on the disk storage. For
> >> >    instance, clang's raw profile size is typically 80M -- if the
> >> >    instrumented clang is used to build a medium to large size project
> >> >    (such as clang itself), profile data can easily use up hundreds of
> >> >    Gig bytes of local storage.
> >>
> >> This argument is kind of confusing. It says that one profile is
> >> typicially 80M, then claims that this uses 100s of GB of data. From
> >> these statements that only makes sense I suppose that's true if you run
> >> 1000 profiling runs without merging the data in between. Is that what
> >> you're talking about, or did I miss something?
> >>
> >> Yes. For instance,  first build a  clang with
> -fprofile-instr-generate=prof.data.%p, and use this instrumented clang to
> build another large project such as clang itself. The second build will
> produce tons of profile data.
> >>
> >>
> >> > 2) pid can also be recycled. This means that some of the profile data
> may be
> >> >    overridden without being noticed.
> >> >
> >> > The way to solve this problem is to allow profile data to be merged in
> >> > process.
> >>
> >> I'm not convinced. Can you provide some more concrete examples of where
> >> the out of process merging model fails? This was a *very deliberate*
> >> design decision in how clang's profiling works, and most of the
> >> subsequent decisions have been based on this initial one. Changing it
> >> has far reaching effects.
> >>
> >> I am not proposing changing the out of process merging -- it is still
> needed. What I meant is that, in a scenario where the instrumented binaries
> are running multiple times (using their existing running harness), there is
> no good/automatic way of making sure different process's profile data won't
> have name conflict.
> >
> > Could the profile file be named from a hash of the profile data
> themselves?
>
> IIUC David's plan is to repeatedly use the same on-disk raw-profile.
>
>
> > Even with that I still like the direction you're heading to. To avoid
> contention on the "file locking", could there be a pool of output files? I
> don't know how to map a new process to one output file without a lock
> somewhere though.
> >
> > --
> > Mehdi
> >
> >
> >>
> >> Using clang's self build (using instrumented clang as build compiler
> for profile bootstrapping) as an example.  Ideally this should all be done
> transparently  -- i.e, set the instrumented compiler as the build compiler,
> run ninja or make and things will just work, but with the current default
> profile dumping mode, it can fail in many different ways:
> >> 1) Just run ninja/make -- all clang processes will dump profile into
> the same file concurrently -- the result is a corrupted profile -- FAIL
> >> 2) run ninja with LLVM_PROFILE_FILE=....%p
> >>    2.1) failure mode #1 --> really slow build due to large IO; or
> running out of diskspace
> >>    2.2) failure mode #2 --> pid recyling leading to profile file name
> conflict -- profile overwriting happens and we loss data
> >>
> >> Suppose 2) above finally succeeds, the user will have to merge
> thousands of raw profiles to indexed profile.
> >>
> >> With the proposed profile on-line merging, you just need to use the
> instrumented clang, and one merged raw profile data automagically produced
> in the end. The raw to indexed merge is also much faster.
> >>
> >> The online merge feature has a huge advantage when considering
> integrating the instrumented binary with existing make systems or
> loadtesting harness -- it is almost seamless.
> >>
> >>
> >>
> >> > I have a prototype implementation and plan to send it out for review
> >> > soon after some clean ups. By default, the profiling merging is off
> and it can
> >> > be turned on with an user option or via an environment variable. The
> following
> >> > summarizes the issues involved in adding this feature:
> >> > 1. the target platform needs to have file locking support
> >> > 2. there needs an efficient way to identify the profile data and
> associate it
> >> >    with the binary using binary/profdata signature;
> >> > 3. Currently without merging, profile data from shared libraries
> >> >    (including dlopen/dlcose ones) are concatenated into the primary
> >> >    profile file. This can complicate matters, as the merger also
> needs to
> >> >    find the matching shared libs, and the merger also needs to avoid
> >> >    unnecessary data movement/copy;
> >> > 4. value profile data is variable in length even for the same binary.
> >>
> >> If we actually want this, we should reconsider the design of having a
> >> raw vs processed profiling format. The raw profile format is
> >> specifically designed to be fast to write out and not to consider
> >> merging profiles at all. This feature would make it nearly as
> >> complicated as the processed format and lose all of the advantages of
> >> making them different.
> >>
> >> See above -- all the nice raw profile dumping mechanism is still kept
> -- there won't be a change of that.
> >>
> >> thanks,
> >>
> >> David
> >>
> >>
> >>
> >> > All the above issues are resolved and clang self build with
> instrumented
> >> > binary passes (with both j1 and high parallelism).
> >> >
> >> > If you have any concerns, please let me know.
> >>
> >> _______________________________________________
> >> LLVM Developers mailing list
> >> llvm-dev at lists.llvm.org
> >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> >
> > _______________________________________________
> > LLVM Developers mailing list
> > llvm-dev at lists.llvm.org
> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> >
> > However, I have not seen it demonstrated that this kind of refined data
> collection will actually improve PGO results in practice.
>
> That's an important question to ask. But, if users do not want profiles
> from certain runs, shouldn't they either (1) pass
> LLVM_PROFILE_FILE=/dev/null, or (2) not instrument the binary?
>

Having ways to skip profile for certain runs is probably orthogonal to this
feature. Yes, using LLVM_PROFILE_FILE=/dev/null can be one way to do it.

>
>
> > Also, in general, I am very wary of file locking. This can cause huge
> amounts of slowdown for a build and has potential portability problems.
>
> Yeah, these things aren't great, but as long as they are opt-in I think
> it'd be reasonable.
>
> Also to be fair, writing out tons of raw profiles to separate files is a
> lot slower than overwriting the same raw profile.
>
>
> > I don't see it as a substantially better solution than wrapping clang in
> a script that runs clang and then just calls llvm-profdata to do the
> merging. Running llvm-profdata is cheap compared to doing locking in a
> highly parallel situation like a build.
>
> ^ + 1.
>
>
Doing merging using llvm-profdata requires locking for the indexed data
file too. There are many issues:
1) we don't want to teach llvm-profdata to do locking, or
2) currently seems not possible to pass the file lock ownership from parent
wrapper to llvm-profdata
3) llvm-profdata is a host tool which may not run on the target system
4) compared with raw merging/dumping, profile conversion is slower which
had more negative impact on parallel build.

thanks,

David

> vedant
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160301/441aff95/attachment.html>