[llvm-dev] Add support for in-process profile merging in profile-runtime

Xinliang David Li via llvm-dev llvm-dev at lists.llvm.org
Tue Mar 1 15:46:14 PST 2016


On Tue, Mar 1, 2016 at 3:41 PM, Vedant Kumar <vsk at apple.com> wrote:

>
> > On Mar 1, 2016, at 3:33 PM, Xinliang David Li <davidxl at google.com>
> wrote:
> >
> >
> >
> > On Mon, Feb 29, 2016 at 8:41 AM, Vedant Kumar <vsk at apple.com> wrote:
> > + 1 to Sean's suggestion of using a wrapper script to call profdata
> merge.
> >
> > David, does that work for your use case?
> >
> >  Every user who needs this use model will be forced to implement this
> solution which does not scale. The proposed feature is relatively simple to
> do.
> >
> > Some inline comments ---
> >
> > > On Feb 28, 2016, at 10:45 AM, Mehdi Amini via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
> > >
> > >>
> > >> On Feb 28, 2016, at 12:46 AM, Xinliang David Li via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
> > >>
> > >> Justin, looks like there is some misunderstanding in my email. I want
> to clarify it here first:
> > >>
> > >> 1) I am not proposing changing the default profile dumping model as
> used today. The online merging is totally optional;
> > >> 2) the on-line profile merging is not doing conversion from raw to
> index format. It does very simple raw-to-raw merging using existing runtime
> APIs.
> > >> 3) the change to existing profile runtime code is just a few lines.
> All the new functionality is isolated in one new file. It will become clear
> when the patch is sent out later.
> > >>
> > >> My inline replies below:
> > >>
> > >>
> > >> On Sat, Feb 27, 2016 at 11:44 PM, Justin Bogner <
> mail at justinbogner.com> wrote:
> > >> Xinliang David Li via llvm-dev <llvm-dev at lists.llvm.org> writes:
> > >> > One of the main missing features in Clang/LLVM profile runtime is
> the lack of
> > >> > support for online/in-process profile merging support. Profile data
> collected
> > >> > for different workloads for the same executable binary need to be
> collected
> > >> > and merged later by the offline post-processing tool.  This
> limitation makes
> > >> > it hard to handle cases where the instrumented binary needs to be
> run with
> > >> > large number of small workloads, possibly in parallel.  For
> instance, to do
> > >> > PGO for clang, we may choose to  build  a large project with the
> instrumented
> > >> > Clang binary. This is because
> > >> >
> > >> > 1) to avoid profile from different runs from overriding others, %p
> > >> >    substitution needs to be specified in either the command line or
> an
> > >> >    environment variable so that different process can dump profile
> data
> > >> >    into its own file named using pid.
> > >>
> > >> ... or you can specify a more specific name that describes what's
> under
> > >> test, instead of %p.
> > >>
> > >> yes -- but the problem still exists -- each training process will
> need its own copy of raw profile.
> > >>
> > >>
> > >> >    This will create huge requirement on the disk storage. For
> > >> >    instance, clang's raw profile size is typically 80M -- if the
> > >> >    instrumented clang is used to build a medium to large size
> project
> > >> >    (such as clang itself), profile data can easily use up hundreds
> of
> > >> >    Gig bytes of local storage.
> > >>
> > >> This argument is kind of confusing. It says that one profile is
> > >> typicially 80M, then claims that this uses 100s of GB of data. From
> > >> these statements that only makes sense I suppose that's true if you
> run
> > >> 1000 profiling runs without merging the data in between. Is that what
> > >> you're talking about, or did I miss something?
> > >>
> > >> Yes. For instance,  first build a  clang with
> -fprofile-instr-generate=prof.data.%p, and use this instrumented clang to
> build another large project such as clang itself. The second build will
> produce tons of profile data.
> > >>
> > >>
> > >> > 2) pid can also be recycled. This means that some of the profile
> data may be
> > >> >    overridden without being noticed.
> > >> >
> > >> > The way to solve this problem is to allow profile data to be merged
> in
> > >> > process.
> > >>
> > >> I'm not convinced. Can you provide some more concrete examples of
> where
> > >> the out of process merging model fails? This was a *very deliberate*
> > >> design decision in how clang's profiling works, and most of the
> > >> subsequent decisions have been based on this initial one. Changing it
> > >> has far reaching effects.
> > >>
> > >> I am not proposing changing the out of process merging -- it is still
> needed. What I meant is that, in a scenario where the instrumented binaries
> are running multiple times (using their existing running harness), there is
> no good/automatic way of making sure different process's profile data won't
> have name conflict.
> > >
> > > Could the profile file be named from a hash of the profile data
> themselves?
> >
> > IIUC David's plan is to repeatedly use the same on-disk raw-profile.
> >
> >
> > > Even with that I still like the direction you're heading to. To avoid
> contention on the "file locking", could there be a pool of output files? I
> don't know how to map a new process to one output file without a lock
> somewhere though.
> > >
> > > --
> > > Mehdi
> > >
> > >
> > >>
> > >> Using clang's self build (using instrumented clang as build compiler
> for profile bootstrapping) as an example.  Ideally this should all be done
> transparently  -- i.e, set the instrumented compiler as the build compiler,
> run ninja or make and things will just work, but with the current default
> profile dumping mode, it can fail in many different ways:
> > >> 1) Just run ninja/make -- all clang processes will dump profile into
> the same file concurrently -- the result is a corrupted profile -- FAIL
> > >> 2) run ninja with LLVM_PROFILE_FILE=....%p
> > >>    2.1) failure mode #1 --> really slow build due to large IO; or
> running out of diskspace
> > >>    2.2) failure mode #2 --> pid recyling leading to profile file name
> conflict -- profile overwriting happens and we loss data
> > >>
> > >> Suppose 2) above finally succeeds, the user will have to merge
> thousands of raw profiles to indexed profile.
> > >>
> > >> With the proposed profile on-line merging, you just need to use the
> instrumented clang, and one merged raw profile data automagically produced
> in the end. The raw to indexed merge is also much faster.
> > >>
> > >> The online merge feature has a huge advantage when considering
> integrating the instrumented binary with existing make systems or
> loadtesting harness -- it is almost seamless.
> > >>
> > >>
> > >>
> > >> > I have a prototype implementation and plan to send it out for review
> > >> > soon after some clean ups. By default, the profiling merging is off
> and it can
> > >> > be turned on with an user option or via an environment variable.
> The following
> > >> > summarizes the issues involved in adding this feature:
> > >> > 1. the target platform needs to have file locking support
> > >> > 2. there needs an efficient way to identify the profile data and
> associate it
> > >> >    with the binary using binary/profdata signature;
> > >> > 3. Currently without merging, profile data from shared libraries
> > >> >    (including dlopen/dlcose ones) are concatenated into the primary
> > >> >    profile file. This can complicate matters, as the merger also
> needs to
> > >> >    find the matching shared libs, and the merger also needs to avoid
> > >> >    unnecessary data movement/copy;
> > >> > 4. value profile data is variable in length even for the same
> binary.
> > >>
> > >> If we actually want this, we should reconsider the design of having a
> > >> raw vs processed profiling format. The raw profile format is
> > >> specifically designed to be fast to write out and not to consider
> > >> merging profiles at all. This feature would make it nearly as
> > >> complicated as the processed format and lose all of the advantages of
> > >> making them different.
> > >>
> > >> See above -- all the nice raw profile dumping mechanism is still kept
> -- there won't be a change of that.
> > >>
> > >> thanks,
> > >>
> > >> David
> > >>
> > >>
> > >>
> > >> > All the above issues are resolved and clang self build with
> instrumented
> > >> > binary passes (with both j1 and high parallelism).
> > >> >
> > >> > If you have any concerns, please let me know.
> > >>
> > >> _______________________________________________
> > >> LLVM Developers mailing list
> > >> llvm-dev at lists.llvm.org
> > >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> > >
> > > _______________________________________________
> > > LLVM Developers mailing list
> > > llvm-dev at lists.llvm.org
> > > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> > >
> > > However, I have not seen it demonstrated that this kind of refined
> data collection will actually improve PGO results in practice.
> >
> > That's an important question to ask. But, if users do not want profiles
> from certain runs, shouldn't they either (1) pass
> LLVM_PROFILE_FILE=/dev/null, or (2) not instrument the binary?
> >
> > Having ways to skip profile for certain runs is probably orthogonal to
> this feature. Yes, using LLVM_PROFILE_FILE=/dev/null can be one way to do
> it.
> >
> >
> >
> > > Also, in general, I am very wary of file locking. This can cause huge
> amounts of slowdown for a build and has potential portability problems.
> >
> > Yeah, these things aren't great, but as long as they are opt-in I think
> it'd be reasonable.
> >
> > Also to be fair, writing out tons of raw profiles to separate files is a
> lot slower than overwriting the same raw profile.
> >
> >
> > > I don't see it as a substantially better solution than wrapping clang
> in a script that runs clang and then just calls llvm-profdata to do the
> merging. Running llvm-profdata is cheap compared to doing locking in a
> highly parallel situation like a build.
> >
> > ^ + 1.
> >
> >
> > Doing merging using llvm-profdata requires locking for the indexed data
> file too. There are many issues:
> > 1) we don't want to teach llvm-profdata to do locking, or
> > 2) currently seems not possible to pass the file lock ownership from
> parent wrapper to llvm-profdata
>
> Is this step needed to make sure another process doesn't overwrite a
> temporary raw profile before it's merged into the accumulated profile?
>
>
Instrumented binary process produces temporary raw profile data will use
tricks like %p that won't collide with other process (such temp files will
be deleted by the wrapper after merging). It is the indexed profile
produced by llvm-profdata that needs synchronization in this scheme.

David


>
> > 3) llvm-profdata is a host tool which may not run on the target system
>
> IMO this is the best reason to extend compiler-rt to write out accumulated
> raw profiles.
>
>
> > 4) compared with raw merging/dumping, profile conversion is slower which
> had more negative impact on parallel build.
>
> Good point.
>
> vedant
>
> >
> > thanks,
> >
> > David
> >
> >
> > vedant
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160301/fbe422fe/attachment-0001.html>


More information about the llvm-dev mailing list