[llvm-dev] RFC: Pass to prune redundant profiling instrumentation

Fri Mar 11 12:57:14 PST 2016

On Fri, Mar 11, 2016 at 12:47 PM, Vedant Kumar <vsk at apple.com> wrote:

> There have been a lot of responses. I'll try to summarize the thread and
> respond
> to some of the questions/feedback.
>
>
> Summary
> =======
>
> 1. We should teach GlobalDCE to strip out instrumented functions which the
>    inliner cannot delete.
>
> 2. Sean suggests adding metadata to loads/stores of counters to convey that
>    they do not alias normal program data.
>
>    I'm not familiar with AA, but off-hand this seems like a good idea.
>
> 3. Sean also suggests improving the optimizer to registerize counters when
>    possible.
>
>    Ditto, seems like a good idea.
>
> 4. Sean has an alternate proposal for counter pruning which works by
> remapping
>    FE-generated counters to counters generated by the IR instrumentation
> pass.
>    David likes this proposal.
>
>
I did not express preference about this translation scheme, but proposed
using IR based instrumentation longer term :)

>    I have some questions about this but am willing to try it. IMO, the
> pruning
>    pass I've suggested could complement this nicely -- I don't view them as
>    mutually exclusive (though I could be wrong).
>
> 5. There seems to be consensus over the need to improve performance of
>    FE-instrumented programs.
>
> 6. David is concerned that the proposed pruning pass is too narrow in
> scope,
>    and would like to see results on a larger benchmark. I'll try to get
> some
>    numbers.
>
> 7. David mentioned some optimizations that are possible if we make coverage
>    information "yes or no" (per line). Justin pointed out that this doesn't
>    work for a few of Apple's use cases.
>

What Justine said IIUC is that with the current way implicit counts are
generated, it is not doable. This does not seem to be related to Apple's
use cases.

David

>
>
> Clarifications
> ==============
>
> 1. In my original email, two of my numbers were incorrect. I listed the
>    sizes of the __llvm_prf_alias sections, not the number of actual
> aliases.
>    The corrected lines are:
>
> O3 + PGOInstr + Pruning:      0.606s (8.6% performance win, _42_ aliases)
> O3 + CoverageInstr + Pruning: 0.610s (11.6% performance win, _44_ aliases)
>
> 2.
>
> >>  Determine which profile counters are essential.
> >
> > What is an "essential counter"?
>
> The pass I proposed works like this:
>
>   for (PC : ProfileCounters) {
>     if (Update-Sites-Of(PC) == Update-Sites-Of(Other-PC)) {
>       mark PC as "essential"
>       mark Other-PC as "redundant" (i.e, alias of PC)
>     }
>   }
>
> Where Update-Sites-Of(PC) is the set of basic blocks in which PC is
> updated.
> If there are multiple counters updated in each update site for PC, they are
> all marked as aliases in one pass.
>
>
> 3.
>
> >> Erase all updates to redundant profile counters.
> >
> > When you are saying "erase", you need to actually replace the multiple
> counter increment with *a new* counter, right? I.e. when replacing:
> >
> >     instrprof_increment(profc_f2);
> >     instrprof_increment(profc_f1);
> >
> > you need to emit:
> >
> >     instrprof_increment(profc_f2f1_fused);
>
>
> No. The pass only erases updates to counters which it can prove are
> truly redundant. There is no need to create a new (or fused) counter
> because profc_f1 == profc_f2 in all possible profiles.
>
> 4.
>
> >> The complexity of this pass is O(N*M), where N is the number of profile
> >> counters, and M is the average number of updates per counter.
> >
> > What you're describing here is not clear to me? But I may have been
> totally misunderstanding the whole concept :)
>
> Hopefully the pseudo-code helps? Alternatively I can put a WIP patch up
> on Phab.
>
> > It should be just O(N) where N is the number of instructions.
>
> The complexity of the pass is not upper-bounded by the number of
> instructions in the program because certain loads and stores can be visited
> more than once.
>
>
> FE to IR Counter Remapping
> ==========================
>
> I have a question about this plan:
>
> > for each CFG edge:
> >     record which FE counters have ended up associated with it
> > remove FE counters
> > run IR instrumentation pass
> > emit a side table mapping IR instr counters to FE counters
>
> Currently, -instrprof happens early in the pipeline. IIUC this is done to
> allow the optimizer to work with load+add+stores, instead of profile update
> intrinsics.
>
> Say we introduce a counter remapping pass like the one Sean suggested. It
> should be run before -instrprof so that we don't waste time lowering a
> bunch
> of instrprof_increment intrinsics which we'll have to throw away later.
>
> But that means that the CFGs that the counter remapping pass operates on
> won't
> reflect changes made by the inliner (or any other optimizations which
> alter the
> CFG), right?
>
> ISTM the pruning pass I've proposed is useful whether we're doing FE-based
> instrumentation _or_ late instrumentation. Since it operates on
> loads+stores
> directly, it can clean up redundant counter increments at any point in the
> pipeline (after -instrprof).
>
>
> vedant
>
>
> > On Mar 10, 2016, at 7:48 PM, Mehdi Amini <mehdi.amini at apple.com> wrote:
> >
> >>
> >> On Mar 10, 2016, at 7:21 PM, Vedant Kumar via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
> >>
> >> Hi,
> >>
> >> I'd like to add a new pass to LLVM which removes redundant profile
> counter
> >> updates. The goal is to speed up code coverage testing and profile
> generation
> >> for PGO.
> >>
> >> I'm sending this email out to describe my approach, share some early
> results,
> >> and gather feedback.
> >>
> >>
> >> Problem Overview
> >> ================
> >>
> >> A profile counter is redundant if it's incremented in exactly the same
> basic
> >> blocks as some other profile counter. Consider the following module:
> >>
> >>   local void f1() {
> >>     instrprof_increment(profc_f1);
> >>   }
> >>
> >>   void f2() {
> >>     instrprof_increment(profc_f2);
> >>     f1();
> >>   }
> >>
> >> Once the inliner runs and deletes f1, we're left with:
> >>
> >>   void f2() {
> >>     instrprof_increment(profc_f2);
> >>     instrprof_increment(profc_f1);
> >>   }
> >>
> >> Now we can say profc_f1 is redundant (or, an alias for profc_f2).
> >>
> >> I've noticed that frontend-based instrumentation can generate many
> redundant
> >> profile counters. This degrades performance and increases code size.
> We can
> >> address the problem by getting rid of redundant counter updates. The
> trick is
> >> to make sure we get back the same profiles.
> >>
> >>
> >> Proposed Solution
> >> =================
> >>
> >> I propose a pruning pass which takes the following steps:
> >>
> >> 1. Delete functions with local linkage and only one use, if that use is
> in
> >>    a profile data record.
> >>
> >>    These functions are left around by the inliner (it doesn't know that
> >>    they're safe to delete). Deleting them reduces code size and
> simplifies
> >>    subsequent analysis of profile counters.
> >
> > I think that this is something global-DCE should be teached about (if it
> does not know already).
> >
> >
> >>
> >> 2. Determine which profile counters are essential.
> >
> > What is an "essential counter"?
> >
> >>
> >> 3. Erase all updates to redundant profile counters.
> >
> > When you are saying "erase", you need to actually replace the multiple
> counter increment with *a new* counter, right? I.e. when replacing:
> >
> >     instrprof_increment(profc_f2);
> >     instrprof_increment(profc_f1);
> >
> > you need to emit:
> >
> >     instrprof_increment(profc_f2f1_fused);
> >
> >
> >> 4. Emit the aliases into a new section in the binary.
> >>
> >>    Aliases are represented as { Dst: i64*, Src: i64* } tuples. Some
> changes
> >>    in compiler-rt are required to walk the alias section and fill in the
> >>    correct execution counts at program exit time.
> >>
> >> This pass needs to be run after the inliner in order to be effective.
> >>
> >> The complexity of this pass is O(N*M), where N is the number of profile
> >> counters, and M is the average number of updates per counter.
> >
> > What you're describing here is not clear to me? But I may have been
> totally misunderstanding the whole concept :)
> >
> >
> > --
> > Mehdi
> >
> >
> >> In practice it is
> >> a bit faster, since we can skip the analysis of counters which are
> discovered to
> >> be redundant early on in the process.
> >>
> >>
> >> Early Results
> >> =============
> >>
> >> The pruning pass results in 25% speed improvement in the example
> program above
> >> (where f2 is called in a loop 10^8 times).
> >>
> >> Here is a slightly less contrived example:
> >>
> >> #include <vector>
> >> #include <algorithm>
> >> #include <cstdlib>
> >>
> >> static void escape(void *p) {
> >>   asm volatile("" : : "g"(p) : "memory");
> >> }
> >>
> >> int main(int argc, char **argv) {
> >>   std::vector<int> V(atoi(argv[1]));
> >>   escape(reinterpret_cast<void *>(V.data()));
> >>   std::sort(V.begin(), V.end());
> >>   return V[0];
> >> }
> >>
> >> I get the following results on my desktop (10^8 elements, 5 runs each):
> >>
> >> O3:                           0.262s
> >> O3 + PGOInstr:                0.663s
> >> O3 + PGOInstr + Pruning:      0.606s (8.6% performance win, 672 aliases)
> >> O3 + CoverageInstr:           0.690s
> >> O3 + CoverageInstr + Pruning: 0.610s (11.6% performance win, 688
> aliases)
> >>
> >>
> >> Next Steps?
> >> ===========
> >>
> >> Is the performance of instrumented code something we think we need to
> fix?
> >> What's an acceptable compile-time overhead for running this pruning
> pass? Is
> >> the general approach a non-starter for anybody?
> >>
> >> I'd like to get some feedback and gauge interest before pursuing this
> further.
> >> Possible next steps include benchmarking instrumented versions of clang
> and
> >> swift on the relevant test suites, running performance tests from lnt,
> running
> >> compile-time tests, and measuring any code size differences.
> >>
> >>
> >> thanks
> >> vedant
> >>
> >> _______________________________________________
> >> LLVM Developers mailing list
> >> llvm-dev at lists.llvm.org
> >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160311/91bfc291/attachment.html>