[llvm-dev] RFC: Pass to prune redundant profiling instrumentation

Fri Mar 11 19:26:31 PST 2016

On Fri, Mar 11, 2016 at 7:04 PM, Sean Silva <chisophugis at gmail.com> wrote:

>
>
> On Fri, Mar 11, 2016 at 6:00 PM, Vedant Kumar <vsk at apple.com> wrote:
>
>>
>> > On Mar 11, 2016, at 5:28 PM, Sean Silva <chisophugis at gmail.com> wrote:
>> >
>> >
>> >
>> > On Fri, Mar 11, 2016 at 12:47 PM, Vedant Kumar <vsk at apple.com> wrote:
>> > There have been a lot of responses. I'll try to summarize the thread
>> and respond
>> > to some of the questions/feedback.
>> >
>> >
>> > Summary
>> > =======
>> >
>> > 1. We should teach GlobalDCE to strip out instrumented functions which
>> the
>> >    inliner cannot delete.
>> >
>> > 2. Sean suggests adding metadata to loads/stores of counters to convey
>> that
>> >    they do not alias normal program data.
>> >
>> >    I'm not familiar with AA, but off-hand this seems like a good idea.
>> >
>> > 3. Sean also suggests improving the optimizer to registerize counters
>> when
>> >    possible.
>> >
>> >    Ditto, seems like a good idea.
>> >
>> > 4. Sean has an alternate proposal for counter pruning which works by
>> remapping
>> >    FE-generated counters to counters generated by the IR
>> instrumentation pass.
>> >    David likes this proposal.
>> >
>> >    I have some questions about this but am willing to try it. IMO, the
>> pruning
>> >    pass I've suggested could complement this nicely -- I don't view
>> them as
>> >    mutually exclusive (though I could be wrong).
>> >
>> > 5. There seems to be consensus over the need to improve performance of
>> >    FE-instrumented programs.
>> >
>> > 6. David is concerned that the proposed pruning pass is too narrow in
>> scope,
>> >    and would like to see results on a larger benchmark. I'll try to get
>> some
>> >    numbers.
>> >
>> > 7. David mentioned some optimizations that are possible if we make
>> coverage
>> >    information "yes or no" (per line). Justin pointed out that this
>> doesn't
>> >    work for a few of Apple's use cases.
>> >
>> >
>> > Clarifications
>> > ==============
>> >
>> > 1. In my original email, two of my numbers were incorrect. I listed the
>> >    sizes of the __llvm_prf_alias sections, not the number of actual
>> aliases.
>> >    The corrected lines are:
>> >
>> > O3 + PGOInstr + Pruning:      0.606s (8.6% performance win, _42_
>> aliases)
>> > O3 + CoverageInstr + Pruning: 0.610s (11.6% performance win, _44_
>> aliases)
>> >
>> > 2.
>> >
>> > >>  Determine which profile counters are essential.
>> > >
>> > > What is an "essential counter"?
>> >
>> > The pass I proposed works like this:
>> >
>> >   for (PC : ProfileCounters) {
>> >     if (Update-Sites-Of(PC) == Update-Sites-Of(Other-PC)) {
>> >       mark PC as "essential"
>> >       mark Other-PC as "redundant" (i.e, alias of PC)
>> >     }
>> >   }
>> >
>> > Where Update-Sites-Of(PC) is the set of basic blocks in which PC is
>> updated.
>> > If there are multiple counters updated in each update site for PC, they
>> are
>> > all marked as aliases in one pass.
>> >
>> >
>> > 3.
>> >
>> > >> Erase all updates to redundant profile counters.
>> > >
>> > > When you are saying "erase", you need to actually replace the
>> multiple counter increment with *a new* counter, right? I.e. when replacing:
>> > >
>> > >     instrprof_increment(profc_f2);
>> > >     instrprof_increment(profc_f1);
>> > >
>> > > you need to emit:
>> > >
>> > >     instrprof_increment(profc_f2f1_fused);
>> >
>> >
>> > No. The pass only erases updates to counters which it can prove are
>> > truly redundant. There is no need to create a new (or fused) counter
>> > because profc_f1 == profc_f2 in all possible profiles.
>> >
>> > Oh, this makes it substantially less useful I would think (and I
>> misunderstood your original proposal the same way that Mehdi did here).
>> > I guess I'm not sure what situations your proposed transformation would
>> catch.
>> > The only one I can think of is something like:
>> > - function foo() has an entry counter
>> > - function bar() calls foo() from a BB that has a counter
>> > - function bar() is the only caller of foo() and calls it in only one
>> BB.
>> > Can you add some debug dumping to your pass and give some real-world
>> examples of counters that it merges?
>>
>> I still have the pruned IR from the example I brought up at the start
>> of the thread:
>>
>>   https://ghostbin.com/paste/qus2s
>>
>> Look for the __llvm_prf_alias section.
>>
>
> Could you maybe digest that a bit for us to give source-level examples of
> what those counters are that end up aliased?
>
> Also, in the linked IR there is still a tremendous amount of counter
> redundancy being left on the table that a "counter fusion" approach would
> avoid.
> E.g.
> https://ghostbin.com/paste/qus2s#L306 - 6 counter updates in the same BB
> https://ghostbin.com/paste/qus2s#L286 - 3 counter updates in the same BB
> https://ghostbin.com/paste/qus2s#L261 - 2 counter updates in the same BB
> https://ghostbin.com/paste/qus2s#L191 - 5 counter updates in the same BB
> https://ghostbin.com/paste/qus2s#L435 - 9 counter updates in the same BB
> https://ghostbin.com/paste/qus2s#L373 - 2 counter updates in the same BB
>
> It seems like the "identical counter merge" leaves a lot on the table that
> even a simple BB-at-a-time counter fusion approach would get right.
>
> Also, notice that in that IR there are 21 BB's but 65 counters (even after
> identical counter merge). This indicates that actually eliminating the
> original counters entirely (besides recording a way to reconstruct them,
> but that can be compressed) might actually reduce the total amount of data
> in the final binary.
>
>
>>
>>
>> > The fused counter approach (or the one I suggested which is just a
>> generalization of that (see below)) seems like it can reduce the number of
>> counter updates much more.
>> >
>> >
>> > 4.
>> >
>> > >> The complexity of this pass is O(N*M), where N is the number of
>> profile
>> > >> counters, and M is the average number of updates per counter.
>> > >
>> > > What you're describing here is not clear to me? But I may have been
>> totally misunderstanding the whole concept :)
>> >
>> > Hopefully the pseudo-code helps? Alternatively I can put a WIP patch up
>> > on Phab.
>> >
>> > > It should be just O(N) where N is the number of instructions.
>> >
>> > The complexity of the pass is not upper-bounded by the number of
>> > instructions in the program because certain loads and stores can be
>> visited
>> > more than once.
>> >
>> >
>> > FE to IR Counter Remapping
>> > ==========================
>> >
>> > I have a question about this plan:
>> >
>> > > for each CFG edge:
>> > >     record which FE counters have ended up associated with it
>> > > remove FE counters
>> > > run IR instrumentation pass
>> > > emit a side table mapping IR instr counters to FE counters
>> >
>> > Currently, -instrprof happens early in the pipeline. IIUC this is done
>> to
>> > allow the optimizer to work with load+add+stores, instead of profile
>> update
>> > intrinsics.
>> >
>> > Say we introduce a counter remapping pass like the one Sean suggested.
>> It
>> > should be run before -instrprof so that we don't waste time lowering a
>> bunch
>> > of instrprof_increment intrinsics which we'll have to throw away later.
>> >
>> > But that means that the CFGs that the counter remapping pass operates
>> on won't
>> > reflect changes made by the inliner (or any other optimizations which
>> alter the
>> > CFG), right?
>> >
>> > ISTM the pruning pass I've proposed is useful whether we're doing
>> FE-based
>> > instrumentation _or_ late instrumentation. Since it operates on
>> loads+stores
>> > directly, it can clean up redundant counter increments at any point in
>> the
>> > pipeline (after -instrprof).
>> >
>> >
>> > What I suggested would operate on loads/stores and could be used for
>> cleanup at any point in the pipeline too.
>> >
>> > It would just be a generic transformation "okay, we have some set of
>> counter updates chosen earlier in the pipeline that are now mangled / made
>> redundant by further optimization; let's re-run counter placement and emit
>> some side tables mapping the newly chosen counters to the original
>> counters". I'm not sure how complicated it is to reconstruct one set of
>> counters from the other (just linear combinations, I think? So hopefully
>> not too complicated).
>>
>> I think I understand what you're proposing.
>>
>> To not break code coverage, we would need to emit the 'side table' you
>> described
>> into the binary (so that we can correctly evaluate counter expressions).
>>
>
> Yes. Otherwise the original counters cannot be reconstructed. This is no
> different from emitting the aliases needed by the "identical counter merge"
> approach.
>
>
>>
>> While I think this is feasible and can get better results, it might not
>> *also* be
>> simpler than running pass I described. IMO what you've described would
>> take more
>> work to implement and would put a higher burden on the compiler (the
>> pipeline looks
>> something like -instrprof, -remove-counters, -pgo-instr-gen,
>> -remap-counters).
>> That's not a bad thing -- just something to be aware of.
>>
>
> FWIW, I was imagining that this would be a single pass that would call
> into the IR instr gen as a utility (some refactoring would probably be
> needed, but that's doable).
>
> I expect it to be more work to implement the fully general thing I
> described. But implementing a BB-at-a-time counter fusion I think would
> likely be not much more complicated than "identical counter merge". It
> would just be:
>
> for each BB:
>     collect all counter updates in this BB, together with their
> multiplicities
>     create new counter
>     emit side-table data that relates the new counter to an array of
> (other counter, multiplicity of update)
>
> The runtime just emits the side-table and then llvm-profdata does:
>
> for each counter C:
>     for (otherCounter, multiplicity) in side-table[C]:
>         counters[otherCounter] += multiplicity * counters[C]
>
>
There are other issues that can complicate the matter.

1) The assumption in the algorithm is that the source counter has only one
update site -- but instead it may have more than one sites update due to
inlining.  Only the counts collected in the out-of-line function instance's
update site can be safely propagated to the 'fused' counters.
2) the 'source' BB may not even have a 'source' counter owned by the caller
function created -- so 'implicit' counters need to be used
3) The side table can be huge for large applications.

IMO, doing post-cleanup (other than enhancing generic optimizations that
can be applied to other globals)  is not the right approach.

David

> -- Sean Silva
>
>
>>
>> vedant
>>
>>
>> > In other words, what I was suggesting is basically a generalization of
>> Mehdi's "fused" counter transformation.
>> >
>> > Ideally this would be a reusable transformation. We could run it one or
>> multiple times throughout the optimization pipeline, and probably want to
>> always run it at least once at the very end before codegen. (note that you
>> only ever need at most a single side-table mapping the "current" counters
>> to the "original" counters no matter how many times you re-run counter
>> placement, since every time you re-run counter placement you are emitting a
>> completely new side-table and deleting the old one, and the side table
>> always reconstructs the original counters).
>> >
>> > The availability of this facility would enable frontends to be as naive
>> as they want about their counter placement, which is probably a net win for
>> LLVM as a toolkit. It would be analogous to the alloca trick that frontends
>> use to avoid having to construct SSA themselves: the existence of mem2reg
>> allows them to not have to worry about it.
>> >
>> > -- Sean Silva
>> >
>> >
>> >
>> > vedant
>> >
>> >
>> > > On Mar 10, 2016, at 7:48 PM, Mehdi Amini <mehdi.amini at apple.com>
>> wrote:
>> > >
>> > >>
>> > >> On Mar 10, 2016, at 7:21 PM, Vedant Kumar via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>> > >>
>> > >> Hi,
>> > >>
>> > >> I'd like to add a new pass to LLVM which removes redundant profile
>> counter
>> > >> updates. The goal is to speed up code coverage testing and profile
>> generation
>> > >> for PGO.
>> > >>
>> > >> I'm sending this email out to describe my approach, share some early
>> results,
>> > >> and gather feedback.
>> > >>
>> > >>
>> > >> Problem Overview
>> > >> ================
>> > >>
>> > >> A profile counter is redundant if it's incremented in exactly the
>> same basic
>> > >> blocks as some other profile counter. Consider the following module:
>> > >>
>> > >>   local void f1() {
>> > >>     instrprof_increment(profc_f1);
>> > >>   }
>> > >>
>> > >>   void f2() {
>> > >>     instrprof_increment(profc_f2);
>> > >>     f1();
>> > >>   }
>> > >>
>> > >> Once the inliner runs and deletes f1, we're left with:
>> > >>
>> > >>   void f2() {
>> > >>     instrprof_increment(profc_f2);
>> > >>     instrprof_increment(profc_f1);
>> > >>   }
>> > >>
>> > >> Now we can say profc_f1 is redundant (or, an alias for profc_f2).
>> > >>
>> > >> I've noticed that frontend-based instrumentation can generate many
>> redundant
>> > >> profile counters. This degrades performance and increases code
>> size.  We can
>> > >> address the problem by getting rid of redundant counter updates. The
>> trick is
>> > >> to make sure we get back the same profiles.
>> > >>
>> > >>
>> > >> Proposed Solution
>> > >> =================
>> > >>
>> > >> I propose a pruning pass which takes the following steps:
>> > >>
>> > >> 1. Delete functions with local linkage and only one use, if that use
>> is in
>> > >>    a profile data record.
>> > >>
>> > >>    These functions are left around by the inliner (it doesn't know
>> that
>> > >>    they're safe to delete). Deleting them reduces code size and
>> simplifies
>> > >>    subsequent analysis of profile counters.
>> > >
>> > > I think that this is something global-DCE should be teached about (if
>> it does not know already).
>> > >
>> > >
>> > >>
>> > >> 2. Determine which profile counters are essential.
>> > >
>> > > What is an "essential counter"?
>> > >
>> > >>
>> > >> 3. Erase all updates to redundant profile counters.
>> > >
>> > > When you are saying "erase", you need to actually replace the
>> multiple counter increment with *a new* counter, right? I.e. when replacing:
>> > >
>> > >     instrprof_increment(profc_f2);
>> > >     instrprof_increment(profc_f1);
>> > >
>> > > you need to emit:
>> > >
>> > >     instrprof_increment(profc_f2f1_fused);
>> > >
>> > >
>> > >> 4. Emit the aliases into a new section in the binary.
>> > >>
>> > >>    Aliases are represented as { Dst: i64*, Src: i64* } tuples. Some
>> changes
>> > >>    in compiler-rt are required to walk the alias section and fill in
>> the
>> > >>    correct execution counts at program exit time.
>> > >>
>> > >> This pass needs to be run after the inliner in order to be effective.
>> > >>
>> > >> The complexity of this pass is O(N*M), where N is the number of
>> profile
>> > >> counters, and M is the average number of updates per counter.
>> > >
>> > > What you're describing here is not clear to me? But I may have been
>> totally misunderstanding the whole concept :)
>> > >
>> > >
>> > > --
>> > > Mehdi
>> > >
>> > >
>> > >> In practice it is
>> > >> a bit faster, since we can skip the analysis of counters which are
>> discovered to
>> > >> be redundant early on in the process.
>> > >>
>> > >>
>> > >> Early Results
>> > >> =============
>> > >>
>> > >> The pruning pass results in 25% speed improvement in the example
>> program above
>> > >> (where f2 is called in a loop 10^8 times).
>> > >>
>> > >> Here is a slightly less contrived example:
>> > >>
>> > >> #include <vector>
>> > >> #include <algorithm>
>> > >> #include <cstdlib>
>> > >>
>> > >> static void escape(void *p) {
>> > >>   asm volatile("" : : "g"(p) : "memory");
>> > >> }
>> > >>
>> > >> int main(int argc, char **argv) {
>> > >>   std::vector<int> V(atoi(argv[1]));
>> > >>   escape(reinterpret_cast<void *>(V.data()));
>> > >>   std::sort(V.begin(), V.end());
>> > >>   return V[0];
>> > >> }
>> > >>
>> > >> I get the following results on my desktop (10^8 elements, 5 runs
>> each):
>> > >>
>> > >> O3:                           0.262s
>> > >> O3 + PGOInstr:                0.663s
>> > >> O3 + PGOInstr + Pruning:      0.606s (8.6% performance win, 672
>> aliases)
>> > >> O3 + CoverageInstr:           0.690s
>> > >> O3 + CoverageInstr + Pruning: 0.610s (11.6% performance win, 688
>> aliases)
>> > >>
>> > >>
>> > >> Next Steps?
>> > >> ===========
>> > >>
>> > >> Is the performance of instrumented code something we think we need
>> to fix?
>> > >> What's an acceptable compile-time overhead for running this pruning
>> pass? Is
>> > >> the general approach a non-starter for anybody?
>> > >>
>> > >> I'd like to get some feedback and gauge interest before pursuing
>> this further.
>> > >> Possible next steps include benchmarking instrumented versions of
>> clang and
>> > >> swift on the relevant test suites, running performance tests from
>> lnt, running
>> > >> compile-time tests, and measuring any code size differences.
>> > >>
>> > >>
>> > >> thanks
>> > >> vedant
>> > >>
>> > >> _______________________________________________
>> > >> LLVM Developers mailing list
>> > >> llvm-dev at lists.llvm.org
>> > >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>> >
>> >
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160311/32073d0c/attachment.html>