[llvm-dev] RFC: Pass to prune redundant profiling instrumentation

Fri Mar 11 19:43:22 PST 2016

On Fri, Mar 11, 2016 at 7:26 PM, Xinliang David Li <davidxl at google.com>
wrote:

>
>
> On Fri, Mar 11, 2016 at 7:04 PM, Sean Silva <chisophugis at gmail.com> wrote:
>
>>
>>
>> On Fri, Mar 11, 2016 at 6:00 PM, Vedant Kumar <vsk at apple.com> wrote:
>>
>>>
>>> > On Mar 11, 2016, at 5:28 PM, Sean Silva <chisophugis at gmail.com> wrote:
>>> >
>>> >
>>> >
>>> > On Fri, Mar 11, 2016 at 12:47 PM, Vedant Kumar <vsk at apple.com> wrote:
>>> > There have been a lot of responses. I'll try to summarize the thread
>>> and respond
>>> > to some of the questions/feedback.
>>> >
>>> >
>>> > Summary
>>> > =======
>>> >
>>> > 1. We should teach GlobalDCE to strip out instrumented functions which
>>> the
>>> >    inliner cannot delete.
>>> >
>>> > 2. Sean suggests adding metadata to loads/stores of counters to convey
>>> that
>>> >    they do not alias normal program data.
>>> >
>>> >    I'm not familiar with AA, but off-hand this seems like a good idea.
>>> >
>>> > 3. Sean also suggests improving the optimizer to registerize counters
>>> when
>>> >    possible.
>>> >
>>> >    Ditto, seems like a good idea.
>>> >
>>> > 4. Sean has an alternate proposal for counter pruning which works by
>>> remapping
>>> >    FE-generated counters to counters generated by the IR
>>> instrumentation pass.
>>> >    David likes this proposal.
>>> >
>>> >    I have some questions about this but am willing to try it. IMO, the
>>> pruning
>>> >    pass I've suggested could complement this nicely -- I don't view
>>> them as
>>> >    mutually exclusive (though I could be wrong).
>>> >
>>> > 5. There seems to be consensus over the need to improve performance of
>>> >    FE-instrumented programs.
>>> >
>>> > 6. David is concerned that the proposed pruning pass is too narrow in
>>> scope,
>>> >    and would like to see results on a larger benchmark. I'll try to
>>> get some
>>> >    numbers.
>>> >
>>> > 7. David mentioned some optimizations that are possible if we make
>>> coverage
>>> >    information "yes or no" (per line). Justin pointed out that this
>>> doesn't
>>> >    work for a few of Apple's use cases.
>>> >
>>> >
>>> > Clarifications
>>> > ==============
>>> >
>>> > 1. In my original email, two of my numbers were incorrect. I listed the
>>> >    sizes of the __llvm_prf_alias sections, not the number of actual
>>> aliases.
>>> >    The corrected lines are:
>>> >
>>> > O3 + PGOInstr + Pruning:      0.606s (8.6% performance win, _42_
>>> aliases)
>>> > O3 + CoverageInstr + Pruning: 0.610s (11.6% performance win, _44_
>>> aliases)
>>> >
>>> > 2.
>>> >
>>> > >>  Determine which profile counters are essential.
>>> > >
>>> > > What is an "essential counter"?
>>> >
>>> > The pass I proposed works like this:
>>> >
>>> >   for (PC : ProfileCounters) {
>>> >     if (Update-Sites-Of(PC) == Update-Sites-Of(Other-PC)) {
>>> >       mark PC as "essential"
>>> >       mark Other-PC as "redundant" (i.e, alias of PC)
>>> >     }
>>> >   }
>>> >
>>> > Where Update-Sites-Of(PC) is the set of basic blocks in which PC is
>>> updated.
>>> > If there are multiple counters updated in each update site for PC,
>>> they are
>>> > all marked as aliases in one pass.
>>> >
>>> >
>>> > 3.
>>> >
>>> > >> Erase all updates to redundant profile counters.
>>> > >
>>> > > When you are saying "erase", you need to actually replace the
>>> multiple counter increment with *a new* counter, right? I.e. when replacing:
>>> > >
>>> > >     instrprof_increment(profc_f2);
>>> > >     instrprof_increment(profc_f1);
>>> > >
>>> > > you need to emit:
>>> > >
>>> > >     instrprof_increment(profc_f2f1_fused);
>>> >
>>> >
>>> > No. The pass only erases updates to counters which it can prove are
>>> > truly redundant. There is no need to create a new (or fused) counter
>>> > because profc_f1 == profc_f2 in all possible profiles.
>>> >
>>> > Oh, this makes it substantially less useful I would think (and I
>>> misunderstood your original proposal the same way that Mehdi did here).
>>> > I guess I'm not sure what situations your proposed transformation
>>> would catch.
>>> > The only one I can think of is something like:
>>> > - function foo() has an entry counter
>>> > - function bar() calls foo() from a BB that has a counter
>>> > - function bar() is the only caller of foo() and calls it in only one
>>> BB.
>>> > Can you add some debug dumping to your pass and give some real-world
>>> examples of counters that it merges?
>>>
>>> I still have the pruned IR from the example I brought up at the start
>>> of the thread:
>>>
>>>   https://ghostbin.com/paste/qus2s
>>>
>>> Look for the __llvm_prf_alias section.
>>>
>>
>> Could you maybe digest that a bit for us to give source-level examples of
>> what those counters are that end up aliased?
>>
>> Also, in the linked IR there is still a tremendous amount of counter
>> redundancy being left on the table that a "counter fusion" approach would
>> avoid.
>> E.g.
>> https://ghostbin.com/paste/qus2s#L306 - 6 counter updates in the same BB
>> https://ghostbin.com/paste/qus2s#L286 - 3 counter updates in the same BB
>> https://ghostbin.com/paste/qus2s#L261 - 2 counter updates in the same BB
>> https://ghostbin.com/paste/qus2s#L191 - 5 counter updates in the same BB
>> https://ghostbin.com/paste/qus2s#L435 - 9 counter updates in the same BB
>> https://ghostbin.com/paste/qus2s#L373 - 2 counter updates in the same BB
>>
>> It seems like the "identical counter merge" leaves a lot on the table
>> that even a simple BB-at-a-time counter fusion approach would get right.
>>
>> Also, notice that in that IR there are 21 BB's but 65 counters (even
>> after identical counter merge). This indicates that actually eliminating
>> the original counters entirely (besides recording a way to reconstruct
>> them, but that can be compressed) might actually reduce the total amount of
>> data in the final binary.
>>
>>
>>>
>>>
>>> > The fused counter approach (or the one I suggested which is just a
>>> generalization of that (see below)) seems like it can reduce the number of
>>> counter updates much more.
>>> >
>>> >
>>> > 4.
>>> >
>>> > >> The complexity of this pass is O(N*M), where N is the number of
>>> profile
>>> > >> counters, and M is the average number of updates per counter.
>>> > >
>>> > > What you're describing here is not clear to me? But I may have been
>>> totally misunderstanding the whole concept :)
>>> >
>>> > Hopefully the pseudo-code helps? Alternatively I can put a WIP patch up
>>> > on Phab.
>>> >
>>> > > It should be just O(N) where N is the number of instructions.
>>> >
>>> > The complexity of the pass is not upper-bounded by the number of
>>> > instructions in the program because certain loads and stores can be
>>> visited
>>> > more than once.
>>> >
>>> >
>>> > FE to IR Counter Remapping
>>> > ==========================
>>> >
>>> > I have a question about this plan:
>>> >
>>> > > for each CFG edge:
>>> > >     record which FE counters have ended up associated with it
>>> > > remove FE counters
>>> > > run IR instrumentation pass
>>> > > emit a side table mapping IR instr counters to FE counters
>>> >
>>> > Currently, -instrprof happens early in the pipeline. IIUC this is done
>>> to
>>> > allow the optimizer to work with load+add+stores, instead of profile
>>> update
>>> > intrinsics.
>>> >
>>> > Say we introduce a counter remapping pass like the one Sean suggested.
>>> It
>>> > should be run before -instrprof so that we don't waste time lowering a
>>> bunch
>>> > of instrprof_increment intrinsics which we'll have to throw away later.
>>> >
>>> > But that means that the CFGs that the counter remapping pass operates
>>> on won't
>>> > reflect changes made by the inliner (or any other optimizations which
>>> alter the
>>> > CFG), right?
>>> >
>>> > ISTM the pruning pass I've proposed is useful whether we're doing
>>> FE-based
>>> > instrumentation _or_ late instrumentation. Since it operates on
>>> loads+stores
>>> > directly, it can clean up redundant counter increments at any point in
>>> the
>>> > pipeline (after -instrprof).
>>> >
>>> >
>>> > What I suggested would operate on loads/stores and could be used for
>>> cleanup at any point in the pipeline too.
>>> >
>>> > It would just be a generic transformation "okay, we have some set of
>>> counter updates chosen earlier in the pipeline that are now mangled / made
>>> redundant by further optimization; let's re-run counter placement and emit
>>> some side tables mapping the newly chosen counters to the original
>>> counters". I'm not sure how complicated it is to reconstruct one set of
>>> counters from the other (just linear combinations, I think? So hopefully
>>> not too complicated).
>>>
>>> I think I understand what you're proposing.
>>>
>>> To not break code coverage, we would need to emit the 'side table' you
>>> described
>>> into the binary (so that we can correctly evaluate counter expressions).
>>>
>>
>> Yes. Otherwise the original counters cannot be reconstructed. This is no
>> different from emitting the aliases needed by the "identical counter merge"
>> approach.
>>
>>
>>>
>>> While I think this is feasible and can get better results, it might not
>>> *also* be
>>> simpler than running pass I described. IMO what you've described would
>>> take more
>>> work to implement and would put a higher burden on the compiler (the
>>> pipeline looks
>>> something like -instrprof, -remove-counters, -pgo-instr-gen,
>>> -remap-counters).
>>> That's not a bad thing -- just something to be aware of.
>>>
>>
>> FWIW, I was imagining that this would be a single pass that would call
>> into the IR instr gen as a utility (some refactoring would probably be
>> needed, but that's doable).
>>
>> I expect it to be more work to implement the fully general thing I
>> described. But implementing a BB-at-a-time counter fusion I think would
>> likely be not much more complicated than "identical counter merge". It
>> would just be:
>>
>> for each BB:
>>     collect all counter updates in this BB, together with their
>> multiplicities
>>     create new counter
>>     emit side-table data that relates the new counter to an array of
>> (other counter, multiplicity of update)
>>
>> The runtime just emits the side-table and then llvm-profdata does:
>>
>> for each counter C:
>>     for (otherCounter, multiplicity) in side-table[C]:
>>         counters[otherCounter] += multiplicity * counters[C]
>>
>>
> There are other issues that can complicate the matter.
>
> 1) The assumption in the algorithm is that the source counter has only one
> update site -- but instead it may have more than one sites update due to
> inlining.  Only the counts collected in the out-of-line function instance's
> update site can be safely propagated to the 'fused' counters.
>

We may be miscommunicating. `otherCounter` can take on the same value in
different iterations of `for each counter C:`. So the counter that is
present in the out-of-line version would be updated by the fused counters
from other BB's where it has been inlined, just as if there had in fact
been an increment at runtime in those other BB's.

> 2) the 'source' BB may not even have a 'source' counter owned by the
> caller function created -- so 'implicit' counters need to be used
>

Yes, but I don't think the handling of this will be substantially more
complicated than aliases in the "merge identical counters" approach. More
likely, the correct factoring would be to have a clear distinction between
the counters that are incremented at runtime and the counters that the
original instrumentor (esp. frontends) wants to read in the final profdata
file. This seems like infrastructure needed for various kinds of
post-cleanup approaches.

> 3) The side table can be huge for large applications.
>

Yes, but the counter array is also huge, which could be substantially
reduced. Also the text size would be reduced. It is unclear to me that it
will actually cause significant growth.

>
> IMO, doing post-cleanup (other than enhancing generic optimizations that
> can be applied to other globals)  is not the right approach.
>

I don't disagree. I am just providing an example that *if* we want
post-cleanup, there are more effective ways than "merge identical counters".

I am actually not hugely interested in FE instrumentation now that we have
IR instrumentation. IR instrumentation seems like clearly the right
approach for the use cases that I am concerned about.
However, if I were strongly invested in FE instrumentation, I think that a
cleanup pass makes sense. The more general approach I propose would be a
way to translate FE instrumentation to IR instrumentation, thus we can
concentrate on IR instr.

-- Sean Silva

>
> David
>
>
>
>
>> -- Sean Silva
>>
>>
>>>
>>> vedant
>>>
>>>
>>> > In other words, what I was suggesting is basically a generalization of
>>> Mehdi's "fused" counter transformation.
>>> >
>>> > Ideally this would be a reusable transformation. We could run it one
>>> or multiple times throughout the optimization pipeline, and probably want
>>> to always run it at least once at the very end before codegen. (note that
>>> you only ever need at most a single side-table mapping the "current"
>>> counters to the "original" counters no matter how many times you re-run
>>> counter placement, since every time you re-run counter placement you are
>>> emitting a completely new side-table and deleting the old one, and the side
>>> table always reconstructs the original counters).
>>> >
>>> > The availability of this facility would enable frontends to be as
>>> naive as they want about their counter placement, which is probably a net
>>> win for LLVM as a toolkit. It would be analogous to the alloca trick that
>>> frontends use to avoid having to construct SSA themselves: the existence of
>>> mem2reg allows them to not have to worry about it.
>>> >
>>> > -- Sean Silva
>>> >
>>> >
>>> >
>>> > vedant
>>> >
>>> >
>>> > > On Mar 10, 2016, at 7:48 PM, Mehdi Amini <mehdi.amini at apple.com>
>>> wrote:
>>> > >
>>> > >>
>>> > >> On Mar 10, 2016, at 7:21 PM, Vedant Kumar via llvm-dev <
>>> llvm-dev at lists.llvm.org> wrote:
>>> > >>
>>> > >> Hi,
>>> > >>
>>> > >> I'd like to add a new pass to LLVM which removes redundant profile
>>> counter
>>> > >> updates. The goal is to speed up code coverage testing and profile
>>> generation
>>> > >> for PGO.
>>> > >>
>>> > >> I'm sending this email out to describe my approach, share some
>>> early results,
>>> > >> and gather feedback.
>>> > >>
>>> > >>
>>> > >> Problem Overview
>>> > >> ================
>>> > >>
>>> > >> A profile counter is redundant if it's incremented in exactly the
>>> same basic
>>> > >> blocks as some other profile counter. Consider the following module:
>>> > >>
>>> > >>   local void f1() {
>>> > >>     instrprof_increment(profc_f1);
>>> > >>   }
>>> > >>
>>> > >>   void f2() {
>>> > >>     instrprof_increment(profc_f2);
>>> > >>     f1();
>>> > >>   }
>>> > >>
>>> > >> Once the inliner runs and deletes f1, we're left with:
>>> > >>
>>> > >>   void f2() {
>>> > >>     instrprof_increment(profc_f2);
>>> > >>     instrprof_increment(profc_f1);
>>> > >>   }
>>> > >>
>>> > >> Now we can say profc_f1 is redundant (or, an alias for profc_f2).
>>> > >>
>>> > >> I've noticed that frontend-based instrumentation can generate many
>>> redundant
>>> > >> profile counters. This degrades performance and increases code
>>> size.  We can
>>> > >> address the problem by getting rid of redundant counter updates.
>>> The trick is
>>> > >> to make sure we get back the same profiles.
>>> > >>
>>> > >>
>>> > >> Proposed Solution
>>> > >> =================
>>> > >>
>>> > >> I propose a pruning pass which takes the following steps:
>>> > >>
>>> > >> 1. Delete functions with local linkage and only one use, if that
>>> use is in
>>> > >>    a profile data record.
>>> > >>
>>> > >>    These functions are left around by the inliner (it doesn't know
>>> that
>>> > >>    they're safe to delete). Deleting them reduces code size and
>>> simplifies
>>> > >>    subsequent analysis of profile counters.
>>> > >
>>> > > I think that this is something global-DCE should be teached about
>>> (if it does not know already).
>>> > >
>>> > >
>>> > >>
>>> > >> 2. Determine which profile counters are essential.
>>> > >
>>> > > What is an "essential counter"?
>>> > >
>>> > >>
>>> > >> 3. Erase all updates to redundant profile counters.
>>> > >
>>> > > When you are saying "erase", you need to actually replace the
>>> multiple counter increment with *a new* counter, right? I.e. when replacing:
>>> > >
>>> > >     instrprof_increment(profc_f2);
>>> > >     instrprof_increment(profc_f1);
>>> > >
>>> > > you need to emit:
>>> > >
>>> > >     instrprof_increment(profc_f2f1_fused);
>>> > >
>>> > >
>>> > >> 4. Emit the aliases into a new section in the binary.
>>> > >>
>>> > >>    Aliases are represented as { Dst: i64*, Src: i64* } tuples. Some
>>> changes
>>> > >>    in compiler-rt are required to walk the alias section and fill
>>> in the
>>> > >>    correct execution counts at program exit time.
>>> > >>
>>> > >> This pass needs to be run after the inliner in order to be
>>> effective.
>>> > >>
>>> > >> The complexity of this pass is O(N*M), where N is the number of
>>> profile
>>> > >> counters, and M is the average number of updates per counter.
>>> > >
>>> > > What you're describing here is not clear to me? But I may have been
>>> totally misunderstanding the whole concept :)
>>> > >
>>> > >
>>> > > --
>>> > > Mehdi
>>> > >
>>> > >
>>> > >> In practice it is
>>> > >> a bit faster, since we can skip the analysis of counters which are
>>> discovered to
>>> > >> be redundant early on in the process.
>>> > >>
>>> > >>
>>> > >> Early Results
>>> > >> =============
>>> > >>
>>> > >> The pruning pass results in 25% speed improvement in the example
>>> program above
>>> > >> (where f2 is called in a loop 10^8 times).
>>> > >>
>>> > >> Here is a slightly less contrived example:
>>> > >>
>>> > >> #include <vector>
>>> > >> #include <algorithm>
>>> > >> #include <cstdlib>
>>> > >>
>>> > >> static void escape(void *p) {
>>> > >>   asm volatile("" : : "g"(p) : "memory");
>>> > >> }
>>> > >>
>>> > >> int main(int argc, char **argv) {
>>> > >>   std::vector<int> V(atoi(argv[1]));
>>> > >>   escape(reinterpret_cast<void *>(V.data()));
>>> > >>   std::sort(V.begin(), V.end());
>>> > >>   return V[0];
>>> > >> }
>>> > >>
>>> > >> I get the following results on my desktop (10^8 elements, 5 runs
>>> each):
>>> > >>
>>> > >> O3:                           0.262s
>>> > >> O3 + PGOInstr:                0.663s
>>> > >> O3 + PGOInstr + Pruning:      0.606s (8.6% performance win, 672
>>> aliases)
>>> > >> O3 + CoverageInstr:           0.690s
>>> > >> O3 + CoverageInstr + Pruning: 0.610s (11.6% performance win, 688
>>> aliases)
>>> > >>
>>> > >>
>>> > >> Next Steps?
>>> > >> ===========
>>> > >>
>>> > >> Is the performance of instrumented code something we think we need
>>> to fix?
>>> > >> What's an acceptable compile-time overhead for running this pruning
>>> pass? Is
>>> > >> the general approach a non-starter for anybody?
>>> > >>
>>> > >> I'd like to get some feedback and gauge interest before pursuing
>>> this further.
>>> > >> Possible next steps include benchmarking instrumented versions of
>>> clang and
>>> > >> swift on the relevant test suites, running performance tests from
>>> lnt, running
>>> > >> compile-time tests, and measuring any code size differences.
>>> > >>
>>> > >>
>>> > >> thanks
>>> > >> vedant
>>> > >>
>>> > >> _______________________________________________
>>> > >> LLVM Developers mailing list
>>> > >> llvm-dev at lists.llvm.org
>>> > >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>> >
>>> >
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160311/7ba95ab7/attachment.html>