[llvm-dev] RFC: Pass to prune redundant profiling instrumentation

Vedant Kumar via llvm-dev llvm-dev at lists.llvm.org
Fri Mar 11 12:47:43 PST 2016


There have been a lot of responses. I'll try to summarize the thread and respond
to some of the questions/feedback.


Summary
=======

1. We should teach GlobalDCE to strip out instrumented functions which the
   inliner cannot delete.

2. Sean suggests adding metadata to loads/stores of counters to convey that
   they do not alias normal program data.

   I'm not familiar with AA, but off-hand this seems like a good idea.

3. Sean also suggests improving the optimizer to registerize counters when
   possible.

   Ditto, seems like a good idea.

4. Sean has an alternate proposal for counter pruning which works by remapping
   FE-generated counters to counters generated by the IR instrumentation pass.
   David likes this proposal.

   I have some questions about this but am willing to try it. IMO, the pruning
   pass I've suggested could complement this nicely -- I don't view them as
   mutually exclusive (though I could be wrong).

5. There seems to be consensus over the need to improve performance of
   FE-instrumented programs.

6. David is concerned that the proposed pruning pass is too narrow in scope,
   and would like to see results on a larger benchmark. I'll try to get some
   numbers.

7. David mentioned some optimizations that are possible if we make coverage
   information "yes or no" (per line). Justin pointed out that this doesn't
   work for a few of Apple's use cases.


Clarifications
==============

1. In my original email, two of my numbers were incorrect. I listed the
   sizes of the __llvm_prf_alias sections, not the number of actual aliases.
   The corrected lines are:

O3 + PGOInstr + Pruning:      0.606s (8.6% performance win, _42_ aliases)
O3 + CoverageInstr + Pruning: 0.610s (11.6% performance win, _44_ aliases)

2. 

>>  Determine which profile counters are essential.
> 
> What is an "essential counter"?

The pass I proposed works like this:

  for (PC : ProfileCounters) {
    if (Update-Sites-Of(PC) == Update-Sites-Of(Other-PC)) {
      mark PC as "essential"
      mark Other-PC as "redundant" (i.e, alias of PC)
    }
  }

Where Update-Sites-Of(PC) is the set of basic blocks in which PC is updated.
If there are multiple counters updated in each update site for PC, they are
all marked as aliases in one pass.


3.

>> Erase all updates to redundant profile counters.
> 
> When you are saying "erase", you need to actually replace the multiple counter increment with *a new* counter, right? I.e. when replacing:
> 
>     instrprof_increment(profc_f2);
>     instrprof_increment(profc_f1);
> 
> you need to emit:
> 
>     instrprof_increment(profc_f2f1_fused);


No. The pass only erases updates to counters which it can prove are
truly redundant. There is no need to create a new (or fused) counter
because profc_f1 == profc_f2 in all possible profiles.

4.

>> The complexity of this pass is O(N*M), where N is the number of profile
>> counters, and M is the average number of updates per counter.
> 
> What you're describing here is not clear to me? But I may have been totally misunderstanding the whole concept :)

Hopefully the pseudo-code helps? Alternatively I can put a WIP patch up
on Phab.

> It should be just O(N) where N is the number of instructions.

The complexity of the pass is not upper-bounded by the number of
instructions in the program because certain loads and stores can be visited
more than once.


FE to IR Counter Remapping
==========================

I have a question about this plan:

> for each CFG edge:
>     record which FE counters have ended up associated with it
> remove FE counters
> run IR instrumentation pass
> emit a side table mapping IR instr counters to FE counters

Currently, -instrprof happens early in the pipeline. IIUC this is done to
allow the optimizer to work with load+add+stores, instead of profile update
intrinsics.

Say we introduce a counter remapping pass like the one Sean suggested. It
should be run before -instrprof so that we don't waste time lowering a bunch
of instrprof_increment intrinsics which we'll have to throw away later.

But that means that the CFGs that the counter remapping pass operates on won't
reflect changes made by the inliner (or any other optimizations which alter the
CFG), right?

ISTM the pruning pass I've proposed is useful whether we're doing FE-based
instrumentation _or_ late instrumentation. Since it operates on loads+stores
directly, it can clean up redundant counter increments at any point in the
pipeline (after -instrprof).


vedant


> On Mar 10, 2016, at 7:48 PM, Mehdi Amini <mehdi.amini at apple.com> wrote:
> 
>> 
>> On Mar 10, 2016, at 7:21 PM, Vedant Kumar via llvm-dev <llvm-dev at lists.llvm.org> wrote:
>> 
>> Hi,
>> 
>> I'd like to add a new pass to LLVM which removes redundant profile counter
>> updates. The goal is to speed up code coverage testing and profile generation
>> for PGO.
>> 
>> I'm sending this email out to describe my approach, share some early results,
>> and gather feedback.
>> 
>> 
>> Problem Overview
>> ================
>> 
>> A profile counter is redundant if it's incremented in exactly the same basic
>> blocks as some other profile counter. Consider the following module:
>> 
>>   local void f1() {
>>     instrprof_increment(profc_f1);
>>   }
>> 
>>   void f2() {
>>     instrprof_increment(profc_f2);
>>     f1();
>>   }
>> 
>> Once the inliner runs and deletes f1, we're left with:
>> 
>>   void f2() {
>>     instrprof_increment(profc_f2);
>>     instrprof_increment(profc_f1);
>>   }
>> 
>> Now we can say profc_f1 is redundant (or, an alias for profc_f2).
>> 
>> I've noticed that frontend-based instrumentation can generate many redundant
>> profile counters. This degrades performance and increases code size.  We can
>> address the problem by getting rid of redundant counter updates. The trick is
>> to make sure we get back the same profiles.
>> 
>> 
>> Proposed Solution
>> =================
>> 
>> I propose a pruning pass which takes the following steps:
>> 
>> 1. Delete functions with local linkage and only one use, if that use is in
>>    a profile data record.
>> 
>>    These functions are left around by the inliner (it doesn't know that
>>    they're safe to delete). Deleting them reduces code size and simplifies
>>    subsequent analysis of profile counters.
> 
> I think that this is something global-DCE should be teached about (if it does not know already).
> 
> 
>> 
>> 2. Determine which profile counters are essential.
> 
> What is an "essential counter"?
> 
>> 
>> 3. Erase all updates to redundant profile counters.
> 
> When you are saying "erase", you need to actually replace the multiple counter increment with *a new* counter, right? I.e. when replacing:
> 
>     instrprof_increment(profc_f2);
>     instrprof_increment(profc_f1);
> 
> you need to emit:
> 
>     instrprof_increment(profc_f2f1_fused);
> 
> 
>> 4. Emit the aliases into a new section in the binary.
>> 
>>    Aliases are represented as { Dst: i64*, Src: i64* } tuples. Some changes
>>    in compiler-rt are required to walk the alias section and fill in the
>>    correct execution counts at program exit time.
>> 
>> This pass needs to be run after the inliner in order to be effective.
>> 
>> The complexity of this pass is O(N*M), where N is the number of profile
>> counters, and M is the average number of updates per counter.
> 
> What you're describing here is not clear to me? But I may have been totally misunderstanding the whole concept :)
> 
> 
> -- 
> Mehdi
> 
> 
>> In practice it is
>> a bit faster, since we can skip the analysis of counters which are discovered to
>> be redundant early on in the process.
>> 
>> 
>> Early Results
>> =============
>> 
>> The pruning pass results in 25% speed improvement in the example program above
>> (where f2 is called in a loop 10^8 times).
>> 
>> Here is a slightly less contrived example:
>> 
>> #include <vector>
>> #include <algorithm>
>> #include <cstdlib>
>> 
>> static void escape(void *p) {
>>   asm volatile("" : : "g"(p) : "memory");
>> }
>> 
>> int main(int argc, char **argv) { 
>>   std::vector<int> V(atoi(argv[1])); 
>>   escape(reinterpret_cast<void *>(V.data())); 
>>   std::sort(V.begin(), V.end()); 
>>   return V[0];
>> } 
>> 
>> I get the following results on my desktop (10^8 elements, 5 runs each):
>> 
>> O3:                           0.262s
>> O3 + PGOInstr:                0.663s
>> O3 + PGOInstr + Pruning:      0.606s (8.6% performance win, 672 aliases)
>> O3 + CoverageInstr:           0.690s
>> O3 + CoverageInstr + Pruning: 0.610s (11.6% performance win, 688 aliases)
>> 
>> 
>> Next Steps?
>> ===========
>> 
>> Is the performance of instrumented code something we think we need to fix?
>> What's an acceptable compile-time overhead for running this pruning pass? Is
>> the general approach a non-starter for anybody?
>> 
>> I'd like to get some feedback and gauge interest before pursuing this further.
>> Possible next steps include benchmarking instrumented versions of clang and
>> swift on the relevant test suites, running performance tests from lnt, running
>> compile-time tests, and measuring any code size differences.
>> 
>> 
>> thanks
>> vedant
>> 
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev



More information about the llvm-dev mailing list