[llvm-dev] RFC: Pass to prune redundant profiling instrumentation

Thu Mar 10 19:48:04 PST 2016

> On Mar 10, 2016, at 7:21 PM, Vedant Kumar via llvm-dev <llvm-dev at lists.llvm.org> wrote:
> 
> Hi,
> 
> I'd like to add a new pass to LLVM which removes redundant profile counter
> updates. The goal is to speed up code coverage testing and profile generation
> for PGO.
> 
> I'm sending this email out to describe my approach, share some early results,
> and gather feedback.
> 
> 
> Problem Overview
> ================
> 
> A profile counter is redundant if it's incremented in exactly the same basic
> blocks as some other profile counter. Consider the following module:
> 
>    local void f1() {
>      instrprof_increment(profc_f1);
>    }
> 
>    void f2() {
>      instrprof_increment(profc_f2);
>      f1();
>    }
> 
> Once the inliner runs and deletes f1, we're left with:
> 
>    void f2() {
>      instrprof_increment(profc_f2);
>      instrprof_increment(profc_f1);
>    }
> 
> Now we can say profc_f1 is redundant (or, an alias for profc_f2).
> 
> I've noticed that frontend-based instrumentation can generate many redundant
> profile counters. This degrades performance and increases code size.  We can
> address the problem by getting rid of redundant counter updates. The trick is
> to make sure we get back the same profiles.
> 
> 
> Proposed Solution
> =================
> 
> I propose a pruning pass which takes the following steps:
> 
>  1. Delete functions with local linkage and only one use, if that use is in
>     a profile data record.
> 
>     These functions are left around by the inliner (it doesn't know that
>     they're safe to delete). Deleting them reduces code size and simplifies
>     subsequent analysis of profile counters.

I think that this is something global-DCE should be teached about (if it does not know already).

> 
>  2. Determine which profile counters are essential.

What is an "essential counter"?

> 
>  3. Erase all updates to redundant profile counters.

When you are saying "erase", you need to actually replace the multiple counter increment with *a new* counter, right? I.e. when replacing:

     instrprof_increment(profc_f2);
     instrprof_increment(profc_f1);

you need to emit:

     instrprof_increment(profc_f2f1_fused);

>  4. Emit the aliases into a new section in the binary.
> 
>     Aliases are represented as { Dst: i64*, Src: i64* } tuples. Some changes
>     in compiler-rt are required to walk the alias section and fill in the
>     correct execution counts at program exit time.
> 
> This pass needs to be run after the inliner in order to be effective.
> 
> The complexity of this pass is O(N*M), where N is the number of profile
> counters, and M is the average number of updates per counter.

What you're describing here is not clear to me? But I may have been totally misunderstanding the whole concept :)

-- 
Mehdi

> In practice it is
> a bit faster, since we can skip the analysis of counters which are discovered to
> be redundant early on in the process.
> 
> 
> Early Results
> =============
> 
> The pruning pass results in 25% speed improvement in the example program above
> (where f2 is called in a loop 10^8 times).
> 
> Here is a slightly less contrived example:
> 
>  #include <vector>
>  #include <algorithm>
>  #include <cstdlib>
> 
>  static void escape(void *p) {
>    asm volatile("" : : "g"(p) : "memory");
>  }
> 
>  int main(int argc, char **argv) { 
>    std::vector<int> V(atoi(argv[1])); 
>    escape(reinterpret_cast<void *>(V.data())); 
>    std::sort(V.begin(), V.end()); 
>    return V[0];
>  } 
> 
> I get the following results on my desktop (10^8 elements, 5 runs each):
> 
>  O3:                           0.262s
>  O3 + PGOInstr:                0.663s
>  O3 + PGOInstr + Pruning:      0.606s (8.6% performance win, 672 aliases)
>  O3 + CoverageInstr:           0.690s
>  O3 + CoverageInstr + Pruning: 0.610s (11.6% performance win, 688 aliases)
> 
> 
> Next Steps?
> ===========
> 
> Is the performance of instrumented code something we think we need to fix?
> What's an acceptable compile-time overhead for running this pruning pass? Is
> the general approach a non-starter for anybody?
> 
> I'd like to get some feedback and gauge interest before pursuing this further.
> Possible next steps include benchmarking instrumented versions of clang and
> swift on the relevant test suites, running performance tests from lnt, running
> compile-time tests, and measuring any code size differences.
> 
> 
> thanks
> vedant
> 
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev