[llvm-dev] RFC: Pass to prune redundant profiling instrumentation

Thu Mar 10 19:21:18 PST 2016

Hi,

I'd like to add a new pass to LLVM which removes redundant profile counter
updates. The goal is to speed up code coverage testing and profile generation
for PGO.

I'm sending this email out to describe my approach, share some early results,
and gather feedback.

Problem Overview
================

A profile counter is redundant if it's incremented in exactly the same basic
blocks as some other profile counter. Consider the following module:

    local void f1() {
      instrprof_increment(profc_f1);
    }

    void f2() {
      instrprof_increment(profc_f2);
      f1();
    }

Once the inliner runs and deletes f1, we're left with:

    void f2() {
      instrprof_increment(profc_f2);
      instrprof_increment(profc_f1);
    }

Now we can say profc_f1 is redundant (or, an alias for profc_f2).

I've noticed that frontend-based instrumentation can generate many redundant
profile counters. This degrades performance and increases code size.  We can
address the problem by getting rid of redundant counter updates. The trick is
to make sure we get back the same profiles.

Proposed Solution
=================

I propose a pruning pass which takes the following steps:

  1. Delete functions with local linkage and only one use, if that use is in
     a profile data record.

     These functions are left around by the inliner (it doesn't know that
     they're safe to delete). Deleting them reduces code size and simplifies
     subsequent analysis of profile counters.

  2. Determine which profile counters are essential.

  3. Erase all updates to redundant profile counters.

  4. Emit the aliases into a new section in the binary.

     Aliases are represented as { Dst: i64*, Src: i64* } tuples. Some changes
     in compiler-rt are required to walk the alias section and fill in the
     correct execution counts at program exit time.

This pass needs to be run after the inliner in order to be effective.

The complexity of this pass is O(N*M), where N is the number of profile
counters, and M is the average number of updates per counter. In practice it is
a bit faster, since we can skip the analysis of counters which are discovered to
be redundant early on in the process.

Early Results
=============

The pruning pass results in 25% speed improvement in the example program above
(where f2 is called in a loop 10^8 times).

Here is a slightly less contrived example:

  #include <vector>
  #include <algorithm>
  #include <cstdlib>

  static void escape(void *p) {
    asm volatile("" : : "g"(p) : "memory");
  }

  int main(int argc, char **argv) { 
    std::vector<int> V(atoi(argv[1])); 
    escape(reinterpret_cast<void *>(V.data())); 
    std::sort(V.begin(), V.end()); 
    return V[0];
  } 

I get the following results on my desktop (10^8 elements, 5 runs each):

  O3:                           0.262s
  O3 + PGOInstr:                0.663s
  O3 + PGOInstr + Pruning:      0.606s (8.6% performance win, 672 aliases)
  O3 + CoverageInstr:           0.690s
  O3 + CoverageInstr + Pruning: 0.610s (11.6% performance win, 688 aliases)

Next Steps?
===========

Is the performance of instrumented code something we think we need to fix?
What's an acceptable compile-time overhead for running this pruning pass? Is
the general approach a non-starter for anybody?

I'd like to get some feedback and gauge interest before pursuing this further.
Possible next steps include benchmarking instrumented versions of clang and
swift on the relevant test suites, running performance tests from lnt, running
compile-time tests, and measuring any code size differences.

thanks
vedant