[llvm-dev] RFC: Pass to prune redundant profiling instrumentation

Thu Mar 10 22:13:34 PST 2016

On Thu, Mar 10, 2016 at 9:42 PM, Xinliang David Li <xinliangli at gmail.com>
wrote:

>
>
> On Thu, Mar 10, 2016 at 8:33 PM, Sean Silva via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>>
>>
>> On Thu, Mar 10, 2016 at 7:21 PM, Vedant Kumar via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>>> Hi,
>>>
>>> I'd like to add a new pass to LLVM which removes redundant profile
>>> counter
>>> updates. The goal is to speed up code coverage testing and profile
>>> generation
>>> for PGO.
>>>
>>
>> We may want to have a focused discussion about this goal, rather than a
>> particular suggestion. There are a lot of things we can do. Off the top of
>> my head, some are:
>>
>> 1. add some sort of alias annotation (such as an independent TBAA root
>> node) to all the counter increment memory instructions to tell the
>> optimizer they don't interfere with the usual loads and stores.
>>
>> 2. communicate to the optimizer that counters can be registerized. In a
>> loop like:
>> for (int i = 0; i < N; i++) {
>>   if (foo())
>>     bar();
>>   else
>>     baz();
>> }
>> we perform O(N) counter increments (i.e. load, increment, store) last I
>> checked. However, if the counters are in registers, then we only perform
>> O(1) memory operations. This can dramatically reduce the pressure on the
>> CPU's load/store units and also relieve cross-core cache line ping-pong
>> when two cores are executing the same code. Note that the latter benefit is
>> attained even if we ultimately end up spilling the counters due to
>> increased register pressure.
>>
>> I actually don't know what is preventing the usual optimization pipeline
>> from getting 2 right.
>>
>
> Call Mod-ref.  We need to teach the optimizer that the counter owned by
> the current function (if the function is proved to be non-recursive in some
> way) can not be modified by any other calls.
>

I don't think that's a sufficient explanation. Consider the following
example:

Sean:~/tmp % cat testprofile.cpp
int foo(int n) {
  unsigned Ret = 42;
  for (int i = 0; i < n; i++) {
    if (i % 100) {
      Ret += 789;
    } else {
      Ret *= (283 + Ret);
    }
  }
  return Ret;
}

Sean:~/tmp % ~/pg/release/bin/clang++ -o - -fprofile-instr-generate
testprofile.cpp -S -emit-llvm  -O2 >foo.ll
Sean:~/tmp % ~/pg/release/bin/opt -view-cfg foo.ll

[image: Inline image 2]

-- Sean Silva

>
> David
>
>
>
>
>>
>>
>>>
>>> I'm sending this email out to describe my approach, share some early
>>> results,
>>> and gather feedback.
>>>
>>>
>>> Problem Overview
>>> ================
>>>
>>> A profile counter is redundant if it's incremented in exactly the same
>>> basic
>>> blocks as some other profile counter. Consider the following module:
>>>
>>>     local void f1() {
>>>       instrprof_increment(profc_f1);
>>>     }
>>>
>>>     void f2() {
>>>       instrprof_increment(profc_f2);
>>>       f1();
>>>     }
>>>
>>> Once the inliner runs and deletes f1, we're left with:
>>>
>>>     void f2() {
>>>       instrprof_increment(profc_f2);
>>>       instrprof_increment(profc_f1);
>>>     }
>>>
>>> Now we can say profc_f1 is redundant (or, an alias for profc_f2).
>>>
>>> I've noticed that frontend-based instrumentation can generate many
>>> redundant
>>> profile counters. This degrades performance and increases code size.  We
>>> can
>>> address the problem by getting rid of redundant counter updates. The
>>> trick is
>>> to make sure we get back the same profiles.
>>>
>>>
>>> Proposed Solution
>>> =================
>>>
>>> I propose a pruning pass which takes the following steps:
>>>
>>>   1. Delete functions with local linkage and only one use, if that use
>>> is in
>>>      a profile data record.
>>>
>>>      These functions are left around by the inliner (it doesn't know that
>>>      they're safe to delete). Deleting them reduces code size and
>>> simplifies
>>>      subsequent analysis of profile counters.
>>>
>>>   2. Determine which profile counters are essential.
>>>
>>>   3. Erase all updates to redundant profile counters.
>>>
>>>   4. Emit the aliases into a new section in the binary.
>>>
>>>      Aliases are represented as { Dst: i64*, Src: i64* } tuples. Some
>>> changes
>>>      in compiler-rt are required to walk the alias section and fill in
>>> the
>>>      correct execution counts at program exit time.
>>>
>>> This pass needs to be run after the inliner in order to be effective.
>>>
>>> The complexity of this pass is O(N*M), where N is the number of profile
>>> counters, and M is the average number of updates per counter. In
>>> practice it is
>>> a bit faster, since we can skip the analysis of counters which are
>>> discovered to
>>> be redundant early on in the process.
>>>
>>
>> I think a conceptually simpler design is something like:
>>
>> for each CFG edge:
>>     record which FE counters have ended up associated with it
>> remove FE counters
>> run IR instrumentation pass
>> emit a side table mapping IR instr counters to FE counters (more
>> generally: how to reconstruct FE counters from the IR counters)
>>
>> The problem is simply reduced to the IR instrumentation pass.
>>
>>
>>>
>>>
>>> Early Results
>>> =============
>>>
>>> The pruning pass results in 25% speed improvement in the example program
>>> above
>>> (where f2 is called in a loop 10^8 times).
>>>
>>> Here is a slightly less contrived example:
>>>
>>>   #include <vector>
>>>   #include <algorithm>
>>>   #include <cstdlib>
>>>
>>>   static void escape(void *p) {
>>>     asm volatile("" : : "g"(p) : "memory");
>>>   }
>>>
>>>   int main(int argc, char **argv) {
>>>     std::vector<int> V(atoi(argv[1]));
>>>     escape(reinterpret_cast<void *>(V.data()));
>>>     std::sort(V.begin(), V.end());
>>
>>     return V[0];
>>>   }
>>>
>>> I get the following results on my desktop (10^8 elements, 5 runs each):
>>>
>>>   O3:                           0.262s
>>>   O3 + PGOInstr:                0.663s
>>>   O3 + PGOInstr + Pruning:      0.606s (8.6% performance win, 672
>>> aliases)
>>>   O3 + CoverageInstr:           0.690s
>>>   O3 + CoverageInstr + Pruning: 0.610s (11.6% performance win, 688
>>> aliases)
>>>
>>>
>>> Next Steps?
>>> ===========
>>>
>>> Is the performance of instrumented code something we think we need to
>>> fix?
>>>
>>
>> With frontend instrumentation it definitely is.
>>
>> What's an acceptable compile-time overhead for running this pruning pass?
>>
>>
>> Instrumented builds are "special" anyway so a fair slowdown is probably
>> acceptable. I.e. this doesn't affect regular developer compilation times.
>> As a baseline, maybe compare compile time of `-fprofile-instr-generate -O2`
>> vs. `-O2`. If `-fprofile-instr-generate -O2` is 10% slower than the
>> control, then that indicates that a 5% extra slowdown is probably
>> reasonable for a substantial reduction in profiling overhead (which can
>> result in a qualitative improvement to the actual usability of the
>> instrumented program).
>>
>> -- Sean Silva
>>
>>
>>
>>> Is
>>> the general approach a non-starter for anybody?
>>>
>>> I'd like to get some feedback and gauge interest before pursuing this
>>> further.
>>> Possible next steps include benchmarking instrumented versions of clang
>>> and
>>> swift on the relevant test suites, running performance tests from lnt,
>>> running
>>> compile-time tests, and measuring any code size differences.
>>>
>>>
>>> thanks
>>> vedant
>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160310/f2c960ae/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Screen Shot 2016-03-10 at 10.12.52 PM.png
Type: image/png
Size: 98967 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160310/f2c960ae/attachment-0001.png>