[llvm-dev] RFC: Pass to prune redundant profiling instrumentation

Thu Mar 10 21:42:44 PST 2016

On Thu, Mar 10, 2016 at 8:33 PM, Sean Silva via llvm-dev <
llvm-dev at lists.llvm.org> wrote:

>
>
> On Thu, Mar 10, 2016 at 7:21 PM, Vedant Kumar via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> Hi,
>>
>> I'd like to add a new pass to LLVM which removes redundant profile counter
>> updates. The goal is to speed up code coverage testing and profile
>> generation
>> for PGO.
>>
>
> We may want to have a focused discussion about this goal, rather than a
> particular suggestion. There are a lot of things we can do. Off the top of
> my head, some are:
>
> 1. add some sort of alias annotation (such as an independent TBAA root
> node) to all the counter increment memory instructions to tell the
> optimizer they don't interfere with the usual loads and stores.
>
> 2. communicate to the optimizer that counters can be registerized. In a
> loop like:
> for (int i = 0; i < N; i++) {
>   if (foo())
>     bar();
>   else
>     baz();
> }
> we perform O(N) counter increments (i.e. load, increment, store) last I
> checked. However, if the counters are in registers, then we only perform
> O(1) memory operations. This can dramatically reduce the pressure on the
> CPU's load/store units and also relieve cross-core cache line ping-pong
> when two cores are executing the same code. Note that the latter benefit is
> attained even if we ultimately end up spilling the counters due to
> increased register pressure.
>
> I actually don't know what is preventing the usual optimization pipeline
> from getting 2 right.
>

Call Mod-ref.  We need to teach the optimizer that the counter owned by the
current function (if the function is proved to be non-recursive in some
way) can not be modified by any other calls.

David

>
>
>>
>> I'm sending this email out to describe my approach, share some early
>> results,
>> and gather feedback.
>>
>>
>> Problem Overview
>> ================
>>
>> A profile counter is redundant if it's incremented in exactly the same
>> basic
>> blocks as some other profile counter. Consider the following module:
>>
>>     local void f1() {
>>       instrprof_increment(profc_f1);
>>     }
>>
>>     void f2() {
>>       instrprof_increment(profc_f2);
>>       f1();
>>     }
>>
>> Once the inliner runs and deletes f1, we're left with:
>>
>>     void f2() {
>>       instrprof_increment(profc_f2);
>>       instrprof_increment(profc_f1);
>>     }
>>
>> Now we can say profc_f1 is redundant (or, an alias for profc_f2).
>>
>> I've noticed that frontend-based instrumentation can generate many
>> redundant
>> profile counters. This degrades performance and increases code size.  We
>> can
>> address the problem by getting rid of redundant counter updates. The
>> trick is
>> to make sure we get back the same profiles.
>>
>>
>> Proposed Solution
>> =================
>>
>> I propose a pruning pass which takes the following steps:
>>
>>   1. Delete functions with local linkage and only one use, if that use is
>> in
>>      a profile data record.
>>
>>      These functions are left around by the inliner (it doesn't know that
>>      they're safe to delete). Deleting them reduces code size and
>> simplifies
>>      subsequent analysis of profile counters.
>>
>>   2. Determine which profile counters are essential.
>>
>>   3. Erase all updates to redundant profile counters.
>>
>>   4. Emit the aliases into a new section in the binary.
>>
>>      Aliases are represented as { Dst: i64*, Src: i64* } tuples. Some
>> changes
>>      in compiler-rt are required to walk the alias section and fill in the
>>      correct execution counts at program exit time.
>>
>> This pass needs to be run after the inliner in order to be effective.
>>
>> The complexity of this pass is O(N*M), where N is the number of profile
>> counters, and M is the average number of updates per counter. In practice
>> it is
>> a bit faster, since we can skip the analysis of counters which are
>> discovered to
>> be redundant early on in the process.
>>
>
> I think a conceptually simpler design is something like:
>
> for each CFG edge:
>     record which FE counters have ended up associated with it
> remove FE counters
> run IR instrumentation pass
> emit a side table mapping IR instr counters to FE counters (more
> generally: how to reconstruct FE counters from the IR counters)
>
> The problem is simply reduced to the IR instrumentation pass.
>
>
>>
>>
>> Early Results
>> =============
>>
>> The pruning pass results in 25% speed improvement in the example program
>> above
>> (where f2 is called in a loop 10^8 times).
>>
>> Here is a slightly less contrived example:
>>
>>   #include <vector>
>>   #include <algorithm>
>>   #include <cstdlib>
>>
>>   static void escape(void *p) {
>>     asm volatile("" : : "g"(p) : "memory");
>>   }
>>
>>   int main(int argc, char **argv) {
>>     std::vector<int> V(atoi(argv[1]));
>>     escape(reinterpret_cast<void *>(V.data()));
>>     std::sort(V.begin(), V.end());
>
>     return V[0];
>>   }
>>
>> I get the following results on my desktop (10^8 elements, 5 runs each):
>>
>>   O3:                           0.262s
>>   O3 + PGOInstr:                0.663s
>>   O3 + PGOInstr + Pruning:      0.606s (8.6% performance win, 672 aliases)
>>   O3 + CoverageInstr:           0.690s
>>   O3 + CoverageInstr + Pruning: 0.610s (11.6% performance win, 688
>> aliases)
>>
>>
>> Next Steps?
>> ===========
>>
>> Is the performance of instrumented code something we think we need to fix?
>>
>
> With frontend instrumentation it definitely is.
>
> What's an acceptable compile-time overhead for running this pruning pass?
>
>
> Instrumented builds are "special" anyway so a fair slowdown is probably
> acceptable. I.e. this doesn't affect regular developer compilation times.
> As a baseline, maybe compare compile time of `-fprofile-instr-generate -O2`
> vs. `-O2`. If `-fprofile-instr-generate -O2` is 10% slower than the
> control, then that indicates that a 5% extra slowdown is probably
> reasonable for a substantial reduction in profiling overhead (which can
> result in a qualitative improvement to the actual usability of the
> instrumented program).
>
> -- Sean Silva
>
>
>
>> Is
>> the general approach a non-starter for anybody?
>>
>> I'd like to get some feedback and gauge interest before pursuing this
>> further.
>> Possible next steps include benchmarking instrumented versions of clang
>> and
>> swift on the relevant test suites, running performance tests from lnt,
>> running
>> compile-time tests, and measuring any code size differences.
>>
>>
>> thanks
>> vedant
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160310/27b05662/attachment.html>