[llvm-dev] RFC: Pass to prune redundant profiling instrumentation

Thu Mar 10 22:34:12 PST 2016

I saw your example has function call in the loop..

For this example, I think the reason is we don't yet do speculative PRE
which usually requires profile information. Note that the update is
conditionally done in a branch. Note that the counter update of the block
%8 is fully optimized away.

David

On Thu, Mar 10, 2016 at 10:13 PM, Sean Silva <chisophugis at gmail.com> wrote:

>
>
> On Thu, Mar 10, 2016 at 9:42 PM, Xinliang David Li <xinliangli at gmail.com>
> wrote:
>
>>
>>
>> On Thu, Mar 10, 2016 at 8:33 PM, Sean Silva via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>>>
>>>
>>> On Thu, Mar 10, 2016 at 7:21 PM, Vedant Kumar via llvm-dev <
>>> llvm-dev at lists.llvm.org> wrote:
>>>
>>>> Hi,
>>>>
>>>> I'd like to add a new pass to LLVM which removes redundant profile
>>>> counter
>>>> updates. The goal is to speed up code coverage testing and profile
>>>> generation
>>>> for PGO.
>>>>
>>>
>>> We may want to have a focused discussion about this goal, rather than a
>>> particular suggestion. There are a lot of things we can do. Off the top of
>>> my head, some are:
>>>
>>> 1. add some sort of alias annotation (such as an independent TBAA root
>>> node) to all the counter increment memory instructions to tell the
>>> optimizer they don't interfere with the usual loads and stores.
>>>
>>> 2. communicate to the optimizer that counters can be registerized. In a
>>> loop like:
>>> for (int i = 0; i < N; i++) {
>>>   if (foo())
>>>     bar();
>>>   else
>>>     baz();
>>> }
>>> we perform O(N) counter increments (i.e. load, increment, store) last I
>>> checked. However, if the counters are in registers, then we only perform
>>> O(1) memory operations. This can dramatically reduce the pressure on the
>>> CPU's load/store units and also relieve cross-core cache line ping-pong
>>> when two cores are executing the same code. Note that the latter benefit is
>>> attained even if we ultimately end up spilling the counters due to
>>> increased register pressure.
>>>
>>> I actually don't know what is preventing the usual optimization pipeline
>>> from getting 2 right.
>>>
>>
>> Call Mod-ref.  We need to teach the optimizer that the counter owned by
>> the current function (if the function is proved to be non-recursive in some
>> way) can not be modified by any other calls.
>>
>
> I don't think that's a sufficient explanation. Consider the following
> example:
>
> Sean:~/tmp % cat testprofile.cpp
> int foo(int n) {
>   unsigned Ret = 42;
>   for (int i = 0; i < n; i++) {
>     if (i % 100) {
>       Ret += 789;
>     } else {
>       Ret *= (283 + Ret);
>     }
>   }
>   return Ret;
> }
>
> Sean:~/tmp % ~/pg/release/bin/clang++ -o - -fprofile-instr-generate
> testprofile.cpp -S -emit-llvm  -O2 >foo.ll
> Sean:~/tmp % ~/pg/release/bin/opt -view-cfg foo.ll
>
>
>
> [image: Inline image 2]
>
>
>
> -- Sean Silva
>
>
>>
>> David
>>
>>
>>
>>
>>>
>>>
>>>>
>>>> I'm sending this email out to describe my approach, share some early
>>>> results,
>>>> and gather feedback.
>>>>
>>>>
>>>> Problem Overview
>>>> ================
>>>>
>>>> A profile counter is redundant if it's incremented in exactly the same
>>>> basic
>>>> blocks as some other profile counter. Consider the following module:
>>>>
>>>>     local void f1() {
>>>>       instrprof_increment(profc_f1);
>>>>     }
>>>>
>>>>     void f2() {
>>>>       instrprof_increment(profc_f2);
>>>>       f1();
>>>>     }
>>>>
>>>> Once the inliner runs and deletes f1, we're left with:
>>>>
>>>>     void f2() {
>>>>       instrprof_increment(profc_f2);
>>>>       instrprof_increment(profc_f1);
>>>>     }
>>>>
>>>> Now we can say profc_f1 is redundant (or, an alias for profc_f2).
>>>>
>>>> I've noticed that frontend-based instrumentation can generate many
>>>> redundant
>>>> profile counters. This degrades performance and increases code size.
>>>> We can
>>>> address the problem by getting rid of redundant counter updates. The
>>>> trick is
>>>> to make sure we get back the same profiles.
>>>>
>>>>
>>>> Proposed Solution
>>>> =================
>>>>
>>>> I propose a pruning pass which takes the following steps:
>>>>
>>>>   1. Delete functions with local linkage and only one use, if that use
>>>> is in
>>>>      a profile data record.
>>>>
>>>>      These functions are left around by the inliner (it doesn't know
>>>> that
>>>>      they're safe to delete). Deleting them reduces code size and
>>>> simplifies
>>>>      subsequent analysis of profile counters.
>>>>
>>>>   2. Determine which profile counters are essential.
>>>>
>>>>   3. Erase all updates to redundant profile counters.
>>>>
>>>>   4. Emit the aliases into a new section in the binary.
>>>>
>>>>      Aliases are represented as { Dst: i64*, Src: i64* } tuples. Some
>>>> changes
>>>>      in compiler-rt are required to walk the alias section and fill in
>>>> the
>>>>      correct execution counts at program exit time.
>>>>
>>>> This pass needs to be run after the inliner in order to be effective.
>>>>
>>>> The complexity of this pass is O(N*M), where N is the number of profile
>>>> counters, and M is the average number of updates per counter. In
>>>> practice it is
>>>> a bit faster, since we can skip the analysis of counters which are
>>>> discovered to
>>>> be redundant early on in the process.
>>>>
>>>
>>> I think a conceptually simpler design is something like:
>>>
>>> for each CFG edge:
>>>     record which FE counters have ended up associated with it
>>> remove FE counters
>>> run IR instrumentation pass
>>> emit a side table mapping IR instr counters to FE counters (more
>>> generally: how to reconstruct FE counters from the IR counters)
>>>
>>> The problem is simply reduced to the IR instrumentation pass.
>>>
>>>
>>>>
>>>>
>>>> Early Results
>>>> =============
>>>>
>>>> The pruning pass results in 25% speed improvement in the example
>>>> program above
>>>> (where f2 is called in a loop 10^8 times).
>>>>
>>>> Here is a slightly less contrived example:
>>>>
>>>>   #include <vector>
>>>>   #include <algorithm>
>>>>   #include <cstdlib>
>>>>
>>>>   static void escape(void *p) {
>>>>     asm volatile("" : : "g"(p) : "memory");
>>>>   }
>>>>
>>>>   int main(int argc, char **argv) {
>>>>     std::vector<int> V(atoi(argv[1]));
>>>>     escape(reinterpret_cast<void *>(V.data()));
>>>>     std::sort(V.begin(), V.end());
>>>
>>>     return V[0];
>>>>   }
>>>>
>>>> I get the following results on my desktop (10^8 elements, 5 runs each):
>>>>
>>>>   O3:                           0.262s
>>>>   O3 + PGOInstr:                0.663s
>>>>   O3 + PGOInstr + Pruning:      0.606s (8.6% performance win, 672
>>>> aliases)
>>>>   O3 + CoverageInstr:           0.690s
>>>>   O3 + CoverageInstr + Pruning: 0.610s (11.6% performance win, 688
>>>> aliases)
>>>>
>>>>
>>>> Next Steps?
>>>> ===========
>>>>
>>>> Is the performance of instrumented code something we think we need to
>>>> fix?
>>>>
>>>
>>> With frontend instrumentation it definitely is.
>>>
>>> What's an acceptable compile-time overhead for running this pruning pass?
>>>
>>>
>>> Instrumented builds are "special" anyway so a fair slowdown is probably
>>> acceptable. I.e. this doesn't affect regular developer compilation times.
>>> As a baseline, maybe compare compile time of `-fprofile-instr-generate -O2`
>>> vs. `-O2`. If `-fprofile-instr-generate -O2` is 10% slower than the
>>> control, then that indicates that a 5% extra slowdown is probably
>>> reasonable for a substantial reduction in profiling overhead (which can
>>> result in a qualitative improvement to the actual usability of the
>>> instrumented program).
>>>
>>> -- Sean Silva
>>>
>>>
>>>
>>>> Is
>>>> the general approach a non-starter for anybody?
>>>>
>>>> I'd like to get some feedback and gauge interest before pursuing this
>>>> further.
>>>> Possible next steps include benchmarking instrumented versions of clang
>>>> and
>>>> swift on the relevant test suites, running performance tests from lnt,
>>>> running
>>>> compile-time tests, and measuring any code size differences.
>>>>
>>>>
>>>> thanks
>>>> vedant
>>>>
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> llvm-dev at lists.llvm.org
>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>
>>>
>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160310/b2dc0526/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Screen Shot 2016-03-10 at 10.12.52 PM.png
Type: image/png
Size: 98967 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160310/b2dc0526/attachment-0001.png>