[llvm-dev] RFC: Pass to prune redundant profiling instrumentation

Xinliang David Li via llvm-dev llvm-dev at lists.llvm.org
Fri Mar 11 08:44:47 PST 2016


On Fri, Mar 11, 2016 at 12:20 AM, Sean Silva <chisophugis at gmail.com> wrote:

>
> On Thu, Mar 10, 2016 at 10:36 PM, Xinliang David Li <xinliangli at gmail.com>
> wrote:
>
>>
>>
>> On Thu, Mar 10, 2016 at 10:13 PM, Sean Silva <chisophugis at gmail.com>
>> wrote:
>>
>>>
>>>
>>> On Thu, Mar 10, 2016 at 9:42 PM, Xinliang David Li <xinliangli at gmail.com
>>> > wrote:
>>>
>>>>
>>>>
>>>> On Thu, Mar 10, 2016 at 8:33 PM, Sean Silva via llvm-dev <
>>>> llvm-dev at lists.llvm.org> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Thu, Mar 10, 2016 at 7:21 PM, Vedant Kumar via llvm-dev <
>>>>> llvm-dev at lists.llvm.org> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I'd like to add a new pass to LLVM which removes redundant profile
>>>>>> counter
>>>>>> updates. The goal is to speed up code coverage testing and profile
>>>>>> generation
>>>>>> for PGO.
>>>>>>
>>>>>
>>>>> We may want to have a focused discussion about this goal, rather than
>>>>> a particular suggestion. There are a lot of things we can do. Off the top
>>>>> of my head, some are:
>>>>>
>>>>> 1. add some sort of alias annotation (such as an independent TBAA root
>>>>> node) to all the counter increment memory instructions to tell the
>>>>> optimizer they don't interfere with the usual loads and stores.
>>>>>
>>>>> 2. communicate to the optimizer that counters can be registerized. In
>>>>> a loop like:
>>>>> for (int i = 0; i < N; i++) {
>>>>>   if (foo())
>>>>>     bar();
>>>>>   else
>>>>>     baz();
>>>>> }
>>>>> we perform O(N) counter increments (i.e. load, increment, store) last
>>>>> I checked. However, if the counters are in registers, then we only perform
>>>>> O(1) memory operations. This can dramatically reduce the pressure on the
>>>>> CPU's load/store units and also relieve cross-core cache line ping-pong
>>>>> when two cores are executing the same code. Note that the latter benefit is
>>>>> attained even if we ultimately end up spilling the counters due to
>>>>> increased register pressure.
>>>>>
>>>>> I actually don't know what is preventing the usual optimization
>>>>> pipeline from getting 2 right.
>>>>>
>>>>
>>>> Call Mod-ref.  We need to teach the optimizer that the counter owned by
>>>> the current function (if the function is proved to be non-recursive in some
>>>> way) can not be modified by any other calls.
>>>>
>>>
>>> I don't think that's a sufficient explanation. Consider the following
>>> example:
>>>
>>> Sean:~/tmp % cat testprofile.cpp
>>> int foo(int n) {
>>>   unsigned Ret = 42;
>>>   for (int i = 0; i < n; i++) {
>>>     if (i % 100) {
>>>       Ret += 789;
>>>     } else {
>>>       Ret *= (283 + Ret);
>>>     }
>>>   }
>>>   return Ret;
>>> }
>>>
>>> Sean:~/tmp % ~/pg/release/bin/clang++ -o - -fprofile-instr-generate
>>> testprofile.cpp -S -emit-llvm  -O2 >foo.ll
>>> Sean:~/tmp % ~/pg/release/bin/opt -view-cfg foo.ll
>>>
>>>
>>>
>> (resent)
>>
>> I saw your example has function call in the loop, thus the comment about
>> Modref.
>>
>> For this example, the reason seems to that we don't yet do speculative
>> PRE which usually requires profile information. Note that the update is
>> conditionally done in a branch.  Also note that the counter update of the
>> block %8 is fully optimized away.
>>
>
> (forgive my ignorance of compiler theory)
>
> Should we generally expect the optimizer to get this right at some point?
> (i.e. is it a bug in the optimizer?) Or do you think that a special pass is
> needed to get the desired effect for counters for normal use of
> instrumentation?
> It seems like inherently counter updates will happen only conditionally,
> and so this kind of situation will be common.
>

This is a limitation of the compiler. It is not specific to counter
variables -- same applies to user defined globals. There might be a bug
tracking this.



>
> Counters seem like they have special semantics that allows more
> flexibility. For example, the call mod-ref problem is actually not a
> problem due to commutativity. We know that the callee will at most
> increment the counter, so as long as the load,increment,store is not
> interrupted by a call (which is easy to avoid) then we can freely move it
> across calls. (actually, in theory a call into the profile lib could read
> the counters, but in practice I expect this to not be an issue).
>
>
Good observation, but it can cause problems with no access by call
assumption -- for instance register promotion of a counter variable across
call but it is actually updated by a call in the loop body. At the end of
the loop, the update from the calls will be lost.

David


> -- Sean Silva
>
>
>>
>> David
>>
>>
>>
>>>
>>> -- Sean Silva
>>>
>>>
>>>>
>>>> David
>>>>
>>>>
>>>>
>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> I'm sending this email out to describe my approach, share some early
>>>>>> results,
>>>>>> and gather feedback.
>>>>>>
>>>>>>
>>>>>> Problem Overview
>>>>>> ================
>>>>>>
>>>>>> A profile counter is redundant if it's incremented in exactly the
>>>>>> same basic
>>>>>> blocks as some other profile counter. Consider the following module:
>>>>>>
>>>>>>     local void f1() {
>>>>>>       instrprof_increment(profc_f1);
>>>>>>     }
>>>>>>
>>>>>>     void f2() {
>>>>>>       instrprof_increment(profc_f2);
>>>>>>       f1();
>>>>>>     }
>>>>>>
>>>>>> Once the inliner runs and deletes f1, we're left with:
>>>>>>
>>>>>>     void f2() {
>>>>>>       instrprof_increment(profc_f2);
>>>>>>       instrprof_increment(profc_f1);
>>>>>>     }
>>>>>>
>>>>>> Now we can say profc_f1 is redundant (or, an alias for profc_f2).
>>>>>>
>>>>>> I've noticed that frontend-based instrumentation can generate many
>>>>>> redundant
>>>>>> profile counters. This degrades performance and increases code size.
>>>>>> We can
>>>>>> address the problem by getting rid of redundant counter updates. The
>>>>>> trick is
>>>>>> to make sure we get back the same profiles.
>>>>>>
>>>>>>
>>>>>> Proposed Solution
>>>>>> =================
>>>>>>
>>>>>> I propose a pruning pass which takes the following steps:
>>>>>>
>>>>>>   1. Delete functions with local linkage and only one use, if that
>>>>>> use is in
>>>>>>      a profile data record.
>>>>>>
>>>>>>      These functions are left around by the inliner (it doesn't know
>>>>>> that
>>>>>>      they're safe to delete). Deleting them reduces code size and
>>>>>> simplifies
>>>>>>      subsequent analysis of profile counters.
>>>>>>
>>>>>>   2. Determine which profile counters are essential.
>>>>>>
>>>>>>   3. Erase all updates to redundant profile counters.
>>>>>>
>>>>>>   4. Emit the aliases into a new section in the binary.
>>>>>>
>>>>>>      Aliases are represented as { Dst: i64*, Src: i64* } tuples. Some
>>>>>> changes
>>>>>>      in compiler-rt are required to walk the alias section and fill
>>>>>> in the
>>>>>>      correct execution counts at program exit time.
>>>>>>
>>>>>> This pass needs to be run after the inliner in order to be effective.
>>>>>>
>>>>>> The complexity of this pass is O(N*M), where N is the number of
>>>>>> profile
>>>>>> counters, and M is the average number of updates per counter. In
>>>>>> practice it is
>>>>>> a bit faster, since we can skip the analysis of counters which are
>>>>>> discovered to
>>>>>> be redundant early on in the process.
>>>>>>
>>>>>
>>>>> I think a conceptually simpler design is something like:
>>>>>
>>>>> for each CFG edge:
>>>>>     record which FE counters have ended up associated with it
>>>>> remove FE counters
>>>>> run IR instrumentation pass
>>>>> emit a side table mapping IR instr counters to FE counters (more
>>>>> generally: how to reconstruct FE counters from the IR counters)
>>>>>
>>>>> The problem is simply reduced to the IR instrumentation pass.
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>> Early Results
>>>>>> =============
>>>>>>
>>>>>> The pruning pass results in 25% speed improvement in the example
>>>>>> program above
>>>>>> (where f2 is called in a loop 10^8 times).
>>>>>>
>>>>>> Here is a slightly less contrived example:
>>>>>>
>>>>>>   #include <vector>
>>>>>>   #include <algorithm>
>>>>>>   #include <cstdlib>
>>>>>>
>>>>>>   static void escape(void *p) {
>>>>>>     asm volatile("" : : "g"(p) : "memory");
>>>>>>   }
>>>>>>
>>>>>>   int main(int argc, char **argv) {
>>>>>>     std::vector<int> V(atoi(argv[1]));
>>>>>>     escape(reinterpret_cast<void *>(V.data()));
>>>>>>     std::sort(V.begin(), V.end());
>>>>>
>>>>>     return V[0];
>>>>>>   }
>>>>>>
>>>>>> I get the following results on my desktop (10^8 elements, 5 runs
>>>>>> each):
>>>>>>
>>>>>>   O3:                           0.262s
>>>>>>   O3 + PGOInstr:                0.663s
>>>>>>   O3 + PGOInstr + Pruning:      0.606s (8.6% performance win, 672
>>>>>> aliases)
>>>>>>   O3 + CoverageInstr:           0.690s
>>>>>>   O3 + CoverageInstr + Pruning: 0.610s (11.6% performance win, 688
>>>>>> aliases)
>>>>>>
>>>>>>
>>>>>> Next Steps?
>>>>>> ===========
>>>>>>
>>>>>> Is the performance of instrumented code something we think we need to
>>>>>> fix?
>>>>>>
>>>>>
>>>>> With frontend instrumentation it definitely is.
>>>>>
>>>>> What's an acceptable compile-time overhead for running this pruning
>>>>>> pass?
>>>>>
>>>>>
>>>>> Instrumented builds are "special" anyway so a fair slowdown is
>>>>> probably acceptable. I.e. this doesn't affect regular developer compilation
>>>>> times. As a baseline, maybe compare compile time of
>>>>> `-fprofile-instr-generate -O2` vs. `-O2`. If `-fprofile-instr-generate -O2`
>>>>> is 10% slower than the control, then that indicates that a 5% extra
>>>>> slowdown is probably reasonable for a substantial reduction in profiling
>>>>> overhead (which can result in a qualitative improvement to the actual
>>>>> usability of the instrumented program).
>>>>>
>>>>> -- Sean Silva
>>>>>
>>>>>
>>>>>
>>>>>> Is
>>>>>> the general approach a non-starter for anybody?
>>>>>>
>>>>>> I'd like to get some feedback and gauge interest before pursuing this
>>>>>> further.
>>>>>> Possible next steps include benchmarking instrumented versions of
>>>>>> clang and
>>>>>> swift on the relevant test suites, running performance tests from
>>>>>> lnt, running
>>>>>> compile-time tests, and measuring any code size differences.
>>>>>>
>>>>>>
>>>>>> thanks
>>>>>> vedant
>>>>>>
>>>>>> _______________________________________________
>>>>>> LLVM Developers mailing list
>>>>>> llvm-dev at lists.llvm.org
>>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> LLVM Developers mailing list
>>>>> llvm-dev at lists.llvm.org
>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160311/b594deac/attachment.html>


More information about the llvm-dev mailing list