[llvm-dev] RFC: Pass to prune redundant profiling instrumentation

Sean Silva via llvm-dev llvm-dev at lists.llvm.org
Fri Mar 11 00:20:28 PST 2016


On Thu, Mar 10, 2016 at 10:36 PM, Xinliang David Li <xinliangli at gmail.com>
wrote:

>
>
> On Thu, Mar 10, 2016 at 10:13 PM, Sean Silva <chisophugis at gmail.com>
> wrote:
>
>>
>>
>> On Thu, Mar 10, 2016 at 9:42 PM, Xinliang David Li <xinliangli at gmail.com>
>> wrote:
>>
>>>
>>>
>>> On Thu, Mar 10, 2016 at 8:33 PM, Sean Silva via llvm-dev <
>>> llvm-dev at lists.llvm.org> wrote:
>>>
>>>>
>>>>
>>>> On Thu, Mar 10, 2016 at 7:21 PM, Vedant Kumar via llvm-dev <
>>>> llvm-dev at lists.llvm.org> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I'd like to add a new pass to LLVM which removes redundant profile
>>>>> counter
>>>>> updates. The goal is to speed up code coverage testing and profile
>>>>> generation
>>>>> for PGO.
>>>>>
>>>>
>>>> We may want to have a focused discussion about this goal, rather than a
>>>> particular suggestion. There are a lot of things we can do. Off the top of
>>>> my head, some are:
>>>>
>>>> 1. add some sort of alias annotation (such as an independent TBAA root
>>>> node) to all the counter increment memory instructions to tell the
>>>> optimizer they don't interfere with the usual loads and stores.
>>>>
>>>> 2. communicate to the optimizer that counters can be registerized. In a
>>>> loop like:
>>>> for (int i = 0; i < N; i++) {
>>>>   if (foo())
>>>>     bar();
>>>>   else
>>>>     baz();
>>>> }
>>>> we perform O(N) counter increments (i.e. load, increment, store) last I
>>>> checked. However, if the counters are in registers, then we only perform
>>>> O(1) memory operations. This can dramatically reduce the pressure on the
>>>> CPU's load/store units and also relieve cross-core cache line ping-pong
>>>> when two cores are executing the same code. Note that the latter benefit is
>>>> attained even if we ultimately end up spilling the counters due to
>>>> increased register pressure.
>>>>
>>>> I actually don't know what is preventing the usual optimization
>>>> pipeline from getting 2 right.
>>>>
>>>
>>> Call Mod-ref.  We need to teach the optimizer that the counter owned by
>>> the current function (if the function is proved to be non-recursive in some
>>> way) can not be modified by any other calls.
>>>
>>
>> I don't think that's a sufficient explanation. Consider the following
>> example:
>>
>> Sean:~/tmp % cat testprofile.cpp
>> int foo(int n) {
>>   unsigned Ret = 42;
>>   for (int i = 0; i < n; i++) {
>>     if (i % 100) {
>>       Ret += 789;
>>     } else {
>>       Ret *= (283 + Ret);
>>     }
>>   }
>>   return Ret;
>> }
>>
>> Sean:~/tmp % ~/pg/release/bin/clang++ -o - -fprofile-instr-generate
>> testprofile.cpp -S -emit-llvm  -O2 >foo.ll
>> Sean:~/tmp % ~/pg/release/bin/opt -view-cfg foo.ll
>>
>>
>>
> (resent)
>
> I saw your example has function call in the loop, thus the comment about
> Modref.
>
> For this example, the reason seems to that we don't yet do speculative PRE
> which usually requires profile information. Note that the update is
> conditionally done in a branch.  Also note that the counter update of the
> block %8 is fully optimized away.
>

(forgive my ignorance of compiler theory)

Should we generally expect the optimizer to get this right at some point?
(i.e. is it a bug in the optimizer?) Or do you think that a special pass is
needed to get the desired effect for counters for normal use of
instrumentation?
It seems like inherently counter updates will happen only conditionally,
and so this kind of situation will be common.

Counters seem like they have special semantics that allows more
flexibility. For example, the call mod-ref problem is actually not a
problem due to commutativity. We know that the callee will at most
increment the counter, so as long as the load,increment,store is not
interrupted by a call (which is easy to avoid) then we can freely move it
across calls. (actually, in theory a call into the profile lib could read
the counters, but in practice I expect this to not be an issue).

-- Sean Silva


>
> David
>
>
>
>>
>> -- Sean Silva
>>
>>
>>>
>>> David
>>>
>>>
>>>
>>>
>>>>
>>>>
>>>>>
>>>>> I'm sending this email out to describe my approach, share some early
>>>>> results,
>>>>> and gather feedback.
>>>>>
>>>>>
>>>>> Problem Overview
>>>>> ================
>>>>>
>>>>> A profile counter is redundant if it's incremented in exactly the same
>>>>> basic
>>>>> blocks as some other profile counter. Consider the following module:
>>>>>
>>>>>     local void f1() {
>>>>>       instrprof_increment(profc_f1);
>>>>>     }
>>>>>
>>>>>     void f2() {
>>>>>       instrprof_increment(profc_f2);
>>>>>       f1();
>>>>>     }
>>>>>
>>>>> Once the inliner runs and deletes f1, we're left with:
>>>>>
>>>>>     void f2() {
>>>>>       instrprof_increment(profc_f2);
>>>>>       instrprof_increment(profc_f1);
>>>>>     }
>>>>>
>>>>> Now we can say profc_f1 is redundant (or, an alias for profc_f2).
>>>>>
>>>>> I've noticed that frontend-based instrumentation can generate many
>>>>> redundant
>>>>> profile counters. This degrades performance and increases code size.
>>>>> We can
>>>>> address the problem by getting rid of redundant counter updates. The
>>>>> trick is
>>>>> to make sure we get back the same profiles.
>>>>>
>>>>>
>>>>> Proposed Solution
>>>>> =================
>>>>>
>>>>> I propose a pruning pass which takes the following steps:
>>>>>
>>>>>   1. Delete functions with local linkage and only one use, if that use
>>>>> is in
>>>>>      a profile data record.
>>>>>
>>>>>      These functions are left around by the inliner (it doesn't know
>>>>> that
>>>>>      they're safe to delete). Deleting them reduces code size and
>>>>> simplifies
>>>>>      subsequent analysis of profile counters.
>>>>>
>>>>>   2. Determine which profile counters are essential.
>>>>>
>>>>>   3. Erase all updates to redundant profile counters.
>>>>>
>>>>>   4. Emit the aliases into a new section in the binary.
>>>>>
>>>>>      Aliases are represented as { Dst: i64*, Src: i64* } tuples. Some
>>>>> changes
>>>>>      in compiler-rt are required to walk the alias section and fill in
>>>>> the
>>>>>      correct execution counts at program exit time.
>>>>>
>>>>> This pass needs to be run after the inliner in order to be effective.
>>>>>
>>>>> The complexity of this pass is O(N*M), where N is the number of profile
>>>>> counters, and M is the average number of updates per counter. In
>>>>> practice it is
>>>>> a bit faster, since we can skip the analysis of counters which are
>>>>> discovered to
>>>>> be redundant early on in the process.
>>>>>
>>>>
>>>> I think a conceptually simpler design is something like:
>>>>
>>>> for each CFG edge:
>>>>     record which FE counters have ended up associated with it
>>>> remove FE counters
>>>> run IR instrumentation pass
>>>> emit a side table mapping IR instr counters to FE counters (more
>>>> generally: how to reconstruct FE counters from the IR counters)
>>>>
>>>> The problem is simply reduced to the IR instrumentation pass.
>>>>
>>>>
>>>>>
>>>>>
>>>>> Early Results
>>>>> =============
>>>>>
>>>>> The pruning pass results in 25% speed improvement in the example
>>>>> program above
>>>>> (where f2 is called in a loop 10^8 times).
>>>>>
>>>>> Here is a slightly less contrived example:
>>>>>
>>>>>   #include <vector>
>>>>>   #include <algorithm>
>>>>>   #include <cstdlib>
>>>>>
>>>>>   static void escape(void *p) {
>>>>>     asm volatile("" : : "g"(p) : "memory");
>>>>>   }
>>>>>
>>>>>   int main(int argc, char **argv) {
>>>>>     std::vector<int> V(atoi(argv[1]));
>>>>>     escape(reinterpret_cast<void *>(V.data()));
>>>>>     std::sort(V.begin(), V.end());
>>>>
>>>>     return V[0];
>>>>>   }
>>>>>
>>>>> I get the following results on my desktop (10^8 elements, 5 runs each):
>>>>>
>>>>>   O3:                           0.262s
>>>>>   O3 + PGOInstr:                0.663s
>>>>>   O3 + PGOInstr + Pruning:      0.606s (8.6% performance win, 672
>>>>> aliases)
>>>>>   O3 + CoverageInstr:           0.690s
>>>>>   O3 + CoverageInstr + Pruning: 0.610s (11.6% performance win, 688
>>>>> aliases)
>>>>>
>>>>>
>>>>> Next Steps?
>>>>> ===========
>>>>>
>>>>> Is the performance of instrumented code something we think we need to
>>>>> fix?
>>>>>
>>>>
>>>> With frontend instrumentation it definitely is.
>>>>
>>>> What's an acceptable compile-time overhead for running this pruning
>>>>> pass?
>>>>
>>>>
>>>> Instrumented builds are "special" anyway so a fair slowdown is probably
>>>> acceptable. I.e. this doesn't affect regular developer compilation times.
>>>> As a baseline, maybe compare compile time of `-fprofile-instr-generate -O2`
>>>> vs. `-O2`. If `-fprofile-instr-generate -O2` is 10% slower than the
>>>> control, then that indicates that a 5% extra slowdown is probably
>>>> reasonable for a substantial reduction in profiling overhead (which can
>>>> result in a qualitative improvement to the actual usability of the
>>>> instrumented program).
>>>>
>>>> -- Sean Silva
>>>>
>>>>
>>>>
>>>>> Is
>>>>> the general approach a non-starter for anybody?
>>>>>
>>>>> I'd like to get some feedback and gauge interest before pursuing this
>>>>> further.
>>>>> Possible next steps include benchmarking instrumented versions of
>>>>> clang and
>>>>> swift on the relevant test suites, running performance tests from lnt,
>>>>> running
>>>>> compile-time tests, and measuring any code size differences.
>>>>>
>>>>>
>>>>> thanks
>>>>> vedant
>>>>>
>>>>> _______________________________________________
>>>>> LLVM Developers mailing list
>>>>> llvm-dev at lists.llvm.org
>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> llvm-dev at lists.llvm.org
>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160311/1ba731c9/attachment.html>


More information about the llvm-dev mailing list