[llvm-dev] RFC: Pass to prune redundant profiling instrumentation
Sean Silva via llvm-dev
llvm-dev at lists.llvm.org
Thu Mar 10 22:13:34 PST 2016
On Thu, Mar 10, 2016 at 9:42 PM, Xinliang David Li <xinliangli at gmail.com>
wrote:
>
>
> On Thu, Mar 10, 2016 at 8:33 PM, Sean Silva via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>>
>>
>> On Thu, Mar 10, 2016 at 7:21 PM, Vedant Kumar via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>>> Hi,
>>>
>>> I'd like to add a new pass to LLVM which removes redundant profile
>>> counter
>>> updates. The goal is to speed up code coverage testing and profile
>>> generation
>>> for PGO.
>>>
>>
>> We may want to have a focused discussion about this goal, rather than a
>> particular suggestion. There are a lot of things we can do. Off the top of
>> my head, some are:
>>
>> 1. add some sort of alias annotation (such as an independent TBAA root
>> node) to all the counter increment memory instructions to tell the
>> optimizer they don't interfere with the usual loads and stores.
>>
>> 2. communicate to the optimizer that counters can be registerized. In a
>> loop like:
>> for (int i = 0; i < N; i++) {
>> if (foo())
>> bar();
>> else
>> baz();
>> }
>> we perform O(N) counter increments (i.e. load, increment, store) last I
>> checked. However, if the counters are in registers, then we only perform
>> O(1) memory operations. This can dramatically reduce the pressure on the
>> CPU's load/store units and also relieve cross-core cache line ping-pong
>> when two cores are executing the same code. Note that the latter benefit is
>> attained even if we ultimately end up spilling the counters due to
>> increased register pressure.
>>
>> I actually don't know what is preventing the usual optimization pipeline
>> from getting 2 right.
>>
>
> Call Mod-ref. We need to teach the optimizer that the counter owned by
> the current function (if the function is proved to be non-recursive in some
> way) can not be modified by any other calls.
>
I don't think that's a sufficient explanation. Consider the following
example:
Sean:~/tmp % cat testprofile.cpp
int foo(int n) {
unsigned Ret = 42;
for (int i = 0; i < n; i++) {
if (i % 100) {
Ret += 789;
} else {
Ret *= (283 + Ret);
}
}
return Ret;
}
Sean:~/tmp % ~/pg/release/bin/clang++ -o - -fprofile-instr-generate
testprofile.cpp -S -emit-llvm -O2 >foo.ll
Sean:~/tmp % ~/pg/release/bin/opt -view-cfg foo.ll
[image: Inline image 2]
-- Sean Silva
>
> David
>
>
>
>
>>
>>
>>>
>>> I'm sending this email out to describe my approach, share some early
>>> results,
>>> and gather feedback.
>>>
>>>
>>> Problem Overview
>>> ================
>>>
>>> A profile counter is redundant if it's incremented in exactly the same
>>> basic
>>> blocks as some other profile counter. Consider the following module:
>>>
>>> local void f1() {
>>> instrprof_increment(profc_f1);
>>> }
>>>
>>> void f2() {
>>> instrprof_increment(profc_f2);
>>> f1();
>>> }
>>>
>>> Once the inliner runs and deletes f1, we're left with:
>>>
>>> void f2() {
>>> instrprof_increment(profc_f2);
>>> instrprof_increment(profc_f1);
>>> }
>>>
>>> Now we can say profc_f1 is redundant (or, an alias for profc_f2).
>>>
>>> I've noticed that frontend-based instrumentation can generate many
>>> redundant
>>> profile counters. This degrades performance and increases code size. We
>>> can
>>> address the problem by getting rid of redundant counter updates. The
>>> trick is
>>> to make sure we get back the same profiles.
>>>
>>>
>>> Proposed Solution
>>> =================
>>>
>>> I propose a pruning pass which takes the following steps:
>>>
>>> 1. Delete functions with local linkage and only one use, if that use
>>> is in
>>> a profile data record.
>>>
>>> These functions are left around by the inliner (it doesn't know that
>>> they're safe to delete). Deleting them reduces code size and
>>> simplifies
>>> subsequent analysis of profile counters.
>>>
>>> 2. Determine which profile counters are essential.
>>>
>>> 3. Erase all updates to redundant profile counters.
>>>
>>> 4. Emit the aliases into a new section in the binary.
>>>
>>> Aliases are represented as { Dst: i64*, Src: i64* } tuples. Some
>>> changes
>>> in compiler-rt are required to walk the alias section and fill in
>>> the
>>> correct execution counts at program exit time.
>>>
>>> This pass needs to be run after the inliner in order to be effective.
>>>
>>> The complexity of this pass is O(N*M), where N is the number of profile
>>> counters, and M is the average number of updates per counter. In
>>> practice it is
>>> a bit faster, since we can skip the analysis of counters which are
>>> discovered to
>>> be redundant early on in the process.
>>>
>>
>> I think a conceptually simpler design is something like:
>>
>> for each CFG edge:
>> record which FE counters have ended up associated with it
>> remove FE counters
>> run IR instrumentation pass
>> emit a side table mapping IR instr counters to FE counters (more
>> generally: how to reconstruct FE counters from the IR counters)
>>
>> The problem is simply reduced to the IR instrumentation pass.
>>
>>
>>>
>>>
>>> Early Results
>>> =============
>>>
>>> The pruning pass results in 25% speed improvement in the example program
>>> above
>>> (where f2 is called in a loop 10^8 times).
>>>
>>> Here is a slightly less contrived example:
>>>
>>> #include <vector>
>>> #include <algorithm>
>>> #include <cstdlib>
>>>
>>> static void escape(void *p) {
>>> asm volatile("" : : "g"(p) : "memory");
>>> }
>>>
>>> int main(int argc, char **argv) {
>>> std::vector<int> V(atoi(argv[1]));
>>> escape(reinterpret_cast<void *>(V.data()));
>>> std::sort(V.begin(), V.end());
>>
>> return V[0];
>>> }
>>>
>>> I get the following results on my desktop (10^8 elements, 5 runs each):
>>>
>>> O3: 0.262s
>>> O3 + PGOInstr: 0.663s
>>> O3 + PGOInstr + Pruning: 0.606s (8.6% performance win, 672
>>> aliases)
>>> O3 + CoverageInstr: 0.690s
>>> O3 + CoverageInstr + Pruning: 0.610s (11.6% performance win, 688
>>> aliases)
>>>
>>>
>>> Next Steps?
>>> ===========
>>>
>>> Is the performance of instrumented code something we think we need to
>>> fix?
>>>
>>
>> With frontend instrumentation it definitely is.
>>
>> What's an acceptable compile-time overhead for running this pruning pass?
>>
>>
>> Instrumented builds are "special" anyway so a fair slowdown is probably
>> acceptable. I.e. this doesn't affect regular developer compilation times.
>> As a baseline, maybe compare compile time of `-fprofile-instr-generate -O2`
>> vs. `-O2`. If `-fprofile-instr-generate -O2` is 10% slower than the
>> control, then that indicates that a 5% extra slowdown is probably
>> reasonable for a substantial reduction in profiling overhead (which can
>> result in a qualitative improvement to the actual usability of the
>> instrumented program).
>>
>> -- Sean Silva
>>
>>
>>
>>> Is
>>> the general approach a non-starter for anybody?
>>>
>>> I'd like to get some feedback and gauge interest before pursuing this
>>> further.
>>> Possible next steps include benchmarking instrumented versions of clang
>>> and
>>> swift on the relevant test suites, running performance tests from lnt,
>>> running
>>> compile-time tests, and measuring any code size differences.
>>>
>>>
>>> thanks
>>> vedant
>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160310/f2c960ae/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Screen Shot 2016-03-10 at 10.12.52 PM.png
Type: image/png
Size: 98967 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160310/f2c960ae/attachment-0001.png>
More information about the llvm-dev
mailing list