[llvm-dev] RFC: Pass to prune redundant profiling instrumentation
Xinliang David Li via llvm-dev
llvm-dev at lists.llvm.org
Thu Mar 10 22:34:12 PST 2016
I saw your example has function call in the loop..
For this example, I think the reason is we don't yet do speculative PRE
which usually requires profile information. Note that the update is
conditionally done in a branch. Note that the counter update of the block
%8 is fully optimized away.
David
On Thu, Mar 10, 2016 at 10:13 PM, Sean Silva <chisophugis at gmail.com> wrote:
>
>
> On Thu, Mar 10, 2016 at 9:42 PM, Xinliang David Li <xinliangli at gmail.com>
> wrote:
>
>>
>>
>> On Thu, Mar 10, 2016 at 8:33 PM, Sean Silva via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>>>
>>>
>>> On Thu, Mar 10, 2016 at 7:21 PM, Vedant Kumar via llvm-dev <
>>> llvm-dev at lists.llvm.org> wrote:
>>>
>>>> Hi,
>>>>
>>>> I'd like to add a new pass to LLVM which removes redundant profile
>>>> counter
>>>> updates. The goal is to speed up code coverage testing and profile
>>>> generation
>>>> for PGO.
>>>>
>>>
>>> We may want to have a focused discussion about this goal, rather than a
>>> particular suggestion. There are a lot of things we can do. Off the top of
>>> my head, some are:
>>>
>>> 1. add some sort of alias annotation (such as an independent TBAA root
>>> node) to all the counter increment memory instructions to tell the
>>> optimizer they don't interfere with the usual loads and stores.
>>>
>>> 2. communicate to the optimizer that counters can be registerized. In a
>>> loop like:
>>> for (int i = 0; i < N; i++) {
>>> if (foo())
>>> bar();
>>> else
>>> baz();
>>> }
>>> we perform O(N) counter increments (i.e. load, increment, store) last I
>>> checked. However, if the counters are in registers, then we only perform
>>> O(1) memory operations. This can dramatically reduce the pressure on the
>>> CPU's load/store units and also relieve cross-core cache line ping-pong
>>> when two cores are executing the same code. Note that the latter benefit is
>>> attained even if we ultimately end up spilling the counters due to
>>> increased register pressure.
>>>
>>> I actually don't know what is preventing the usual optimization pipeline
>>> from getting 2 right.
>>>
>>
>> Call Mod-ref. We need to teach the optimizer that the counter owned by
>> the current function (if the function is proved to be non-recursive in some
>> way) can not be modified by any other calls.
>>
>
> I don't think that's a sufficient explanation. Consider the following
> example:
>
> Sean:~/tmp % cat testprofile.cpp
> int foo(int n) {
> unsigned Ret = 42;
> for (int i = 0; i < n; i++) {
> if (i % 100) {
> Ret += 789;
> } else {
> Ret *= (283 + Ret);
> }
> }
> return Ret;
> }
>
> Sean:~/tmp % ~/pg/release/bin/clang++ -o - -fprofile-instr-generate
> testprofile.cpp -S -emit-llvm -O2 >foo.ll
> Sean:~/tmp % ~/pg/release/bin/opt -view-cfg foo.ll
>
>
>
> [image: Inline image 2]
>
>
>
> -- Sean Silva
>
>
>>
>> David
>>
>>
>>
>>
>>>
>>>
>>>>
>>>> I'm sending this email out to describe my approach, share some early
>>>> results,
>>>> and gather feedback.
>>>>
>>>>
>>>> Problem Overview
>>>> ================
>>>>
>>>> A profile counter is redundant if it's incremented in exactly the same
>>>> basic
>>>> blocks as some other profile counter. Consider the following module:
>>>>
>>>> local void f1() {
>>>> instrprof_increment(profc_f1);
>>>> }
>>>>
>>>> void f2() {
>>>> instrprof_increment(profc_f2);
>>>> f1();
>>>> }
>>>>
>>>> Once the inliner runs and deletes f1, we're left with:
>>>>
>>>> void f2() {
>>>> instrprof_increment(profc_f2);
>>>> instrprof_increment(profc_f1);
>>>> }
>>>>
>>>> Now we can say profc_f1 is redundant (or, an alias for profc_f2).
>>>>
>>>> I've noticed that frontend-based instrumentation can generate many
>>>> redundant
>>>> profile counters. This degrades performance and increases code size.
>>>> We can
>>>> address the problem by getting rid of redundant counter updates. The
>>>> trick is
>>>> to make sure we get back the same profiles.
>>>>
>>>>
>>>> Proposed Solution
>>>> =================
>>>>
>>>> I propose a pruning pass which takes the following steps:
>>>>
>>>> 1. Delete functions with local linkage and only one use, if that use
>>>> is in
>>>> a profile data record.
>>>>
>>>> These functions are left around by the inliner (it doesn't know
>>>> that
>>>> they're safe to delete). Deleting them reduces code size and
>>>> simplifies
>>>> subsequent analysis of profile counters.
>>>>
>>>> 2. Determine which profile counters are essential.
>>>>
>>>> 3. Erase all updates to redundant profile counters.
>>>>
>>>> 4. Emit the aliases into a new section in the binary.
>>>>
>>>> Aliases are represented as { Dst: i64*, Src: i64* } tuples. Some
>>>> changes
>>>> in compiler-rt are required to walk the alias section and fill in
>>>> the
>>>> correct execution counts at program exit time.
>>>>
>>>> This pass needs to be run after the inliner in order to be effective.
>>>>
>>>> The complexity of this pass is O(N*M), where N is the number of profile
>>>> counters, and M is the average number of updates per counter. In
>>>> practice it is
>>>> a bit faster, since we can skip the analysis of counters which are
>>>> discovered to
>>>> be redundant early on in the process.
>>>>
>>>
>>> I think a conceptually simpler design is something like:
>>>
>>> for each CFG edge:
>>> record which FE counters have ended up associated with it
>>> remove FE counters
>>> run IR instrumentation pass
>>> emit a side table mapping IR instr counters to FE counters (more
>>> generally: how to reconstruct FE counters from the IR counters)
>>>
>>> The problem is simply reduced to the IR instrumentation pass.
>>>
>>>
>>>>
>>>>
>>>> Early Results
>>>> =============
>>>>
>>>> The pruning pass results in 25% speed improvement in the example
>>>> program above
>>>> (where f2 is called in a loop 10^8 times).
>>>>
>>>> Here is a slightly less contrived example:
>>>>
>>>> #include <vector>
>>>> #include <algorithm>
>>>> #include <cstdlib>
>>>>
>>>> static void escape(void *p) {
>>>> asm volatile("" : : "g"(p) : "memory");
>>>> }
>>>>
>>>> int main(int argc, char **argv) {
>>>> std::vector<int> V(atoi(argv[1]));
>>>> escape(reinterpret_cast<void *>(V.data()));
>>>> std::sort(V.begin(), V.end());
>>>
>>> return V[0];
>>>> }
>>>>
>>>> I get the following results on my desktop (10^8 elements, 5 runs each):
>>>>
>>>> O3: 0.262s
>>>> O3 + PGOInstr: 0.663s
>>>> O3 + PGOInstr + Pruning: 0.606s (8.6% performance win, 672
>>>> aliases)
>>>> O3 + CoverageInstr: 0.690s
>>>> O3 + CoverageInstr + Pruning: 0.610s (11.6% performance win, 688
>>>> aliases)
>>>>
>>>>
>>>> Next Steps?
>>>> ===========
>>>>
>>>> Is the performance of instrumented code something we think we need to
>>>> fix?
>>>>
>>>
>>> With frontend instrumentation it definitely is.
>>>
>>> What's an acceptable compile-time overhead for running this pruning pass?
>>>
>>>
>>> Instrumented builds are "special" anyway so a fair slowdown is probably
>>> acceptable. I.e. this doesn't affect regular developer compilation times.
>>> As a baseline, maybe compare compile time of `-fprofile-instr-generate -O2`
>>> vs. `-O2`. If `-fprofile-instr-generate -O2` is 10% slower than the
>>> control, then that indicates that a 5% extra slowdown is probably
>>> reasonable for a substantial reduction in profiling overhead (which can
>>> result in a qualitative improvement to the actual usability of the
>>> instrumented program).
>>>
>>> -- Sean Silva
>>>
>>>
>>>
>>>> Is
>>>> the general approach a non-starter for anybody?
>>>>
>>>> I'd like to get some feedback and gauge interest before pursuing this
>>>> further.
>>>> Possible next steps include benchmarking instrumented versions of clang
>>>> and
>>>> swift on the relevant test suites, running performance tests from lnt,
>>>> running
>>>> compile-time tests, and measuring any code size differences.
>>>>
>>>>
>>>> thanks
>>>> vedant
>>>>
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> llvm-dev at lists.llvm.org
>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>
>>>
>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160310/b2dc0526/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Screen Shot 2016-03-10 at 10.12.52 PM.png
Type: image/png
Size: 98967 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160310/b2dc0526/attachment-0001.png>
More information about the llvm-dev
mailing list