<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Thu, Mar 10, 2016 at 9:42 PM, Xinliang David Li <span dir="ltr"><<a href="mailto:xinliangli@gmail.com" target="_blank">xinliangli@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote"><span class="">On Thu, Mar 10, 2016 at 8:33 PM, Sean Silva via llvm-dev <span dir="ltr"><<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote"><span>On Thu, Mar 10, 2016 at 7:21 PM, Vedant Kumar via llvm-dev <span dir="ltr"><<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">Hi,<br>
<br>
I'd like to add a new pass to LLVM which removes redundant profile counter<br>
updates. The goal is to speed up code coverage testing and profile generation<br>
for PGO.<br></blockquote><div><br></div></span><div>We may want to have a focused discussion about this goal, rather than a particular suggestion. There are a lot of things we can do. Off the top of my head, some are:</div><div><br></div><div>1. add some sort of alias annotation (such as an independent TBAA root node) to all the counter increment memory instructions to tell the optimizer they don't interfere with the usual loads and stores.</div><div><br></div><div>2. communicate to the optimizer that counters can be registerized. In a loop like:</div><div>for (int i = 0; i < N; i++) {</div><div> if (foo())</div><div> bar();</div><div> else</div><div> baz();</div><div>}</div><div>we perform O(N) counter increments (i.e. load, increment, store) last I checked. However, if the counters are in registers, then we only perform O(1) memory operations. This can dramatically reduce the pressure on the CPU's load/store units and also relieve cross-core cache line ping-pong when two cores are executing the same code. Note that the latter benefit is attained even if we ultimately end up spilling the counters due to increased register pressure.</div><div><br></div><div>I actually don't know what is preventing the usual optimization pipeline from getting 2 right.</div></div></div></div></blockquote><div><br></div></span><div>Call Mod-ref. We need to teach the optimizer that the counter owned by the current function (if the function is proved to be non-recursive in some way) can not be modified by any other calls.</div></div></div></div></blockquote><div><br></div><div>I don't think that's a sufficient explanation. Consider the following example:</div><div><br></div><div><div>Sean:~/tmp % cat testprofile.cpp </div><div>int foo(int n) {</div><div> unsigned Ret = 42;</div><div> for (int i = 0; i < n; i++) {</div><div> if (i % 100) {</div><div> Ret += 789;</div><div> } else {</div><div> Ret *= (283 + Ret);</div><div> }</div><div> }</div><div> return Ret;</div><div>}</div><div><br></div><div>Sean:~/tmp % ~/pg/release/bin/clang++ -o - -fprofile-instr-generate testprofile.cpp -S -emit-llvm -O2 >foo.ll</div><div>Sean:~/tmp % ~/pg/release/bin/opt -view-cfg foo.ll </div></div><div><br></div><div><br></div><div><br></div><div><img src="cid:ii_153644e9a0e4e71c" alt="Inline image 2" style="margin-right: 0px;"><br></div><div><br></div><div><br></div><div><br></div><div>-- Sean Silva</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><span class=""><font color="#888888"><div><br></div><div>David</div></font></span><div><div class="h5"><div><br></div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div><div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
<br>
I'm sending this email out to describe my approach, share some early results,<br>
and gather feedback.<br>
<br>
<br>
Problem Overview<br>
================<br>
<br>
A profile counter is redundant if it's incremented in exactly the same basic<br>
blocks as some other profile counter. Consider the following module:<br>
<br>
local void f1() {<br>
instrprof_increment(profc_f1);<br>
}<br>
<br>
void f2() {<br>
instrprof_increment(profc_f2);<br>
f1();<br>
}<br>
<br>
Once the inliner runs and deletes f1, we're left with:<br>
<br>
void f2() {<br>
instrprof_increment(profc_f2);<br>
instrprof_increment(profc_f1);<br>
}<br>
<br>
Now we can say profc_f1 is redundant (or, an alias for profc_f2).<br>
<br>
I've noticed that frontend-based instrumentation can generate many redundant<br>
profile counters. This degrades performance and increases code size. We can<br>
address the problem by getting rid of redundant counter updates. The trick is<br>
to make sure we get back the same profiles.<br>
<br>
<br>
Proposed Solution<br>
=================<br>
<br>
I propose a pruning pass which takes the following steps:<br>
<br>
1. Delete functions with local linkage and only one use, if that use is in<br>
a profile data record.<br>
<br>
These functions are left around by the inliner (it doesn't know that<br>
they're safe to delete). Deleting them reduces code size and simplifies<br>
subsequent analysis of profile counters.<br>
<br>
2. Determine which profile counters are essential.<br>
<br>
3. Erase all updates to redundant profile counters.<br>
<br>
4. Emit the aliases into a new section in the binary.<br>
<br>
Aliases are represented as { Dst: i64*, Src: i64* } tuples. Some changes<br>
in compiler-rt are required to walk the alias section and fill in the<br>
correct execution counts at program exit time.<br>
<br>
This pass needs to be run after the inliner in order to be effective.<br>
<br>
The complexity of this pass is O(N*M), where N is the number of profile<br>
counters, and M is the average number of updates per counter. In practice it is<br>
a bit faster, since we can skip the analysis of counters which are discovered to<br>
be redundant early on in the process.<br></blockquote><div><br></div></div></div><div>I think a conceptually simpler design is something like:</div><div><br></div><div>for each CFG edge:</div><div> record which FE counters have ended up associated with it</div><div>remove FE counters</div><div>run IR instrumentation pass</div><div>emit a side table mapping IR instr counters to FE counters (more generally: how to reconstruct FE counters from the IR counters)</div><div><br></div><div>The problem is simply reduced to the IR instrumentation pass.</div><span><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
<br>
<br>
Early Results<br>
=============<br>
<br>
The pruning pass results in 25% speed improvement in the example program above<br>
(where f2 is called in a loop 10^8 times).<br>
<br>
Here is a slightly less contrived example:<br>
<br>
#include <vector><br>
#include <algorithm><br>
#include <cstdlib><br>
<br>
static void escape(void *p) {<br>
asm volatile("" : : "g"(p) : "memory");<br>
}<br>
<br>
int main(int argc, char **argv) {<br>
std::vector<int> V(atoi(argv[1]));<br>
escape(reinterpret_cast<void *>(V.data()));<br>
std::sort(V.begin(), V.end());</blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
return V[0];<br>
}<br>
<br>
I get the following results on my desktop (10^8 elements, 5 runs each):<br>
<br>
O3: 0.262s<br>
O3 + PGOInstr: 0.663s<br>
O3 + PGOInstr + Pruning: 0.606s (8.6% performance win, 672 aliases)<br>
O3 + CoverageInstr: 0.690s<br>
O3 + CoverageInstr + Pruning: 0.610s (11.6% performance win, 688 aliases)<br>
<br>
<br>
Next Steps?<br>
===========<br>
<br>
Is the performance of instrumented code something we think we need to fix?<br></blockquote><div><br></div></span><div>With frontend instrumentation it definitely is.</div><span><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
What's an acceptable compile-time overhead for running this pruning pass?</blockquote><div><br></div></span><div>Instrumented builds are "special" anyway so a fair slowdown is probably acceptable. I.e. this doesn't affect regular developer compilation times. As a baseline, maybe compare compile time of `-fprofile-instr-generate -O2` vs. `-O2`. If `-fprofile-instr-generate -O2` is 10% slower than the control, then that indicates that a 5% extra slowdown is probably reasonable for a substantial reduction in profiling overhead (which can result in a qualitative improvement to the actual usability of the instrumented program).</div><div><br></div><div>-- Sean Silva</div><span><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"> Is<br>
the general approach a non-starter for anybody?<br>
<br>
I'd like to get some feedback and gauge interest before pursuing this further.<br>
Possible next steps include benchmarking instrumented versions of clang and<br>
swift on the relevant test suites, running performance tests from lnt, running<br>
compile-time tests, and measuring any code size differences.<br>
<br>
<br>
thanks<br>
vedant<br>
<br>
_______________________________________________<br>
LLVM Developers mailing list<br>
<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a><br>
<a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><br>
</blockquote></span></div><br></div></div>
<br>_______________________________________________<br>
LLVM Developers mailing list<br>
<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a><br>
<a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><br>
<br></blockquote></div></div></div><br></div></div>
</blockquote></div><br></div></div>