<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Mar 11, 2016 at 11:17 AM, Justin Bogner <span dir="ltr"><<a href="mailto:mail@justinbogner.com" target="_blank">mail@justinbogner.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="HOEnZb"><div class="h5">Xinliang David Li via llvm-dev <<a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>> writes:<br>

> Vedant, Are you concerned about instrumentation run  performance for PGO or<br>

> for coverage testing? For coverage testing, is coverage information enough<br>

> or count information is also needed?  Depending on the answer to the<br>

> questions, the solution may be very different.<br>

><br>

> If the answer is for PGO, the a much better longer term solution is to<br>

> migrate to use IR based instrumentation. Not only because IR based<br>

> instrumentation places counter update 'optimally', the early optimization<br>

> including pre-inline of tiny funcs is very effective in reducing instr<br>

> overhead plus the benefit of better profile quality due to context<br>

> sensitivity. For more details or performance numbers see Rong's RFC about<br>

> late instrumentation.<br>

><br>

> If the answer is PGO, but for various reasons, the FE based instrumentation<br>

> has to be used, then I think there is another more effective way previously<br>

> suggested by Sean. Basically you can skip single BB functions completely<br>

> during instrumentation. There are various reasons why profile data for<br>

> single bb function is not useful:<br>

> 1) they are usually small and beneficial to be inlined regardless of<br>

> profile data<br>

> 2) the BB of the inline instance can get profile data from the caller<br>

> context<br>

> 3) the profile data for the out of line single BB func is also useless<br>

><br>

> Rong has some data showing the effectiveness of this method -- not as good<br>

> as the pre-optimization approach, but IIRC also very good.<br>

><br>

> If the answer is for coverage but the actual count value does not really<br>

> matter, then a more effective way of reducing overhead is to teach the<br>

> instrumentation lowering pass to lower the counter update into a simple<br>

> store:<br>

><br>

>   counter_1 = 1;<br>

<br>

</div></div>This won't work. The FE based instrumentation has implicit counters for<br>

various AST constructs that can only be worked out by doing math with<br>

the explicit counters. If they don't have actual counts, this doesn't<br>

work.<br></blockquote><div><br></div><div>This depends on how FE picks regions to instrument. The key part is that FE should not generate region with 'minus' counter expressions, only 'add' is allowed. </div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<span class=""><br>

> Such stores from the inline instances of the same func can be easily<br>

> eliminated. It will also help multithreaded program a lot when there is<br>

> heavy counter contention.<br>

><br>

> Another benefit is that the size of the counter can be effectively reduced<br>

> to 1  byte instead of 8 byte.<br>

><br>

> The tricky situation is you want coverage data but count per line is also<br>

> important -- but I wonder having special pass to just handle this scenario<br>

> is worth the effort.<br>

<br>

</span>Isn't "coverage with count-per-line" the whole point of a coverage<br>

feature?<br></blockquote><div><br></div><div>The point of coverage testing is to see if there are any lines of the code not 'executed' at runtime -- in that sense,  count does not really matter. Whether it executes 1000 times or only 1 time is irrelevant. However it makes a difference if a line has '0' count.  </div><div><br></div><div>Note that asan based coverage tool does not do counting either.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

It's also convenient to gather data for PGO and coverage in the same<br>

run, rather than having to build instrumented twice and run the same<br>

tests/training runs twice. Performance of the FE based instrumentation<br>

is pretty important for productivity reasons.<br>

<span class=""><br>

> Also a side note, I think longer term we should unify three instrumentation<br>

> mechanism into one: FE based, IR based, and old gcda instrumentation.  IR<br>

> based is the most efficient implementation -- when combined with gcda<br>

> profiling runtime, it can be used to replace current gcda profiling which<br>

> is not efficient. It is also possible to use IR based instr for coverage<br>

> mapping (with coverage map format mostly unchanged) but main challenge is<br>

> passing source info etc.<br>

<br>

</span>I don't disagree, but I think that getting the same quality of coverage<br>

data from the IR based profiles as we do from the FE ones is a fairly<br>

large undertaking.<br></blockquote><div><br></div><div>Right --  I am thinking about this and there might be good ways to do it (e.g, let FE prepare some of the map data and source info filled and backend can fill in with counter expressions).</div><div><br></div><div>thanks,</div><div><br></div><div>David</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div class="HOEnZb"><div class="h5"><br>

> thanks,<br>

><br>

> David<br>

><br>

><br>

> On Thu, Mar 10, 2016 at 7:21 PM, Vedant Kumar via llvm-dev <<br>

> <a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>> wrote:<br>

><br>

>> Hi,<br>

>><br>

>> I'd like to add a new pass to LLVM which removes redundant profile counter<br>

>> updates. The goal is to speed up code coverage testing and profile<br>

>> generation<br>

>> for PGO.<br>

>><br>

>> I'm sending this email out to describe my approach, share some early<br>

>> results,<br>

>> and gather feedback.<br>

>><br>

><br>

> From your example, it seems only one scenario is handled -- local function<br>

> with single callsite ? This seems to be quite narrow in scope. Before<br>

> pursuing further, it is better to measure the impact of this on larger<br>

> benchmarks.<br>

><br>

><br>

><br>

>> thanks<br>

>> vedant<br>

>><br>

>> _______________________________________________<br>

>> LLVM Developers mailing list<br>

>> <a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a><br>

>> <a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><br>

>><br>

> _______________________________________________<br>

> LLVM Developers mailing list<br>

> <a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a><br>

> <a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><br>

</div></div></blockquote></div><br></div></div>