<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Mar 11, 2016 at 12:47 PM, Vedant Kumar <span dir="ltr"><<a href="mailto:vsk@apple.com" target="_blank">vsk@apple.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">There have been a lot of responses. I'll try to summarize the thread and respond<br>

to some of the questions/feedback.<br>

<br>

<br>

Summary<br>

=======<br>

<br>

1. We should teach GlobalDCE to strip out instrumented functions which the<br>

   inliner cannot delete.<br>

<br>

2. Sean suggests adding metadata to loads/stores of counters to convey that<br>

   they do not alias normal program data.<br>

<br>

   I'm not familiar with AA, but off-hand this seems like a good idea.<br>

<br>

3. Sean also suggests improving the optimizer to registerize counters when<br>

   possible.<br>

<br>

   Ditto, seems like a good idea.<br>

<br>

4. Sean has an alternate proposal for counter pruning which works by remapping<br>

   FE-generated counters to counters generated by the IR instrumentation pass.<br>

   David likes this proposal.<br>

<br></blockquote><div><br></div><div>I did not express preference about this translation scheme, but proposed using IR based instrumentation longer term :)</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

   I have some questions about this but am willing to try it. IMO, the pruning<br>

   pass I've suggested could complement this nicely -- I don't view them as<br>

   mutually exclusive (though I could be wrong).<br>

<br>

5. There seems to be consensus over the need to improve performance of<br>

   FE-instrumented programs.<br>

<br>

6. David is concerned that the proposed pruning pass is too narrow in scope,<br>

   and would like to see results on a larger benchmark. I'll try to get some<br>

   numbers.<br>

<br>

7. David mentioned some optimizations that are possible if we make coverage<br>

   information "yes or no" (per line). Justin pointed out that this doesn't<br>

   work for a few of Apple's use cases.<br></blockquote><div><br></div><div>What Justine said IIUC is that with the current way implicit counts are generated, it is not doable. This does not seem to be related to Apple's use cases.</div><div><br></div><div>David</div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

<br>

Clarifications<br>

==============<br>

<br>

1. In my original email, two of my numbers were incorrect. I listed the<br>

   sizes of the __llvm_prf_alias sections, not the number of actual aliases.<br>

   The corrected lines are:<br>

<br>

O3 + PGOInstr + Pruning:      0.606s (8.6% performance win, _42_ aliases)<br>

O3 + CoverageInstr + Pruning: 0.610s (11.6% performance win, _44_ aliases)<br>

<br>

2.<br>

<br>

>>  Determine which profile counters are essential.<br>

><br>

> What is an "essential counter"?<br>

<br>

The pass I proposed works like this:<br>

<br>

  for (PC : ProfileCounters) {<br>

    if (Update-Sites-Of(PC) == Update-Sites-Of(Other-PC)) {<br>

      mark PC as "essential"<br>

      mark Other-PC as "redundant" (i.e, alias of PC)<br>

    }<br>

  }<br>

<br>

Where Update-Sites-Of(PC) is the set of basic blocks in which PC is updated.<br>

If there are multiple counters updated in each update site for PC, they are<br>

all marked as aliases in one pass.<br>

<br>

<br>

3.<br>

<br>

>> Erase all updates to redundant profile counters.<br>

><br>

> When you are saying "erase", you need to actually replace the multiple counter increment with *a new* counter, right? I.e. when replacing:<br>

><br>

>     instrprof_increment(profc_f2);<br>

>     instrprof_increment(profc_f1);<br>

><br>

> you need to emit:<br>

><br>

>     instrprof_increment(profc_f2f1_fused);<br>

<br>

<br>

No. The pass only erases updates to counters which it can prove are<br>

truly redundant. There is no need to create a new (or fused) counter<br>

because profc_f1 == profc_f2 in all possible profiles.<br>

<br>

4.<br>

<br>

>> The complexity of this pass is O(N*M), where N is the number of profile<br>

>> counters, and M is the average number of updates per counter.<br>

><br>

> What you're describing here is not clear to me? But I may have been totally misunderstanding the whole concept :)<br>

<br>

Hopefully the pseudo-code helps? Alternatively I can put a WIP patch up<br>

on Phab.<br>

<br>

> It should be just O(N) where N is the number of instructions.<br>

<br>

The complexity of the pass is not upper-bounded by the number of<br>

instructions in the program because certain loads and stores can be visited<br>

more than once.<br>

<br>

<br>

FE to IR Counter Remapping<br>

==========================<br>

<br>

I have a question about this plan:<br>

<br>

> for each CFG edge:<br>

>     record which FE counters have ended up associated with it<br>

> remove FE counters<br>

> run IR instrumentation pass<br>

> emit a side table mapping IR instr counters to FE counters<br>

<br>

Currently, -instrprof happens early in the pipeline. IIUC this is done to<br>

allow the optimizer to work with load+add+stores, instead of profile update<br>

intrinsics.<br>

<br>

Say we introduce a counter remapping pass like the one Sean suggested. It<br>

should be run before -instrprof so that we don't waste time lowering a bunch<br>

of instrprof_increment intrinsics which we'll have to throw away later.<br>

<br>

But that means that the CFGs that the counter remapping pass operates on won't<br>

reflect changes made by the inliner (or any other optimizations which alter the<br>

CFG), right?<br>

<br>

ISTM the pruning pass I've proposed is useful whether we're doing FE-based<br>

instrumentation _or_ late instrumentation. Since it operates on loads+stores<br>

directly, it can clean up redundant counter increments at any point in the<br>

pipeline (after -instrprof).<br>

<br>

<br>

vedant<br>

<br>

<br>

> On Mar 10, 2016, at 7:48 PM, Mehdi Amini <<a href="mailto:mehdi.amini@apple.com">mehdi.amini@apple.com</a>> wrote:<br>

><br>

>><br>

>> On Mar 10, 2016, at 7:21 PM, Vedant Kumar via llvm-dev <<a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>> wrote:<br>

>><br>

>> Hi,<br>

>><br>

>> I'd like to add a new pass to LLVM which removes redundant profile counter<br>

>> updates. The goal is to speed up code coverage testing and profile generation<br>

>> for PGO.<br>

>><br>

>> I'm sending this email out to describe my approach, share some early results,<br>

>> and gather feedback.<br>

>><br>

>><br>

>> Problem Overview<br>

>> ================<br>

>><br>

>> A profile counter is redundant if it's incremented in exactly the same basic<br>

>> blocks as some other profile counter. Consider the following module:<br>

>><br>

>>   local void f1() {<br>

>>     instrprof_increment(profc_f1);<br>

>>   }<br>

>><br>

>>   void f2() {<br>

>>     instrprof_increment(profc_f2);<br>

>>     f1();<br>

>>   }<br>

>><br>

>> Once the inliner runs and deletes f1, we're left with:<br>

>><br>

>>   void f2() {<br>

>>     instrprof_increment(profc_f2);<br>

>>     instrprof_increment(profc_f1);<br>

>>   }<br>

>><br>

>> Now we can say profc_f1 is redundant (or, an alias for profc_f2).<br>

>><br>

>> I've noticed that frontend-based instrumentation can generate many redundant<br>

>> profile counters. This degrades performance and increases code size.  We can<br>

>> address the problem by getting rid of redundant counter updates. The trick is<br>

>> to make sure we get back the same profiles.<br>

>><br>

>><br>

>> Proposed Solution<br>

>> =================<br>

>><br>

>> I propose a pruning pass which takes the following steps:<br>

>><br>

>> 1. Delete functions with local linkage and only one use, if that use is in<br>

>>    a profile data record.<br>

>><br>

>>    These functions are left around by the inliner (it doesn't know that<br>

>>    they're safe to delete). Deleting them reduces code size and simplifies<br>

>>    subsequent analysis of profile counters.<br>

><br>

> I think that this is something global-DCE should be teached about (if it does not know already).<br>

><br>

><br>

>><br>

>> 2. Determine which profile counters are essential.<br>

><br>

> What is an "essential counter"?<br>

><br>

>><br>

>> 3. Erase all updates to redundant profile counters.<br>

><br>

> When you are saying "erase", you need to actually replace the multiple counter increment with *a new* counter, right? I.e. when replacing:<br>

><br>

>     instrprof_increment(profc_f2);<br>

>     instrprof_increment(profc_f1);<br>

><br>

> you need to emit:<br>

><br>

>     instrprof_increment(profc_f2f1_fused);<br>

><br>

><br>

>> 4. Emit the aliases into a new section in the binary.<br>

>><br>

>>    Aliases are represented as { Dst: i64*, Src: i64* } tuples. Some changes<br>

>>    in compiler-rt are required to walk the alias section and fill in the<br>

>>    correct execution counts at program exit time.<br>

>><br>

>> This pass needs to be run after the inliner in order to be effective.<br>

>><br>

>> The complexity of this pass is O(N*M), where N is the number of profile<br>

>> counters, and M is the average number of updates per counter.<br>

><br>

> What you're describing here is not clear to me? But I may have been totally misunderstanding the whole concept :)<br>

><br>

><br>

> --<br>

> Mehdi<br>

><br>

><br>

>> In practice it is<br>

>> a bit faster, since we can skip the analysis of counters which are discovered to<br>

>> be redundant early on in the process.<br>

>><br>

>><br>

>> Early Results<br>

>> =============<br>

>><br>

>> The pruning pass results in 25% speed improvement in the example program above<br>

>> (where f2 is called in a loop 10^8 times).<br>

>><br>

>> Here is a slightly less contrived example:<br>

>><br>

>> #include <vector><br>

>> #include <algorithm><br>

>> #include <cstdlib><br>

>><br>

>> static void escape(void *p) {<br>

>>   asm volatile("" : : "g"(p) : "memory");<br>

>> }<br>

>><br>

>> int main(int argc, char **argv) {<br>

>>   std::vector<int> V(atoi(argv[1]));<br>

>>   escape(reinterpret_cast<void *>(V.data()));<br>

>>   std::sort(V.begin(), V.end());<br>

>>   return V[0];<br>

>> }<br>

>><br>

>> I get the following results on my desktop (10^8 elements, 5 runs each):<br>

>><br>

>> O3:                           0.262s<br>

>> O3 + PGOInstr:                0.663s<br>

>> O3 + PGOInstr + Pruning:      0.606s (8.6% performance win, 672 aliases)<br>

>> O3 + CoverageInstr:           0.690s<br>

>> O3 + CoverageInstr + Pruning: 0.610s (11.6% performance win, 688 aliases)<br>

>><br>

>><br>

>> Next Steps?<br>

>> ===========<br>

>><br>

>> Is the performance of instrumented code something we think we need to fix?<br>

>> What's an acceptable compile-time overhead for running this pruning pass? Is<br>

>> the general approach a non-starter for anybody?<br>

>><br>

>> I'd like to get some feedback and gauge interest before pursuing this further.<br>

>> Possible next steps include benchmarking instrumented versions of clang and<br>

>> swift on the relevant test suites, running performance tests from lnt, running<br>

>> compile-time tests, and measuring any code size differences.<br>

>><br>

>><br>

>> thanks<br>

>> vedant<br>

>><br>

>> _______________________________________________<br>

>> LLVM Developers mailing list<br>

>> <a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a><br>

>> <a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><br>

<br>

</blockquote></div><br></div></div>