<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Mar 11, 2016 at 2:50 PM, Vedant Kumar <span dir="ltr"><<a href="mailto:vsk@apple.com" target="_blank">vsk@apple.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class=""><br>

> On Mar 11, 2016, at 2:25 PM, Justin Bogner <<a href="mailto:mail@justinbogner.com">mail@justinbogner.com</a>> wrote:<br>

><br>

> Vedant Kumar <<a href="mailto:vsk@apple.com">vsk@apple.com</a>> writes:<br>

>> There have been a lot of responses. I'll try to summarize the thread<br>

>> and respond to some of the questions/feedback.<br>

> ...<br>

>> FE to IR Counter Remapping<br>

>> ==========================<br>

>><br>

>> I have a question about this plan:<br>

>><br>

>>> for each CFG edge:<br>

>>>    record which FE counters have ended up associated with it<br>

>>> remove FE counters<br>

>>> run IR instrumentation pass<br>

>>> emit a side table mapping IR instr counters to FE counters<br>

>><br>

>> Currently, -instrprof happens early in the pipeline. IIUC this is done to<br>

>> allow the optimizer to work with load+add+stores, instead of profile update<br>

>> intrinsics.<br>

><br>

> It would be an interesting experiment to see what it would look like to<br>

> teach optimizations about the instrprof intrinsics and lower them much<br>

> later. I suspect knowing that these aren't just stores to random memory<br>

> would enable us to make good decisions in various places.<br>

<br>

</span>Do you think we could get good enough results by attaching !invariant.load or<br>

AA metadata to lowered profile counter updates?<br>

<span class=""><br>

<br>

> Of course, this might end up spreading to much special case knowledge<br>

> through various optimizations and not be worth it.<br>

><br>

>> Say we introduce a counter remapping pass like the one Sean suggested. It<br>

>> should be run before -instrprof so that we don't waste time lowering a bunch<br>

>> of instrprof_increment intrinsics which we'll have to throw away later.<br>

>><br>

>> But that means that the CFGs that the counter remapping pass operates<br>

>> on won't reflect changes made by the inliner (or any other<br>

>> optimizations which alter the CFG), right?<br>

>><br>

>> ISTM the pruning pass I've proposed is useful whether we're doing FE-based<br>

>> instrumentation _or_ late instrumentation. Since it operates on loads+stores<br>

>> directly, it can clean up redundant counter increments at any point in the<br>

>> pipeline (after -instrprof).<br>

<br>

</span>I'd like to add an interesting data point to back this up. Revisiting the<br>

std::sort example, here's what I get with -fprofile-instrument=llvm (again<br>

using 10^8 array elements, and averaging over 5 runs):<br>

<br>

O3:                       0.262s<br>

O3 + LLVMInstr:           0.705s<br>

O3 + LLVMInstr + Pruning: 0.644s (47 counter alias mappings created)<br></blockquote><div><br></div><div>There is a llvm-pipeline change for llvm instr pending. Once that is in,  the benefit shown here will probably disappear.  </div><div><br></div><div>David</div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

So, it *is* possible for we see real performance improvements by running a<br>

pruning pass after late IR-based instrumentation.<br>

<br>

I still think we need more numbers before moving forward, and will work on<br>

that.<br>

<span class="HOEnZb"><font color="#888888"><br>

vedant</font></span></blockquote></div><br></div></div>