[llvm-dev] RFC: Pass to prune redundant profiling instrumentation

Fri Mar 11 14:50:14 PST 2016

> On Mar 11, 2016, at 2:25 PM, Justin Bogner <mail at justinbogner.com> wrote:
> 
> Vedant Kumar <vsk at apple.com> writes:
>> There have been a lot of responses. I'll try to summarize the thread
>> and respond to some of the questions/feedback.
> ...
>> FE to IR Counter Remapping
>> ==========================
>> 
>> I have a question about this plan:
>> 
>>> for each CFG edge:
>>>    record which FE counters have ended up associated with it
>>> remove FE counters
>>> run IR instrumentation pass
>>> emit a side table mapping IR instr counters to FE counters
>> 
>> Currently, -instrprof happens early in the pipeline. IIUC this is done to
>> allow the optimizer to work with load+add+stores, instead of profile update
>> intrinsics.
> 
> It would be an interesting experiment to see what it would look like to
> teach optimizations about the instrprof intrinsics and lower them much
> later. I suspect knowing that these aren't just stores to random memory
> would enable us to make good decisions in various places.

Do you think we could get good enough results by attaching !invariant.load or
AA metadata to lowered profile counter updates?

> Of course, this might end up spreading to much special case knowledge
> through various optimizations and not be worth it.
> 
>> Say we introduce a counter remapping pass like the one Sean suggested. It
>> should be run before -instrprof so that we don't waste time lowering a bunch
>> of instrprof_increment intrinsics which we'll have to throw away later.
>> 
>> But that means that the CFGs that the counter remapping pass operates
>> on won't reflect changes made by the inliner (or any other
>> optimizations which alter the CFG), right?
>> 
>> ISTM the pruning pass I've proposed is useful whether we're doing FE-based
>> instrumentation _or_ late instrumentation. Since it operates on loads+stores
>> directly, it can clean up redundant counter increments at any point in the
>> pipeline (after -instrprof).

I'd like to add an interesting data point to back this up. Revisiting the
std::sort example, here's what I get with -fprofile-instrument=llvm (again
using 10^8 array elements, and averaging over 5 runs):

O3:                       0.262s
O3 + LLVMInstr:           0.705s
O3 + LLVMInstr + Pruning: 0.644s (47 counter alias mappings created)

So, it *is* possible for we see real performance improvements by running a
pruning pass after late IR-based instrumentation.

I still think we need more numbers before moving forward, and will work on
that.

vedant