<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Thu, Apr 13, 2017 at 5:18 PM, Mikulin, Dmitry via llvm-dev <span dir="ltr"><<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">I’m taking a first look at InstCombine performance. I picked up the caching patch and ran a few experiments on one of our larger C++ apps. The size of the *.0.2.internalize.bc no-debug IR is ~ 30M. Here are my observations so far.<br>

<br>

Interestingly, caching produced a slight but measurable performance degradation of -O3 compile time.<br>

<br>

InstCombine takes about 35% of total execution time, of which ~20% originates from CGPassManager.<br>

<br>

ComputeKnownBits contributes 7.8%, but calls from InstCombine contribute only 2.6% to the total execution time. Caching only covers InstCombine use of KnownBits. This may explain limited gain or even slight degradation if KnownBits are not re-computed as often as we thought.<br></blockquote><div>....</div><div>Not entirely shocking. </div><div>I continue to believe we could compute it essentially once or twice, in the right iteration order, and have a performance gain here.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

Most of the time is spent in instruction visitor routines. CmpInst, LoadInst, CallInst, GetElementPtrInst and StoreInst are the top contributors.<br>

<br>

ICmpInst          6.1%<br>

LoadInst          5.5%<br>

CallInst          2.1%<br>

GetElementPtrInst 2.1%<br>

StoreInst         1.6%<br>

<br>

Out of 35% InstCombine time, about half is spent in the top 5 visitor routines.<br>

<br>

I wanted to see what transformations InstCombine actually performs. Using -debug option turned out not to be very scalable. Never mind the large output size of the trace, running "opt -debug -instcombine”</blockquote><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> on anything other than a small IR is excruciatingly slow. Out of curiosity I profiled it too: 96% of the time is spent decoding and printing instructions. Is this a known problem?</blockquote><div><br></div><div>Yes</div><div><br></div><div>The problem is *every* value print call builds an assembly writer, which  calls the module typefinder to be able to print the types.  This walks everything in the module to find the types.</div><div>For large ir with a lot of types, this is *ridiculously* slow.</div><div><br></div><div>IE it's basically processing a large part of the module for *every operand* it prints.</div><div><br></div><div>You can, for something like what you are doing, just hack it up to build it once or not at all.</div><div><br></div><div>I've never understood why this doesn't annoy people more :)</div><div><br></div><div>As a hack, you can comment out AsmWriter.cpp:2137</div><div><br></div><div>It should fix what you are seeing,.</div><div><br></div><div>In practice, most types seem to print fine even without doing this, so ... </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> If so, what are the alternatives for debugging large scale problem? If not, it’s possibly another item to add to the to-do list.<br>

<br>

Back to InstCombine, from the profile it does not appear there’s an obvious magic bullet that can help drastically improve performance. I will take a closer look at visitor functions and see if there’s anything that can be done.<br>

<br>

Dmitry.<br>

<div class="HOEnZb"><div class="h5"><br>

<br>

> On Mar 22, 2017, at 6:45 PM, Davide Italiano via llvm-dev <<a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>> wrote:<br>

><br>

> On Wed, Mar 22, 2017 at 6:29 PM, Mikhail Zolotukhin via llvm-dev<br>

> <<a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>> wrote:<br>

>><br>

>> In my testing results are not that impressive, but that's because I'm now focusing on Os. For me even complete disabling of all KnownBits-related patterns in InstCombine places the results very close to the noise level. In my original patch I also had some extra patterns moved under ExpensiveCombines - and that seems to make a difference too (without this part, or without the KnownBits part I get results below 1%, which are not reported as regressions/improvements).<br>

>><br>

><br>

> Have you profiled a single InstCombine run to see where we actually<br>

> spend our cycles (as Sanjay did for his reduced testcase)?<br>

><br>

>> I realize that InstCombine doesn't usually do any harm, if we don't care about compile time, but that's only the case for O3 (to some extent), not for other optimization levels.<br>

><br>

> Independently from what's the optimization level, I think compile-time<br>

> is important. Note, for example, that we run a (kinda) similar<br>

> pipeline at O3 and LTO (full, that is), where the impact of compile<br>

> time is much more evident. Also, while people are not generally bitten<br>

> by O3 compilation time, you may end up with terrible performances for<br>

> large TUs (and I unfortunately learned this the hard way).<br>

><br>

> --<br>

> Davide<br>

> ______________________________<wbr>_________________<br>

> LLVM Developers mailing list<br>

> <a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a><br>

> <a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/<wbr>mailman/listinfo/llvm-dev</a><br>

<br>

______________________________<wbr>_________________<br>

LLVM Developers mailing list<br>

<a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a><br>

<a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/<wbr>mailman/listinfo/llvm-dev</a><br>

</div></div></blockquote></div><br></div></div>