[llvm-dev] Saving Compile Time in InstCombine

Fri Apr 14 10:39:27 PDT 2017

> On Apr 13, 2017, at 7:43 PM, Davide Italiano <davide at freebsd.org> wrote:
> 
> On Thu, Apr 13, 2017 at 5:18 PM, Mikulin, Dmitry
> <dmitry.mikulin at sony.com> wrote:
>> I’m taking a first look at InstCombine performance. I picked up the caching patch and ran a few experiments on one of our larger C++ apps. The size of the *.0.2.internalize.bc no-debug IR is ~ 30M. Here are my observations so far.
>> 
>> Interestingly, caching produced a slight but measurable performance degradation of -O3 compile time.
>> 
>> InstCombine takes about 35% of total execution time, of which ~20% originates from CGPassManager.
>> 
> 
> It's because we run instcombine as we inline (see
> addFunctionSimplificationPasses()) IIRC. We don't quite do this at LTO
> time (FullLTO) because it's too expensive compile-time wise. ThinLTO
> runs it.
> 
>> ComputeKnownBits contributes 7.8%, but calls from InstCombine contribute only 2.6% to the total execution time. Caching only covers InstCombine use of KnownBits. This may explain limited gain or even slight degradation if KnownBits are not re-computed as often as we thought.
>> 
>> Most of the time is spent in instruction visitor routines. CmpInst, LoadInst, CallInst, GetElementPtrInst and StoreInst are the top contributors.
>> 
>> ICmpInst          6.1%
>> LoadInst          5.5%
>> CallInst          2.1%
>> GetElementPtrInst 2.1%
>> StoreInst         1.6%
>> 
>> Out of 35% InstCombine time, about half is spent in the top 5 visitor routines.
>> 
> 
> So walking the matchers seems to be expensive from your preliminary
> analysis, at least, this is what you're saying?

Looks like it. Other than computeKnownBits, most other functions at the top of the profile for InstCombine are instruction visitors.

> Is this a run with debug info? i.e. are you passing -g to the per-TU
> pipeline? I'm inclined to think this is mostly an additive effect
> adding matchers here and there that don't really hurt small testcases
> but we pay the debt over time (in particular for LTO). Side note, I
> noticed (and others did as well) that instcombine is way slower with
> `-g` on (one of the reasons could be we walking much longer use lists,
> due to the dbg use). Do you have numbers of instcombine ran on IR with
> and without debug info?

I do have the numbers for the same app with and without debug info. The results above are for the no-debug version.

Total execution time of -O3 is 34% slower with debug info. The size of the debug IR is 162M vs 39M no-debug. Both profiles look relatively similar with the exception of bit code writer and verifier taking a larger share in the -g case.

Looking at InstCombine, it’s 23% slower. One notable thing is that CallInst takes significantly larger share with -g: 5s vs 13s, which translates to about half of the InstCombine slowdown. Need to understand why. ComputeKnownBits takes about the same time and other visitors have elevated times I would guess due to the need to propagate debug info.

> 
>> I wanted to see what transformations InstCombine actually performs. Using -debug option turned out not to be very scalable. Never mind the large output size of the trace, running "opt -debug -instcombine” on anything other than a small IR is excruciatingly slow. Out of curiosity I profiled it too: 96% of the time is spent decoding and printing instructions. Is this a known problem? If so, what are the alternatives for debugging large scale problem? If not, it’s possibly another item to add to the to-do list.
>> 
> 
> You may consider adding statistics (those should be much more
> scalable) although more coarse.
> 
> Thanks!
> 
> -- 
> Davide
> 
> "There are no solved problems; there are only problems that are more
> or less solved" -- Henri Poincare
>