[llvm-dev] Saving Compile Time in InstCombine

Thu Apr 13 19:43:46 PDT 2017

On Thu, Apr 13, 2017 at 5:18 PM, Mikulin, Dmitry
<dmitry.mikulin at sony.com> wrote:
> I’m taking a first look at InstCombine performance. I picked up the caching patch and ran a few experiments on one of our larger C++ apps. The size of the *.0.2.internalize.bc no-debug IR is ~ 30M. Here are my observations so far.
>
> Interestingly, caching produced a slight but measurable performance degradation of -O3 compile time.
>
> InstCombine takes about 35% of total execution time, of which ~20% originates from CGPassManager.
>

It's because we run instcombine as we inline (see
addFunctionSimplificationPasses()) IIRC. We don't quite do this at LTO
time (FullLTO) because it's too expensive compile-time wise. ThinLTO
runs it.

> ComputeKnownBits contributes 7.8%, but calls from InstCombine contribute only 2.6% to the total execution time. Caching only covers InstCombine use of KnownBits. This may explain limited gain or even slight degradation if KnownBits are not re-computed as often as we thought.
>
> Most of the time is spent in instruction visitor routines. CmpInst, LoadInst, CallInst, GetElementPtrInst and StoreInst are the top contributors.
>
> ICmpInst          6.1%
> LoadInst          5.5%
> CallInst          2.1%
> GetElementPtrInst 2.1%
> StoreInst         1.6%
>
> Out of 35% InstCombine time, about half is spent in the top 5 visitor routines.
>

So walking the matchers seems to be expensive from your preliminary
analysis, at least, this is what you're saying?
Is this a run with debug info? i.e. are you passing -g to the per-TU
pipeline? I'm inclined to think this is mostly an additive effect
adding matchers here and there that don't really hurt small testcases
but we pay the debt over time (in particular for LTO). Side note, I
noticed (and others did as well) that instcombine is way slower with
`-g` on (one of the reasons could be we walking much longer use lists,
due to the dbg use). Do you have numbers of instcombine ran on IR with
and without debug info?

> I wanted to see what transformations InstCombine actually performs. Using -debug option turned out not to be very scalable. Never mind the large output size of the trace, running "opt -debug -instcombine” on anything other than a small IR is excruciatingly slow. Out of curiosity I profiled it too: 96% of the time is spent decoding and printing instructions. Is this a known problem? If so, what are the alternatives for debugging large scale problem? If not, it’s possibly another item to add to the to-do list.
>

You may consider adding statistics (those should be much more
scalable) although more coarse.

Thanks!

-- 
Davide

"There are no solved problems; there are only problems that are more
or less solved" -- Henri Poincare