[llvm-dev] Saving Compile Time in InstCombine

Thu Apr 13 21:46:22 PDT 2017

I've looked a little on where time is spent. I can provide more info when
I'm back at work tomorrow.

visitGetElementPtr spends most of its time in SimplifyGEPInst. I'll have to
look back at what its doing in there.

Loads and stores spend much of their time in getOrEnforceKnownAlignment
which uses computeKnownBits. I think loads also spend time in
FindAvailableLoadedValue.

I haven't looked at calls or icmps in much detail.

~Craig

On Thu, Apr 13, 2017 at 8:27 PM, Daniel Berlin via llvm-dev <
llvm-dev at lists.llvm.org> wrote:

>
>
> On Thu, Apr 13, 2017 at 5:18 PM, Mikulin, Dmitry via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> I’m taking a first look at InstCombine performance. I picked up the
>> caching patch and ran a few experiments on one of our larger C++ apps. The
>> size of the *.0.2.internalize.bc no-debug IR is ~ 30M. Here are my
>> observations so far.
>>
>> Interestingly, caching produced a slight but measurable performance
>> degradation of -O3 compile time.
>>
>> InstCombine takes about 35% of total execution time, of which ~20%
>> originates from CGPassManager.
>>
>> ComputeKnownBits contributes 7.8%, but calls from InstCombine contribute
>> only 2.6% to the total execution time. Caching only covers InstCombine use
>> of KnownBits. This may explain limited gain or even slight degradation if
>> KnownBits are not re-computed as often as we thought.
>>
> ....
> Not entirely shocking.
> I continue to believe we could compute it essentially once or twice, in
> the right iteration order, and have a performance gain here.
>
>
>>
>> Most of the time is spent in instruction visitor routines. CmpInst,
>> LoadInst, CallInst, GetElementPtrInst and StoreInst are the top
>> contributors.
>>
>> ICmpInst          6.1%
>> LoadInst          5.5%
>> CallInst          2.1%
>> GetElementPtrInst 2.1%
>> StoreInst         1.6%
>>
>> Out of 35% InstCombine time, about half is spent in the top 5 visitor
>> routines.
>>
>> I wanted to see what transformations InstCombine actually performs. Using
>> -debug option turned out not to be very scalable. Never mind the large
>> output size of the trace, running "opt -debug -instcombine”
>
>
> on anything other than a small IR is excruciatingly slow. Out of curiosity
>> I profiled it too: 96% of the time is spent decoding and printing
>> instructions. Is this a known problem?
>
>
> Yes
>
> The problem is *every* value print call builds an assembly writer, which
>  calls the module typefinder to be able to print the types.  This walks
> everything in the module to find the types.
> For large ir with a lot of types, this is *ridiculously* slow.
>
> IE it's basically processing a large part of the module for *every
> operand* it prints.
>
> You can, for something like what you are doing, just hack it up to build
> it once or not at all.
>
> I've never understood why this doesn't annoy people more :)
>
> As a hack, you can comment out AsmWriter.cpp:2137
>
> It should fix what you are seeing,.
>
> In practice, most types seem to print fine even without doing this, so ...
>
>> If so, what are the alternatives for debugging large scale problem? If
>> not, it’s possibly another item to add to the to-do list.
>>
>> Back to InstCombine, from the profile it does not appear there’s an
>> obvious magic bullet that can help drastically improve performance. I will
>> take a closer look at visitor functions and see if there’s anything that
>> can be done.
>>
>> Dmitry.
>>
>>
>> > On Mar 22, 2017, at 6:45 PM, Davide Italiano via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>> >
>> > On Wed, Mar 22, 2017 at 6:29 PM, Mikhail Zolotukhin via llvm-dev
>> > <llvm-dev at lists.llvm.org> wrote:
>> >>
>> >> In my testing results are not that impressive, but that's because I'm
>> now focusing on Os. For me even complete disabling of all KnownBits-related
>> patterns in InstCombine places the results very close to the noise level.
>> In my original patch I also had some extra patterns moved under
>> ExpensiveCombines - and that seems to make a difference too (without this
>> part, or without the KnownBits part I get results below 1%, which are not
>> reported as regressions/improvements).
>> >>
>> >
>> > Have you profiled a single InstCombine run to see where we actually
>> > spend our cycles (as Sanjay did for his reduced testcase)?
>> >
>> >> I realize that InstCombine doesn't usually do any harm, if we don't
>> care about compile time, but that's only the case for O3 (to some extent),
>> not for other optimization levels.
>> >
>> > Independently from what's the optimization level, I think compile-time
>> > is important. Note, for example, that we run a (kinda) similar
>> > pipeline at O3 and LTO (full, that is), where the impact of compile
>> > time is much more evident. Also, while people are not generally bitten
>> > by O3 compilation time, you may end up with terrible performances for
>> > large TUs (and I unfortunately learned this the hard way).
>> >
>> > --
>> > Davide
>> > _______________________________________________
>> > LLVM Developers mailing list
>> > llvm-dev at lists.llvm.org
>> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170413/069bc51f/attachment.html>