[llvm-dev] Saving Compile Time in InstCombine

Wed Mar 22 06:36:25 PDT 2017

On 03/20/2017 11:51 PM, Gerolf Hoflehner wrote:
>
>> On Mar 17, 2017, at 6:12 PM, David Majnemer via llvm-dev 
>> <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote:
>>
>> Honestly, I'm not a huge fan of this change as-is. The set of 
>> transforms that were added behind ExpensiveChecks seems awfully 
>> strange and many would not lead the reader to believe that they are 
>> expensive at all (the SimplifyDemandedInstructionBits and 
>> foldICmpUsingKnownBits calls being the obvious expensive routines).
>>
>> The purpose of many of InstCombine's xforms is to canonicalize the IR 
>> to make life easier for downstream passes and analyses.
>
> As we get further along with compile-time improvements one question we 
> need to ask ourselves more frequently is about the effectiveness of 
> optimizations/passes. For example -  in this case - how can we make an 
> educated assessment that running the combiner N times is a good 
> cost/benefit investment of compute resources? The questions below are 
> meant to figure out what technologies/instrumentations/etc could help 
> towards a more data-driven decision process when it comes to the 
> effectiveness of optimizations. Instcombiner might just be an 
> inspirational use case to see what is possible in that direction.
>
> The combiner is invoked in full multiple times. But is it really 
> necessary to run all of it for that purpose? After instcombine is run 
> once is there a mapping from transformation -> combines? I suspect 
> most transformations could invoke a subset of combines to 
> re-canonicalize. Or, if there was a (cheap) verifier for canonical IR, 
> it could invoke a specific canonicalization routine. Instrumenting the 
> instcombiner and checking which patterns actually kick in (for 
> different invocations)  might give insight into how the combiner could 
> be structured and so that only a subset of pattern need to be checked.
>>
>> InstCombine internally relies on knowing what transforms it may or 
>> may not perform. This is important: canonicalizations may duel 
>> endlessly if we get this wrong; the order of the combines is also 
>> important for exactly the same reason (SelectionDAG deals with this 
>> problem in a different way with its pattern complexity field).
>
> Can you elaborate on this “duel endlessly” with specific examples? 
> This is out of curiosity. There must be verifiers that check that this 
> cannot happen. Or an implementation strategy that guarantees that. 
> Global isel will run into the same/similar question when it gets far 
> enough to replace SD.
>>
>> Another concern with moving seemingly arbitrary combines under 
>> ExpensiveCombines is that it will make it that much harder to 
>> understand what is and is not canonical at a given point during the 
>> execution of the optimizer.
>
>>
>> I'd be much more interested in a patch which caches the result of 
>> frequently called ValueTracking functionality like ComputeKnownBits, 
>> ComputeSignBit, etc. which often doesn't change but is not 
>> intelligently reused. I imagine that the performance win might be 
>> quite comparable.
>
> Can you back this up with measurements? Caching schemes are tricky. Is 
> there a way to evaluate when the results of ComputeKnownBits etc is 
> actually effective meaining the result is used and gives faster 
> instructions? E.g. it might well be that only the first instance of 
> inst_combine benefits from computing the bits.

To (hopefully) make it easier to answer this question, I've posted my 
(work-in-progress) patch which adds a known-bits cache to InstCombine. I 
rebased it yesterday, so it should be fairly easy to apply: 
https://reviews.llvm.org/D31239 - Seeing what this does to the 
performance of the benchmarks mentioned in this thread (among others) 
would certainly be interesting.

  -Hal

>
>
>> Such a patch would have the benefit of keeping the set of available 
>> transforms constant throughout the pipeline while bringing execution 
>> time down; I wouldn't be at all surprised if caching the 
>> ValueTracking functions resulted in a bigger time savings.
>>
>> On Fri, Mar 17, 2017 at 5:49 PM, Hal Finkel via llvm-dev 
>> <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote:
>>
>>
>>     On 03/17/2017 04:30 PM, Mehdi Amini via llvm-dev wrote:
>>>
>>>>     On Mar 17, 2017, at 11:50 AM, Mikhail Zolotukhin via llvm-dev
>>>>     <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote:
>>>>
>>>>     Hi,
>>>>
>>>>     One of the most time-consuming passes in LLVM middle-end is
>>>>     InstCombine (see e.g. [1]). It is a very powerful pass capable
>>>>     of doing all the crazy stuff, and new patterns are being
>>>>     constantly introduced there. The problem is that we often use
>>>>     it just as a clean-up pass: it's scheduled 6 times in the
>>>>     current pass pipeline, and each time it's invoked it checks all
>>>>     known patterns. It sounds ok for O3, where we try to squeeze as
>>>>     much performance as possible, but it is too excessive for other
>>>>     opt-levels. InstCombine has an ExpensiveCombines parameter to
>>>>     address that - but I think it's underused at the moment.
>>>
>>>     Yes, the “ExpensiveCombines” has been added recently (4.0? 3.9?)
>>>     but I believe has always been intended to be extended the way
>>>     you’re doing it. So I support this effort :)
>>
>>     +1
>>
>>     Also, did your profiling reveal why the other combines are
>>     expensive? Among other things, I'm curious if the expensive ones
>>     tend to spend a lot of time in ValueTracking (getting known bits
>>     and similar)?
>>
>>      -Hal
>>
>>
>>>
>>>     CC: David for the general direction on InstCombine though.
>>>
>>>
>>>     —
>>>     Mehdi
>>>
>>>
>>>
>>>>
>>>>     Trying to find out, which patterns are important, and which are
>>>>     rare, I profiled clang using CTMark and got the following
>>>>     coverage report:
>>>>     <InstCombine_covreport.html>
>>>>     (beware, the file is ~6MB).
>>>>
>>>>     Guided by this profile I moved some patterns under the "if
>>>>     (ExpensiveCombines)" check, which expectedly happened to be
>>>>     neutral for runtime performance, but improved compile-time. The
>>>>     testing results are below (measured for Os).
>>>>
>>>>     Performance Improvements - Compile Time 	Δ 	Previous 	Current 	σ
>>>>     CTMark/sqlite3/sqlite3
>>>>     <http://michaelsmacmini.local/perf/v4/nts/2/graph?test.15=2>
>>>>     -1.55% 	6.8155 	6.7102 	0.0081
>>>>     CTMark/mafft/pairlocalalign
>>>>     <http://michaelsmacmini.local/perf/v4/nts/2/graph?test.1=2>
>>>>     -1.05% 	8.0407 	7.9559 	0.0193
>>>>     CTMark/ClamAV/clamscan
>>>>     <http://michaelsmacmini.local/perf/v4/nts/2/graph?test.7=2>
>>>>     -1.02% 	11.3893 	11.2734 	0.0081
>>>>     CTMark/lencod/lencod
>>>>     <http://michaelsmacmini.local/perf/v4/nts/2/graph?test.10=2>
>>>>     -1.01% 	12.8763 	12.7461 	0.0244
>>>>     CTMark/SPASS/SPASS
>>>>     <http://michaelsmacmini.local/perf/v4/nts/2/graph?test.5=2>
>>>>     -1.01% 	12.5048 	12.3791 	0.0340
>>>>
>>>>
>>>>     Performance Improvements - Compile Time 	Δ 	Previous 	Current 	σ
>>>>     External/SPEC/CINT2006/403.gcc/403.gcc
>>>>     <http://michaelsmacmini.local/perf/v4/nts/2/graph?test.14=2>
>>>>     -1.64% 	54.0801 	53.1930 	-
>>>>     External/SPEC/CINT2006/400.perlbench/400.perlbench
>>>>     <http://michaelsmacmini.local/perf/v4/nts/2/graph?test.7=2>
>>>>     -1.25% 	19.1481 	18.9091 	-
>>>>     External/SPEC/CINT2006/445.gobmk/445.gobmk
>>>>     <http://michaelsmacmini.local/perf/v4/nts/2/graph?test.15=2>
>>>>     -1.01% 	15.2819 	15.1274 	-
>>>>
>>>>
>>>>
>>>>     Do such changes make sense? The patch doesn't change O3, but it
>>>>     does change Os and potentially can change performance there
>>>>     (though I didn't see any changes in my tests).
>>>>
>>>>     The patch is attached for the reference, if we decide to go for
>>>>     it, I'll upload it to phab:
>>>>
>>>>     <0001-InstCombine-Move-some-infrequent-patterns-under-if-E.patch>
>>>>
>>>>
>>>>     Thanks,
>>>>     Michael
>>>>
>>>>     [1]:
>>>>     http://lists.llvm.org/pipermail/llvm-dev/2016-December/108279.html
>>>>     <http://lists.llvm.org/pipermail/llvm-dev/2016-December/108279.html>
>>>>
>>>>     _______________________________________________
>>>>     LLVM Developers mailing list
>>>>     llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>>>>     http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>     <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
>>>
>>>
>>>
>>>     _______________________________________________
>>>     LLVM Developers mailing list
>>>     llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>>>     http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>     <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
>>
>>     -- 
>>     Hal Finkel
>>     Lead, Compiler Technology and Programming Languages
>>     Leadership Computing Facility
>>     Argonne National Laboratory
>>
>>     _______________________________________________ LLVM Developers
>>     mailing list llvm-dev at lists.llvm.org
>>     <mailto:llvm-dev at lists.llvm.org>
>>     http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>     <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev> 
>>
>> _______________________________________________ LLVM Developers 
>> mailing list llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org> 
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev 
-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170322/56166879/attachment-0001.html>