[llvm-dev] Saving Compile Time in InstCombine

Wed Mar 22 14:23:03 PDT 2017

On 03/22/2017 01:34 PM, Sanjay Patel wrote:
> > To (hopefully) make it easier to answer this question, I've posted 
> my (work-in-progress) patch which adds a known-bits cache to InstCombine.
> > I rebased it yesterday, so it  should be fairly easy to apply: 
> https://reviews.llvm.org/D31239 <https://reviews.llvm.org/D31239> - 
> Seeing what this does to the performance of the
> > benchmarks mentioned in this thread (among others) would certainly 
> be interesting.
>
> Thanks! I only have the one rough data point based on PR32037, but 
> caching does good things for compile-time on that example.
>
> Trunk r298514 compiled release on macOS running on Haswell 4GHz:
> $ time ./opt -O2  row_common.bc -S -o /dev/null
>
> user    0m0.302s
> user    0m0.300s
> user    0m0.296s
> user    0m0.299s
> user    0m0.296s
>
> With your patch applied:
>
> user    0m0.264s
> user    0m0.269s
> user    0m0.269s
> user    0m0.268s
> user    0m0.268s
>
> So the time for all of -O2 has dropped to 89.6% of the baseline 
> (improvement of 11.5%).
> A profile shows time spent in InstCombine->computeKnownBits dropped 
> from 58 ms to 15 ms (lines up with the overall time drop), so we're 
> about 4x faster in ValueTracking with the caching.

Yay :-) -- Unfortunately, I won't have time to work on this in the near 
future, but if someone would like to pick this up and fix the nsw/nuw 
invalidation issue (which is exposed in a few of the regression-tests 
which fail with the patch applied), that would be great.

  -Hal

>
> On Wed, Mar 22, 2017 at 7:36 AM, Hal Finkel via llvm-dev 
> <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote:
>
>
>     On 03/20/2017 11:51 PM, Gerolf Hoflehner wrote:
>>
>>>     On Mar 17, 2017, at 6:12 PM, David Majnemer via llvm-dev
>>>     <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote:
>>>
>>>     Honestly, I'm not a huge fan of this change as-is. The set of
>>>     transforms that were added behind ExpensiveChecks seems awfully
>>>     strange and many would not lead the reader to believe that they
>>>     are expensive at all (the SimplifyDemandedInstructionBits and
>>>     foldICmpUsingKnownBits calls being the obvious expensive routines).
>>>
>>>     The purpose of many of InstCombine's xforms is to canonicalize
>>>     the IR to make life easier for downstream passes and analyses.
>>
>>     As we get further along with compile-time improvements one
>>     question we need to ask ourselves more frequently is about the
>>     effectiveness of optimizations/passes. For example -  in this
>>     case - how can we make an educated assessment that running the
>>     combiner N times is a good cost/benefit investment of compute
>>     resources? The questions below are meant to figure out what
>>     technologies/instrumentations/etc could help towards a more
>>     data-driven decision process when it comes to the effectiveness
>>     of optimizations. Instcombiner might just be an inspirational use
>>     case to see what is possible in that direction.
>>
>>     The combiner is invoked in full multiple times. But is it really
>>     necessary to run all of it for that purpose? After instcombine is
>>     run once is there a mapping from transformation -> combines? I
>>     suspect most transformations could invoke a subset of combines to
>>     re-canonicalize. Or, if there was a (cheap) verifier for
>>     canonical IR, it could invoke a specific canonicalization
>>     routine. Instrumenting the instcombiner and checking which
>>     patterns actually kick in (for different invocations)  might give
>>     insight into how the combiner could be structured and so that
>>     only a subset of pattern need to be checked.
>>>
>>>     InstCombine internally relies on knowing what transforms it may
>>>     or may not perform. This is important: canonicalizations may
>>>     duel endlessly if we get this wrong; the order of the combines
>>>     is also important for exactly the same reason (SelectionDAG
>>>     deals with this problem in a different way with its pattern
>>>     complexity field).
>>
>>     Can you elaborate on this “duel endlessly” with specific
>>     examples? This is out of curiosity. There must be verifiers that
>>     check that this cannot happen. Or an implementation strategy that
>>     guarantees that. Global isel will run into the same/similar
>>     question when it gets far enough to replace SD.
>>>
>>>     Another concern with moving seemingly arbitrary combines under
>>>     ExpensiveCombines is that it will make it that much harder to
>>>     understand what is and is not canonical at a given point during
>>>     the execution of the optimizer.
>>
>>>
>>>     I'd be much more interested in a patch which caches the result
>>>     of frequently called ValueTracking functionality like
>>>     ComputeKnownBits, ComputeSignBit, etc. which often doesn't
>>>     change but is not intelligently reused. I imagine that the
>>>     performance win might be quite comparable.
>>
>>     Can you back this up with measurements? Caching schemes are
>>     tricky. Is there a way to evaluate when the results of
>>     ComputeKnownBits etc is actually effective meaining the result is
>>     used and gives faster instructions? E.g. it might well be that
>>     only the first instance of inst_combine benefits from computing
>>     the bits.
>
>     To (hopefully) make it easier to answer this question, I've posted
>     my (work-in-progress) patch which adds a known-bits cache to
>     InstCombine. I rebased it yesterday, so it should be fairly easy
>     to apply: https://reviews.llvm.org/D31239
>     <https://reviews.llvm.org/D31239> - Seeing what this does to the
>     performance of the benchmarks mentioned in this thread (among
>     others) would certainly be interesting.
>
>      -Hal
>
>
>>
>>
>>>     Such a patch would have the benefit of keeping the set of
>>>     available transforms constant throughout the pipeline while
>>>     bringing execution time down; I wouldn't be at all surprised if
>>>     caching the ValueTracking functions resulted in a bigger time
>>>     savings.
>>>
>>>     On Fri, Mar 17, 2017 at 5:49 PM, Hal Finkel via llvm-dev
>>>     <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote:
>>>
>>>
>>>         On 03/17/2017 04:30 PM, Mehdi Amini via llvm-dev wrote:
>>>>
>>>>>         On Mar 17, 2017, at 11:50 AM, Mikhail Zolotukhin via
>>>>>         llvm-dev <llvm-dev at lists.llvm.org
>>>>>         <mailto:llvm-dev at lists.llvm.org>> wrote:
>>>>>
>>>>>         Hi,
>>>>>
>>>>>         One of the most time-consuming passes in LLVM middle-end
>>>>>         is InstCombine (see e.g. [1]). It is a very powerful pass
>>>>>         capable of doing all the crazy stuff, and new patterns are
>>>>>         being constantly introduced there. The problem is that we
>>>>>         often use it just as a clean-up pass: it's scheduled 6
>>>>>         times in the current pass pipeline, and each time it's
>>>>>         invoked it checks all known patterns. It sounds ok for O3,
>>>>>         where we try to squeeze as much performance as possible,
>>>>>         but it is too excessive for other opt-levels. InstCombine
>>>>>         has an ExpensiveCombines parameter to address that - but I
>>>>>         think it's underused at the moment.
>>>>
>>>>         Yes, the “ExpensiveCombines” has been added recently (4.0?
>>>>         3.9?) but I believe has always been intended to be extended
>>>>         the way you’re doing it. So I support this effort :)
>>>
>>>         +1
>>>
>>>         Also, did your profiling reveal why the other combines are
>>>         expensive? Among other things, I'm curious if the expensive
>>>         ones tend to spend a lot of time in ValueTracking (getting
>>>         known bits and similar)?
>>>
>>>          -Hal
>>>
>>>
>>>>
>>>>         CC: David for the general direction on InstCombine though.
>>>>
>>>>
>>>>         —
>>>>         Mehdi
>>>>
>>>>
>>>>
>>>>>
>>>>>         Trying to find out, which patterns are important, and
>>>>>         which are rare, I profiled clang using CTMark and got the
>>>>>         following coverage report:
>>>>>         <InstCombine_covreport.html>
>>>>>         (beware, the file is ~6MB).
>>>>>
>>>>>         Guided by this profile I moved some patterns under the "if
>>>>>         (ExpensiveCombines)" check, which expectedly happened to
>>>>>         be neutral for runtime performance, but improved
>>>>>         compile-time. The testing results are below (measured for Os).
>>>>>
>>>>>         Performance Improvements - Compile Time 	Δ 	Previous
>>>>>         Current 	σ
>>>>>         CTMark/sqlite3/sqlite3
>>>>>         <http://michaelsmacmini.local/perf/v4/nts/2/graph?test.15=2>
>>>>>         	-1.55% 	6.8155 	6.7102 	0.0081
>>>>>         CTMark/mafft/pairlocalalign
>>>>>         <http://michaelsmacmini.local/perf/v4/nts/2/graph?test.1=2>
>>>>>         	-1.05% 	8.0407 	7.9559 	0.0193
>>>>>         CTMark/ClamAV/clamscan
>>>>>         <http://michaelsmacmini.local/perf/v4/nts/2/graph?test.7=2>
>>>>>         	-1.02% 	11.3893 	11.2734 	0.0081
>>>>>         CTMark/lencod/lencod
>>>>>         <http://michaelsmacmini.local/perf/v4/nts/2/graph?test.10=2>
>>>>>         	-1.01% 	12.8763 	12.7461 	0.0244
>>>>>         CTMark/SPASS/SPASS
>>>>>         <http://michaelsmacmini.local/perf/v4/nts/2/graph?test.5=2>
>>>>>         	-1.01% 	12.5048 	12.3791 	0.0340
>>>>>
>>>>>
>>>>>         Performance Improvements - Compile Time 	Δ 	Previous
>>>>>         Current 	σ
>>>>>         External/SPEC/CINT2006/403.gcc/403.gcc
>>>>>         <http://michaelsmacmini.local/perf/v4/nts/2/graph?test.14=2>
>>>>>         	-1.64% 	54.0801 	53.1930 	-
>>>>>         External/SPEC/CINT2006/400.perlbench/400.perlbench
>>>>>         <http://michaelsmacmini.local/perf/v4/nts/2/graph?test.7=2>
>>>>>         	-1.25% 	19.1481 	18.9091 	-
>>>>>         External/SPEC/CINT2006/445.gobmk/445.gobmk
>>>>>         <http://michaelsmacmini.local/perf/v4/nts/2/graph?test.15=2>
>>>>>         	-1.01% 	15.2819 	15.1274 	-
>>>>>
>>>>>
>>>>>
>>>>>         Do such changes make sense? The patch doesn't change O3,
>>>>>         but it does change Os and potentially can change
>>>>>         performance there (though I didn't see any changes in my
>>>>>         tests).
>>>>>
>>>>>         The patch is attached for the reference, if we decide to
>>>>>         go for it, I'll upload it to phab:
>>>>>
>>>>>         <0001-InstCombine-Move-some-infrequent-patterns-under-if-E.patch>
>>>>>
>>>>>
>>>>>
>>>>>         Thanks,
>>>>>         Michael
>>>>>
>>>>>         [1]:
>>>>>         http://lists.llvm.org/pipermail/llvm-dev/2016-December/108279.html
>>>>>         <http://lists.llvm.org/pipermail/llvm-dev/2016-December/108279.html>
>>>>>
>>>>>         _______________________________________________
>>>>>         LLVM Developers mailing list
>>>>>         llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>>>>>         http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>         <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
>>>>
>>>>
>>>>
>>>>         _______________________________________________
>>>>         LLVM Developers mailing list
>>>>         llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>>>>         http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>         <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
>>>
>>>         -- 
>>>         Hal Finkel
>>>         Lead, Compiler Technology and Programming Languages
>>>         Leadership Computing Facility
>>>         Argonne National Laboratory
>>>
>>>         _______________________________________________ LLVM
>>>         Developers mailing list llvm-dev at lists.llvm.org
>>>         <mailto:llvm-dev at lists.llvm.org>
>>>         http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>         <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev> 
>>>
>>>     _______________________________________________ LLVM Developers
>>>     mailing list llvm-dev at lists.llvm.org
>>>     <mailto:llvm-dev at lists.llvm.org>
>>>     http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>     <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev> 
>     -- 
>     Hal Finkel
>     Lead, Compiler Technology and Programming Languages
>     Leadership Computing Facility
>     Argonne National Laboratory
>
>     _______________________________________________ LLVM Developers
>     mailing list llvm-dev at lists.llvm.org
>     <mailto:llvm-dev at lists.llvm.org>
>     http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>     <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev> 
>
-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170322/3d9f37ec/attachment-0001.html>