[llvm-dev] Saving Compile Time in InstCombine
Hal Finkel via llvm-dev
llvm-dev at lists.llvm.org
Wed Mar 22 14:23:03 PDT 2017
On 03/22/2017 01:34 PM, Sanjay Patel wrote:
> > To (hopefully) make it easier to answer this question, I've posted
> my (work-in-progress) patch which adds a known-bits cache to InstCombine.
> > I rebased it yesterday, so it should be fairly easy to apply:
> https://reviews.llvm.org/D31239 <https://reviews.llvm.org/D31239> -
> Seeing what this does to the performance of the
> > benchmarks mentioned in this thread (among others) would certainly
> be interesting.
>
> Thanks! I only have the one rough data point based on PR32037, but
> caching does good things for compile-time on that example.
>
> Trunk r298514 compiled release on macOS running on Haswell 4GHz:
> $ time ./opt -O2 row_common.bc -S -o /dev/null
>
> user 0m0.302s
> user 0m0.300s
> user 0m0.296s
> user 0m0.299s
> user 0m0.296s
>
> With your patch applied:
>
> user 0m0.264s
> user 0m0.269s
> user 0m0.269s
> user 0m0.268s
> user 0m0.268s
>
> So the time for all of -O2 has dropped to 89.6% of the baseline
> (improvement of 11.5%).
> A profile shows time spent in InstCombine->computeKnownBits dropped
> from 58 ms to 15 ms (lines up with the overall time drop), so we're
> about 4x faster in ValueTracking with the caching.
Yay :-) -- Unfortunately, I won't have time to work on this in the near
future, but if someone would like to pick this up and fix the nsw/nuw
invalidation issue (which is exposed in a few of the regression-tests
which fail with the patch applied), that would be great.
-Hal
>
> On Wed, Mar 22, 2017 at 7:36 AM, Hal Finkel via llvm-dev
> <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote:
>
>
> On 03/20/2017 11:51 PM, Gerolf Hoflehner wrote:
>>
>>> On Mar 17, 2017, at 6:12 PM, David Majnemer via llvm-dev
>>> <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote:
>>>
>>> Honestly, I'm not a huge fan of this change as-is. The set of
>>> transforms that were added behind ExpensiveChecks seems awfully
>>> strange and many would not lead the reader to believe that they
>>> are expensive at all (the SimplifyDemandedInstructionBits and
>>> foldICmpUsingKnownBits calls being the obvious expensive routines).
>>>
>>> The purpose of many of InstCombine's xforms is to canonicalize
>>> the IR to make life easier for downstream passes and analyses.
>>
>> As we get further along with compile-time improvements one
>> question we need to ask ourselves more frequently is about the
>> effectiveness of optimizations/passes. For example - in this
>> case - how can we make an educated assessment that running the
>> combiner N times is a good cost/benefit investment of compute
>> resources? The questions below are meant to figure out what
>> technologies/instrumentations/etc could help towards a more
>> data-driven decision process when it comes to the effectiveness
>> of optimizations. Instcombiner might just be an inspirational use
>> case to see what is possible in that direction.
>>
>> The combiner is invoked in full multiple times. But is it really
>> necessary to run all of it for that purpose? After instcombine is
>> run once is there a mapping from transformation -> combines? I
>> suspect most transformations could invoke a subset of combines to
>> re-canonicalize. Or, if there was a (cheap) verifier for
>> canonical IR, it could invoke a specific canonicalization
>> routine. Instrumenting the instcombiner and checking which
>> patterns actually kick in (for different invocations) might give
>> insight into how the combiner could be structured and so that
>> only a subset of pattern need to be checked.
>>>
>>> InstCombine internally relies on knowing what transforms it may
>>> or may not perform. This is important: canonicalizations may
>>> duel endlessly if we get this wrong; the order of the combines
>>> is also important for exactly the same reason (SelectionDAG
>>> deals with this problem in a different way with its pattern
>>> complexity field).
>>
>> Can you elaborate on this “duel endlessly” with specific
>> examples? This is out of curiosity. There must be verifiers that
>> check that this cannot happen. Or an implementation strategy that
>> guarantees that. Global isel will run into the same/similar
>> question when it gets far enough to replace SD.
>>>
>>> Another concern with moving seemingly arbitrary combines under
>>> ExpensiveCombines is that it will make it that much harder to
>>> understand what is and is not canonical at a given point during
>>> the execution of the optimizer.
>>
>>>
>>> I'd be much more interested in a patch which caches the result
>>> of frequently called ValueTracking functionality like
>>> ComputeKnownBits, ComputeSignBit, etc. which often doesn't
>>> change but is not intelligently reused. I imagine that the
>>> performance win might be quite comparable.
>>
>> Can you back this up with measurements? Caching schemes are
>> tricky. Is there a way to evaluate when the results of
>> ComputeKnownBits etc is actually effective meaining the result is
>> used and gives faster instructions? E.g. it might well be that
>> only the first instance of inst_combine benefits from computing
>> the bits.
>
> To (hopefully) make it easier to answer this question, I've posted
> my (work-in-progress) patch which adds a known-bits cache to
> InstCombine. I rebased it yesterday, so it should be fairly easy
> to apply: https://reviews.llvm.org/D31239
> <https://reviews.llvm.org/D31239> - Seeing what this does to the
> performance of the benchmarks mentioned in this thread (among
> others) would certainly be interesting.
>
> -Hal
>
>
>>
>>
>>> Such a patch would have the benefit of keeping the set of
>>> available transforms constant throughout the pipeline while
>>> bringing execution time down; I wouldn't be at all surprised if
>>> caching the ValueTracking functions resulted in a bigger time
>>> savings.
>>>
>>> On Fri, Mar 17, 2017 at 5:49 PM, Hal Finkel via llvm-dev
>>> <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote:
>>>
>>>
>>> On 03/17/2017 04:30 PM, Mehdi Amini via llvm-dev wrote:
>>>>
>>>>> On Mar 17, 2017, at 11:50 AM, Mikhail Zolotukhin via
>>>>> llvm-dev <llvm-dev at lists.llvm.org
>>>>> <mailto:llvm-dev at lists.llvm.org>> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> One of the most time-consuming passes in LLVM middle-end
>>>>> is InstCombine (see e.g. [1]). It is a very powerful pass
>>>>> capable of doing all the crazy stuff, and new patterns are
>>>>> being constantly introduced there. The problem is that we
>>>>> often use it just as a clean-up pass: it's scheduled 6
>>>>> times in the current pass pipeline, and each time it's
>>>>> invoked it checks all known patterns. It sounds ok for O3,
>>>>> where we try to squeeze as much performance as possible,
>>>>> but it is too excessive for other opt-levels. InstCombine
>>>>> has an ExpensiveCombines parameter to address that - but I
>>>>> think it's underused at the moment.
>>>>
>>>> Yes, the “ExpensiveCombines” has been added recently (4.0?
>>>> 3.9?) but I believe has always been intended to be extended
>>>> the way you’re doing it. So I support this effort :)
>>>
>>> +1
>>>
>>> Also, did your profiling reveal why the other combines are
>>> expensive? Among other things, I'm curious if the expensive
>>> ones tend to spend a lot of time in ValueTracking (getting
>>> known bits and similar)?
>>>
>>> -Hal
>>>
>>>
>>>>
>>>> CC: David for the general direction on InstCombine though.
>>>>
>>>>
>>>> —
>>>> Mehdi
>>>>
>>>>
>>>>
>>>>>
>>>>> Trying to find out, which patterns are important, and
>>>>> which are rare, I profiled clang using CTMark and got the
>>>>> following coverage report:
>>>>> <InstCombine_covreport.html>
>>>>> (beware, the file is ~6MB).
>>>>>
>>>>> Guided by this profile I moved some patterns under the "if
>>>>> (ExpensiveCombines)" check, which expectedly happened to
>>>>> be neutral for runtime performance, but improved
>>>>> compile-time. The testing results are below (measured for Os).
>>>>>
>>>>> Performance Improvements - Compile Time Δ Previous
>>>>> Current σ
>>>>> CTMark/sqlite3/sqlite3
>>>>> <http://michaelsmacmini.local/perf/v4/nts/2/graph?test.15=2>
>>>>> -1.55% 6.8155 6.7102 0.0081
>>>>> CTMark/mafft/pairlocalalign
>>>>> <http://michaelsmacmini.local/perf/v4/nts/2/graph?test.1=2>
>>>>> -1.05% 8.0407 7.9559 0.0193
>>>>> CTMark/ClamAV/clamscan
>>>>> <http://michaelsmacmini.local/perf/v4/nts/2/graph?test.7=2>
>>>>> -1.02% 11.3893 11.2734 0.0081
>>>>> CTMark/lencod/lencod
>>>>> <http://michaelsmacmini.local/perf/v4/nts/2/graph?test.10=2>
>>>>> -1.01% 12.8763 12.7461 0.0244
>>>>> CTMark/SPASS/SPASS
>>>>> <http://michaelsmacmini.local/perf/v4/nts/2/graph?test.5=2>
>>>>> -1.01% 12.5048 12.3791 0.0340
>>>>>
>>>>>
>>>>> Performance Improvements - Compile Time Δ Previous
>>>>> Current σ
>>>>> External/SPEC/CINT2006/403.gcc/403.gcc
>>>>> <http://michaelsmacmini.local/perf/v4/nts/2/graph?test.14=2>
>>>>> -1.64% 54.0801 53.1930 -
>>>>> External/SPEC/CINT2006/400.perlbench/400.perlbench
>>>>> <http://michaelsmacmini.local/perf/v4/nts/2/graph?test.7=2>
>>>>> -1.25% 19.1481 18.9091 -
>>>>> External/SPEC/CINT2006/445.gobmk/445.gobmk
>>>>> <http://michaelsmacmini.local/perf/v4/nts/2/graph?test.15=2>
>>>>> -1.01% 15.2819 15.1274 -
>>>>>
>>>>>
>>>>>
>>>>> Do such changes make sense? The patch doesn't change O3,
>>>>> but it does change Os and potentially can change
>>>>> performance there (though I didn't see any changes in my
>>>>> tests).
>>>>>
>>>>> The patch is attached for the reference, if we decide to
>>>>> go for it, I'll upload it to phab:
>>>>>
>>>>> <0001-InstCombine-Move-some-infrequent-patterns-under-if-E.patch>
>>>>>
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Michael
>>>>>
>>>>> [1]:
>>>>> http://lists.llvm.org/pipermail/llvm-dev/2016-December/108279.html
>>>>> <http://lists.llvm.org/pipermail/llvm-dev/2016-December/108279.html>
>>>>>
>>>>> _______________________________________________
>>>>> LLVM Developers mailing list
>>>>> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>> <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>> <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
>>>
>>> --
>>> Hal Finkel
>>> Lead, Compiler Technology and Programming Languages
>>> Leadership Computing Facility
>>> Argonne National Laboratory
>>>
>>> _______________________________________________ LLVM
>>> Developers mailing list llvm-dev at lists.llvm.org
>>> <mailto:llvm-dev at lists.llvm.org>
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>> <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
>>>
>>> _______________________________________________ LLVM Developers
>>> mailing list llvm-dev at lists.llvm.org
>>> <mailto:llvm-dev at lists.llvm.org>
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>> <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
> --
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
>
> _______________________________________________ LLVM Developers
> mailing list llvm-dev at lists.llvm.org
> <mailto:llvm-dev at lists.llvm.org>
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
>
--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170322/3d9f37ec/attachment-0001.html>
More information about the llvm-dev
mailing list