[PATCHES] A module inliner pass with a greedy call site queue

Sun Aug 10 18:17:05 PDT 2014

Hi Gerolf,

The performance/code_size data I provided is for with command line option
"-O3 -target arm64-linux-gnueabi -mcpu=cortex-a57 -fvectorize
-fslp-vectorize -ffast-math", so you can say it is just O3.

Thanks,
-Jiangning

2014-08-09 6:12 GMT+08:00 Gerolf Hoflehner <ghoflehner at apple.com>:

> Hi,
>
> I’m not an inlining expert. I have no prejudice or preference about
> top-down/bottom-up etc., and tend to favor flexibility.
>
> I would love to see more data for this hard heuristic problem:
>
> -what about compile-time?
> -did you get a chance to look into vpr? I’m curious about the specific
> explanation for the gain. Is this for O3 LTO PGO, O3 LTO or just O3?
> -could you share data on SPEC2006 for the ref input set? 2006 has a much
> larger code footprint than 2000 and should reward us with more insight.
>
> Thanks
> Gerolf
>
>
>
>
> On Aug 7, 2014, at 12:14 AM, Jiangning Liu <liujiangning1 at gmail.com>
> wrote:
>
> Hi Yin,
>
> Sorry that previously I didn't notice command line option "*-mllvm
> -inline-perf-mode=true*", because your test case doesn't show that. So
> now I measured performance on cortex-a57 again with command line option "*-mllvm
> -greedy-inliner=true -mllvm -inline-perf-mode=true*".
>
> spec2000 greedy_inliner_perf Threshold_1000_perf greedy_inliner_code_size
> threshold_1000_code_size
> 164.gzip 0.00% -0.78% 6.25% 14.55%
> 175.vpr -4.09% -3.14% 1.84% 14.49%
> 176.gcc 0.83% 0.83% 0.08% 33.16%
> 181.mcf 0.00% 0.00% 3.58% 19.58%
> 186.crafty -0.93% 1.85% -0.94% 14.38%
> 197.parser -1.61% -2.24% -0.04% 1.48%
> 252.eon -7.30% -6.52% 2.64% 6.42%
> 253.perlbmk -2.38% -3.76% -1.75% 2.22%
> 254.gap 0.00% -1.72% 2.93% 18.44%
> 255.vortex -1.04% -4.19% 3.40% 47.07%
> 256.bzip2 1.40% -1.83% 2.23% 10.11%
> 300.twolf 1.87% -0.36% -1.87% 23.48%
> 177.mesa -3.36% -2.52% 0.48% 35.04%
> 179.art 1.37% 0.00% 0.45% 9.26%
> 183.equake -4.35% -5.80% 1.67% 23.23%
> 188.ammp 0.35% 1.75% 0.07% 6.69%
>
> So now this performance result looks quite promising!
>
> For xxx_perf, the negative number means running time is reduced and
> performance is better.
> For xxx_codesize, the number is only for .text section.
>
> From this result we can see,
>
> 1) The greedy inliner obtained the similar performance improvement as
> setting threshold to be 1000 with the original inliner.
> 2) But comparing with the significant code size bloat, the code size
> change of greedy inliner is quite limited on average.
>
> Thanks,
> -Jiangning
>
>
> 2014-08-05 17:11 GMT+08:00 Jiangning Liu <liujiangning1 at gmail.com>:
>
>> Hi Yin,
>>
>> I don't see performance improvement on cortex-a57 for eon with your
>> patch, and spec2000/int data is as below, (negative is good)
>>
>> 164.gzip 0.00%
>> 175.vpr -4.55%
>> 176.gcc 0.83%
>> 181.mcf 0.00%
>> 186.crafty 0.00%
>> 197.parser -2.26%
>> 252.eon 1.46%
>> 253.perlbmk 5.24%
>> 254.gap 0.88%
>> 255.vortex -0.52%
>> 256.bzip2 0.47%
>> 300.twolf 1.87%
>>
>> Thanks,
>> -Jiangning
>>
>>
>>
>> 2014-08-05 10:52 GMT+08:00 Jiangning Liu <liujiangning1 at gmail.com>:
>>
>> Yin,
>>>
>>> I got the following "make check-all" failure.
>>>
>>> /home/jialiu01/llvm/llvm/tools/clang/test/Driver/greedy-inliner.c:8:11:
>>> error: expected string not found in input
>>> // CHECK: Greedy Inliner
>>>           ^
>>> <stdin>:1:1: note: scanning from here
>>> clang (LLVM option parsing): for the -print-after option: Cannot find
>>> option named 'greedy-inliner'!
>>> ^
>>>
>>> Can you confirm that is an issue?
>>>
>>> And for performance, I haven't got the data on Cortex-A57 yet, and I
>>> will let you know as soon as I get the result. For Cortex-A53, I never try
>>> it before.
>>>
>>> Thanks,
>>> -Jiangning
>>>
>>>
>>>
>>> 2014-08-05 6:06 GMT+08:00 Yin Ma <yinma at codeaurora.org>:
>>>
>>> Hi All,
>>>>
>>>>
>>>>
>>>> Thank Jiangning for a comprehensive testing for Greedy inliner. I am
>>>> aware of Chandler's discussion about rewriting the pass manager in order to
>>>> overcome the limitation of current inliner and the intension toward the
>>>> perfect solution.
>>>>
>>>>
>>>>
>>>> But we had to provide an inliner solution to address some LLVM
>>>> performance degradation compared to GCC. That is how the greedy inliner was
>>>> born. This inliner is a module pass, it does not have the SCC<->function
>>>> analysis problem. Note that the Greedy inliner is a flexible solution, it
>>>> can be set up to be either a bottom up, top down or other custom order
>>>> (that is the purpose of using a queue with sorted weights).
>>>>
>>>>
>>>>
>>>> Regarding code size, for our internal very large C++ code base, the
>>>> Greedy inliner did better job compared with SCC inliner at -Os. It was able
>>>> to inline more functions than the SCC inliner without increasing code size.
>>>> In one instance the generated file by either inliner approaches was quite
>>>> similar size. However, looking at the number of entries in the symbol
>>>> table, the Greedy inliner version had 540880 entries, while the SCC inliner
>>>> version had 619639 entries. This was achieved by setting weights to favor
>>>> top down order. Chandler, if you have any large code base examples in mind,
>>>> I would like to try.
>>>>
>>>>
>>>>
>>>> Regarding performance, the Greedy inliner has also shown better
>>>> performance than the SCC inliner. I already reported the gains for
>>>> SPEC2000 (eon 16%, mesa 5%) without any degradation of other tests.
>>>> Jiangning also verified it independently. This was achieved by setting
>>>> weights to favor call sites in loops.
>>>>
>>>>
>>>>
>>>> For virtual dispatch, we didn't see any C++ virtual dispatch problem
>>>> exposed when evaluating the Greedy inliner because greedy inliner reused
>>>> the SCC inliner to do the local decision. If anyone has a test case for
>>>> this or program in mind, I can try to run it and report the findings.
>>>>
>>>>
>>>>
>>>> I like the suggestion from Hal to have a "more in-depth discussion on
>>>> the goals of inlining, and how we do, or plan to, achieve them." Since now
>>>> we have two concrete solutions for inliners, how about we have BOF
>>>> discussion at the LLVM dev conference? I can send a proposal.
>>>>
>>>>
>>>>
>>>> What do you guys think?
>>>>
>>>>
>>>>
>>>> Here are some details on the scenarios we considered when tuning the
>>>> greedy inliner and other possible future scenarios. The first one is A <- B
>>>> <- C case mentioned, B is in a loop of A.  B to A should be higher priority
>>>> to be considered before C to B
>>>>
>>>> A() {
>>>>
>>>> For(..)  { Call B() }
>>>>
>>>> }
>>>>
>>>> B() {
>>>>
>>>>    call C()
>>>>
>>>> }
>>>>
>>>>
>>>>
>>>> The second is A called B many times, one bar is in a loop that need to
>>>> be inlined.  B() instead loops should have higher priority to be considered
>>>> than other B in A(). other B may not be benefitical to be inlined for code
>>>> size tuning.
>>>>
>>>> A() {
>>>>
>>>> Call B()
>>>>
>>>> Call B()
>>>>
>>>> Call B()
>>>>
>>>> For(...) { Call B() }
>>>>
>>>> Call B()
>>>>
>>>> }
>>>>
>>>>
>>>>
>>>> The third is a series of continue calls, in the architecture we
>>>> targeted on, we don't want to inline them. Inliner should have a global
>>>> sense to do the decision.
>>>>
>>>> A() {
>>>>
>>>> If (...) {
>>>>
>>>>       Call B();
>>>>
>>>>       Call B();
>>>>
>>>>       Call B();
>>>>
>>>>       Call B();
>>>>
>>>> }else if {
>>>>
>>>>       Call B();
>>>>
>>>>       Call B();
>>>>
>>>>       Call B();
>>>>
>>>>       Call B();
>>>>
>>>> }else ...
>>>>
>>>> ...
>>>>
>>>> }
>>>>
>>>>
>>>>
>>>> The next one is a future scenario that supports profile based decision.
>>>> I considered this case but not implemented in the current version of greey
>>>> inliner. Block frequency info can be used in computation to guide order and
>>>> decision.
>>>>
>>>>
>>>>
>>>> The key to take into account top-down/bottom up differences and the
>>>> scenarios described above is to have an inliner framework that has the
>>>> concept of a global queue with  sorted weights.  It is a very flexible
>>>> framework.  Any future LLVM inliner solution we decide on should support
>>>> this type of feature.
>>>>
>>>>
>>>>
>>>> Yin
>>>>
>>>>
>>>>
>>>> *From:* llvm-commits-bounces at cs.uiuc.edu [mailto:
>>>> llvm-commits-bounces at cs.uiuc.edu] *On Behalf Of *Chandler Carruth
>>>> *Sent:* Sunday, August 03, 2014 11:51 PM
>>>> *To:* Jiangning Liu
>>>> *Cc:* Jiangning Liu; Commit Messages and Patches for LLVM
>>>>
>>>> *Subject:* Re: [PATCHES] A module inliner pass with a greedy call site
>>>> queue
>>>>
>>>>
>>>>
>>>> Just a brief note...
>>>>
>>>>
>>>>
>>>> On Sun, Aug 3, 2014 at 11:42 PM, Jiangning Liu <liujiangning1 at gmail.com>
>>>> wrote:
>>>>
>>>> 1. I measured code size impact by Yin's patch, overall I don't see code
>>>> size regression.
>>>>
>>>>
>>>>
>>>> 1) For the following cpp program in SPEC, we have the following data.
>>>>
>>>>
>>>>
>>>> -O2 result:
>>>>
>>>>
>>>>
>>>> spec old_text_section old_data_section new_text_section
>>>> new_text_section text_percentage data_percentage
>>>>
>>>> 252.eon 302848 2232 297301 2312 -1.83% 3.58%
>>>>
>>>> 450.soplex 366474 1536 389164 1656 6.19% 7.81%
>>>>
>>>> 453.povray 898032 12632 850444 12632 -5.30% 0.00%
>>>>
>>>> 471.omnetpp 685516 9136 693349 9128 1.14% -0.09%
>>>>
>>>> 473.astar 38999 860 41011 860 5.16% 0.00%
>>>>
>>>> 483.xalancbmk 4282478 139376 4414286 139376 3.08% 0.00%
>>>>
>>>> sum 6574347 165772 6685555 165964 1.69% 0.12%
>>>>
>>>>
>>>> SPEC is highly misleading w.r.t. code size. Also, there are several
>>>> regressions in code size in addition to improvements. It would be useful to
>>>> get measurements from larger code bases.
>>>>
>>>
>>>
>>
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20140811/260ddfb5/attachment.html>