[PATCHES] A module inliner pass with a greedy call site queue

Sun Aug 3 23:42:44 PDT 2014

Hi,

1. I measured code size impact by Yin's patch, overall I don't see code
size regression.

1) For the following cpp program in SPEC, we have the following data.

-O2 result:

spec old_text_section old_data_section new_text_section new_text_section
text_percentage data_percentage
252.eon 302848 2232 297301 2312 -1.83% 3.58%
450.soplex 366474 1536 389164 1656 6.19% 7.81%
453.povray 898032 12632 850444 12632 -5.30% 0.00%
471.omnetpp 685516 9136 693349 9128 1.14% -0.09%
473.astar 38999 860 41011 860 5.16% 0.00%
483.xalancbmk 4282478 139376 4414286 139376 3.08% 0.00%
sum 6574347 165772 6685555 165964 1.69% 0.12%

-Os result:

spec old_text_section old_data_section new_text_section new_text_section
text_percentage data_percentage
252.eon 282163 2256 280966 2280 -0.42% 1.06%
450.soplex 311515 1544 294294 1568 -5.53% 1.55%
453.povray 759599 12608 745275 12608 -1.89% 0.00%
471.omnetpp 657335 9144 652195 9128 -0.78% -0.17%
473.astar 32439 860 32847 860 1.26% 0.00%
483.xalancbmk 3935159 139392 3854060 139376 -2.06% -0.01%
summary 5978210 165804 5859637 165820 -1.98% 0.01%

2) For bzip2 being cared by Nick, I do see code size reduction like below.

-O2

spec old_text_section old_data_section new_text_section new_text_section
text_percentage data_percentage
256.bzip2 41076 3892 40188 3892 -2.16% 0.00%
401.bzip2 68435 3844 66787 3844 -2.41% 0.00%

-Os

spec old_text_section old_data_section new_text_section new_text_section
text_percentage data_percentage
256.bzip2 29750 3892 28894 3892 -2.88% 0.00%
401.bzip2 51845 3844 51429 3844 -0.80% 0.00%

2. I spent some time studying the original inlining algorithm and the new
one provided by Yin, and got some thoughts,

1) It's clear that this new inline algorithm doesn't really want to
*replace* the original one, and I can see the following code is to call the
original SCC inliner for local decision making purpose.,

    CallGraphNode Node(Caller);
    std::vector<CallGraphNode*> NodeVec;
    NodeVec.push_back(&Node);
    CallGraphSCC  SCC(NULL);
    SCC.initialize(&NodeVec[0], &NodeVec[0]+NodeVec.size());
    FuncInliner->setPreferredCallSite(CS);
    FuncInliner->setBonusThreshold(BonusThreshold);
    bool LocalChanged = FuncInliner->runOnSCC(SCC);
    Changed |= LocalChanged;

So the special tunings from SCC Inliner should be able to be inherited
accordingly. So my understanding is the tuning of "the impact of inlining
on virtual dispatch" from current SCC Inliner should be kept naturally.
Maybe my undertstanding is wrong here, so Yin can confirm around this.

2) I can see the point that Chandler is working on improving current pass
manager to help solving this inlining B to A issue for A->B->C, but I
personally think this new GreedyInliner modeling is still meaningful,
because

A. It can solve the performance issue of 252.eon and 177.mesa. At present,
we have big gap for those two benchmarks comparing with GCC, and we want
this improvement very much for our customer and stake holders. In reality,
I personally like to see we raise the performance bar under the condition
of not hurting code size.  Also, I don't see code size regression for them
at all.

177.mesa 483677 972 479625 972 -0.84% 0.00%
252.eon 302848 2232 297301 2312 -1.83% 3.58%

B. I can see this GreedyInliner doesn't change current infrastructure at
all, and it is simply an extension of current SCC inliner. I don't see any
blocking issue for Chandler's pass manager improvement. For long run, if we
can see even better result for both performance and code size in future
with the new pass manager infrastructure, we can still get it removed. But
at least, again, raising the bar is important for us to build high quality
product with llvm.

C. From the GreedyInliner design itself, I do see some valuable points,
although we all know it is a heuristic problem and it can't be perfect. For
example, as James mentioned, taking into account important factors such as loop
nest depth. Actually my understanding of this GreedyInliner is essentially
it provides an even flexible call site parsing ordering capability with
some enhanced modeling rules. If the modeling itself is simply just SCC,
then it would degrade to be the original SCC Inliner.

3. Therefore, my suggestions to Yin are,

1) Make an even clear call site queue interface, and then we would be able
to have different inlining heuristic building blocks to be supported, no
matter whether it is top-down or bottom-up. For example, to be specific,
the following code should be able to be abstracted with an interface.

  // Collect all call sites and compute function size.
  for (Module::iterator it = M.begin(); it != M.end(); ++it) {
    Function* F = &*it;
    if (F->isDeclaration())
      continue;

    unsigned long BBCount;
    unsigned long InstCount;
    FunctionCallCount[F] = CollectFunctionCallSites(F, CallSites, BBCount,
                                                    InstCount);
    unsigned long FS = InstCount + BasicBlockSizeCost * BBCount;
    FunctionSizes[F] = FS;

    // We only inline linkonce functions into other functions.
    // And not count it for module size growth.
    if (!F->hasLinkOnceLinkage())
      TotalSize += FS;
  }

For SCC Inliner, it can be implemented just a SCC parsing order.
For Greedy Inliner, it can be implemented by
using ComputeCallSitesWeight(...) + GetBestCallSite(...).

2) Collect more benchmark data for more targets. e.g. x86, and more
benchmarks like spec2006, geekbench, eembc, and others etc. And I can help
to measure the performance impact on cortex-a57.

Thanks,
-Jiangning

2014-07-31 23:04 GMT+08:00 Hal Finkel <hfinkel at anl.gov>:

> ----- Original Message -----
> > From: "Chandler Carruth" <chandlerc at google.com>
> > To: "Yin Ma" <yinma at codeaurora.org>
> > Cc: "James Molloy" <james.molloy at arm.com>, "Nick Lewycky" <
> nicholas at mxc.ca>, "Hal Finkel" <hfinkel at anl.gov>,
> > apazos at codeaurora.org, "Jiangning Liu" <Jiangning.Liu at arm.com>, "Commit
> Messages and Patches for LLVM"
> > <llvm-commits at cs.uiuc.edu>
> > Sent: Wednesday, July 30, 2014 5:52:19 PM
> > Subject: Re: [PATCHES] A module inliner pass with a greedy call site
> queue
> >
> >
> >
> >
> >
> > On Wed, Jul 30, 2014 at 3:46 PM, Yin Ma < yinma at codeaurora.org >
> > wrote:
> >
> >
> >
> > We found the current SCC
> > inliner cannot inline a critical function in one of the SPEC2000
> > benchmarks
> > unless we increase the threshold to a very large number. Like A calls
> > B
> > calls C, The SCC inliner will start B -> C and inlined C into B,
> > however,
> > the performance gain is B to A and B is in a loop of A. However,
> > after
> > inlining C to B, B becomes very large so B cannot be inlined to A
> > with
> > default inline threshold.
> > This is a well known and studied limitation of the inliner. I gave a
> > talk at a prior dev meeting about it. There are specific ways to
> > address it within the existing inlining framework that I've been
> > working on for some time, but it requires infrastructure changes to
> > implement.
> >
>
> I realize that you've discussed this in the past, but perhaps a recap
> would be helpful. As I recall, the pass-manager rewrite will enable the
> inliner to make use of function-level analysis. Specifically, this will
> include making use of profiling data so that, with appropriate profiling
> data, we can avoid inlining cold call sites.
>
> On a theoretical basis, I'm not sure that addresses the same issue as
> top-down inlining, although there is obviously some overlap in effect.
> Specifically, if a goal of inlining is to push the critical path length as
> close as possible to the target's limit for efficient dispatch, but no
> farther, then I fail to see how pure bottom-up inlining can ever achieve
> that as well as a top-down approach. That having been said, I think that
> bottom-up inlining seems much more likely to expose optimization
> opportunities.
>
> In short, while I'm not at all convinced that Yin's heuristic is the right
> approach, I feel that a more in-depth discussion on the goals of inlining,
> and how we do, or plan to, achieve them, would be beneficial.
>
> Thanks again,
> Hal
>
> >
> >
> >
> > I'm still very concerned that the code size impact and other impacts
> > have not been widely analyzed, but only narrowly analyzed. For
> > example, LLVM's inliner has historically shown over 10% code size
> > advantage across a wide range of benchmark C++ applications, large
> > and small. I don't think we want to sacrifice that.
> >
> >
> >
> >
> > I also don't think you should underestimate the impact of inlining on
> > virtual dispatch. The collapsing of indirect calls to direct calls
> > is *very* important and has been carefully tuned in LLVM's current
> > inliner.
>
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20140804/41ddcf8f/attachment.html>