[PATCH] D94001: [CSSPGO] Call site prioritized BFS inlining for sample PGO

Tue Jan 5 12:49:50 PST 2021

wenlei added a comment.

In D94001#2479793 <https://reviews.llvm.org/D94001#2479793>, @davidxl wrote:

> A few high level questions:
>
> 1. Can the size bloat problem be handled in llvm-profgen time ? Basically using hotness information to prune/merge profiles properly?

Yes to some degree. llvm-profgen can do two things: 1) prune and merge profile using hotness without knowledge of inlining. We are doing this already for baseline without this change, but not enough to limit inlining, since this has to be conservative as the pruning is not selective enough. 2) In best case scenario, if we can predict all inline decision, llvm-profgen can prepare (promote and merge) the profile in a way that only those context profile needed for inlining is kept, in which case the inlining will be bounded by the profile output from llvm-profgen. However, llvm-profgen don't have inline cost, so it's hard to prepare context profile perfectly. It's best to leave it up to the sample profile inliner to decide what context profile is useful, and what to be promoted and merged.

We do plan to implement some top down global inlining estimation and adjust profiles accordingly in llvm-profgen though, for ThinLTO. This is because LTO sample profile inliner is not global for ThinLTO, and letting llvm-profgen do some preparation would help with cross-module context profile adjustment. (We could also do something in ThinLink, but that adds more complexity and may also hurt compile time)

> 2. What is the intuition behind BFS order?

This is to have a more balanced inlining. BFS with priority queue will always pick the most beneficial call site to inline within several levels of call graph from current function; while DFS may go deep on a particular call path without knowing whether we have more beneficial candidate on other paths.

> 3. How often does the size limit get triggered with the new inliner?

Most functions in SPEC didn't hit the limit. There's one outlier, gobmk, without the cap it's disastrous for perf and code size - its call graph is very dense with small functions, and unbounded profile guided inliner would go wild. I can also double check on other workload.

> 4. what is the largest improvement in spec06? Any internal benchmark data?

The largest improvements came from povray (11%),  gobmk (10%), followed by a few others in low single digit.

When we made this change, we haven't tried internal workload, and later when we try internal workload, it's group with other changes together. So unfortunately, I don't have data for internal workload for this change alone. We can give it a try to on some internal benchmarks, though we will have to use internal fork to do that which has other stuff not yet upstreamed.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D94001/new/

https://reviews.llvm.org/D94001