[PATCH] D99351: [CSSPGO] Top-down processing order based on full profile.

Thu Mar 25 09:09:28 PDT 2021

hoy created this revision.
Herald added subscribers: wenlei, hiraditya.
hoy requested review of this revision.
Herald added a project: LLVM.
Herald added a subscriber: llvm-commits.

Use profiled call edges to augment the top-down order. There are cases
that the top-down order computed based on the static call graph doesn't
reflect real execution order. For example

1. Incomplete static call graph due to unknown indirect call targets.

Adjusting the order by considering indirect call edges from the
profile can enable the inlining of indirect call targets by allowing
the caller processed before them.

2. Mutual call edges in an SCC. The static processing order computed for

an SCC may not reflect the call contexts in the context-sensitive
profile, thus may cause potential inlining to be overlooked. The
function order in one SCC is being adjusted to a top-down order based
on the profile to favor more inlining.

3. Transitive indirect call edges due to inlining. When a callee function

is inlined into into a caller function in LTO prelink, every call edge
originated from the callee will be transferred to the caller. If any
of the transferred edges is indirect, the original profiled indirect
edge, even if considered, would not enforce a top-down order from the
caller to the potential indirect call target in LTO postlink since the
inlined callee is gone from the static call graph.

4. #3 can happen even for direct call targets, due to functions defined

in header files. Header functions, when included into source files,
are defined multiple times but only one definition survives due to
ODR. Therefore, the LTO prelink inlining done on those dropped
definitions can be useless based on a local file scope. More
importantly, the inlinee, once fully inlined to a to-be-dropped
inliner, will have no profile to consume when its outlined version is
compiled. This can lead to a profile-less prelink compilation for the
outlined version of the inlinee function which may be called from
external modules. while this isn't easy to fix, we rely on the
postlink AutoFDO pipeline to optimize the inlinee. Since the survived
copy of the inliner (defined in headers) can be inlined in its local
scope in prelink, it may not exist in the merged IR in postlink, and
we'll need the profiled call edges to enforce a top-down order for the
rest of the functions.

Considering those cases, a profiled call graph completely independent of
the static call graph is constructed based on profile data, where
function objects are not even needed to handle case #3 and case 4.

I'm seeing an average 0.4% perf win out of SPEC2017. For certain benchmark such as Xalanbmk, the win is bigger, about 2.5%.

Repository:
  rG LLVM Github Monorepo

https://reviews.llvm.org/D99351

Files:
  llvm/include/llvm/Transforms/IPO/SampleContextTracker.h
  llvm/lib/Transforms/IPO/SampleContextTracker.cpp
  llvm/lib/Transforms/IPO/SampleProfile.cpp
  llvm/test/Transforms/SampleProfile/profile-context-order.ll
  llvm/test/Transforms/SampleProfile/profile-topdown-order.ll

-------------- next part --------------
A non-text attachment was scrubbed...
Name: D99351.333321.patch
Type: text/x-patch
Size: 18973 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20210325/c7cf5118/attachment.bin>