[all-commits] [llvm/llvm-project] 7d096f: [CSSPGO] Support of CS profiles in extended binary...

Wed Feb 3 19:29:18 PST 2021

  Branch: refs/heads/release/12.x
  Home:   https://github.com/llvm/llvm-project
  Commit: 7d096f9bb350429628c6befce8f94dba4bbc6db9
      https://github.com/llvm/llvm-project/commit/7d096f9bb350429628c6befce8f94dba4bbc6db9
  Author: Hongtao Yu <hoy at fb.com>
  Date:   2021-02-03 (Wed, 03 Feb 2021)

  Changed paths:
    M llvm/include/llvm/ProfileData/SampleProf.h
    M llvm/include/llvm/ProfileData/SampleProfReader.h
    M llvm/lib/ProfileData/SampleProfReader.cpp
    M llvm/lib/ProfileData/SampleProfWriter.cpp
    M llvm/lib/Transforms/IPO/SampleContextTracker.cpp
    M llvm/test/Transforms/SampleProfile/profile-context-tracker.ll
    A llvm/test/tools/llvm-profdata/Inputs/cs-sample.proftext
    A llvm/test/tools/llvm-profdata/cs-sample-profile.test
    M llvm/tools/llvm-profdata/llvm-profdata.cpp
    M llvm/tools/llvm-profgen/ProfileGenerator.cpp

  Log Message:
  -----------
  [CSSPGO] Support of CS profiles in extended binary format.

This change brings up support of context-sensitive profiles in the format of extended binary. Existing sample profile reader/writer/merger code is being tweaked to reflect the fact of bracketed input contexts, like (`[...]`). The paired brackets are also needed in extbinary profiles because we don't yet have an otherwise good way to tell calling contexts apart from regular function names since the context delimiter `@` can somehow serve as a part of the C++ mangled names.

Reviewed By: wmi, wenlei

Differential Revision: https://reviews.llvm.org/D95547

(cherry picked from commit 7e99bddfeaab2713a8bb6ca538da25b66e6efc59)

  Commit: f2cabaac9525ba4b86301136e21ec9aad6aaf326
      https://github.com/llvm/llvm-project/commit/f2cabaac9525ba4b86301136e21ec9aad6aaf326
  Author: Hongtao Yu <hoy at fb.com>
  Date:   2021-02-03 (Wed, 03 Feb 2021)

  Changed paths:
    M llvm/lib/ProfileData/SampleProfReader.cpp
    M llvm/lib/Transforms/IPO/SampleContextTracker.cpp
    A llvm/test/Transforms/SampleProfile/Inputs/pseudo-probe-inline.prof
    A llvm/test/Transforms/SampleProfile/pseudo-probe-inline.ll

  Log Message:
  -----------
  [CSSPGO] Tweaking inlining with pseudo probes.

Fixing up a couple places where `getCallSiteIdentifier` is needed to support pseudo-probe-based callsites.

Also fixing an issue in the extbinary profile reader where the metadata section is not fully scanned based on the number of profiles loaded only for the current module.

Reviewed By: wmi, wenlei

Differential Revision: https://reviews.llvm.org/D95791

(cherry picked from commit 224fee8219bb3aed34f13ce40935e1b3ede90a0f)

  Commit: b9fa16f2234edddf6e0f449a0e7b646ee9046cf3
      https://github.com/llvm/llvm-project/commit/b9fa16f2234edddf6e0f449a0e7b646ee9046cf3
  Author: Hongtao Yu <hoy at fb.com>
  Date:   2021-02-03 (Wed, 03 Feb 2021)

  Changed paths:
    M clang/include/clang/Driver/Options.td
    M clang/lib/Driver/ToolChains/CommonArgs.cpp
    A clang/test/Driver/pseudo-probe-lto.c

  Log Message:
  -----------
  [CSSPGO] Passing the clang driver switch -fpseudo-probe-for-profiling to the linker.

As titled.

Reviewed By: wmi, wenlei

Differential Revision: https://reviews.llvm.org/D95271

(cherry picked from commit d3e2e3740d0730cb6788c771bb01a8f3e935bf2e)

  Commit: 27ff658e97528540e4425c0cb6400f3e5355f53a
      https://github.com/llvm/llvm-project/commit/27ff658e97528540e4425c0cb6400f3e5355f53a
  Author: Wenlei He <aktoon at gmail.com>
  Date:   2021-02-03 (Wed, 03 Feb 2021)

  Changed paths:
    M llvm/include/llvm/Transforms/IPO/SampleContextTracker.h
    M llvm/lib/Transforms/IPO/SampleContextTracker.cpp
    M llvm/lib/Transforms/IPO/SampleProfile.cpp
    A llvm/test/Transforms/SampleProfile/Inputs/indirect-call-csspgo.prof
    A llvm/test/Transforms/SampleProfile/csspgo-inline-debug.ll
    A llvm/test/Transforms/SampleProfile/csspgo-inline-icall.ll
    A llvm/test/Transforms/SampleProfile/csspgo-inline.ll
    M llvm/test/Transforms/SampleProfile/profile-context-tracker-debug.ll
    M llvm/test/Transforms/SampleProfile/profile-context-tracker.ll
    M llvm/test/Transforms/SampleProfile/pseudo-probe-inline.ll

  Log Message:
  -----------
  [CSSPGO] Call site prioritized inlining for sample PGO

This change implemented call site prioritized BFS profile guided inlining for sample profile loader. The new inlining strategy maximize the benefit of context-sensitive profile as mentioned in the follow up discussion of CSSPGO RFC. The change will not affect today's AutoFDO as it's opt-in. CSSPGO now defaults to the new FDO inliner, but can fall back to today's replay inliner using a switch (`-sample-profile-prioritized-inline=0`).

Motivation

With baseline AutoFDO, the inliner in sample profile loader only replays previous inlining, and the use of profile is only for pruning previous inlining that turned out to be cold. Due to the nature of replay, the FDO inliner is simple with hotness being the only decision factor. It has the following limitations that we're improving now for CSSPGO.
 - It doesn't take inline candidate size into account. Since it's doing replay, the size growth is bounded by previous CGSCC inlining. With context-sensitive profile, FDO inliner is no longer limited by previous inlining, so we need to take size into account to avoid significant size bloat.
 - The way it looks at hotness is not accurate. It uses total samples in an inlinee as proxy for hotness, while what really matters for an inline decision is the call site count. This is an unfortunate fall back because call site count and callee entry count are not reliable due to dwarf based correlation, especially for inlinees. Now paired with pseudo-probe, we have accurate call site count and callee's entry count, so we can use that to gauge hotness more accurately.
 - It treats all call sites from a block as hot as long as there's one call site considered hot. This is normally true, but since total samples is used as hotness proxy, this transitiveness within block magnifies the inacurate hotness heuristic. With pseduo-probe and the change above, this is no longer an issue for CSSPGO.

New FDO Inliner

Putting all the requirement for CSSPGO together, we need a top-down call site prioritized BFS inliner. Here're reasons why each component is needed.
 - Top-down: We need a top-down inliner to better leverage context-sensitive profile, so inlining is driven by accurate context profile, and post-inline is also accurate. This is already implemented in https://reviews.llvm.org/D70655.
 - Size Cap: For top-down inliner, taking function size into account for inline decision alone isn't sufficient to control size growth. We also need to explicitly cap size growth because with top-down inlining, we can grow inliner size significantly with large number of smaller inlinees even if each individually passes the cost/size check.
 - Prioritize call sites: With size cap, inlining order also becomes important, because if we stop inlining due to size budget limit, we'd want to use budget towards the most beneficial call sites.
 - BFS inline: Same as call site prioritization, if we stop inlining due to size budget limit, we want a balanced inline tree, rather than going deep on one call path.

Note that the new inliner avoids repeatedly evaluating same set of call site, so it should help with compile time too. For this reason, we could transition today's FDO inliner to use a queue with equal priority to avoid wasted reevaluation of same call site (TODO).

Speculative indirect call promotion and inlining is also supported now with CSSPGO just like baseline AutoFDO.

Tunings and knobs

I created tuning knobs for size growth/cap control, and for hot threshold separate from CGSCC inliner. The default values are selected based on initial tuning with CSSPGO.

Results

Evaluated with an internal LLVM fork couple months ago, plus another change to adjust hot-threshold cutoff for context profile (will send up after this one), the new inliner show ~1% geomean perf win on spec2006 with CSSPGO, while reducing code size too. The measurement was done using train-train setup, MonoLTO w/ new pass manager and pseudo-probe. Note that this is just a starting point - we hope that the new inliner will open up more opportunity with CSSPGO, but it will certainly take more time and effort to make it fully calibrated and ready for bigger workloads (we're working on it).

Differential Revision: https://reviews.llvm.org/D94001

(cherry picked from commit 6bae5973c476e16dbbc82030d65c7859a6628e89)

  Commit: c2f3f45b5c5bd6f9b86a766fc40130b34acb8293
      https://github.com/llvm/llvm-project/commit/c2f3f45b5c5bd6f9b86a766fc40130b34acb8293
  Author: Wenlei He <aktoon at gmail.com>
  Date:   2021-02-03 (Wed, 03 Feb 2021)

  Changed paths:
    M llvm/lib/Transforms/IPO/SampleProfile.cpp
    M llvm/test/Transforms/SampleProfile/pseudo-probe-inline.ll
    M llvm/test/Transforms/SampleProfile/remarks.ll

  Log Message:
  -----------
  [CSSPGO] Factor out common part for CSSPGO inline and AFDO inline

Refactoring SampleProfileLoader::inlineHotFunctions to use helpers from CSSPGO inlining and reduce similar code in the inlining loop, plus minor cleanup for AFDO path.

This is resubmit of D95024, with build break and overtighten assertion fixed.

Test Plan:

(cherry picked from commit 1645f465be85223e9f5b6303a3e5e0e491fd819f)

  Commit: a9157c5628dc89b13936bbc8eef261cb02d63d40
      https://github.com/llvm/llvm-project/commit/a9157c5628dc89b13936bbc8eef261cb02d63d40
  Author: Hongtao Yu <hoy at fb.com>
  Date:   2021-02-03 (Wed, 03 Feb 2021)

  Changed paths:
    M clang/test/CodeGen/pseudo-probe-emit.c
    M llvm/include/llvm/IR/IntrinsicInst.h
    M llvm/include/llvm/IR/Intrinsics.td
    M llvm/include/llvm/IR/PseudoProbe.h
    M llvm/include/llvm/Passes/StandardInstrumentations.h
    M llvm/include/llvm/ProfileData/SampleProf.h
    M llvm/include/llvm/Transforms/IPO/SampleProfileProbe.h
    M llvm/lib/IR/PseudoProbe.cpp
    M llvm/lib/Passes/PassBuilder.cpp
    M llvm/lib/Passes/PassRegistry.def
    M llvm/lib/Passes/StandardInstrumentations.cpp
    M llvm/lib/Transforms/IPO/SampleProfile.cpp
    M llvm/lib/Transforms/IPO/SampleProfileProbe.cpp
    A llvm/test/Transforms/SampleProfile/Inputs/pseudo-probe-update.prof
    M llvm/test/Transforms/SampleProfile/pseudo-probe-emit-inline.ll
    M llvm/test/Transforms/SampleProfile/pseudo-probe-emit.ll
    M llvm/test/Transforms/SampleProfile/pseudo-probe-inline.ll
    M llvm/test/Transforms/SampleProfile/pseudo-probe-profile.ll
    A llvm/test/Transforms/SampleProfile/pseudo-probe-update.ll
    A llvm/test/Transforms/SampleProfile/pseudo-probe-verify.ll

  Log Message:
  -----------
  [CSSPGO] Introducing distribution factor for pseudo probe.

Sample re-annotation is required in LTO time to achieve a reasonable post-inline profile quality. However, we have seen that such LTO-time re-annotation degrades profile quality. This is mainly caused by preLTO code duplication that is done by passes such as loop unrolling, jump threading, indirect call promotion etc, where samples corresponding to a source location are aggregated multiple times due to the duplicates. In this change we are introducing a concept of distribution factor for pseudo probes so that samples can be distributed for duplicated probes scaled by a factor. We hope that optimizations duplicating code well-maintain the branch frequency information (BFI) based on which probe distribution factors are calculated. Distribution factors are updated at the end of preLTO pipeline to reflect an estimated portion of the real execution count.

This change also introduces a pseudo probe verifier that can be run after each IR passes to detect duplicated pseudo probes.

A saturated distribution factor stands for 1.0. A pesudo probe will carry a factor with the value ranged from 0.0 to 1.0. A 64-bit integral distribution factor field that represents [0.0, 1.0] is associated to each block probe. Unfortunately this cannot be done for callsite probes due to the size limitation of a 32-bit Dwarf discriminator. A 7-bit distribution factor is used instead.

Changes are also needed to the sample profile inliner to deal with prorated callsite counts. Call sites duplicated by PreLTO passes, when later on inlined in LTO time, should have the callees’s probe prorated based on the Prelink-computed distribution factors. The distribution factors should also be taken into account when computing hotness for inline candidates. Also, Indirect call promotion results in multiple callisites. The original samples should be distributed across them. This is fixed by adjusting the callisites' distribution factors.

Reviewed By: wmi

Differential Revision: https://reviews.llvm.org/D93264

(cherry picked from commit 3d89b3cbec230633e8228787819b15116c1a1730)

Compare: https://github.com/llvm/llvm-project/compare/f5602e0bf31a...a9157c5628dc