[all-commits] [llvm/llvm-project] 45b155: [BOLT] using jump weights in profi

Wed Jan 11 14:35:12 PST 2023

  Branch: refs/heads/main
  Home:   https://github.com/llvm/llvm-project
  Commit: 45b155924e1662dc69883d14149908434f77094f
      https://github.com/llvm/llvm-project/commit/45b155924e1662dc69883d14149908434f77094f
  Author: spupyrev <spupyrev at fb.com>
  Date:   2023-01-11 (Wed, 11 Jan 2023)

  Changed paths:
    M llvm/include/llvm/Transforms/Utils/SampleProfileInference.h
    M llvm/lib/Transforms/Utils/SampleProfileInference.cpp
    M llvm/test/Transforms/SampleProfile/profile-context-tracker.ll

  Log Message:
  -----------
  [BOLT] using jump weights in profi

We want to use profile inference (profi) in BOLT for stale profile matching.
This is the second change for existing usages of profi (e.g., CSSPGO):

(i) Added the ability to provide (estimated) jump weights for the algorithm. The
goal of the algorithm is to create a valid control flow for a given function
(that is, one in which incoming counts equal outgoing counts for every basic
block while minimally modifying the original input block and jump weights). The
input jump weights will be provided based on collected LBR profiles in BOLT.

(ii) Added the corresponding options to ProfiParams.

(iii) Slightly modified / simplified the construction of the flow network in profi
so as it utilizes fewer auxiliary nodes. This is done by introducing parallel
edges to the network (which is supported by MMF) and reduces the size of the
network from 3*|V| to 2*|V|, where |V| is the number of basic blocks in the
function.

**Inference (profile quality) impact:**
The diff is supposed to be a no-op for the inferred counts. However, our
implementation of MCF is not fully deterministic and might return different
results depending on the input network model. Since we changed the model
construction, there are a few differences in comparison to the original
implementation. I checked manually on an internal benchmark and see a minor
difference (+/- 1 count for certain basic blocks) in just a dozen of instances
(out of 10000+ input functions). Hence, the diff is highly unlikely to have an
impact for existing prod workloads.

**Runtime impact:**
I measure up to 10% speedup for block-only (ie CSSPGO/AutoFDO) inference and up
to 50% speedup for block+jump inference (ie BOLT) in comparison to the original
unoptimized version.

Reviewed By: hoy

Differential Revision: https://reviews.llvm.org/D139870