[all-commits] [llvm/llvm-project] b00fc1: profi - a flow-based profile inference algorithm: ...

Tue Nov 23 09:09:01 PST 2021

  Branch: refs/heads/main
  Home:   https://github.com/llvm/llvm-project
  Commit: b00fc198224efa038a7469e068dd920b3f1aba75
      https://github.com/llvm/llvm-project/commit/b00fc198224efa038a7469e068dd920b3f1aba75
  Author: spupyrev <spupyrev at fb.com>
  Date:   2021-11-23 (Tue, 23 Nov 2021)

  Changed paths:
    A llvm/include/llvm/Transforms/Utils/SampleProfileInference.h
    M llvm/include/llvm/Transforms/Utils/SampleProfileLoaderBaseImpl.h
    M llvm/lib/Transforms/IPO/SampleProfile.cpp
    M llvm/lib/Transforms/Utils/CMakeLists.txt
    A llvm/lib/Transforms/Utils/SampleProfileInference.cpp
    M llvm/lib/Transforms/Utils/SampleProfileLoaderBaseUtil.cpp
    A llvm/test/Transforms/SampleProfile/Inputs/profile-inference.prof
    A llvm/test/Transforms/SampleProfile/profile-inference.ll

  Log Message:
  -----------
  profi - a flow-based profile inference algorithm: Part I (out of 3)

The benefits of sampling-based PGO crucially depends on the quality of profile
data. This diff implements a flow-based algorithm, called profi, that helps to
overcome the inaccuracies in a profile after it is collected.

Profi is an extended and significantly re-engineered classic MCMF (min-cost
max-flow) approach suggested by Levin, Newman, and Haber [2008, Complementing
missing and inaccurate profiling using a minimum cost circulation algorithm]. It
models profile inference as an optimization problem on a control-flow graph with
the objectives and constraints capturing the desired properties of profile data.
Three important challenges that are being solved by profi:
- "fixing" errors in profiles caused by sampling;
- converting basic block counts to edge frequencies (branch probabilities);
- dealing with "dangling" blocks having no samples in the profile.

The main implementation (and required docs) are in SampleProfileInference.cpp.
The worst-time complexity is quadratic in the number of blocks in a function,
O(|V|^2). However a careful engineering and extensive evaluation shows that
the running time is (slightly) super-linear. In particular, instances with
1000 blocks are solved within 0.1 second.

The algorithm has been extensively tested internally on prod workloads,
significantly improving the quality of generated profile data and providing
speedups in the range from 0% to 5%. For "smaller" benchmarks (SPEC06/17), it
generally improves the performance (with a few outliers) but extra work in
the compiler might be needed to re-tune existing optimization passes relying on
profile counts.

Reviewed By: wenlei, hoy

Differential Revision: https://reviews.llvm.org/D109860