[PATCH] D124490: [InstrProf] Minimal Block Coverage

Tue Mar 14 17:53:56 PDT 2023

ellis added a comment.

In D124490#4153293 <https://reviews.llvm.org/D124490#4153293>, @MaskRay wrote:

> In `BlockCoverageInference.cpp` , `for (auto &BB : F) { ...  getReachableAvoiding` is strictly quadratic and I think that may be problematic.
> There are quite a few programs (e.g. Chisel genrerated C++) which contain functions with many basic blocks (say, >1000).
> Also see D137184 <https://reviews.llvm.org/D137184>: a workaround for some programs with too many critical edges.
>
> Applying a strict quadratic algorithm is a compile time pitfall. I understand that it may be fine for some mobile applications.
> There are certainly other quadratic algorithms such as an `O((V+E)*c)` algorithm in GCC gcov (V: number of vertices; E: number of edges; c: number of elementary circuits),
> an O(V^2) implementation of Semi-NCA algorithm.
> For them, the quadratic time complexity is a theoretic upper bound which is really really difficult to achieve in practice (if ever achievable considering that in practice most CFGs are reducible).
>
> Note that the optimality in the number of instrumented basic blocks is not required.
> It does not necessarily lead to optimal performance since we have discarded execution count information.
> Is it possible to use a faster algorithm which may instrument a few more basic blocks (let's arbitrarily say 1.2-approximation is good enough in practice)?

We have analyzed the runtime on several real-world programs that have large functions with many basic blocks. We found that the total runtime of the `pgo-instr-gen` pass for functions with less than 1.5K blocks is less than 5 seconds. For functions with between 4K and 10K basic blocks, the runtime was 150 seconds to 850 seconds.

Note that the intended use case is to instrument minimal block coverage on binaries interested in minimizing binary size so that we can better control the machine outliner. Those binaries likely won't have functions this large because inlining is likely restricted to reduce code size. For those special cases where functions are large (say, >1.5K blocks), I'd like to bail on this algorithm and instead instrument every block. This will keep the code complexity down while preventing build time regressions for the pathological cases.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D124490/new/

https://reviews.llvm.org/D124490