[llvm-dev] [InstrProfiling] Lightweight Instrumentation

Mon Oct 18 11:28:55 PDT 2021

By the way, it is easier to view this RFC in google groups at
https://groups.google.com/g/llvm-dev/c/r03Z6JoN7d4.

On Mon, Oct 18, 2021 at 10:27 AM Ellis Hoag <ellis.sparky.hoag at gmail.com>
wrote:

> *RFC: Lightweight Instrumentation*
>
> Hi all,
>
> Our team at Facebook would like to propose a lightweight variant of IR
> instrumentation PGO for use in the mobile space. IRPGO is a proven
> technology in LLVM that can boost performance for server workloads.
> However, the larger binary resulting from instrumentation significantly
> limits its use for mobile applications. In this proposal, we introduce a
> few changes to IRPGO to reduce the instrumented binary size, making it
> suitable for PGO on mobile devices.
>
> This proposal is driven by the same need behind the earlier MIP (machine
> IR profile) prototype <https://reviews.llvm.org/D104060>. But unlike MIP
> where there is significant divergence from IRPGO, this proposed lightweight
> instrumentation fits into the existing IRPGO framework with a few
> extensions to achieve a smaller instrumented binary.
>
> We’d like to share the new design and results from our prototype and get
> feedback.
>
> Best,
> Ellis, Kyungwoo, and Wenlei
> Motivation
> In the mobile space, profile guided optimization can also have an outsized
> impact on performance just like PGO for server workloads, but conventional
> instrumentation comes with a large binary size and code size increase as
> high as 50%, which limits its use for mobile application for two reasons:
>
>    - Mobile applications are very sensitive to total binary size as
>    larger binaries take longer to download and use more space on devices.
>    There could be a hard size limit for over-the-air (OTA) updates for this
>    reason.
>    - When code (.text) size increases, it takes longer for applications
>    to start up and could also degrade runtime performance due to more page
>    faults on devices with limited RAM.
>
> Reducing the size overhead from instrumentation would make IRPGO usable
> for mobile applications so we could send instrumented binaries through OTA
> updates in production environments, collect representative production
> profiles, and apply PGO.
> OverviewThe size overhead from IRPGO mainly comes from two things: 1)
> metadata for mapping raw counts back to IR/CFG, which has to stay with the
> binary. 2) the increased .text size due to insertion of instrumented code
> and less effective optimization after instrumentation. Two extensions are
> proposed to reduce the size overhead from each of the above:
>
>    - We allow the use of debug info / dwarf as alternative metadata for
>    mapping counts to IR, aka profile correlation. Debug info is extractable
>    from the binary, therefore such metadata doesn’t need to be shipped to
>    mobile devices. Debug info has been used extensively for sampling based PGO
>    in LLVM, so it has reasonable quality to support profile correlation.
>    - We add the flexibility to allow coarse grained instrumentation that:
>    1) only insert probes at function entry instead of each block (or blocks
>    decided by MST placement); 2) optional coverage mode using one byte
>    booleans in addition to today’s counting mode using 8 byte counters.
>
> The extensions offer a spectrum of trade-off choices from the most
> accurate PGO to something very lightweight that can be used in mobile
> space. With debug info extracted and using function entry coverage mode,
> the size increase can be reduced from close to 50% down to below 5%
> (measured with clang self-build PGO).
> Extractable MetadataWith today’s IRPGO, the instrumentation runtime dumps
> out a profraw profile at the end of training. The runtime creates a
> header and appends data from the __llvm_prf_data, __llvm_prf_cnts, and
> __llvm_prf_names sections to create a profraw profile. The __llvm_prf_data
> section contains references to each function’s profile data (in
> __llvm_prf_cnts) and name (in __llvm_prf_names) so they are needed to
> correlate profile data to the functions they instrument.
>
> Some kind of metadata to correlate counts back to IR (specifically CFG
> blocks) is unavoidable. One way to reduce binary size is to make such
> metadata extractable so they don’t have to be shipped to mobile devices. We
> could make __llvm_prf_data and __llvm_prf_names extractable, but the cost
> will be non-trivial and it will be a breaking change. On the other hand,
> debug info is extractable from binary and it already does a very good job
> of maintaining mapping between address and source location / symbols.
> Sample PGO depends entirely on debug info for profile correlation. So we
> picked debug info as the alternative for extractable metadata.
>
> In our proposed instrumentation, we create a special global struct, e.g.,
> __profc__Z3foov, to hold counters for a particular function. The
> __llvm_prf_cnts data section holds all of these structs and serves as
> placeholder for raw profile counters. In our final instrumented binary, we
> only have probe instructions and raw profile data without any
> instrumentation metadata, i.e., there are no __llvm_prf_names or
> __llvm_prf_data sections but we still have a __llvm_prf_cnts section. At
> runtime, we dump the __llvm_prf_cnts section to a file without any
> processing after profiling. To differentiate from IRPGO, the output from
> runtime is called proflite and we can add another VARIANT_MASK_ flag to
> the Version field of the profile header. At llvm-profdata post-processing
> time, we use debug info to correlate our raw profile data as follows. First
> we identify an instrumented function and look for its special global struct
> that holds counters (__profc__Z3foov) in the debug info. The debug info
> can tell us the address of that symbol in the binary and we can compute its
> offset from the __llvm_prof_cnts section. Then we can use that offset to
> read the function entry and block counters from the proflite file.
> Finally we populate profdata output for each function following the
> existing format.
>
> Value profile is not going to be supported with extractable metadata right
> now, though we believe it can also be added following a similar scheme.
>
> To improve debug info quality for profile correlation,
> -fdebug-info-for-profiling from AutoFDO can be used. Additionally, we
> could also use pseudo-probe from CSSPGO
> <https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s/m/hBrJaOWVAwAJ> as
> the alternative metadata which is also fully extractable.
>
> We propose a new flag
> -fprofile-generate-correlate=[profdata|debug-info|pseudo-probe] to choose
> what metadata to use for profile correlation. Either we correlate with
> today’s IRPGO metadata and keep them in their own sections (__llvm_prf_data
> and __llvm_prf_names), with debug info, or with pseudo-probe.
> Coarse-grained InstrumentationIn addition to reducing metadata size (
> __llvm_prf_names and __llvm_prf_data), we can also tune down .text size
> and __llvm_prf_cnts size. We do this by 1) only instrumenting function
> entries instead of each block and 2) lowering precision by tracking single
> byte coverage data rather than 8 byte counters. This is a trade-off between
> profile quality and binary size.
>
> Function profile vs block profile and counting mode vs coverage mode can
> all be selected independently using our proposed flag
> -fprofile-generate-mode=[func-cov|block-cov|func-cnt|block-cnt], and they
> can work with both extractable metadata as well as IRPGO‘s correlation
> method. func-cov and block-cov use single byte booleans for coverage data
> while func-cnt and block-cnt use 8 byte counters. block-cnt represents
> today’s IRPGO which is the default.
>
> When using a profile generated from modes other than block-cnt,
> additional profile inference is needed before the counts can be consumed by
> optimizations. Such inference is done during profile loading and so it’s
> transparent to optimizations.
>
>    - For block coverage mode, we will use coverage info to seed block
>    count inference, and leverage static branch probability at the same time to
>    produce a CFG profile that honors zero count blocks and converts live block
>    coverage data into synthetic counts.
>    - For function count mode, we will derived a CFG profile entirely from
>    static branch probability, then scale the CFG profile based on function
>    entry count.
>    - Function coverage mode is handled similar to function count mode.
>    For covered/live functions, we will derived a CFG profile entirely from
>    static branch probability first, then scale that CFG profile by a constant.
>
> Experiments showed that even with coarse-grained function entry profiles,
> mobile application can still benefit from PGO. But the smaller binary make
> it possible for mobile to use PGO.
> WorkflowSince these are extensions that share the same underlying PGO
> framework, the workflow for lightweight PGO is very similar to existing
> IRPGO.
>
> The diagram below has the PGO workflow today (shown in red) in comparison
> with the workflow for lightweight instrumentation (shown in green). We
> first create an instrumentation build that produces a raw profile at
> runtime. Then we use the llvm-profdata tool to convert that raw profile
> to a profile that the compiler can consume in the PGO build. The main
> difference for lightweight instrumentation is that we create an
> instrumentation build with debug info and we use that debug info to create
> our final profile.
>
> [image: image.png]Prototype & Results
> We have a proof of concept
> <https://github.com/ellishg/llvm-project/commits/instr-correlate-debug-info>
> using dwarf as the extractable metadata and single byte function coverage
> instrumentation. We measured code size by building Clang with and without
> instrumentation using -Oz and no value profiling. Our lightweight
> instrumented Clang binary is only +4 MB (+3.48%) larger than a
> non-instrumented binary. We compare this with today’s PGO instrumentation
> Clang binary which is +54 MB (+46.96%) larger. If we used debug info to
> correlate normal instrumentation (without value profiling) instead of just
> function coverage then we would expect to see an overhead of +43.2 MB
> (+37.5%). We don’t have performance data on clang experiments using the
> prototype since not all components are implemented. However, an alternative
> implementation earlier (similar to MIP) delivered good performance boost
> for mobile applications.
>
> [image: table-large.jpg]
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20211018/e1e290ec/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 230544 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20211018/e1e290ec/attachment-0001.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: table-large.jpg
Type: image/jpeg
Size: 310107 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20211018/e1e290ec/attachment-0001.jpg>