[llvm-dev] [RFC] Computing, storing, and restoring conservative call graphs with LLVM

Sun Jul 11 20:24:14 PDT 2021

Hi all,

Please see the updates to the RFC below.

The terms “type hash”, “type identifier”, and “type id” are used
interchangeably.

The original RFC describes a use case, which is efficient stack trace
collection and reconstruction via offline reverse call graph
traversal. Besides call graph storing and reconstruction as described
in this RFC, additional LLVM features will be introduced for this use
case including 1) ASan features to collect and report compressed stack
traces, 2) a stand-alone tool to reconstruct stack traces based on the
compressed trace and the call graph. A separate RFC will be sent
shortly for these.

Updates
=======
  - Added implementation details on computing and storing call graph
    (see “Implementation” section).
  - Added details on restoring the call graph (see “Updates on
    restoring the call graph” section).  Proposed to introduce an
    option to llvm-objdump (–call-graph-info) instead of having a new
    stand-alone tool.  Added the output layout.

Implementation
==============

The construction scheme will consist of the following parts:

1. Creating type ids for indirect calls and targets at clang front-end

    Generalized type identifiers will be computed for each indirect
    call and target using
    clang::CodeGenModule::CreateMetadataIdentifierGeneralized().
    These type ids will be passed to the middle-end as metadata.  The
    type information is deliberately computed at the front-end from
    C/C++ types instead of IR types due to two main reasons.  First,
    it provides extra precision since some type information is lost at
    IR.  Second, it ensures consistent type identifiers between object
    files compiled at different times since C/C++ standards require
    language-level types to match.

2. Propagating indirect call type ids from the middle-end (LLVM IR) to
   the back-end (machine instructions at AsmPrinter) with CallSiteInfo

    To form the call graph section, the type ids for indirect calls
    and targets are needed at the machine-instruction level.  However,
    the metadata is not propagated from LLVM IR to machine
    instructions.  For indirect targets, machine functions can be
    mapped back to IR functions.  This enables retrieving the type id
    for a machine function from the metadata of the IR function.
    However, there is no such machine-to-IR mapping for arbitrary
    instructions.  To address this,  CallSiteInfo (originally used for
    mapping call arguments to forwarding registers) will be extended
    to include call type ids, so that, they can be retrieved at the
    back-end for indirect calls.

3. Creating comdats for functions whose symbols will be referenced
   from the call graph section

    Comdats will be used to enable dead-stripping of the call graph
    entries.  An LLVM pass will be implemented and enabled to create a
    comdat for each function with at least one call graph entry.  The
    entries of each function will be emitted to the section created
    using its comdat.  As a result, the call graph entries will be
    discarded if the linked functions get removed.

4. Emitting call graph section

    The type ids will be retrieved and the call graph sections will be
    created at the assembly printer as described above.  A special
    type id (0) will be assigned to indirect targets lacking the
    metadata (possibly due to metadata loss), so that a complete list
    of indirect targets can be recovered from the call graph section.
    This makes indirect targets distinguishable from others, which
    provides additional precision.  Labels will be emitted for
    indirect targets and calls, and their values will be emitted to
    the call graph section as pointers to these labels.  Gathering all
    these pieces, the call graph sections will be constructed at the
    assembly printer following the described layout.

Updates on restoring the call graph
===================================

Instead of creating a new stand-alone LLVM tool, a new feature will be
introduced to llvm-objdump (--call-graph-info) to extract and
serialize the call graph from the binary.  This will enable code reuse
for disassembly.

The output will have the following layout:

  INDIRECT TARGETS TYPES
  TypeId1 FuncEntryAddr1 FuncEntryAddr2 ... FuncEntryAddrN
  TypeId2 ...

  INDIRECT CALLS TYPES
  TypeId1 ICallSite1 ICallSite2 ... ICallSiteN
  TypeId2 ...

  INDIRECT CALL SITES
  CallerFuncEntryAddr1 ICallSite1 ICallSite2 ... ICallSiteN
  CallerFuncEntryAddr2 ...

  DIRECT CALL SITES
  CallerFuncEntryAddr1 DCallSite1 DCallTarget1 ... DCallSiteN DCallTargetN
  CallerFuncEntryAddr2 ...

  FUNCTION SYMBOLS
  FuncEntryAddr1 FuncName1
  FuncEntryAddr2 FuncName2
  ...

The information for the indirect target and call type identifiers will
be retrieved from the call graph section. The call sites will be
extracted from the disassembly. The function symbols will be extracted
from the symbol table.

If an indirect call has no type id, it will still appear in “INDIRECT
CALL SITES” list but not in  “INDIRECT CALL TYPES” list.  In such
cases, receiver edges can be drawn to all indirect targets from such
call sites for completeness of the call graph,

Indirect targets without a type id are assigned a special id (0) in
the call graph section construction.  For completeness, these
functions can be taken as receiver to any indirect call.

Best,
Necip Fazil Yildiran

On Thu, Jun 10, 2021 at 3:48 PM Necip Yildiran <necip at google.com> wrote:

> Hi all,
>
> Please find the RFC for computing, storing, and restoring conservative
> call graphs with LLVM below.
>
> This is an early version of the proposed feature, and we are still
> exploring the ways to implement it.  Comments and suggestions are
> welcome.
>
>
> Summary
> =======
>
> Inferring indirect call targets from a binary is challenging without
> source-level information.  Hence, reconstruction of a fine-grained
> call graph from the binary is unfeasible for indirect/virtual calls.
> To address this, we propose to implement a feature in LLVM to collect
> the necessary information to construct the call graph later while the
> source information is present (i.e., during compilation), and store it
> in a non-code section of the binary.  Additionally, we propose to
> implement a stand-alone tool, llvm-discg, or a feature in llvm-objdump
> to extract and serialize the call graph.  Together, these will allow
> recovering a conservative (i.e., possibly bigger than actual to avoid
> missing information) and high-precision call graph only from the
> binary.
>
> In the following, we discuss the motivation for the proposed changes,
> and describe how we plan to construct, store, and reconstruct the call
> graph.
>
>
> Background and motivation
> =========================
>
> Source code provides rich semantic information such as types, which is
> mostly lost when it is compiled to a binary.  A coarse-grained call
> graph can be recovered when indirect calls are allowed to target any
> function in the program, resulting in poor precision.  On the other
> hand, source-level information can easily increase the precision of
> the call graphs, e.g., by restricting the target of an indirect call
> to functions with matching type.
>
> It is infeasible to fully reconstruct the call graph for the binary
> even with the source in hand.  The call graph for the source is
> possibly different than for the binary due to removed or inserted
> function calls, e.g., tail call optimizations or new function calls to
> libc.  Therefore, construction of a call graph during compilation is
> favorable.  Since the call graph belongs to a specific compilation of
> the source, a non-code section in the binary is the right scope to
> store such information.
>
>
> Use case
> ========
>
> Many program analysis tools (bug detectors, profilers, etc.) collect
> call stack traces during execution, to later present them in
> diagnostic reports.  In some cases, like with AddressSanitizer [1],
> stack trace collection is not a performance critical operation since
> the rest of the tool’s overhead is much higher.  In other cases, such
> as statistical heap profilers, it is also not a performance bottleneck
> since the profilers sample a small fraction of events.
>
> For memory safety error detectors that use the upcoming Arm Memory
> Tagging Extension (Arm MTE, [2]), however, stack trace collection will
> become a significant bottleneck, with respect to CPU and RAM usage,
> because stack traces will need to be collected on every malloc() and
> free() call. Reducing this overhead will enable collecting stack
> traces and improve quality of diagnostics for MTE-enabled dynamic
> analysis in production.
>
> Our goal is to achieve efficient stack trace collection and
> reconstruction via offline reverse call graph traversal.  It involves
> emitting stack trace hash (instead of the full trace) during runtime;
> and reconstructing the full stack trace offline from a stack trace
> hash and a call graph when needed.  The proposed LLVM features will
> enable this by reconstructing an accurate call graph from the binary.
>
>
> Construction of the conservative call graph
> ===========================================
>
> The call graph for direct calls is trivial to obtain.  For indirect or
> virtual calls, the conservative assumption will be made: all functions
> with matching signature (with relaxations, e.g., see
> -fsanitize-cfi-icall-generalize-pointers [3]) will be assumed to be a
> possible target.  To further narrow down, a function will not be
> considered as a target unless it has external linkage or its address
> is taken.  Further narrowing for C++ virtual calls can be done later.
>
> The possible targets of indirect calls will be split into disjoint
> classes of unique function signatures.  In the call graph, an indirect
> call will have an edge to the receiver function class with matching
> signature to reflect the possible targets for the call.
>
> The call graph will be saved in a non-code section of the binary,
> i.e., .callgraph.  The functions and call sites will be identified
> with unique function entry and call site labels in the machine code,
> which can be translated into instruction addresses.  The hash of
> function signatures will be used to identify and refer to the disjoint
> classes of indirect targets.
>
> To achieve this in LLVM, a new flag will be introduced, which
> constructs and emits the call graph in the compiler backend, possibly
> implemented in llvm/lib/codeGen/AsmPrinter/AsmPrinter.cpp.
>
>
> Call graph layout in the binary
> ===============================
>
> For direct calls, no additional information will be stored in the
> binary.  For indirect calls, the call graph will be represented in two
> pieces:
>
>   * a non-code section, .callgraph, mapping the function signature
> hashes to the list of target functions entries and indirect/virtual
> call sites.  The first word is a version number to account for
> potential changes in the format.
>
>  * labels in .text, indicating the target function entries and
> indirect/virtual call sites. The .callgraph section will have pointers
> to these labels.  This will not affect the executable code.
>
>
> The layout is as follows:
>
> <.callgraph section>
> FormatVersionNumber, TypeHash1,
> FuncEntryLabelCount1, .LFunc1Entry, .LFunc2Entry, ..., .LFuncNEntry,
> CallSiteLabelCount1,  .LIndirectCallSite1, ..., .LIndirectCallSiteM,
> FormatVersionNumber, TypeHash2, …
> FuncEntryLabelCount2, ...,
> CallSiteLabelCount2,  ...,
> ...
>
> <.text section>
> ...
> .LFunc1Entry:           ; unique label for target function entry for
> func1()
>     push %rbp
>     ...
>
> .LIndirectCallSite1     ; unique label for indirect function call 1
>     callq Y(%rbp)
> ...
>
>
> Because the linker may concatenate .callgraph sections with different
> format version numbers, we repeat the format version number for each
> TypeHash list.  This avoids any special logic for the .callgraph
> section in the linker.
>
>
> Restoring the call graph
> ========================
>
> To extract the call graph from the binary and serialize, a new LLVM
> tool, llvm-discg, will be implemented using LLVM disassembler library,
> or a new feature will be introduced into llvm-objdump.  The direct
> call graph will be reconstructed from the disassembly without needing
> the stored call graph data.  The indirect call graph will be
> reconstructed from the .callgraph section and the disassembly.
>
> For the indirect call graph, the labels in the .callgraph section will
> be resolved to instruction addresses.  Then, disjoint classes of
> function entries and call sites will be recovered based on their type
> hashes.  For a call site with type hash H, edges will be made to all
> function entries with the same type hash H.  Repeating the procedure
> for all call sites, the conservative indirect call graph will be
> reconstructed.
>
>
> Future Direction
> ================
>
> The construction mode of the call graph can be parametrized to tune
> the trade-off between the completeness and soundness of the call graph
> based on additional heuristics.
>
>
> References
> ==========
>
> [1] https://clang.llvm.org/docs/AddressSanitizer.html
> [2]
> https://security.googleblog.com/2019/08/adopting-arm-memory-tagging-extension.html
> [3]
> https://clang.llvm.org/docs/ControlFlowIntegrity.html#fsanitize-cfi-icall-generalize-pointers
>
>
> Best,
> Necip Fazil Yildiran
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210711/c01ad3df/attachment.html>