[llvm-dev] [RFC] Computing, storing, and restoring conservative call graphs with LLVM

Thu Jun 10 12:48:19 PDT 2021

Hi all,

Please find the RFC for computing, storing, and restoring conservative
call graphs with LLVM below.

This is an early version of the proposed feature, and we are still
exploring the ways to implement it.  Comments and suggestions are
welcome.

Summary
=======

Inferring indirect call targets from a binary is challenging without
source-level information.  Hence, reconstruction of a fine-grained
call graph from the binary is unfeasible for indirect/virtual calls.
To address this, we propose to implement a feature in LLVM to collect
the necessary information to construct the call graph later while the
source information is present (i.e., during compilation), and store it
in a non-code section of the binary.  Additionally, we propose to
implement a stand-alone tool, llvm-discg, or a feature in llvm-objdump
to extract and serialize the call graph.  Together, these will allow
recovering a conservative (i.e., possibly bigger than actual to avoid
missing information) and high-precision call graph only from the
binary.

In the following, we discuss the motivation for the proposed changes,
and describe how we plan to construct, store, and reconstruct the call
graph.

Background and motivation
=========================

Source code provides rich semantic information such as types, which is
mostly lost when it is compiled to a binary.  A coarse-grained call
graph can be recovered when indirect calls are allowed to target any
function in the program, resulting in poor precision.  On the other
hand, source-level information can easily increase the precision of
the call graphs, e.g., by restricting the target of an indirect call
to functions with matching type.

It is infeasible to fully reconstruct the call graph for the binary
even with the source in hand.  The call graph for the source is
possibly different than for the binary due to removed or inserted
function calls, e.g., tail call optimizations or new function calls to
libc.  Therefore, construction of a call graph during compilation is
favorable.  Since the call graph belongs to a specific compilation of
the source, a non-code section in the binary is the right scope to
store such information.

Use case
========

Many program analysis tools (bug detectors, profilers, etc.) collect
call stack traces during execution, to later present them in
diagnostic reports.  In some cases, like with AddressSanitizer [1],
stack trace collection is not a performance critical operation since
the rest of the tool’s overhead is much higher.  In other cases, such
as statistical heap profilers, it is also not a performance bottleneck
since the profilers sample a small fraction of events.

For memory safety error detectors that use the upcoming Arm Memory
Tagging Extension (Arm MTE, [2]), however, stack trace collection will
become a significant bottleneck, with respect to CPU and RAM usage,
because stack traces will need to be collected on every malloc() and
free() call. Reducing this overhead will enable collecting stack
traces and improve quality of diagnostics for MTE-enabled dynamic
analysis in production.

Our goal is to achieve efficient stack trace collection and
reconstruction via offline reverse call graph traversal.  It involves
emitting stack trace hash (instead of the full trace) during runtime;
and reconstructing the full stack trace offline from a stack trace
hash and a call graph when needed.  The proposed LLVM features will
enable this by reconstructing an accurate call graph from the binary.

Construction of the conservative call graph
===========================================

The call graph for direct calls is trivial to obtain.  For indirect or
virtual calls, the conservative assumption will be made: all functions
with matching signature (with relaxations, e.g., see
-fsanitize-cfi-icall-generalize-pointers [3]) will be assumed to be a
possible target.  To further narrow down, a function will not be
considered as a target unless it has external linkage or its address
is taken.  Further narrowing for C++ virtual calls can be done later.

The possible targets of indirect calls will be split into disjoint
classes of unique function signatures.  In the call graph, an indirect
call will have an edge to the receiver function class with matching
signature to reflect the possible targets for the call.

The call graph will be saved in a non-code section of the binary,
i.e., .callgraph.  The functions and call sites will be identified
with unique function entry and call site labels in the machine code,
which can be translated into instruction addresses.  The hash of
function signatures will be used to identify and refer to the disjoint
classes of indirect targets.

To achieve this in LLVM, a new flag will be introduced, which
constructs and emits the call graph in the compiler backend, possibly
implemented in llvm/lib/codeGen/AsmPrinter/AsmPrinter.cpp.

Call graph layout in the binary
===============================

For direct calls, no additional information will be stored in the
binary.  For indirect calls, the call graph will be represented in two
pieces:

  * a non-code section, .callgraph, mapping the function signature
hashes to the list of target functions entries and indirect/virtual
call sites.  The first word is a version number to account for
potential changes in the format.

 * labels in .text, indicating the target function entries and
indirect/virtual call sites. The .callgraph section will have pointers
to these labels.  This will not affect the executable code.

The layout is as follows:

<.callgraph section>
FormatVersionNumber, TypeHash1,
FuncEntryLabelCount1, .LFunc1Entry, .LFunc2Entry, ..., .LFuncNEntry,
CallSiteLabelCount1,  .LIndirectCallSite1, ..., .LIndirectCallSiteM,
FormatVersionNumber, TypeHash2, …
FuncEntryLabelCount2, ...,
CallSiteLabelCount2,  ...,
...

<.text section>
...
.LFunc1Entry:           ; unique label for target function entry for func1()
    push %rbp
    ...

.LIndirectCallSite1     ; unique label for indirect function call 1
    callq Y(%rbp)
...

Because the linker may concatenate .callgraph sections with different
format version numbers, we repeat the format version number for each
TypeHash list.  This avoids any special logic for the .callgraph
section in the linker.

Restoring the call graph
========================

To extract the call graph from the binary and serialize, a new LLVM
tool, llvm-discg, will be implemented using LLVM disassembler library,
or a new feature will be introduced into llvm-objdump.  The direct
call graph will be reconstructed from the disassembly without needing
the stored call graph data.  The indirect call graph will be
reconstructed from the .callgraph section and the disassembly.

For the indirect call graph, the labels in the .callgraph section will
be resolved to instruction addresses.  Then, disjoint classes of
function entries and call sites will be recovered based on their type
hashes.  For a call site with type hash H, edges will be made to all
function entries with the same type hash H.  Repeating the procedure
for all call sites, the conservative indirect call graph will be
reconstructed.

Future Direction
================

The construction mode of the call graph can be parametrized to tune
the trade-off between the completeness and soundness of the call graph
based on additional heuristics.

References
==========

[1] https://clang.llvm.org/docs/AddressSanitizer.html
[2] https://security.googleblog.com/2019/08/adopting-arm-memory-tagging-extension.html
[3] https://clang.llvm.org/docs/ControlFlowIntegrity.html#fsanitize-cfi-icall-generalize-pointers

Best,
Necip Fazil Yildiran