[llvm-dev] [RFC] Tooling for parsing and symbolication of Sanitizer reports

Dan Liew via llvm-dev llvm-dev at lists.llvm.org
Tue Oct 6 18:11:45 PDT 2020


# Summary

Currently the Sanitizer family of runtime bug finding tools (e.g.
Address Sanitizer) provide useful reports of problems upon detection.
This RFC proposes adding tools to

1. Parse Sanitizer reports into structured data to make interfacing
with other tools simpler.
2. Take the Sanitizer reports and “Symbolicate” them. That is, add
missing symbol information (function name, source file, line number)
to the structured data version of the report.

The initial stubs for the proposal in this RFC are provided in this
patch: https://reviews.llvm.org/D88938 .

Any thoughts on this RFC on the patch would be appreciated.

# Issues with the existing solutions

* An official parser for sanitizer reports does not exist. Currently
we just tell our users to implement their own (e.g. [1]). This creates
an unnecessary duplication of effort.
* The existing symbolizer (asan_symbolize.py) only works with ASan
reports and doesn’t support other sanitizers like TSan.
* The architecture of the existing symbolizer makes it cumbersome to
support inline frames.
* The architecture of the existing symbolizer is sequential which
prevents performing batched symbolication of stack frames.

# Tools

The proposed tools would be a sub-tools of a new llvm-xsan tool.

E.g.

llvm-xsan <subtool>

Sub-tools will support nesting of sub-tools to allow building
ergonomic tools. E.g.:

llvm-xsan asan <asan subtool>

* The tools would be part of compiler-rt and will optionally ship with
this project.
* The tools will be considered experimental while being incrementally
developed on the master branch.
* Functionality of the tools will be maintained via tests in the compiler-rt.

llvm-xsan could be also used as a vehicle for shipping other Sanitizer
tools in the toolchain in the future.

## Parsing tool

Sanitizer reports are primarily meant to be human readable,
consequently the reports are not structured data (e.g. JSON). This
means that Sanitizer reports are not conveniently machine-readable.

A request [2] was made in the past to teach the sanitizers to emit a
machine-readable format for reports. This request was denied but an
alternative was proposed where a tool could be provided to convert the
human readable Sanitizer reports into a structured data format. This
proposal will implement this alternative.

My proposal is that we implement a parser for Sanitizer reports that
converts them into a structured data. In particular:

* The tool is tied to the Clang/compiler-rt runtime that it ships
with. This means the tool will parse Sanitizer reports that come from
binaries built using the corresponding Clang. However the tool is not
required to parse Sanitizer reports that come from different versions
of Clang.
* The tool can also output a schema that describes the structured data
format. This schema would be versioned and would be allowed to change
once the tool moves out of the experimental stage.
* The format of the human readable Sanitizer reports is allowed to
change but the parser should be correspondingly changed when this
happens. This will be enforced with tests.

The parsing tools would be subtools of the asan, tsan, ubsan subtools.
This would require the user to explicitly communicate the report type
ahead of time. Command line invocation would look something like:

```
llvm-xsan asan parse < asan_report.txt > asan_report.json
llvm-xsan tsan parse < tsan_report.txt > tsan_report.json
llvm-xsan ubsan parse < ubsan_report.txt > ubsan_report.json
```

The structured data format would be JSON. The schema details still
need to be worked out but the schema will need to cover every type of
issue that a Sanitizer can find.

## Symbolication tool

Sanitizer reports include detailed stack traces which show the program
counter (PC) for each frame. PCs are typically not useful to a
developer. Instead they are likely more interested in the function
name, source file and line number that correspond to each of the PCs.
The process of finding the function name, source file and line number
that correspond to a PC is known as “Symbolication”.

There are two approaches to symbolication, online and offline. Online
symbolication performs Symbolication in the process where the issue
was found by invoking an external tool (e.g. llvm-symbolizer) to
“symbolize” each of the PCs. Offline symbolication performs
symbolication outside the process where the issue was found. The
Sanitizers perform online symbolication by default. This process needs
the debug information to be available at runtime. However this
information might be missing. For example:

* The instrumented binary might have been stripped of debug info (e.g.
to reduce binary size).
* The PC points inside a system library which has no available debug info.
* The instrumented binary was built on a different machine. On Apple
platforms debug info lives outside the binary (inside “.dSYM” bundles)
so these might not be copied across from the build machine.

In these cases online symbolication fails and we are left with a
sanitizer report that is extremely hard for a developer to read.

To turn the unsymbolicated Sanitizer report into something useful for
a developer, offline symbolication is necessary. However, the existing
infrastructure (asan_symbolize.py) for doing this has some
deficiencies.

* Only Address Sanitizer reports are supported.
* The current implementation processes each stackframe sequentially.
This does not fit well in contexts where we would like to symbolicate
multiple PCs at a time.
* The current implementation doesn’t provide a way to handle inline
frames (i.e. a PC maps to two or more source locations).

These problems can be resolved by building new tools on top of the
structured data format. This gives a nice separation of concerns
because parsing the report is now separate from symbolicating the PCs
in it.

The symbolication tools would be subtools of the asan, tsan, ubsan
subtools. This would require the user to explicitly communicate the
report type ahead of time. Command line invocation would look
something like:

```
llvm-xsan asan symbolicate < asan_report.json > asan_report_symbolicated.json
llvm-xsan tsan symbolicate < tsan_report.json > tsan_report_symbolicated.json
llvm-xsan ubsan symbolicate < ubsan_report.json > ubsan_report_symbolicated.json
```

There are multiple ways to perform symbolication (some of which are
platform specific). Like asan_symbolize.py the plan would be to
support multiple symbolication backends (that can also be chained
together) that are specified via command line options.

[1] https://github.com/dobin/asanparser/blob/master/asanparser.py
[2] https://github.com/google/sanitizers/issues/268

Thanks,
Dan.


More information about the llvm-dev mailing list