[llvm-dev] [RFC] Tooling for parsing and symbolication of Sanitizer reports

Wed Oct 7 18:07:23 PDT 2020

On Wed, 7 Oct 2020 at 10:38, Petr Hosek <phosek at google.com> wrote:
>
> We ran into the same issues you described and the solution we came up with is the Fuchsia symbolizer markup format, see https://fuchsia.dev/fuchsia-src/reference/kernel/symbolizer_markup. Despite its name, nothing about the format is Fuchsia specific, the format should be generally usable and has already been adopted by other systems such as RTEMS.
>
> The symbolizer markup should address many of the issues you mentioned:
> * It's already available in sanitizer_common and supports all sanitizers, see https://github.com/llvm/llvm-project/blob/fccea7f372cbd33376d2c776f34a0c6925982981/compiler-rt/lib/sanitizer_common/sanitizer_symbolizer_markup.cpp
> * It supports inline frames which was the most recent changes to the markup based on our experience with sanitizer rollout, see https://cs.opensource.google/fuchsia/fuchsia/+/db6e2155d125c389bfc43bafe2f140231da0b6d0
> * It's designed for offline and batched symbolization.
>
> The advantage over emitting JSON directly is that the markup format is line delimited, which simplifies emission and parsing, it's more compact, and it can be easily embedded in other formats (even JSON) which is important in our use case.

The approach you've outlined is a really great way to handle offline
symbolization. However, it only solves part of what I want to solve. I
also want to have a description of the ASan report that is
machine-readable. Having a machine-readable description of the ASan
report allows you to do things like:

* Perform some automated bug-triage. E.g. work out which frame(s)
might be responsible based on the stack trace and the bug-type.
* Create custom user interfaces to display ASan reports.
* Simplifies consuming ASan reports in a database. Such a database
could be used for de-duplication of reports and gathering statistics.

There are probably other things too but these are the first things
that come to mind.

> Currently, the markup is consumed by our symbolizer which is a thin wrapper around llvm-symbolizer, but I planned on eventually proposing and implementing support for this format directly in llvm-symbolizer. We support emitting JSON output in our symbolizer wrapper which would be great to have in llvm-symbolizer as well and is in line with the plan to support JSON output in various LLVM tools that has been repeatedly discussed in the past.
>
> Our hope has been that this markup could be eventually adopted by other platforms and I'd be interested to hear your thoughts. I understand that it may not be a fit for your use cases, but I'd be also interested to hear if there are ways to make it usable for your use.
>

Does this JSON output only describe the stacktraces or does it
describe other parts of the ASan report too (e.g. bug type, pc,
read/write, access size, shadow memory contents)?

> Regarding offline symbolization, we use offline symbolization by default in Fuchsia and our symbolizer wrapper fetches debug info on-demand from our symbol server. We originally used a custom scheme, but recently we started switching to debuginfod which is being quickly adopted by various binary tools in the GNU ecosystem. I'd like to implement debuginfod support directly in LLVM (see also the recent thread about HTTP client/server libraries in LLVM) and integrate it into tools like  llvm-symbolizer which is also important to bring llvm-symbolizer on par with addr2line. This would address the offline symbolization use case in a way that doesn't require new tools.

I didn't realise that addr2line could talk to debuginfod so that
sounds like a sensible thing to support in llvm-symbolizer. For Apple
platforms I think we mostly use `atos` instead of llvm-symbolizer
because it supports Swift demangling, but there may be other reasons
that I'm unaware of.