[llvm-dev] [RFC] Tooling for parsing and symbolication of Sanitizer reports

Kostya Serebryany via llvm-dev llvm-dev at lists.llvm.org
Wed Oct 7 16:24:19 PDT 2020


On Wed, Oct 7, 2020 at 10:23 AM Dan Liew <dan at su-root.co.uk> wrote:

> Hi,
>
> On Tue, 6 Oct 2020 at 18:31, David Blaikie <dblaikie at gmail.com> wrote:
> >
> > My 2c would be to push back a bit more on the "let's not have a machine
> readable format, but instead parse the human readable format" - it seems
> like that's going to make the human readable format/parsing fairly
> brittle/hard to change (I mean, having the parser in tree will help, for
> sure).
>
> I was operating under the assumption that the decision made in
> https://github.com/google/sanitizers/issues/268 was still the status
> quo. That was six years ago though so I'll let Kostya chime in here if
> he now thinks differently about this.
>

My opinion on the matter didn't change, nor did the motivation.
I am opposed to making the sanitizer run-time any more complex,
and I prefer the approach proposed here: separate, adjacently maintained
parser.

On top of the previous motivation, here is some more.
We are going to have more sanitizer-like things in the near future (Arm MTE
is one of them),
that are not necessarily going to be in LLVM and that will not emit JSON.
(and they shouldn't: we don't want any such thing in a production run-time).
But we can support those things with a separate parser.

I have a mild preference to have the parser written as a C++ library, with
C interface.
Not in python, so that it can be used programmatically w/o launching a
sub-process.
But I don't insist (especially given the code is written already)

--kcc


>
> Even if we go down the route of having the sanitizers supporting
> machine-readable output I'd still like there to be an in-tree tool
> that supports doing offline symboliation on the machine readable
> output. So there still might be a case for having the proposed
> "llvm-xsan" tool in-tree.
>
> > It'd be interesting to know more about what problems the valgrind XML
> format have had and how/whether different solutions would address/avoid
> those problems. Also might be good to hear about how other tools are
> parsing the output - whether or not/how they might benefit if it were
> machine readable to begin with.
>
> Huh. I didn't know Valgrind had an XML format so I can't really
> comment on that (yet).
>
> On my side I can say we have at least two use cases inside Apple where
> we are parsing ASan reports and each use case ended up implementing
> their own parser.
>
> >
> > But, yeah, if that's the direction - having an in-tree tool with fairly
> narrow uses could be nice. One action to convert human readable reports to
> json, another to symbolize such a report, a simple tool to render the
> (symbolized or not) data back into human readable form - then sets it up
> for other tools to consume that json and, say, render it in a GUI, perform
> other diagnostics/analysis on the report, etc.
>
> I hadn't thought about a tool to re-render reports in human readable
> form. That's a good idea.
>
>
> > On Tue, Oct 6, 2020 at 6:12 PM Dan Liew via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
> >>
> >> # Summary
> >>
> >> Currently the Sanitizer family of runtime bug finding tools (e.g.
> >> Address Sanitizer) provide useful reports of problems upon detection.
> >> This RFC proposes adding tools to
> >>
> >> 1. Parse Sanitizer reports into structured data to make interfacing
> >> with other tools simpler.
> >> 2. Take the Sanitizer reports and “Symbolicate” them. That is, add
> >> missing symbol information (function name, source file, line number)
> >> to the structured data version of the report.
> >>
> >> The initial stubs for the proposal in this RFC are provided in this
> >> patch: https://reviews.llvm.org/D88938 .
> >>
> >> Any thoughts on this RFC on the patch would be appreciated.
> >>
> >> # Issues with the existing solutions
> >>
> >> * An official parser for sanitizer reports does not exist. Currently
> >> we just tell our users to implement their own (e.g. [1]). This creates
> >> an unnecessary duplication of effort.
> >> * The existing symbolizer (asan_symbolize.py) only works with ASan
> >> reports and doesn’t support other sanitizers like TSan.
> >> * The architecture of the existing symbolizer makes it cumbersome to
> >> support inline frames.
> >> * The architecture of the existing symbolizer is sequential which
> >> prevents performing batched symbolication of stack frames.
> >>
> >> # Tools
> >>
> >> The proposed tools would be a sub-tools of a new llvm-xsan tool.
> >>
> >> E.g.
> >>
> >> llvm-xsan <subtool>
> >>
> >> Sub-tools will support nesting of sub-tools to allow building
> >> ergonomic tools. E.g.:
> >>
> >> llvm-xsan asan <asan subtool>
> >>
> >> * The tools would be part of compiler-rt and will optionally ship with
> >> this project.
> >> * The tools will be considered experimental while being incrementally
> >> developed on the master branch.
> >> * Functionality of the tools will be maintained via tests in the
> compiler-rt.
> >>
> >> llvm-xsan could be also used as a vehicle for shipping other Sanitizer
> >> tools in the toolchain in the future.
> >>
> >> ## Parsing tool
> >>
> >> Sanitizer reports are primarily meant to be human readable,
> >> consequently the reports are not structured data (e.g. JSON). This
> >> means that Sanitizer reports are not conveniently machine-readable.
> >>
> >> A request [2] was made in the past to teach the sanitizers to emit a
> >> machine-readable format for reports. This request was denied but an
> >> alternative was proposed where a tool could be provided to convert the
> >> human readable Sanitizer reports into a structured data format. This
> >> proposal will implement this alternative.
> >>
> >> My proposal is that we implement a parser for Sanitizer reports that
> >> converts them into a structured data. In particular:
> >>
> >> * The tool is tied to the Clang/compiler-rt runtime that it ships
> >> with. This means the tool will parse Sanitizer reports that come from
> >> binaries built using the corresponding Clang. However the tool is not
> >> required to parse Sanitizer reports that come from different versions
> >> of Clang.
> >> * The tool can also output a schema that describes the structured data
> >> format. This schema would be versioned and would be allowed to change
> >> once the tool moves out of the experimental stage.
> >> * The format of the human readable Sanitizer reports is allowed to
> >> change but the parser should be correspondingly changed when this
> >> happens. This will be enforced with tests.
> >>
> >> The parsing tools would be subtools of the asan, tsan, ubsan subtools.
> >> This would require the user to explicitly communicate the report type
> >> ahead of time. Command line invocation would look something like:
> >>
> >> ```
> >> llvm-xsan asan parse < asan_report.txt > asan_report.json
> >> llvm-xsan tsan parse < tsan_report.txt > tsan_report.json
> >> llvm-xsan ubsan parse < ubsan_report.txt > ubsan_report.json
> >> ```
> >>
> >> The structured data format would be JSON. The schema details still
> >> need to be worked out but the schema will need to cover every type of
> >> issue that a Sanitizer can find.
> >>
> >> ## Symbolication tool
> >>
> >> Sanitizer reports include detailed stack traces which show the program
> >> counter (PC) for each frame. PCs are typically not useful to a
> >> developer. Instead they are likely more interested in the function
> >> name, source file and line number that correspond to each of the PCs.
> >> The process of finding the function name, source file and line number
> >> that correspond to a PC is known as “Symbolication”.
> >>
> >> There are two approaches to symbolication, online and offline. Online
> >> symbolication performs Symbolication in the process where the issue
> >> was found by invoking an external tool (e.g. llvm-symbolizer) to
> >> “symbolize” each of the PCs. Offline symbolication performs
> >> symbolication outside the process where the issue was found. The
> >> Sanitizers perform online symbolication by default. This process needs
> >> the debug information to be available at runtime. However this
> >> information might be missing. For example:
> >>
> >> * The instrumented binary might have been stripped of debug info (e.g.
> >> to reduce binary size).
> >> * The PC points inside a system library which has no available debug
> info.
> >> * The instrumented binary was built on a different machine. On Apple
> >> platforms debug info lives outside the binary (inside “.dSYM” bundles)
> >> so these might not be copied across from the build machine.
> >>
> >> In these cases online symbolication fails and we are left with a
> >> sanitizer report that is extremely hard for a developer to read.
> >>
> >> To turn the unsymbolicated Sanitizer report into something useful for
> >> a developer, offline symbolication is necessary. However, the existing
> >> infrastructure (asan_symbolize.py) for doing this has some
> >> deficiencies.
> >>
> >> * Only Address Sanitizer reports are supported.
> >> * The current implementation processes each stackframe sequentially.
> >> This does not fit well in contexts where we would like to symbolicate
> >> multiple PCs at a time.
> >> * The current implementation doesn’t provide a way to handle inline
> >> frames (i.e. a PC maps to two or more source locations).
> >>
> >> These problems can be resolved by building new tools on top of the
> >> structured data format. This gives a nice separation of concerns
> >> because parsing the report is now separate from symbolicating the PCs
> >> in it.
> >>
> >> The symbolication tools would be subtools of the asan, tsan, ubsan
> >> subtools. This would require the user to explicitly communicate the
> >> report type ahead of time. Command line invocation would look
> >> something like:
> >>
> >> ```
> >> llvm-xsan asan symbolicate < asan_report.json >
> asan_report_symbolicated.json
> >> llvm-xsan tsan symbolicate < tsan_report.json >
> tsan_report_symbolicated.json
> >> llvm-xsan ubsan symbolicate < ubsan_report.json >
> ubsan_report_symbolicated.json
> >> ```
> >>
> >> There are multiple ways to perform symbolication (some of which are
> >> platform specific). Like asan_symbolize.py the plan would be to
> >> support multiple symbolication backends (that can also be chained
> >> together) that are specified via command line options.
> >>
> >> [1] https://github.com/dobin/asanparser/blob/master/asanparser.py
> >> [2] https://github.com/google/sanitizers/issues/268
> >>
> >> Thanks,
> >> Dan.
> >> _______________________________________________
> >> LLVM Developers mailing list
> >> llvm-dev at lists.llvm.org
> >> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20201007/52e97d84/attachment.html>


More information about the llvm-dev mailing list