[llvm-dev] [RFC] Tooling for parsing and symbolication of Sanitizer reports

Wed Oct 7 08:17:33 PDT 2020

I agree with this.  We should just support a machine readable format, 
and build a tooling ecosystem around that.

Just make sure to include a version id in the format from the beginning 
so that we can change it.  :)

Philip

On 10/6/20 6:31 PM, David Blaikie via llvm-dev wrote:
> My 2c would be to push back a bit more on the "let's not have a 
> machine readable format, but instead parse the human readable format" 
> - it seems like that's going to make the human readable format/parsing 
> fairly brittle/hard to change (I mean, having the parser in tree will 
> help, for sure). It'd be interesting to know more about what problems 
> the valgrind XML format have had and how/whether different solutions 
> would address/avoid those problems. Also might be good to hear about 
> how other tools are parsing the output - whether or not/how they might 
> benefit if it were machine readable to begin with.
>
> But, yeah, if that's the direction - having an in-tree tool with 
> fairly narrow uses could be nice. One action to convert human readable 
> reports to json, another to symbolize such a report, a simple tool to 
> render the (symbolized or not) data back into human readable form 
> - then sets it up for other tools to consume that json and, say, 
> render it in a GUI, perform other diagnostics/analysis on the report, etc.
>
> On Tue, Oct 6, 2020 at 6:12 PM Dan Liew via llvm-dev 
> <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote:
>
>     # Summary
>
>     Currently the Sanitizer family of runtime bug finding tools (e.g.
>     Address Sanitizer) provide useful reports of problems upon detection.
>     This RFC proposes adding tools to
>
>     1. Parse Sanitizer reports into structured data to make interfacing
>     with other tools simpler.
>     2. Take the Sanitizer reports and “Symbolicate” them. That is, add
>     missing symbol information (function name, source file, line number)
>     to the structured data version of the report.
>
>     The initial stubs for the proposal in this RFC are provided in this
>     patch: https://reviews.llvm.org/D88938 .
>
>     Any thoughts on this RFC on the patch would be appreciated.
>
>     # Issues with the existing solutions
>
>     * An official parser for sanitizer reports does not exist. Currently
>     we just tell our users to implement their own (e.g. [1]). This creates
>     an unnecessary duplication of effort.
>     * The existing symbolizer (asan_symbolize.py) only works with ASan
>     reports and doesn’t support other sanitizers like TSan.
>     * The architecture of the existing symbolizer makes it cumbersome to
>     support inline frames.
>     * The architecture of the existing symbolizer is sequential which
>     prevents performing batched symbolication of stack frames.
>
>     # Tools
>
>     The proposed tools would be a sub-tools of a new llvm-xsan tool.
>
>     E.g.
>
>     llvm-xsan <subtool>
>
>     Sub-tools will support nesting of sub-tools to allow building
>     ergonomic tools. E.g.:
>
>     llvm-xsan asan <asan subtool>
>
>     * The tools would be part of compiler-rt and will optionally ship with
>     this project.
>     * The tools will be considered experimental while being incrementally
>     developed on the master branch.
>     * Functionality of the tools will be maintained via tests in the
>     compiler-rt.
>
>     llvm-xsan could be also used as a vehicle for shipping other Sanitizer
>     tools in the toolchain in the future.
>
>     ## Parsing tool
>
>     Sanitizer reports are primarily meant to be human readable,
>     consequently the reports are not structured data (e.g. JSON). This
>     means that Sanitizer reports are not conveniently machine-readable.
>
>     A request [2] was made in the past to teach the sanitizers to emit a
>     machine-readable format for reports. This request was denied but an
>     alternative was proposed where a tool could be provided to convert the
>     human readable Sanitizer reports into a structured data format. This
>     proposal will implement this alternative.
>
>     My proposal is that we implement a parser for Sanitizer reports that
>     converts them into a structured data. In particular:
>
>     * The tool is tied to the Clang/compiler-rt runtime that it ships
>     with. This means the tool will parse Sanitizer reports that come from
>     binaries built using the corresponding Clang. However the tool is not
>     required to parse Sanitizer reports that come from different versions
>     of Clang.
>     * The tool can also output a schema that describes the structured data
>     format. This schema would be versioned and would be allowed to change
>     once the tool moves out of the experimental stage.
>     * The format of the human readable Sanitizer reports is allowed to
>     change but the parser should be correspondingly changed when this
>     happens. This will be enforced with tests.
>
>     The parsing tools would be subtools of the asan, tsan, ubsan subtools.
>     This would require the user to explicitly communicate the report type
>     ahead of time. Command line invocation would look something like:
>
>     ```
>     llvm-xsan asan parse < asan_report.txt > asan_report.json
>     llvm-xsan tsan parse < tsan_report.txt > tsan_report.json
>     llvm-xsan ubsan parse < ubsan_report.txt > ubsan_report.json
>     ```
>
>     The structured data format would be JSON. The schema details still
>     need to be worked out but the schema will need to cover every type of
>     issue that a Sanitizer can find.
>
>     ## Symbolication tool
>
>     Sanitizer reports include detailed stack traces which show the program
>     counter (PC) for each frame. PCs are typically not useful to a
>     developer. Instead they are likely more interested in the function
>     name, source file and line number that correspond to each of the PCs.
>     The process of finding the function name, source file and line number
>     that correspond to a PC is known as “Symbolication”.
>
>     There are two approaches to symbolication, online and offline. Online
>     symbolication performs Symbolication in the process where the issue
>     was found by invoking an external tool (e.g. llvm-symbolizer) to
>     “symbolize” each of the PCs. Offline symbolication performs
>     symbolication outside the process where the issue was found. The
>     Sanitizers perform online symbolication by default. This process needs
>     the debug information to be available at runtime. However this
>     information might be missing. For example:
>
>     * The instrumented binary might have been stripped of debug info (e.g.
>     to reduce binary size).
>     * The PC points inside a system library which has no available
>     debug info.
>     * The instrumented binary was built on a different machine. On Apple
>     platforms debug info lives outside the binary (inside “.dSYM” bundles)
>     so these might not be copied across from the build machine.
>
>     In these cases online symbolication fails and we are left with a
>     sanitizer report that is extremely hard for a developer to read.
>
>     To turn the unsymbolicated Sanitizer report into something useful for
>     a developer, offline symbolication is necessary. However, the existing
>     infrastructure (asan_symbolize.py) for doing this has some
>     deficiencies.
>
>     * Only Address Sanitizer reports are supported.
>     * The current implementation processes each stackframe sequentially.
>     This does not fit well in contexts where we would like to symbolicate
>     multiple PCs at a time.
>     * The current implementation doesn’t provide a way to handle inline
>     frames (i.e. a PC maps to two or more source locations).
>
>     These problems can be resolved by building new tools on top of the
>     structured data format. This gives a nice separation of concerns
>     because parsing the report is now separate from symbolicating the PCs
>     in it.
>
>     The symbolication tools would be subtools of the asan, tsan, ubsan
>     subtools. This would require the user to explicitly communicate the
>     report type ahead of time. Command line invocation would look
>     something like:
>
>     ```
>     llvm-xsan asan symbolicate < asan_report.json >
>     asan_report_symbolicated.json
>     llvm-xsan tsan symbolicate < tsan_report.json >
>     tsan_report_symbolicated.json
>     llvm-xsan ubsan symbolicate < ubsan_report.json >
>     ubsan_report_symbolicated.json
>     ```
>
>     There are multiple ways to perform symbolication (some of which are
>     platform specific). Like asan_symbolize.py the plan would be to
>     support multiple symbolication backends (that can also be chained
>     together) that are specified via command line options.
>
>     [1] https://github.com/dobin/asanparser/blob/master/asanparser.py
>     [2] https://github.com/google/sanitizers/issues/268
>
>     Thanks,
>     Dan.
>     _______________________________________________
>     LLVM Developers mailing list
>     llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>     https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20201007/b289499f/attachment.html>