[llvm-dev] [RFC] Tooling for parsing and symbolication of Sanitizer reports
Philip Reames via llvm-dev
llvm-dev at lists.llvm.org
Wed Oct 7 08:17:33 PDT 2020
I agree with this. We should just support a machine readable format,
and build a tooling ecosystem around that.
Just make sure to include a version id in the format from the beginning
so that we can change it. :)
Philip
On 10/6/20 6:31 PM, David Blaikie via llvm-dev wrote:
> My 2c would be to push back a bit more on the "let's not have a
> machine readable format, but instead parse the human readable format"
> - it seems like that's going to make the human readable format/parsing
> fairly brittle/hard to change (I mean, having the parser in tree will
> help, for sure). It'd be interesting to know more about what problems
> the valgrind XML format have had and how/whether different solutions
> would address/avoid those problems. Also might be good to hear about
> how other tools are parsing the output - whether or not/how they might
> benefit if it were machine readable to begin with.
>
> But, yeah, if that's the direction - having an in-tree tool with
> fairly narrow uses could be nice. One action to convert human readable
> reports to json, another to symbolize such a report, a simple tool to
> render the (symbolized or not) data back into human readable form
> - then sets it up for other tools to consume that json and, say,
> render it in a GUI, perform other diagnostics/analysis on the report, etc.
>
> On Tue, Oct 6, 2020 at 6:12 PM Dan Liew via llvm-dev
> <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote:
>
> # Summary
>
> Currently the Sanitizer family of runtime bug finding tools (e.g.
> Address Sanitizer) provide useful reports of problems upon detection.
> This RFC proposes adding tools to
>
> 1. Parse Sanitizer reports into structured data to make interfacing
> with other tools simpler.
> 2. Take the Sanitizer reports and “Symbolicate” them. That is, add
> missing symbol information (function name, source file, line number)
> to the structured data version of the report.
>
> The initial stubs for the proposal in this RFC are provided in this
> patch: https://reviews.llvm.org/D88938 .
>
> Any thoughts on this RFC on the patch would be appreciated.
>
> # Issues with the existing solutions
>
> * An official parser for sanitizer reports does not exist. Currently
> we just tell our users to implement their own (e.g. [1]). This creates
> an unnecessary duplication of effort.
> * The existing symbolizer (asan_symbolize.py) only works with ASan
> reports and doesn’t support other sanitizers like TSan.
> * The architecture of the existing symbolizer makes it cumbersome to
> support inline frames.
> * The architecture of the existing symbolizer is sequential which
> prevents performing batched symbolication of stack frames.
>
> # Tools
>
> The proposed tools would be a sub-tools of a new llvm-xsan tool.
>
> E.g.
>
> llvm-xsan <subtool>
>
> Sub-tools will support nesting of sub-tools to allow building
> ergonomic tools. E.g.:
>
> llvm-xsan asan <asan subtool>
>
> * The tools would be part of compiler-rt and will optionally ship with
> this project.
> * The tools will be considered experimental while being incrementally
> developed on the master branch.
> * Functionality of the tools will be maintained via tests in the
> compiler-rt.
>
> llvm-xsan could be also used as a vehicle for shipping other Sanitizer
> tools in the toolchain in the future.
>
> ## Parsing tool
>
> Sanitizer reports are primarily meant to be human readable,
> consequently the reports are not structured data (e.g. JSON). This
> means that Sanitizer reports are not conveniently machine-readable.
>
> A request [2] was made in the past to teach the sanitizers to emit a
> machine-readable format for reports. This request was denied but an
> alternative was proposed where a tool could be provided to convert the
> human readable Sanitizer reports into a structured data format. This
> proposal will implement this alternative.
>
> My proposal is that we implement a parser for Sanitizer reports that
> converts them into a structured data. In particular:
>
> * The tool is tied to the Clang/compiler-rt runtime that it ships
> with. This means the tool will parse Sanitizer reports that come from
> binaries built using the corresponding Clang. However the tool is not
> required to parse Sanitizer reports that come from different versions
> of Clang.
> * The tool can also output a schema that describes the structured data
> format. This schema would be versioned and would be allowed to change
> once the tool moves out of the experimental stage.
> * The format of the human readable Sanitizer reports is allowed to
> change but the parser should be correspondingly changed when this
> happens. This will be enforced with tests.
>
> The parsing tools would be subtools of the asan, tsan, ubsan subtools.
> This would require the user to explicitly communicate the report type
> ahead of time. Command line invocation would look something like:
>
> ```
> llvm-xsan asan parse < asan_report.txt > asan_report.json
> llvm-xsan tsan parse < tsan_report.txt > tsan_report.json
> llvm-xsan ubsan parse < ubsan_report.txt > ubsan_report.json
> ```
>
> The structured data format would be JSON. The schema details still
> need to be worked out but the schema will need to cover every type of
> issue that a Sanitizer can find.
>
> ## Symbolication tool
>
> Sanitizer reports include detailed stack traces which show the program
> counter (PC) for each frame. PCs are typically not useful to a
> developer. Instead they are likely more interested in the function
> name, source file and line number that correspond to each of the PCs.
> The process of finding the function name, source file and line number
> that correspond to a PC is known as “Symbolication”.
>
> There are two approaches to symbolication, online and offline. Online
> symbolication performs Symbolication in the process where the issue
> was found by invoking an external tool (e.g. llvm-symbolizer) to
> “symbolize” each of the PCs. Offline symbolication performs
> symbolication outside the process where the issue was found. The
> Sanitizers perform online symbolication by default. This process needs
> the debug information to be available at runtime. However this
> information might be missing. For example:
>
> * The instrumented binary might have been stripped of debug info (e.g.
> to reduce binary size).
> * The PC points inside a system library which has no available
> debug info.
> * The instrumented binary was built on a different machine. On Apple
> platforms debug info lives outside the binary (inside “.dSYM” bundles)
> so these might not be copied across from the build machine.
>
> In these cases online symbolication fails and we are left with a
> sanitizer report that is extremely hard for a developer to read.
>
> To turn the unsymbolicated Sanitizer report into something useful for
> a developer, offline symbolication is necessary. However, the existing
> infrastructure (asan_symbolize.py) for doing this has some
> deficiencies.
>
> * Only Address Sanitizer reports are supported.
> * The current implementation processes each stackframe sequentially.
> This does not fit well in contexts where we would like to symbolicate
> multiple PCs at a time.
> * The current implementation doesn’t provide a way to handle inline
> frames (i.e. a PC maps to two or more source locations).
>
> These problems can be resolved by building new tools on top of the
> structured data format. This gives a nice separation of concerns
> because parsing the report is now separate from symbolicating the PCs
> in it.
>
> The symbolication tools would be subtools of the asan, tsan, ubsan
> subtools. This would require the user to explicitly communicate the
> report type ahead of time. Command line invocation would look
> something like:
>
> ```
> llvm-xsan asan symbolicate < asan_report.json >
> asan_report_symbolicated.json
> llvm-xsan tsan symbolicate < tsan_report.json >
> tsan_report_symbolicated.json
> llvm-xsan ubsan symbolicate < ubsan_report.json >
> ubsan_report_symbolicated.json
> ```
>
> There are multiple ways to perform symbolication (some of which are
> platform specific). Like asan_symbolize.py the plan would be to
> support multiple symbolication backends (that can also be chained
> together) that are specified via command line options.
>
> [1] https://github.com/dobin/asanparser/blob/master/asanparser.py
> [2] https://github.com/google/sanitizers/issues/268
>
> Thanks,
> Dan.
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20201007/b289499f/attachment.html>
More information about the llvm-dev
mailing list