[llvm-dev] [RFC] Tooling for parsing and symbolication of Sanitizer reports

Wed Oct 7 04:14:54 PDT 2020

There was a refactoring a few years after [2] which organized all asan
reports into simple structs to view them in debugger. It should be quite
straightforward to serialize them into json.
If it's a part of compiler-rt and we have to maintain that, I'd prefer to
maintain direct json serialization then report->json converter.

On Tue, 6 Oct 2020 at 18:32, David Blaikie via llvm-dev <
llvm-dev at lists.llvm.org> wrote:

> My 2c would be to push back a bit more on the "let's not have a machine
> readable format, but instead parse the human readable format" - it seems
> like that's going to make the human readable format/parsing fairly
> brittle/hard to change (I mean, having the parser in tree will help, for
> sure). It'd be interesting to know more about what problems the valgrind
> XML format have had and how/whether different solutions would address/avoid
> those problems. Also might be good to hear about how other tools are
> parsing the output - whether or not/how they might benefit if it were
> machine readable to begin with.
>
> But, yeah, if that's the direction - having an in-tree tool with fairly
> narrow uses could be nice. One action to convert human readable reports to
> json, another to symbolize such a report, a simple tool to render the
> (symbolized or not) data back into human readable form - then sets it up
> for other tools to consume that json and, say, render it in a GUI, perform
> other diagnostics/analysis on the report, etc.
>
> On Tue, Oct 6, 2020 at 6:12 PM Dan Liew via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> # Summary
>>
>> Currently the Sanitizer family of runtime bug finding tools (e.g.
>> Address Sanitizer) provide useful reports of problems upon detection.
>> This RFC proposes adding tools to
>>
>> 1. Parse Sanitizer reports into structured data to make interfacing
>> with other tools simpler.
>> 2. Take the Sanitizer reports and “Symbolicate” them. That is, add
>> missing symbol information (function name, source file, line number)
>> to the structured data version of the report.
>>
>> The initial stubs for the proposal in this RFC are provided in this
>> patch: https://reviews.llvm.org/D88938 .
>>
>> Any thoughts on this RFC on the patch would be appreciated.
>>
>> # Issues with the existing solutions
>>
>> * An official parser for sanitizer reports does not exist. Currently
>> we just tell our users to implement their own (e.g. [1]). This creates
>> an unnecessary duplication of effort.
>> * The existing symbolizer (asan_symbolize.py) only works with ASan
>> reports and doesn’t support other sanitizers like TSan.
>> * The architecture of the existing symbolizer makes it cumbersome to
>> support inline frames.
>> * The architecture of the existing symbolizer is sequential which
>> prevents performing batched symbolication of stack frames.
>>
>> # Tools
>>
>> The proposed tools would be a sub-tools of a new llvm-xsan tool.
>>
>> E.g.
>>
>> llvm-xsan <subtool>
>>
>> Sub-tools will support nesting of sub-tools to allow building
>> ergonomic tools. E.g.:
>>
>> llvm-xsan asan <asan subtool>
>>
>> * The tools would be part of compiler-rt and will optionally ship with
>> this project.
>> * The tools will be considered experimental while being incrementally
>> developed on the master branch.
>> * Functionality of the tools will be maintained via tests in the
>> compiler-rt.
>>
>> llvm-xsan could be also used as a vehicle for shipping other Sanitizer
>> tools in the toolchain in the future.
>>
>> ## Parsing tool
>>
>> Sanitizer reports are primarily meant to be human readable,
>> consequently the reports are not structured data (e.g. JSON). This
>> means that Sanitizer reports are not conveniently machine-readable.
>>
>> A request [2] was made in the past to teach the sanitizers to emit a
>> machine-readable format for reports. This request was denied but an
>> alternative was proposed where a tool could be provided to convert the
>> human readable Sanitizer reports into a structured data format. This
>> proposal will implement this alternative.
>>
>> My proposal is that we implement a parser for Sanitizer reports that
>> converts them into a structured data. In particular:
>>
>> * The tool is tied to the Clang/compiler-rt runtime that it ships
>> with. This means the tool will parse Sanitizer reports that come from
>> binaries built using the corresponding Clang. However the tool is not
>> required to parse Sanitizer reports that come from different versions
>> of Clang.
>> * The tool can also output a schema that describes the structured data
>> format. This schema would be versioned and would be allowed to change
>> once the tool moves out of the experimental stage.
>> * The format of the human readable Sanitizer reports is allowed to
>> change but the parser should be correspondingly changed when this
>> happens. This will be enforced with tests.
>>
>> The parsing tools would be subtools of the asan, tsan, ubsan subtools.
>> This would require the user to explicitly communicate the report type
>> ahead of time. Command line invocation would look something like:
>>
>> ```
>> llvm-xsan asan parse < asan_report.txt > asan_report.json
>> llvm-xsan tsan parse < tsan_report.txt > tsan_report.json
>> llvm-xsan ubsan parse < ubsan_report.txt > ubsan_report.json
>> ```
>>
>> The structured data format would be JSON. The schema details still
>> need to be worked out but the schema will need to cover every type of
>> issue that a Sanitizer can find.
>>
>> ## Symbolication tool
>>
>> Sanitizer reports include detailed stack traces which show the program
>> counter (PC) for each frame. PCs are typically not useful to a
>> developer. Instead they are likely more interested in the function
>> name, source file and line number that correspond to each of the PCs.
>> The process of finding the function name, source file and line number
>> that correspond to a PC is known as “Symbolication”.
>>
>> There are two approaches to symbolication, online and offline. Online
>> symbolication performs Symbolication in the process where the issue
>> was found by invoking an external tool (e.g. llvm-symbolizer) to
>> “symbolize” each of the PCs. Offline symbolication performs
>> symbolication outside the process where the issue was found. The
>> Sanitizers perform online symbolication by default. This process needs
>> the debug information to be available at runtime. However this
>> information might be missing. For example:
>>
>> * The instrumented binary might have been stripped of debug info (e.g.
>> to reduce binary size).
>> * The PC points inside a system library which has no available debug info.
>> * The instrumented binary was built on a different machine. On Apple
>> platforms debug info lives outside the binary (inside “.dSYM” bundles)
>> so these might not be copied across from the build machine.
>>
>> In these cases online symbolication fails and we are left with a
>> sanitizer report that is extremely hard for a developer to read.
>>
>> To turn the unsymbolicated Sanitizer report into something useful for
>> a developer, offline symbolication is necessary. However, the existing
>> infrastructure (asan_symbolize.py) for doing this has some
>> deficiencies.
>>
>> * Only Address Sanitizer reports are supported.
>> * The current implementation processes each stackframe sequentially.
>> This does not fit well in contexts where we would like to symbolicate
>> multiple PCs at a time.
>> * The current implementation doesn’t provide a way to handle inline
>> frames (i.e. a PC maps to two or more source locations).
>>
>> These problems can be resolved by building new tools on top of the
>> structured data format. This gives a nice separation of concerns
>> because parsing the report is now separate from symbolicating the PCs
>> in it.
>>
>> The symbolication tools would be subtools of the asan, tsan, ubsan
>> subtools. This would require the user to explicitly communicate the
>> report type ahead of time. Command line invocation would look
>> something like:
>>
>> ```
>> llvm-xsan asan symbolicate < asan_report.json >
>> asan_report_symbolicated.json
>> llvm-xsan tsan symbolicate < tsan_report.json >
>> tsan_report_symbolicated.json
>> llvm-xsan ubsan symbolicate < ubsan_report.json >
>> ubsan_report_symbolicated.json
>> ```
>>
>> There are multiple ways to perform symbolication (some of which are
>> platform specific). Like asan_symbolize.py the plan would be to
>> support multiple symbolication backends (that can also be chained
>> together) that are specified via command line options.
>>
>> [1] https://github.com/dobin/asanparser/blob/master/asanparser.py
>> [2] https://github.com/google/sanitizers/issues/268
>>
>> Thanks,
>> Dan.
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20201007/4019a175/attachment.html>