<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Oct 7, 2020 at 10:23 AM Dan Liew <<a href="mailto:dan@su-root.co.uk" target="_blank">dan@su-root.co.uk</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi,<br>
<br>
On Tue, 6 Oct 2020 at 18:31, David Blaikie <<a href="mailto:dblaikie@gmail.com" target="_blank">dblaikie@gmail.com</a>> wrote:<br>
><br>
> My 2c would be to push back a bit more on the "let's not have a machine readable format, but instead parse the human readable format" - it seems like that's going to make the human readable format/parsing fairly brittle/hard to change (I mean, having the parser in tree will help, for sure).<br>
<br>
I was operating under the assumption that the decision made in<br>
<a href="https://github.com/google/sanitizers/issues/268" rel="noreferrer" target="_blank">https://github.com/google/sanitizers/issues/268</a> was still the status<br>
quo. That was six years ago though so I'll let Kostya chime in here if<br>
he now thinks differently about this.<br></blockquote><div><br></div><div>My opinion on the matter didn't change, nor did the motivation. </div><div>I am opposed to making the sanitizer run-time any more complex,</div><div>and I prefer the approach proposed here: separate, adjacently maintained parser. </div><div><br></div><div>On top of the previous motivation, here is some more. </div><div>We are going to have more sanitizer-like things in the near future (Arm MTE is one of them), </div><div>that are not necessarily going to be in LLVM and that will not emit JSON. <br></div><div>(and they shouldn't: we don't want any such thing in a production run-time).</div><div>But we can support those things with a separate parser. </div><div><br></div><div>I have a mild preference to have the parser written as a C++ library, with C interface. </div><div>Not in python, so that it can be used programmatically w/o launching a sub-process. </div><div>But I don't insist (especially given the code is written already) </div><div><br></div><div>--kcc </div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
Even if we go down the route of having the sanitizers supporting<br>
machine-readable output I'd still like there to be an in-tree tool<br>
that supports doing offline symboliation on the machine readable<br>
output. So there still might be a case for having the proposed<br>
"llvm-xsan" tool in-tree.<br>
<br>
> It'd be interesting to know more about what problems the valgrind XML format have had and how/whether different solutions would address/avoid those problems. Also might be good to hear about how other tools are parsing the output - whether or not/how they might benefit if it were machine readable to begin with.<br>
<br>
Huh. I didn't know Valgrind had an XML format so I can't really<br>
comment on that (yet).<br>
<br>
On my side I can say we have at least two use cases inside Apple where<br>
we are parsing ASan reports and each use case ended up implementing<br>
their own parser.<br>
<br>
><br>
> But, yeah, if that's the direction - having an in-tree tool with fairly narrow uses could be nice. One action to convert human readable reports to json, another to symbolize such a report, a simple tool to render the (symbolized or not) data back into human readable form - then sets it up for other tools to consume that json and, say, render it in a GUI, perform other diagnostics/analysis on the report, etc.<br>
<br>
I hadn't thought about a tool to re-render reports in human readable<br>
form. That's a good idea.<br>
<br>
<br>
> On Tue, Oct 6, 2020 at 6:12 PM Dan Liew via llvm-dev <<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a>> wrote:<br>
>><br>
>> # Summary<br>
>><br>
>> Currently the Sanitizer family of runtime bug finding tools (e.g.<br>
>> Address Sanitizer) provide useful reports of problems upon detection.<br>
>> This RFC proposes adding tools to<br>
>><br>
>> 1. Parse Sanitizer reports into structured data to make interfacing<br>
>> with other tools simpler.<br>
>> 2. Take the Sanitizer reports and “Symbolicate” them. That is, add<br>
>> missing symbol information (function name, source file, line number)<br>
>> to the structured data version of the report.<br>
>><br>
>> The initial stubs for the proposal in this RFC are provided in this<br>
>> patch: <a href="https://reviews.llvm.org/D88938" rel="noreferrer" target="_blank">https://reviews.llvm.org/D88938</a> .<br>
>><br>
>> Any thoughts on this RFC on the patch would be appreciated.<br>
>><br>
>> # Issues with the existing solutions<br>
>><br>
>> * An official parser for sanitizer reports does not exist. Currently<br>
>> we just tell our users to implement their own (e.g. [1]). This creates<br>
>> an unnecessary duplication of effort.<br>
>> * The existing symbolizer (asan_symbolize.py) only works with ASan<br>
>> reports and doesn’t support other sanitizers like TSan.<br>
>> * The architecture of the existing symbolizer makes it cumbersome to<br>
>> support inline frames.<br>
>> * The architecture of the existing symbolizer is sequential which<br>
>> prevents performing batched symbolication of stack frames.<br>
>><br>
>> # Tools<br>
>><br>
>> The proposed tools would be a sub-tools of a new llvm-xsan tool.<br>
>><br>
>> E.g.<br>
>><br>
>> llvm-xsan <subtool><br>
>><br>
>> Sub-tools will support nesting of sub-tools to allow building<br>
>> ergonomic tools. E.g.:<br>
>><br>
>> llvm-xsan asan <asan subtool><br>
>><br>
>> * The tools would be part of compiler-rt and will optionally ship with<br>
>> this project.<br>
>> * The tools will be considered experimental while being incrementally<br>
>> developed on the master branch.<br>
>> * Functionality of the tools will be maintained via tests in the compiler-rt.<br>
>><br>
>> llvm-xsan could be also used as a vehicle for shipping other Sanitizer<br>
>> tools in the toolchain in the future.<br>
>><br>
>> ## Parsing tool<br>
>><br>
>> Sanitizer reports are primarily meant to be human readable,<br>
>> consequently the reports are not structured data (e.g. JSON). This<br>
>> means that Sanitizer reports are not conveniently machine-readable.<br>
>><br>
>> A request [2] was made in the past to teach the sanitizers to emit a<br>
>> machine-readable format for reports. This request was denied but an<br>
>> alternative was proposed where a tool could be provided to convert the<br>
>> human readable Sanitizer reports into a structured data format. This<br>
>> proposal will implement this alternative.<br>
>><br>
>> My proposal is that we implement a parser for Sanitizer reports that<br>
>> converts them into a structured data. In particular:<br>
>><br>
>> * The tool is tied to the Clang/compiler-rt runtime that it ships<br>
>> with. This means the tool will parse Sanitizer reports that come from<br>
>> binaries built using the corresponding Clang. However the tool is not<br>
>> required to parse Sanitizer reports that come from different versions<br>
>> of Clang.<br>
>> * The tool can also output a schema that describes the structured data<br>
>> format. This schema would be versioned and would be allowed to change<br>
>> once the tool moves out of the experimental stage.<br>
>> * The format of the human readable Sanitizer reports is allowed to<br>
>> change but the parser should be correspondingly changed when this<br>
>> happens. This will be enforced with tests.<br>
>><br>
>> The parsing tools would be subtools of the asan, tsan, ubsan subtools.<br>
>> This would require the user to explicitly communicate the report type<br>
>> ahead of time. Command line invocation would look something like:<br>
>><br>
>> ```<br>
>> llvm-xsan asan parse < asan_report.txt > asan_report.json<br>
>> llvm-xsan tsan parse < tsan_report.txt > tsan_report.json<br>
>> llvm-xsan ubsan parse < ubsan_report.txt > ubsan_report.json<br>
>> ```<br>
>><br>
>> The structured data format would be JSON. The schema details still<br>
>> need to be worked out but the schema will need to cover every type of<br>
>> issue that a Sanitizer can find.<br>
>><br>
>> ## Symbolication tool<br>
>><br>
>> Sanitizer reports include detailed stack traces which show the program<br>
>> counter (PC) for each frame. PCs are typically not useful to a<br>
>> developer. Instead they are likely more interested in the function<br>
>> name, source file and line number that correspond to each of the PCs.<br>
>> The process of finding the function name, source file and line number<br>
>> that correspond to a PC is known as “Symbolication”.<br>
>><br>
>> There are two approaches to symbolication, online and offline. Online<br>
>> symbolication performs Symbolication in the process where the issue<br>
>> was found by invoking an external tool (e.g. llvm-symbolizer) to<br>
>> “symbolize” each of the PCs. Offline symbolication performs<br>
>> symbolication outside the process where the issue was found. The<br>
>> Sanitizers perform online symbolication by default. This process needs<br>
>> the debug information to be available at runtime. However this<br>
>> information might be missing. For example:<br>
>><br>
>> * The instrumented binary might have been stripped of debug info (e.g.<br>
>> to reduce binary size).<br>
>> * The PC points inside a system library which has no available debug info.<br>
>> * The instrumented binary was built on a different machine. On Apple<br>
>> platforms debug info lives outside the binary (inside “.dSYM” bundles)<br>
>> so these might not be copied across from the build machine.<br>
>><br>
>> In these cases online symbolication fails and we are left with a<br>
>> sanitizer report that is extremely hard for a developer to read.<br>
>><br>
>> To turn the unsymbolicated Sanitizer report into something useful for<br>
>> a developer, offline symbolication is necessary. However, the existing<br>
>> infrastructure (asan_symbolize.py) for doing this has some<br>
>> deficiencies.<br>
>><br>
>> * Only Address Sanitizer reports are supported.<br>
>> * The current implementation processes each stackframe sequentially.<br>
>> This does not fit well in contexts where we would like to symbolicate<br>
>> multiple PCs at a time.<br>
>> * The current implementation doesn’t provide a way to handle inline<br>
>> frames (i.e. a PC maps to two or more source locations).<br>
>><br>
>> These problems can be resolved by building new tools on top of the<br>
>> structured data format. This gives a nice separation of concerns<br>
>> because parsing the report is now separate from symbolicating the PCs<br>
>> in it.<br>
>><br>
>> The symbolication tools would be subtools of the asan, tsan, ubsan<br>
>> subtools. This would require the user to explicitly communicate the<br>
>> report type ahead of time. Command line invocation would look<br>
>> something like:<br>
>><br>
>> ```<br>
>> llvm-xsan asan symbolicate < asan_report.json > asan_report_symbolicated.json<br>
>> llvm-xsan tsan symbolicate < tsan_report.json > tsan_report_symbolicated.json<br>
>> llvm-xsan ubsan symbolicate < ubsan_report.json > ubsan_report_symbolicated.json<br>
>> ```<br>
>><br>
>> There are multiple ways to perform symbolication (some of which are<br>
>> platform specific). Like asan_symbolize.py the plan would be to<br>
>> support multiple symbolication backends (that can also be chained<br>
>> together) that are specified via command line options.<br>
>><br>
>> [1] <a href="https://github.com/dobin/asanparser/blob/master/asanparser.py" rel="noreferrer" target="_blank">https://github.com/dobin/asanparser/blob/master/asanparser.py</a><br>
>> [2] <a href="https://github.com/google/sanitizers/issues/268" rel="noreferrer" target="_blank">https://github.com/google/sanitizers/issues/268</a><br>
>><br>
>> Thanks,<br>
>> Dan.<br>
>> _______________________________________________<br>
>> LLVM Developers mailing list<br>
>> <a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a><br>
>> <a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" rel="noreferrer" target="_blank">https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><br>
</blockquote></div></div>