<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Oct 7, 2020 at 10:23 AM Dan Liew <<a href="mailto:dan@su-root.co.uk" target="_blank">dan@su-root.co.uk</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi,<br>

<br>

On Tue, 6 Oct 2020 at 18:31, David Blaikie <<a href="mailto:dblaikie@gmail.com" target="_blank">dblaikie@gmail.com</a>> wrote:<br>

><br>

> My 2c would be to push back a bit more on the "let's not have a machine readable format, but instead parse the human readable format" - it seems like that's going to make the human readable format/parsing fairly brittle/hard to change (I mean, having the parser in tree will help, for sure).<br>

<br>

I was operating under the assumption that the decision made in<br>

<a href="https://github.com/google/sanitizers/issues/268" rel="noreferrer" target="_blank">https://github.com/google/sanitizers/issues/268</a> was still the status<br>

quo. That was six years ago though so I'll let Kostya chime in here if<br>

he now thinks differently about this.<br></blockquote><div><br></div><div>My opinion on the matter didn't change, nor did the motivation. </div><div>I am opposed to making the sanitizer run-time any more complex,</div><div>and I prefer the approach proposed here: separate, adjacently maintained parser. </div><div><br></div><div>On top of the previous motivation, here is some more. </div><div>We are going to have more sanitizer-like things in the near future (Arm MTE is one of them), </div><div>that are not necessarily going to be in LLVM and that will not emit JSON.   <br></div><div>(and they shouldn't: we don't want any such thing in a production run-time).</div><div>But we can support those things with a separate parser. </div><div><br></div><div>I have a mild preference to have the parser written as a C++ library, with C interface.  </div><div>Not in python, so that it can be used programmatically w/o launching a sub-process. </div><div>But I don't insist (especially given the code is written already) </div><div><br></div><div>--kcc </div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

Even if we go down the route of having the sanitizers supporting<br>

machine-readable output I'd still like there to be an in-tree tool<br>

that supports doing offline symboliation on the machine readable<br>

output. So there still might be a case for having the proposed<br>

"llvm-xsan" tool in-tree.<br>

<br>

> It'd be interesting to know more about what problems the valgrind XML format have had and how/whether different solutions would address/avoid those problems. Also might be good to hear about how other tools are parsing the output - whether or not/how they might benefit if it were machine readable to begin with.<br>

<br>

Huh. I didn't know Valgrind had an XML format so I can't really<br>

comment on that (yet).<br>

<br>

On my side I can say we have at least two use cases inside Apple where<br>

we are parsing ASan reports and each use case ended up implementing<br>

their own parser.<br>

<br>

><br>

> But, yeah, if that's the direction - having an in-tree tool with fairly narrow uses could be nice. One action to convert human readable reports to json, another to symbolize such a report, a simple tool to render the (symbolized or not) data back into human readable form - then sets it up for other tools to consume that json and, say, render it in a GUI, perform other diagnostics/analysis on the report, etc.<br>

<br>

I hadn't thought about a tool to re-render reports in human readable<br>

form. That's a good idea.<br>

<br>

<br>

> On Tue, Oct 6, 2020 at 6:12 PM Dan Liew via llvm-dev <<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a>> wrote:<br>

>><br>

>> # Summary<br>

>><br>

>> Currently the Sanitizer family of runtime bug finding tools (e.g.<br>

>> Address Sanitizer) provide useful reports of problems upon detection.<br>

>> This RFC proposes adding tools to<br>

>><br>

>> 1. Parse Sanitizer reports into structured data to make interfacing<br>

>> with other tools simpler.<br>

>> 2. Take the Sanitizer reports and “Symbolicate” them. That is, add<br>

>> missing symbol information (function name, source file, line number)<br>

>> to the structured data version of the report.<br>

>><br>

>> The initial stubs for the proposal in this RFC are provided in this<br>

>> patch: <a href="https://reviews.llvm.org/D88938" rel="noreferrer" target="_blank">https://reviews.llvm.org/D88938</a> .<br>

>><br>

>> Any thoughts on this RFC on the patch would be appreciated.<br>

>><br>

>> # Issues with the existing solutions<br>

>><br>

>> * An official parser for sanitizer reports does not exist. Currently<br>

>> we just tell our users to implement their own (e.g. [1]). This creates<br>

>> an unnecessary duplication of effort.<br>

>> * The existing symbolizer (asan_symbolize.py) only works with ASan<br>

>> reports and doesn’t support other sanitizers like TSan.<br>

>> * The architecture of the existing symbolizer makes it cumbersome to<br>

>> support inline frames.<br>

>> * The architecture of the existing symbolizer is sequential which<br>

>> prevents performing batched symbolication of stack frames.<br>

>><br>

>> # Tools<br>

>><br>

>> The proposed tools would be a sub-tools of a new llvm-xsan tool.<br>

>><br>

>> E.g.<br>

>><br>

>> llvm-xsan <subtool><br>

>><br>

>> Sub-tools will support nesting of sub-tools to allow building<br>

>> ergonomic tools. E.g.:<br>

>><br>

>> llvm-xsan asan <asan subtool><br>

>><br>

>> * The tools would be part of compiler-rt and will optionally ship with<br>

>> this project.<br>

>> * The tools will be considered experimental while being incrementally<br>

>> developed on the master branch.<br>

>> * Functionality of the tools will be maintained via tests in the compiler-rt.<br>

>><br>

>> llvm-xsan could be also used as a vehicle for shipping other Sanitizer<br>

>> tools in the toolchain in the future.<br>

>><br>

>> ## Parsing tool<br>

>><br>

>> Sanitizer reports are primarily meant to be human readable,<br>

>> consequently the reports are not structured data (e.g. JSON). This<br>

>> means that Sanitizer reports are not conveniently machine-readable.<br>

>><br>

>> A request [2] was made in the past to teach the sanitizers to emit a<br>

>> machine-readable format for reports. This request was denied but an<br>

>> alternative was proposed where a tool could be provided to convert the<br>

>> human readable Sanitizer reports into a structured data format. This<br>

>> proposal will implement this alternative.<br>

>><br>

>> My proposal is that we implement a parser for Sanitizer reports that<br>

>> converts them into a structured data. In particular:<br>

>><br>

>> * The tool is tied to the Clang/compiler-rt runtime that it ships<br>

>> with. This means the tool will parse Sanitizer reports that come from<br>

>> binaries built using the corresponding Clang. However the tool is not<br>

>> required to parse Sanitizer reports that come from different versions<br>

>> of Clang.<br>

>> * The tool can also output a schema that describes the structured data<br>

>> format. This schema would be versioned and would be allowed to change<br>

>> once the tool moves out of the experimental stage.<br>

>> * The format of the human readable Sanitizer reports is allowed to<br>

>> change but the parser should be correspondingly changed when this<br>

>> happens. This will be enforced with tests.<br>

>><br>

>> The parsing tools would be subtools of the asan, tsan, ubsan subtools.<br>

>> This would require the user to explicitly communicate the report type<br>

>> ahead of time. Command line invocation would look something like:<br>

>><br>

>> ```<br>

>> llvm-xsan asan parse < asan_report.txt > asan_report.json<br>

>> llvm-xsan tsan parse < tsan_report.txt > tsan_report.json<br>

>> llvm-xsan ubsan parse < ubsan_report.txt > ubsan_report.json<br>

>> ```<br>

>><br>

>> The structured data format would be JSON. The schema details still<br>

>> need to be worked out but the schema will need to cover every type of<br>

>> issue that a Sanitizer can find.<br>

>><br>

>> ## Symbolication tool<br>

>><br>

>> Sanitizer reports include detailed stack traces which show the program<br>

>> counter (PC) for each frame. PCs are typically not useful to a<br>

>> developer. Instead they are likely more interested in the function<br>

>> name, source file and line number that correspond to each of the PCs.<br>

>> The process of finding the function name, source file and line number<br>

>> that correspond to a PC is known as “Symbolication”.<br>

>><br>

>> There are two approaches to symbolication, online and offline. Online<br>

>> symbolication performs Symbolication in the process where the issue<br>

>> was found by invoking an external tool (e.g. llvm-symbolizer) to<br>

>> “symbolize” each of the PCs. Offline symbolication performs<br>

>> symbolication outside the process where the issue was found. The<br>

>> Sanitizers perform online symbolication by default. This process needs<br>

>> the debug information to be available at runtime. However this<br>

>> information might be missing. For example:<br>

>><br>

>> * The instrumented binary might have been stripped of debug info (e.g.<br>

>> to reduce binary size).<br>

>> * The PC points inside a system library which has no available debug info.<br>

>> * The instrumented binary was built on a different machine. On Apple<br>

>> platforms debug info lives outside the binary (inside “.dSYM” bundles)<br>

>> so these might not be copied across from the build machine.<br>

>><br>

>> In these cases online symbolication fails and we are left with a<br>

>> sanitizer report that is extremely hard for a developer to read.<br>

>><br>

>> To turn the unsymbolicated Sanitizer report into something useful for<br>

>> a developer, offline symbolication is necessary. However, the existing<br>

>> infrastructure (asan_symbolize.py) for doing this has some<br>

>> deficiencies.<br>

>><br>

>> * Only Address Sanitizer reports are supported.<br>

>> * The current implementation processes each stackframe sequentially.<br>

>> This does not fit well in contexts where we would like to symbolicate<br>

>> multiple PCs at a time.<br>

>> * The current implementation doesn’t provide a way to handle inline<br>

>> frames (i.e. a PC maps to two or more source locations).<br>

>><br>

>> These problems can be resolved by building new tools on top of the<br>

>> structured data format. This gives a nice separation of concerns<br>

>> because parsing the report is now separate from symbolicating the PCs<br>

>> in it.<br>

>><br>

>> The symbolication tools would be subtools of the asan, tsan, ubsan<br>

>> subtools. This would require the user to explicitly communicate the<br>

>> report type ahead of time. Command line invocation would look<br>

>> something like:<br>

>><br>

>> ```<br>

>> llvm-xsan asan symbolicate < asan_report.json > asan_report_symbolicated.json<br>

>> llvm-xsan tsan symbolicate < tsan_report.json > tsan_report_symbolicated.json<br>

>> llvm-xsan ubsan symbolicate < ubsan_report.json > ubsan_report_symbolicated.json<br>

>> ```<br>

>><br>

>> There are multiple ways to perform symbolication (some of which are<br>

>> platform specific). Like asan_symbolize.py the plan would be to<br>

>> support multiple symbolication backends (that can also be chained<br>

>> together) that are specified via command line options.<br>

>><br>

>> [1] <a href="https://github.com/dobin/asanparser/blob/master/asanparser.py" rel="noreferrer" target="_blank">https://github.com/dobin/asanparser/blob/master/asanparser.py</a><br>

>> [2] <a href="https://github.com/google/sanitizers/issues/268" rel="noreferrer" target="_blank">https://github.com/google/sanitizers/issues/268</a><br>

>><br>

>> Thanks,<br>

>> Dan.<br>

>> _______________________________________________<br>

>> LLVM Developers mailing list<br>

>> <a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a><br>

>> <a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" rel="noreferrer" target="_blank">https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><br>

</blockquote></div></div>