<div dir="ltr"><br><br><div class="gmail_quote"><div dir="ltr">On Wed, Nov 30, 2016 at 3:26 AM Renato Golin <<a href="mailto:renato.golin@linaro.org">renato.golin@linaro.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On 30 November 2016 at 05:08, Dean Michael Berris via llvm-dev<br class="gmail_msg">

<<a href="mailto:llvm-dev@lists.llvm.org" class="gmail_msg" target="_blank">llvm-dev@lists.llvm.org</a>> wrote:<br class="gmail_msg">

> - Is there a preference between the two options provided above?<br class="gmail_msg">

> - Any other alternatives we should consider?<br class="gmail_msg">

> - Which parts of which options do you prefer, and is there a synthesis of either of those options that appeals to you?<br class="gmail_msg">

<br class="gmail_msg">

Hi Dean,<br class="gmail_msg">

<br class="gmail_msg">

I haven't followed the XRay project that closely, but I have been<br class="gmail_msg">

around file formats being formed and either of your two approaches<br class="gmail_msg">

(which are pretty standard) will fail in different ways. But that's<br class="gmail_msg">

ok, because the "fixes" work, they're just not great.<br class="gmail_msg">

<br class="gmail_msg">

If you take the LLVM IR, there were lots of changes, but we always<br class="gmail_msg">

aimed to have one canonical representation. Not just at the syntax of<br class="gmail_msg">

each instruction/construct, but how to represent complex behaviour in<br class="gmail_msg">

the same series of instructions, so that all back-ends can identify<br class="gmail_msg">

and work with it. Of course, the second (semantic) level is less<br class="gmail_msg">

stringent than the first (syntactical), but we try to make it as<br class="gmail_msg">

strict as possible.<br class="gmail_msg">

<br class="gmail_msg">

This hasn't come for free. The two main costs were destructive<br class="gmail_msg">

semantics, for example when we lower C++ classes into arrays and<br class="gmail_msg">

change all the access to jumbled reads and writes because IR readers<br class="gmail_msg">

don't need to understand the ABI of all targets, and backwards<br class="gmail_msg">

incompatibility, for example when we completely changed how exception<br class="gmail_msg">

handling is lowered (from special basic blocks to special constructs<br class="gmail_msg">

as heads/tails of common basic blocks). That price was cheaper than<br class="gmail_msg">

the alternative, but it's still not free.<br class="gmail_msg">

<br class="gmail_msg">

Another approach I followed was SwissProt [1], a manually curated<br class="gmail_msg">

machine readable text file with protein information for cross<br class="gmail_msg">

referencing. Cutting short to the chase, they introduced "line types"<br class="gmail_msg">

with strict formatting for the most common information, and one line<br class="gmail_msg">

type called "comment" where free text was allowed, for additional<br class="gmail_msg">

information. With time, adding a new line type became impossible, so<br class="gmail_msg">

all new fields ended up being added in the comment lines, with a<br class="gmail_msg">

pseudo-strict formatting, which was (probably still is) a nightmare<br class="gmail_msg">

for parsers and humans alike.<br class="gmail_msg">

<br class="gmail_msg">

Between the two, the LLVM IR policy for changes is orders of magnitude<br class="gmail_msg">

better. I suggest you follow that.<br class="gmail_msg">

<br class="gmail_msg">

I also suggest you don't keep multiple canonical representations, and<br class="gmail_msg">

create tools to convert from any other to the canonical format.<br class="gmail_msg">

<br class="gmail_msg">

Finally, I'd separate the design in two phases:<br class="gmail_msg">

<br class="gmail_msg">

1. Experimental, where the canonical form changes constantly in light<br class="gmail_msg">

of new input and there are no backwards/forwards compatibility<br class="gmail_msg">

guarantees at all. This is where all of you get creative and try to<br class="gmail_msg">

sort out the problems in the best way possible.<br class="gmail_msg">

2. Stable, when most of the problems were solved, and you now document<br class="gmail_msg">

a final stable version of the representation. Every new input will<br class="gmail_msg">

have to be represented as a combination of existing ones, so make them<br class="gmail_msg">

generic enough. In need of real change, make sure you have a process<br class="gmail_msg">

that identifies versions and compatibility (for example, having a<br class="gmail_msg">

version tag on every dump), and letting the canonical tool know all of<br class="gmail_msg">

the issues.<br class="gmail_msg">

<br class="gmail_msg">

This last point is important if you want to continue reading old files<br class="gmail_msg">

that don't have the compatibility issue, warn when they do but it's<br class="gmail_msg">

irrelevant, or error when they do and it'll produce garbage. You can<br class="gmail_msg">

also write more efficient converting tools.<br class="gmail_msg">

<br class="gmail_msg">

>From what I understood of this XRay, you could in theory keep the data<br class="gmail_msg">

for years in a tape somewhere in the attic, and want to read it later<br class="gmail_msg">

to compare to a current run, so being compatible is important, but<br class="gmail_msg">

having a canonical form that can be converted to and from other forms<br class="gmail_msg">

is more important, or the comparison tools will get really messy<br class="gmail_msg">

really quickly.<br class="gmail_msg"></blockquote><div><br></div><div>Not sure I quite follow here - perhaps some misunderstanding.<br><br>My mental model here is that the formats are semantically equivalent - with a common in-memory representation (like LLVM IR APIs). It doesn't/shouldn't complicate a comparison tool to support both LLVM IR and bitcode input (or some other hypothetical formats that are semantically equivalent that we could integrate into a common reading API). At least that's my mental model.<br><br>Is there something different here?<br><br>What I'm picturing is that we need an API for reading all these formats and either we use that API only in the conversion tool - and users then have to run the conversion tool before running the tool they want. Or we sink that API into a common place, and all tools use that API to load inputs - making the user experience simpler (they don't have to run an extra conversion step/tool) but it doesn't seem like it should make the development experience more complicated/messy/difficult.<br><br>- Dave</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br class="gmail_msg">

Hope that helps,<br class="gmail_msg">

<br class="gmail_msg">

cheers,<br class="gmail_msg">

--renato<br class="gmail_msg">

<br class="gmail_msg">

<br class="gmail_msg">

[1] <a href="http://web.expasy.org/docs/swiss-prot_guideline.html" rel="noreferrer" class="gmail_msg" target="_blank">http://web.expasy.org/docs/swiss-prot_guideline.html</a><br class="gmail_msg">

</blockquote></div></div>