[llvm-dev] RFC: XRay in the LLVM Library

Wed Nov 30 16:17:38 PST 2016

> On 30 Nov. 2016, at 22:26, Renato Golin <renato.golin at linaro.org> wrote:
> 
> On 30 November 2016 at 05:08, Dean Michael Berris via llvm-dev
> <llvm-dev at lists.llvm.org> wrote:
>> - Is there a preference between the two options provided above?
>> - Any other alternatives we should consider?
>> - Which parts of which options do you prefer, and is there a synthesis of either of those options that appeals to you?
> 
> Hi Dean,
> 
> I haven't followed the XRay project that closely, but I have been
> around file formats being formed and either of your two approaches
> (which are pretty standard) will fail in different ways. But that's
> ok, because the "fixes" work, they're just not great.
> 
> If you take the LLVM IR, there were lots of changes, but we always
> aimed to have one canonical representation. Not just at the syntax of
> each instruction/construct, but how to represent complex behaviour in
> the same series of instructions, so that all back-ends can identify
> and work with it. Of course, the second (semantic) level is less
> stringent than the first (syntactical), but we try to make it as
> strict as possible.
> 
> This hasn't come for free. The two main costs were destructive
> semantics, for example when we lower C++ classes into arrays and
> change all the access to jumbled reads and writes because IR readers
> don't need to understand the ABI of all targets, and backwards
> incompatibility, for example when we completely changed how exception
> handling is lowered (from special basic blocks to special constructs
> as heads/tails of common basic blocks). That price was cheaper than
> the alternative, but it's still not free.
> 
> Another approach I followed was SwissProt [1], a manually curated
> machine readable text file with protein information for cross
> referencing. Cutting short to the chase, they introduced "line types"
> with strict formatting for the most common information, and one line
> type called "comment" where free text was allowed, for additional
> information. With time, adding a new line type became impossible, so
> all new fields ended up being added in the comment lines, with a
> pseudo-strict formatting, which was (probably still is) a nightmare
> for parsers and humans alike.
> 
> Between the two, the LLVM IR policy for changes is orders of magnitude
> better. I suggest you follow that.
> 
> I also suggest you don't keep multiple canonical representations, and
> create tools to convert from any other to the canonical format.

Thanks Renato! Just so I understand this one sentence (to disambiguate), you meant:

1) Don't have multiple canonical forms, just have one.
2) Create tools that will convert to/from that one canonical format.

I think this follows closely the Option B mental model that I had, with the only difference being the canonical reader is a library made part of LLVM "when it's ready", as you suggest later. Would that be accurate?

> 
> Finally, I'd separate the design in two phases:
> 
> 1. Experimental, where the canonical form changes constantly in light
> of new input and there are no backwards/forwards compatibility
> guarantees at all. This is where all of you get creative and try to
> sort out the problems in the best way possible.
> 2. Stable, when most of the problems were solved, and you now document
> a final stable version of the representation. Every new input will
> have to be represented as a combination of existing ones, so make them
> generic enough. In need of real change, make sure you have a process
> that identifies versions and compatibility (for example, having a
> version tag on every dump), and letting the canonical tool know all of
> the issues.
> 
> This last point is important if you want to continue reading old files
> that don't have the compatibility issue, warn when they do but it's
> irrelevant, or error when they do and it'll produce garbage. You can
> also write more efficient converting tools.
> 

I like this suggestion -- thanks!

So in essence we can treat the current implementation as experimental, and make that abundantly clear in any point release where XRay functionality will be included. Is there a clear place where this ought to be documented clearly (aside from the documentation at http://llvm.org/docs/XRay.html)?

XRay trace file headers already contain a version identifier, intended to precisely identify how a reader would interpret the data in there.

> From what I understood of this XRay, you could in theory keep the data
> for years in a tape somewhere in the attic, and want to read it later
> to compare to a current run, so being compatible is important, but
> having a canonical form that can be converted to and from other forms
> is more important, or the comparison tools will get really messy
> really quickly.
> 

Yep, this is definitely one of the goals which is why we're being very careful about what we write down in the traces, optimising for efficient writing and smaller traces at the cost of potential complexity in the analysis tooling.

> Hope that helps,

Definitely does, thanks again!

Cheers

-- Dean