[LLVMdev] [Proposal] Annotated assembly output

Sat Oct 13 21:38:45 PDT 2012

On 13/10/12 17:24, Sean Silva wrote:
> Another question: What kind of documentation you are planning to
> produce for this feature?
>
> -- Sean Silva
>
> On Fri, Oct 12, 2012 at 4:36 PM, Jim Grosbach<grosbach at apple.com>  wrote:
>>
>> On Oct 12, 2012, at 1:07 PM, Sean Silva<silvas at purdue.edu>  wrote:
>>
>>> Hi Jim, thanks for the response. That pretty much clears up my primary
>>> concern. +1 for keeping the C API small/stable/robust :)
>>>
>>> Having multiple hand-implemented parsers accepting the output, I think
>>> it would be wise to have an official "conformance suite" for the
>>> syntax so that external implementors can sleep more soundly with their
>>> implementation; if I were implementing a parser for it, having such a
>>> "conformance suite" would certainly help me feel better. The syntax is
>>> pretty simple, so the whole suite can probably fit in one file.
>>
>> That's an excellent suggestion. We'll see what we can do.
>>
>> Thanks again for the feedback!
>>
>> -Jim
>>
>>>
>>> -- Sean Silva
>>>
>>> On Fri, Oct 12, 2012 at 1:57 PM, Jim Grosbach<grosbach at apple.com>  wrote:
>>>> Hi Sean,
>>>>
>>>> Thanks for the feedback! Exactly the sort of discussion I was hoping to get started.
>>>>
>>>> On Oct 12, 2012, at 10:12 AM, Sean Silva<silvas at purdue.edu>  wrote:
>>>>
>>>>> How is the client supposed to make use of this markup information?
I've implemented a binary representation for arbitrarily nested 
structured data, which I call "storyboard data" - the files usually end 
with ".sbd", part of my v3c-storyboard project in SourceForge: 
http://sourceforge.net/projects/v3c-storyboard/.

Their text equivalent is a storyboard text file, which, again by 
convention end with ".sbt".

Here's an example:

program
( name("hello-world")
, contents
     ( puts
         ( class(function-prototype)
         , returns(int)
         , parameters
             ( str(type(pointer(const(char))))
             )
         )
     , main
         ( class(function)
         , returns(int)
         , parameters
             ( argc(type(int))
             , argv(type(array(pointer(char))))
             )
         , body
             ( puts("Hello, world!")
             , return(0)
             )
         )
     )
)

The current version which you can download can walk the sbd created from 
this data and output "C" code for compilation.
Version 0.2.0-05 (which I'll release in a week or two) has a 
"hello-world3-test" program that uses an LLVM Module to dump IR that's 
assembled into a hello world program and run, looking for that "Hello, 
world!" output for the test to pass.

The 0.2.0-05 "hello-world4-test" program interprets the sbd. The C++ 
program "calls" the "main" function defined in the sbd and the 
implementation uses libffi to create closures for functions defined by 
the sbd and call interfaces for "real" functions, like "puts".

The sbd API can be called from C and C++.

All the demos/tests use C++ as it saves typing, but as you may suspect 
from looking at the projects other parts, I plan to implement a 
graphical user interface(GUI) to interact with sbds.

That GUI will at least in part be implemented by sbds, once the 
interpreter can interact with Qt's C++ classes, which I'm about to start.

Like LLVM I've got one global repository for type information and functions.

The plan is to have multiple domains, where a domain could, for example. 
implement a domain editor that allows the user to create, edit and 
execute other domains.

These are all building blocks to address the fundamental problem of 
software development - text.

Intel are working on a "Kinect" - like camera that can be used to scan 
hand gestures in real time.

I want to use that (or something like it) to develop software.
>>>>
>>>> Target-independent introspection of the assembly. A simple example is color-coded output in a GUI disassembly display. All registers show up one color, all memory references another, and immediates yet another, and other such simple things. More interestingly, the client could use the markup to simplify implementation of mouse-over introspection of register values without having to know anything about the assembly syntax. The only target hook required would be "get the value of the register named 'foo'" since identifying the register names in the asm string is handled by the markup. Or, getting a bit fancier, visualizing data assembly data flow with def-use chains for a register being marked with arrows, again likely triggered via mouseover of a register name. The key bit here is that this is doable without the client having any knowledge of the target assembly syntax itself.
>>>>
>>>>> At first glance it seems like client code will just devolve into a pile
>>>>> of regex insanity. Why not use an existing standardized markup, like
>>>>> XML (not that I'm that fond of XML)?
>>>>
>>>> Plain regex would be a very bad way to handle this. Client code should be very simple, just looking for the '<' characters to find annotations. A parser to recognize the markup and ignore it all should be almost trivial.
>>>>
>>>> XML is basically just massive overkill for this. The idea is a lightweight annotation system that a client can easily strip off while paying attention to the bits and pieces it cares about.
>>>>
>>>>> At a higher level, why not expose an API for iterating over
>>>>> (potentially annotated) tokens which can be programmatically
>>>>> inspected. So what you expose to clients is an AnnotatedAsmTok. Given
>>>>> an AnnotatedAsmTok, they can call "getAnnotation()", or
>>>>> "getRawText()". A textual representation which can be read into this
>>>>> form might be useful, but we should provide the parser.
>>>>
>>>> We could. It's just outside the scope of what we're looking to do on the initial implementation. Note that it does get a bit more complicated since we're not just annotating tokens, but regions of text, and the annotations can (and often will be) nested.
>>>>
>>>>> I guess what I think needs a bit more explanation is why you chose to
>>>>> go the "markup" route, instead of a normal programmatic API.
>>>>
>>>> To keep the surface area of the C API as minimal as possible and robust against changes in what's marked up and how. Consider the interface in EnhancedDisassembly.h, for an example of what we specifically want to avoid (and obsolete).
>>>>
>>>>> Maybe you
>>>>> could also include a couple use cases that capture your "vision" for
>>>>> this functionality, and maybe a tiny bit of sample code doing
>>>>> something interesting with a very rough initial interface (if it seems
>>>>> more natural, since you're talking about a C API, you can just assume
>>>>> bindings and write the example in your scripting language of choice).
>>>>>
>>>>
>>>> Does the description up above sufficiently answer this? FWIW, one of the bits of example "how do I use this?" code I want as part of the project is a pretty-printed disassembly. Specifically, llvm-objdump will produce annotated disassembly and there will be a standalone tool that will take that text as input and use the markup to produce a pretty-printed output (as HTML, ANSI color codes or whatever).
>>>>
>>>> A quick real-world example of where this can get used is colorized disassembly in LLDB without LLDB having to re-implement an assembly parser to do it.
>>>>
>>>> -Jim
>>>>
>>>>
>>>>> -- Sean Silva
>>>>>
>>>>> On Fri, Oct 12, 2012 at 12:51 PM, Jim Grosbach<grosbach at apple.com>  wrote:
>>>>>> The following is a brief proposal for annotated assembly (and disassembly) output. Kevin Enderby and I have been discussing this a bit and are interested in getting broader feedback from interested folks.
>>>>>>
>>>>>>    LLVM Rich Assembly Output
>>>>>>
>>>>>> LLVM's (dis)assembly output is currently very raw. Consumers have limited ability to introspect the instructions' textual representation or to reformat for a more user friendly display. A lot of the actual instruction semantics are contained in the MCInstrDesc for the opcode, but that's not sufficient to reference into individual portions of the instruction text. For clients like disassemblers, list file generators, and pretty-printers, more is necessary than the raw instructions and the ability to print them.
>>>>>>
>>>>>> The intent is for the vast majority of the new functionality to not require new APIS, but to be in the assembly text itself via markup annotations. The markup is simple enough in syntax to be robust even in the case of version mismatches between consumers and producers. That is, the syntax generally does not carry semantics beyond "this text has an annotation," so consumers can simply ignore annotations they do not understand or do not care about.
>>>>>>
>>>>>> ** Instruction Annotations
>>>>>>
>>>>>> Annoated assembly display will supply contextual markup to help clients more efficiently implement things like pretty printers. Most markup will be target independent, so clients can effectively provide good display without any target specific knowledge.
>>>>>>
>>>>>> Annotated assembly goes through the normal instruction printer, but optionally includes contextual tags on portions of the instruction string. An annotation is any '<''>' delimited section of text(1).
>>>>>>
>>>>>> annotation: '<' tag-name tag-modifier-list ':' annotated-text'>'
>>>>>> tag-name: identifier
>>>>>> tag-modifier-list: comma delimited identifier list
>>>>>>
>>>>>> The tag name is an identifier which gives the type of the annotation. For the first pass, this will be very simple, with memory references, registers, and immediates having the tag names "mem", "reg", and "imm", respectively.
>>>>>>
>>>>>> The tag modifier list is typically additional target-specific context, such as register class.
>>>>>>
>>>>>> Clients should accept and ignore any tag names or tag modifiers they do not understand, allowing the annotations to grow in richness without breaking older clients.
>>>>>>
>>>>>> For example, a possible annotation of an ARM load of a stack-relative location might be annotated as:
>>>>>>
>>>>>>    ldr<reg gpr:r0>,<mem regoffset:[<reg gpr:sp>,<imm:#4>]>
>>>>>>
>>>>>>
>>>>>> 1: For assembly dialects in which '<' and/or'>' are legal tokens, a literal token is escaped by following immediately with a repeat of the character.  For example, a literal'<' character is output as'<<' in an annotated assembly string.
>>>>>>
>>>>>>
>>>>>> ** C API Details
>>>>>>
>>>>>> Some intended consumers of this information use the C API, therefore a new C API function for the disassembler will be added to disassemble an instruction with annotations, "LLVMDisasmInstructionAnnotated.".
>>>>>>
>>>>>> _______________________________________________
>>>>>> LLVM Developers mailing list
>>>>>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>>
>>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev