[LLVMdev] [Proposal] Annotated assembly output

Fri Oct 12 09:51:19 PDT 2012

The following is a brief proposal for annotated assembly (and disassembly) output. Kevin Enderby and I have been discussing this a bit and are interested in getting broader feedback from interested folks.

    LLVM Rich Assembly Output

LLVM's (dis)assembly output is currently very raw. Consumers have limited ability to introspect the instructions' textual representation or to reformat for a more user friendly display. A lot of the actual instruction semantics are contained in the MCInstrDesc for the opcode, but that's not sufficient to reference into individual portions of the instruction text. For clients like disassemblers, list file generators, and pretty-printers, more is necessary than the raw instructions and the ability to print them.

The intent is for the vast majority of the new functionality to not require new APIS, but to be in the assembly text itself via markup annotations. The markup is simple enough in syntax to be robust even in the case of version mismatches between consumers and producers. That is, the syntax generally does not carry semantics beyond "this text has an annotation," so consumers can simply ignore annotations they do not understand or do not care about.

** Instruction Annotations

Annoated assembly display will supply contextual markup to help clients more efficiently implement things like pretty printers. Most markup will be target independent, so clients can effectively provide good display without any target specific knowledge.

Annotated assembly goes through the normal instruction printer, but optionally includes contextual tags on portions of the instruction string. An annotation is any '<' '>' delimited section of text(1).

annotation: '<' tag-name tag-modifier-list ':' annotated-text '>'
tag-name: identifier
tag-modifier-list: comma delimited identifier list

The tag name is an identifier which gives the type of the annotation. For the first pass, this will be very simple, with memory references, registers, and immediates having the tag names "mem", "reg", and "imm", respectively.

The tag modifier list is typically additional target-specific context, such as register class.

Clients should accept and ignore any tag names or tag modifiers they do not understand, allowing the annotations to grow in richness without breaking older clients.

For example, a possible annotation of an ARM load of a stack-relative location might be annotated as:

    ldr <reg gpr:r0>, <mem regoffset:[<reg gpr:sp>, <imm:#4>]>

1: For assembly dialects in which '<' and/or '>' are legal tokens, a literal token is escaped by following immediately with a repeat of the character.  For example, a literal '<' character is output as '<<' in an annotated assembly string.

** C API Details

Some intended consumers of this information use the C API, therefore a new C API function for the disassembler will be added to disassemble an instruction with annotations, "LLVMDisasmInstructionAnnotated.".