[llvm-dev] [RFC][binutils] Machine-readable output from Binutils - possible GSOC project?

Tue Jan 14 03:24:21 PST 2020

On 10/01/2020 11:55, James Henderson via llvm-dev wrote:
> Hi all,
> 
> I was giving some thought as to possible project ideas I could propose 
> for this year’s Google Summer of Code, with regards to the LLVM 
> Binutils. One idea that I had was something discussed at last year’s 
> Euro LLVM developer meeting, namely machine-readable output from the 
> LLVM Binutils. Before I actually start advertising this as an open 
> project, I wanted to ask a few questions:
> 
>  1. Are people still interested in this? If so, what is the typical use
>     case you’d use the result of this project for? Why would this be
>     better than the existing llvm-readobj output (if applicable)?
>  2. Which tool(s) and feature(s) would you most want this for? I
>     personally think this should just be another output style for
>     llvm-readobj. Does anybody have any different opinion there?
>  3. Is there any additional tooling in relation to this project that you
>     think would be important to be a part of this project, e.g. a lit
>     function to query the output?
>  4. How might this interact with obj2yaml? Could the new output
>     ultimately be used to replace it?
>  5. Is there a priority for a specific format (e.g. ELF, DWARF, COFF)?
>  6. Would anybody be interested in co-mentoring such a project?

I wonder if machine-readable output from the tools is actually the 
correct approach.  When I have needed something similar, for example 
when parsing traces from a CPU debug interface and mapping them to 
places in the object code, I have used the same underlying libraries 
that these tools use in LLVM to get much richer output.

When I have done so, I have found that there is a huge amount of 
boilerplate involved.  I would be much more interested in moving a lot 
of the logic in these tools into some higher-level (API-stable) library 
abstractions (with scripting-language bindings) and then reimplementing 
the tools in terms of those libraries.

If at all possible, I'd rather not use these via a serialisation format.

For example, consider the disassembly bit.  There are three steps:

1. The binary encoding of the instruction.
2. The semantic decoding of the operation, the input and output 
operands, including information about the kind of instruction (e.g. 
branch, load, store).
3. The text representation.

A lot of the things where I've wanted machine-readable objdump output, 
I've wanted part of 2.  Consider this line from objdump:

16bed:       48 83 c3 01             add    $0x1,%rbx

It has an address in the binary, the hex of the instruction, and the 
formatted assembly for the instruction.  The first two are pretty easy 
to encode in something like YAML, but would the last bit be just a 
string?  A format string with some more explicit values?  Would that be 
sufficient to know that this is an operation that reads and writes %rbx, 
uses a constant as another operand, and does not modify memory or 
control flow?

David