[LLVMdev] GSoC Proposal: Table-Driven Decompilation

Wed Apr 4 07:54:55 PDT 2012

On 4/4/2012 12:08 AM, Charles Davis wrote:
> Proposal:
> Since its humble beginnings in 2001, LLVM has grown from a simple compiler toolkit to an entire family of build tools. Currently, it includes an assembler, a disassembler, a JIT, a C compiler, a debugger, an archiver, various tools for analyzing object files, and even a linker. In fact, just about the only tool missing from this set (aside from various language compilers) is a decompiler--a tool to turn machine code back into LLVM IR. This project proposes adding such a tool.
>
> Some of the information needed to produce such a tool is already present--in the form of target description files, some of which contain patterns used to transform LLVM IR--or, more accurately, a selection DAG--into machine code. Since a decompiler is largely a compiler working in reverse, it should be conceivable to use these patterns to transform machine code back into a selection DAG--and transform that, in turn, back into raw LLVM IR. To actually read machine code, the decompiler will use the MC disassembler to produce MCInst objects, which can be transformed back into CodeGen's MachineInstr representation, so it can be fed through the selection DAG in reverse.
>
> Some DAG->machine code transformations aren't controlled by TableGen patterns, but by custom transformations implemented as C++ code. For those transformations, custom C++ code performing the reverse transformation will be necessary.

This strikes me as an extremely ambitious project, and I wonder how much 
you could actually get done in a summer.

Take the MCInst->MachineInstr conversion. In order to properly do this 
phase, you have to reconstruct functions, basic blocks, control flow... 
as well as identifying global variables and their initializers (should 
they have any). I am not an expert in the state of the art, but some 
experiments with IDA Free indicate that identifying the latter correctly 
(consider arrays where you have references to particular elements of the 
array: you have references to the middle of the array, which would 
generally construct a new symbol offset... arrays of structs are even 
worse!).

Another issue you would have is getting the implicit register usages 
correct... if you leave these out, you would get a MachineFunction that 
you can't do any code generation optimizations on safely. This isn't 
necessarily a problem, but it's something which would need warning 
labels in big, flashing neon signs. In any case, tracking the implicit 
use of registers is critical to being able to do register "deallocation" 
and recovering SSA values for the machine code. It's not clear to me 
that you could let this fall out during the SelectionDAG process.

Finally, it's not clear to me that using a selection DAG would derive 
you much benefit instead of just doing a straight MachineInstr->LLVM IR 
transformation. However, this is not my area of expertise, so I won't 
comment any further in this regard.

-- 
Joshua Cranmer
News submodule owner
DXR coauthor