[LLVMdev] GSoC Proposal: Table-Driven Decompilation

Thu Apr 5 08:18:49 PDT 2012

On Apr 4, 2012, at 8:54 AM, Joshua Cranmer wrote:

> On 4/4/2012 12:08 AM, Charles Davis wrote:
>> Proposal:
>> Since its humble beginnings in 2001, LLVM has grown from a simple compiler toolkit to an entire family of build tools. Currently, it includes an assembler, a disassembler, a JIT, a C compiler, a debugger, an archiver, various tools for analyzing object files, and even a linker. In fact, just about the only tool missing from this set (aside from various language compilers) is a decompiler--a tool to turn machine code back into LLVM IR. This project proposes adding such a tool.
>> 
>> Some of the information needed to produce such a tool is already present--in the form of target description files, some of which contain patterns used to transform LLVM IR--or, more accurately, a selection DAG--into machine code. Since a decompiler is largely a compiler working in reverse, it should be conceivable to use these patterns to transform machine code back into a selection DAG--and transform that, in turn, back into raw LLVM IR. To actually read machine code, the decompiler will use the MC disassembler to produce MCInst objects, which can be transformed back into CodeGen's MachineInstr representation, so it can be fed through the selection DAG in reverse.
>> 
>> Some DAG->machine code transformations aren't controlled by TableGen patterns, but by custom transformations implemented as C++ code. For those transformations, custom C++ code performing the reverse transformation will be necessary.
> 
> This strikes me as an extremely ambitious project, and I wonder how much 
> you could actually get done in a summer.
> 
> Take the MCInst->MachineInstr conversion. In order to properly do this 
> phase, you have to reconstruct functions, basic blocks, control flow... 
> as well as identifying global variables and their initializers (should 
> they have any). I am not an expert in the state of the art, but some 
> experiments with IDA Free indicate that identifying the latter correctly 
> (consider arrays where you have references to particular elements of the 
> array: you have references to the middle of the array, which would 
> generally construct a new symbol offset... arrays of structs are even 
> worse!).
Maybe I should devote my SoC project to just this, then? Perhaps, make a simple MCInst->MachineInstr transformer that doesn't handle anything complex like global variables or arrays or anything like that?
> 
> Another issue you would have is getting the implicit register usages 
> correct... if you leave these out, you would get a MachineFunction that 
> you can't do any code generation optimizations on safely. This isn't 
> necessarily a problem, but it's something which would need warning 
> labels in big, flashing neon signs. In any case, tracking the implicit 
> use of registers is critical to being able to do register "deallocation" 
> and recovering SSA values for the machine code. It's not clear to me 
> that you could let this fall out during the SelectionDAG process.
True.
> 
> Finally, it's not clear to me that using a selection DAG would derive 
> you much benefit instead of just doing a straight MachineInstr->LLVM IR 
> transformation. However, this is not my area of expertise, so I won't 
> comment any further in this regard.
Perhaps after I finish MCInst->MachineInstr conversion, I should do a MachineInstr->LLVM IR transformation--or maybe I should just do MCInst->LLVM IR directly! I wanted to use the SelectionDAG because I wanted to take advantage of the patterns stored in the target description files, but now that I think about it, I'm not sure I'll get much out of it, either…

Thanks for your input.

Chip