[LLVMdev] Using LLVM for decompiling.

John Criswell criswell at illinois.edu
Mon May 7 10:13:16 PDT 2012


On 5/7/12 11:45 AM, James Courtier-Dutton wrote:
> On 7 May 2012 16:31, John Criswell<criswell at illinois.edu>  wrote:
>> On 5/7/12 5:47 AM, James Courtier-Dutton wrote:
>>> Hi,
>>>
>>> I am writing a decompiler. I was wondering if some of LLVM could be
>>> used for a decompiler.
>>> There are several stages in the decompiler process.
>>> 1) Take binary and create a higher level representation of it. Like RTL.
>>> 2) The output is then broken into blocks or nodes, each block ends in
>>> a CALL, JMP, RET, or 2-way or multiway conditional JMP.
>>
>> I'm not sure that there's anything that will help you with this step for
>> LLVM.  The closest I can think of is Qemu, and I think that uses dynamic
>> binary translation (i.e., you have to run the binary program).
>>
> No problem. I have already coded (1) and (2)
> https://github.com/jcdutton/libbeauty
> It uses a tiny VM to help it do 1 and 2, so will eventually be able to
> also handle self modifying code.
> It is similar to qemu in that respect, except that it is not so
> concerned with real time execution of the code that qemu provides.
>>> 3) The blocks or nodes are then analyzed for structure in order to
>>> extract loop information and if...then...else information.
>>
>> Given that you've completed steps one and two (i.e., you've converted the
>> binary instructions to LLVM IR and then discovered basic blocks), then yes,
>> LLVM's current analysis passes should help you with this third step.  LLVM
>> has passes that normalize loops, identify loops in local control-flow
>> graphs, identify dominators/post-dominators, etc.
> Great, which bit of the LLVM source code does this bit (3)?

There are various passes: use http://llvm.org/docs/Passes.html to see 
which analyses exist in LLVM and then use the doxygen docs 
(http://llvm.org/doxygen/hierarchy.html) to get exact information on how 
to use them.

Examples of the above passes include llvm::LoopInfo, 
llvm::DominatorTree, and llvm::PostDominatorTree.


>
>>
>>> 4) Once structure is obtained, data types can be analyzed.
>>
>> The only thing for LLVM which could help here is a type-inference/points-to
>> analysis called DSA.  However, since you're reversing everything from binary
>> code, I doubt DSA's type-inference will work well, so I don't think it will
>> find many (if any) high-level types like structs or arrays of structs.
>>
>> You might be able to build a more sophisticated analysis yourself, but
>> you'll pretty much be on your own.
> I agree, my need to discover data types is not a function that a
> compiler needs to do.

Actually, tools like the SAFECode compiler might benefit from more 
aggressive type-reconstruction.  What we've seen is that certain LLVM 
transforms (I think the loop transforms and instcombine) "remove" type 
information in the IR.  For example, a transform will change a 
getelementptr instruction that indexes into an array of structures into 
a getelementptr instruction that treats the same memory object as an 
array of bytes.  Both instructions do the same thing, but the former has 
information that make type-inference much easier.

There are two ways to fix the issue: either modify the transforms to 
maintain the type information when possible, or work harder to 
reconstruct it when it gets lost.  The former is better, I think, but 
the latter may still be useful.

In short, your efforts may benefit more than just you.

Just out of curiosity, what approach do you plan to take for this problem?

> The Source code has already told the compiler about the data types.
> I will work on this on my own, once I have (3) done.
>
>>
>>> 5) Lastly, source code is output in C or C++ or whatever is needed.
>>
>> LLVM might have facilities for converting LLVM IR to C or C++ code (the C
>> backend was recently removed; there might be a C++ backend, but I'm not
>> sure).  However, they are primarily designed for systems for which LLVM does
>> not provide a native code generator, so the C/C++ code they output isn't
>> very readable.
> Is the reason that the C/C++ code is not very readable, because there
> is not high enough metadata in the LLVM IR to do it?
> I.e. Lack of structure.
> Or was it because it was only designed to be input to another
> compiler, so pretty structure like for loops etc, was not necessary?

The latter; it was only designed to provide output to be consumed by 
another C compiler.

> Why was the C backend removed?

No one used it, and it was bit rotting.  If someone wants to bring it 
back to life, patches are welcome.

-- John T.

>
> James




More information about the llvm-dev mailing list