[LLVMdev] Using LLVM for decompiling.

Mon May 7 10:05:26 PDT 2012

On Mon, May 7, 2012 at 9:45 AM, James Courtier-Dutton
<james.dutton at gmail.com> wrote:
> On 7 May 2012 16:31, John Criswell <criswell at illinois.edu> wrote:
>> On 5/7/12 5:47 AM, James Courtier-Dutton wrote:
>>>
>>> Hi,
>>>
>>> I am writing a decompiler. I was wondering if some of LLVM could be
>>> used for a decompiler.
>>> There are several stages in the decompiler process.
>>> 1) Take binary and create a higher level representation of it. Like RTL.
>>> 2) The output is then broken into blocks or nodes, each block ends in
>>> a CALL, JMP, RET, or 2-way or multiway conditional JMP.
>>
>>
>> I'm not sure that there's anything that will help you with this step for
>> LLVM.  The closest I can think of is Qemu, and I think that uses dynamic
>> binary translation (i.e., you have to run the binary program).
>>
> No problem. I have already coded (1) and (2)
> https://github.com/jcdutton/libbeauty
> It uses a tiny VM to help it do 1 and 2, so will eventually be able to
> also handle self modifying code.
> It is similar to qemu in that respect, except that it is not so
> concerned with real time execution of the code that qemu provides.
>>
>>> 3) The blocks or nodes are then analyzed for structure in order to
>>> extract loop information and if...then...else information.
>>
>>
>> Given that you've completed steps one and two (i.e., you've converted the
>> binary instructions to LLVM IR and then discovered basic blocks), then yes,
>> LLVM's current analysis passes should help you with this third step.  LLVM
>> has passes that normalize loops, identify loops in local control-flow
>> graphs, identify dominators/post-dominators, etc.
> Great, which bit of the LLVM source code does this bit (3)?
>
>>
>>
>>> 4) Once structure is obtained, data types can be analyzed.
>>
>>
>> The only thing for LLVM which could help here is a type-inference/points-to
>> analysis called DSA.  However, since you're reversing everything from binary
>> code, I doubt DSA's type-inference will work well, so I don't think it will
>> find many (if any) high-level types like structs or arrays of structs.
>>
>> You might be able to build a more sophisticated analysis yourself, but
>> you'll pretty much be on your own.
> I agree, my need to discover data types is not a function that a
> compiler needs to do.
> The Source code has already told the compiler about the data types.
> I will work on this on my own, once I have (3) done.
>
>>
>>
>>> 5) Lastly, source code is output in C or C++ or whatever is needed.
>>
>>
>> LLVM might have facilities for converting LLVM IR to C or C++ code (the C
>> backend was recently removed; there might be a C++ backend, but I'm not
>> sure).  However, they are primarily designed for systems for which LLVM does
>> not provide a native code generator, so the C/C++ code they output isn't
>> very readable.
> Is the reason that the C/C++ code is not very readable, because there
> is not high enough metadata in the LLVM IR to do it?
> I.e. Lack of structure.
> Or was it because it was only designed to be input to another
> compiler, so pretty structure like for loops etc, was not necessary?

The old C backend was purely designed to be input to another compiler;
you could easily write a version with much prettier output.

> Why was the C backend removed?

Basically, it was unmaintained, the output was broken, and nobody was
willing to spend the time to fix it.

-Eli