[LLVMdev] Using LLVM for decompiling.

Mon May 7 10:08:38 PDT 2012

On 5/7/2012 11:45 AM, James Courtier-Dutton wrote:
> On 7 May 2012 16:31, John Criswell<criswell at illinois.edu>  wrote:
>> Given that you've completed steps one and two (i.e., you've converted the
>> binary instructions to LLVM IR and then discovered basic blocks), then yes,
>> LLVM's current analysis passes should help you with this third step.  LLVM
>> has passes that normalize loops, identify loops in local control-flow
>> graphs, identify dominators/post-dominators, etc.
> Great, which bit of the LLVM source code does this bit (3)?

Several of the passes in Analysis or Transforms. The code doesn't really 
work if it's not being output in IR, which your current library doesn't 
appear to be spitting out.
>> LLVM might have facilities for converting LLVM IR to C or C++ code (the C
>> backend was recently removed; there might be a C++ backend, but I'm not
>> sure).  However, they are primarily designed for systems for which LLVM does
>> not provide a native code generator, so the C/C++ code they output isn't
>> very readable.
> Is the reason that the C/C++ code is not very readable, because there
> is not high enough metadata in the LLVM IR to do it?
> I.e. Lack of structure.
> Or was it because it was only designed to be input to another
> compiler, so pretty structure like for loops etc, was not necessary?
> Why was the C backend removed?
>
The original reason for the C-backend was to be able to use LLVM for 
optimizations on architectures that it didn't have code generators for. 
However, several changes had been subsequently made to the IR that 
didn't update the C-backend code generator, so it got to the point where 
it tended to work only on small, trivial code samples.

It was never intended to be a decompiler, so the output was never 
particularly readable, and it didn't bother trying to structure if 
statements (IIRC, it was pretty much entirely goto-based). I earlier 
linked to a project which purports to be able to output more readable 
code as a backend (to try to do source-to-source translations for OpenCL 
kernels), but I haven't had the time to investigate it in great detail, 
so I don't know how well it holds up to the code.

Decompiling control flow is fairly easy to do (google "control flow 
structuring" or similar terms and you'll hit open a few troves of papers 
which give fairly clear details on how to do it), so it's rather the 
problem that stripped, optimized executables don't give you reliable 
ways to find functions (or parameters, for that matter) and the complete 
abolition of typing and variable information that makes decompiling 
extremely difficult.

-- 
Joshua Cranmer
News submodule owner
DXR coauthor