[LLVMdev] help decompiling x86 ASM to LLVM IR

Tue Mar 12 10:01:21 PDT 2013

On 3/12/13 11:39 AM, Óscar Fuentes wrote:
> James Courtier-Dutton <james.dutton at gmail.com> writes:
>
>> I am looking to decompile x86 ASM to LLVM IR.
>> The original C is this:
>> int test61 ( unsigned value ) {
>>          int ret;
>>          if (value < 1)
>>                  ret = 0x40;
>>          else
>>                  ret = 0x61;
>>          return ret;
>> }
>>
>> It compiles with GCC -O2 to (rather cleverly removing any branches):
>> 0000000000000000 <test61>:
>>     0:   83 ff 01                cmp    $0x1,%edi
>>     3:   19 c0                   sbb    %eax,%eax
>>     5:   83 e0 df                and    $0xffffffdf,%eax
>>     8:   83 c0 61                add    $0x61,%eax
>>     b:   c3                      retq
>>
>> How would I represent the SBB instruction in LLVM IR?
>> Would I have to first convert the ASM to something like:
>>     0000000000000000 <test61>:
>>     0:                   cmp    $0x1,%edi        Block A
>>     1:                   jb     4:               Block A
>>     2:                   mov    0x61,%eax        Block B
>>     3:                   jmp    5:               Block B
>>     4:                   mov    0x40,%eax        Block C
>>     5:                   retq                    Block D  (Due to join point)
>>
>> ...before I could convert it to LLVM IR ?
>> I.e. Re-write it in such a way as to not need the SBB instruction.
>>
>> The aim is to be able to then recompile it to maybe a different target.
>> The aim is to go from binary -> LLVM IR -> binary for cases where the
>> C source code it not available or lost.
>>
>> I.e. binary available for x86 32 bit.  Re-target it to ARM or x86-64bit.
>> The LLVM IR should be target agnostic, but would permit the
>> re-targetting task without having to build AST and structure as a C or
>> C++ source code program.
>>
>> Any comments?
> This is not possible, except for specific cases.
>
> Consider this code:
>
> long foo(long *p) {
>    ++p;
>    return *p;
> }
>
> The X86 machine code would do something like
>
> add %eax, 4
>
> for `++p', but for x86_64 it would be
>
> add %rax, 8
>
> But you can't know that without looking at the original C code.

This is a bad example.  A compiler compiling LP64 code would generate 
the above code on x86_64 for the given C code.  An ILP32 compiler for 
x86_64 would generate something more akin to the 32-bit x86_32 code 
given above.  It should be possible to statically convert such a simple 
program from one instruction set to another (provided that they're not 
funky instruction sets with 11 bit words).

That said, converting machine code from one machine to another is, I 
believe, an undecidable problem for arbitrary code.  Certainly 
self-modifying code can be a problem.  There's no type-information, 
either, so optimizations that may rely on it can't be done. Anything 
that uses memory-mapped I/O or I/O ports is going to cause a real 
challenge, and system calls won't work the same way on a different 
architecture.  There are probably other gotcha's of which I am not 
aware.  In short, it's an exercise fraught with danger, and there will 
always be a program that breaks your translator.

Most systems that do binary translation do it dynamically (i.e., they 
grab a set of instructions, translate them to the new instruction set, 
and then cache the translation for reuse as the program runs).  They are 
essentially machine code interpreters enhanced with Just-In-Time 
compilation for speed.

-- John T.