[LLVMdev] help decompiling x86 ASM to LLVM IR
John Criswell
criswell at illinois.edu
Tue Mar 12 10:01:21 PDT 2013
On 3/12/13 11:39 AM, Óscar Fuentes wrote:
> James Courtier-Dutton <james.dutton at gmail.com> writes:
>
>> I am looking to decompile x86 ASM to LLVM IR.
>> The original C is this:
>> int test61 ( unsigned value ) {
>> int ret;
>> if (value < 1)
>> ret = 0x40;
>> else
>> ret = 0x61;
>> return ret;
>> }
>>
>> It compiles with GCC -O2 to (rather cleverly removing any branches):
>> 0000000000000000 <test61>:
>> 0: 83 ff 01 cmp $0x1,%edi
>> 3: 19 c0 sbb %eax,%eax
>> 5: 83 e0 df and $0xffffffdf,%eax
>> 8: 83 c0 61 add $0x61,%eax
>> b: c3 retq
>>
>> How would I represent the SBB instruction in LLVM IR?
>> Would I have to first convert the ASM to something like:
>> 0000000000000000 <test61>:
>> 0: cmp $0x1,%edi Block A
>> 1: jb 4: Block A
>> 2: mov 0x61,%eax Block B
>> 3: jmp 5: Block B
>> 4: mov 0x40,%eax Block C
>> 5: retq Block D (Due to join point)
>>
>> ...before I could convert it to LLVM IR ?
>> I.e. Re-write it in such a way as to not need the SBB instruction.
>>
>> The aim is to be able to then recompile it to maybe a different target.
>> The aim is to go from binary -> LLVM IR -> binary for cases where the
>> C source code it not available or lost.
>>
>> I.e. binary available for x86 32 bit. Re-target it to ARM or x86-64bit.
>> The LLVM IR should be target agnostic, but would permit the
>> re-targetting task without having to build AST and structure as a C or
>> C++ source code program.
>>
>> Any comments?
> This is not possible, except for specific cases.
>
> Consider this code:
>
> long foo(long *p) {
> ++p;
> return *p;
> }
>
> The X86 machine code would do something like
>
> add %eax, 4
>
> for `++p', but for x86_64 it would be
>
> add %rax, 8
>
> But you can't know that without looking at the original C code.
This is a bad example. A compiler compiling LP64 code would generate
the above code on x86_64 for the given C code. An ILP32 compiler for
x86_64 would generate something more akin to the 32-bit x86_32 code
given above. It should be possible to statically convert such a simple
program from one instruction set to another (provided that they're not
funky instruction sets with 11 bit words).
That said, converting machine code from one machine to another is, I
believe, an undecidable problem for arbitrary code. Certainly
self-modifying code can be a problem. There's no type-information,
either, so optimizations that may rely on it can't be done. Anything
that uses memory-mapped I/O or I/O ports is going to cause a real
challenge, and system calls won't work the same way on a different
architecture. There are probably other gotcha's of which I am not
aware. In short, it's an exercise fraught with danger, and there will
always be a program that breaks your translator.
Most systems that do binary translation do it dynamically (i.e., they
grab a set of instructions, translate them to the new instruction set,
and then cache the translation for reuse as the program runs). They are
essentially machine code interpreters enhanced with Just-In-Time
compilation for speed.
-- John T.
More information about the llvm-dev
mailing list