[LLVMdev] help decompiling x86 ASM to LLVM IR

Tue Mar 12 09:39:54 PDT 2013

James Courtier-Dutton <james.dutton at gmail.com> writes:

> I am looking to decompile x86 ASM to LLVM IR.
> The original C is this:
> int test61 ( unsigned value ) {
>         int ret;
>         if (value < 1)
>                 ret = 0x40;
>         else
>                 ret = 0x61;
>         return ret;
> }
>
> It compiles with GCC -O2 to (rather cleverly removing any branches):
> 0000000000000000 <test61>:
>    0:   83 ff 01                cmp    $0x1,%edi
>    3:   19 c0                   sbb    %eax,%eax
>    5:   83 e0 df                and    $0xffffffdf,%eax
>    8:   83 c0 61                add    $0x61,%eax
>    b:   c3                      retq
>
> How would I represent the SBB instruction in LLVM IR?
> Would I have to first convert the ASM to something like:
>    0000000000000000 <test61>:
>    0:                   cmp    $0x1,%edi        Block A
>    1:                   jb     4:               Block A
>    2:                   mov    0x61,%eax        Block B
>    3:                   jmp    5:               Block B
>    4:                   mov    0x40,%eax        Block C
>    5:                   retq                    Block D  (Due to join point)
>
> ...before I could convert it to LLVM IR ?
> I.e. Re-write it in such a way as to not need the SBB instruction.
>
> The aim is to be able to then recompile it to maybe a different target.
> The aim is to go from binary -> LLVM IR -> binary for cases where the
> C source code it not available or lost.
>
> I.e. binary available for x86 32 bit.  Re-target it to ARM or x86-64bit.
> The LLVM IR should be target agnostic, but would permit the
> re-targetting task without having to build AST and structure as a C or
> C++ source code program.
>
> Any comments?

This is not possible, except for specific cases.

Consider this code:

long foo(long *p) {
  ++p;
  return *p;
}

The X86 machine code would do something like

add %eax, 4

for `++p', but for x86_64 it would be

add %rax, 8

But you can't know that without looking at the original C code.

And that's the most simple case.

The gist is that the assembly code does not contain enough semantic
information.