[LLVMdev] help decompiling x86 ASM to LLVM IR

James Courtier-Dutton james.dutton at gmail.com
Tue Mar 12 11:17:21 PDT 2013

On 12 March 2013 17:10, Joshua Cranmer 🐧 <Pidgeot18 at gmail.com> wrote:
> On 3/12/2013 11:55 AM, James Courtier-Dutton wrote:
>> 2) From the binary, I would know if it was for 32bit or 64bit.
>> 3) I could then use (1) and (2) to know if "add %rax, 8" is "p = p +
>> 1" (64bit long), or "p = p + 2(32bit long)"
>> So, I think your "It is not possible" is a bit too black and white.
> No, it's AI-hard, as evidenced that porting programs from 32-bit to 64-bit
> at the source-code level is nontrivial for large projects with lots of
> developers. And you only have less information at assembly level.

So, if we take the source-code level case.
You can write a source-code level program that will compile unchanged
to produce a 32-bit application or a 64-bit application.
Proof of this is just looking at almost any Linux based distro
available in 32-bit or 64-bitapplications.
So, if you then ask a different question:
Instead of porting a 32-bit program to 64-bit, port the 32-bit program
to a program that will work equally well if compiled for 32-bit target
or 64-bit target?

First steps in this might be looking at every use of "int" and "long"
and replace them with int32_t and int64_t. I.e. replace target
specific types with target agnostic types.
So, if the binary is 32bit, int will be 32bit, change the source code
to say "int32_t" instead of "int".
if the binary is 32bit, and on that target long will be 32bit, change
the source code to say "int32_t".

I know that there will be special cases that are difficult to handle.
I don't expect 100%. I am looking to write a tool that can do say 80%
of the work.
I believe that I could recognise blocks that we know will work, and
highlight the "unsure" sections of the code, for closer inspection.
I am hoping to be able to highlight target agnostic code and highlight
target specific code and automate the target agnosic parts.

My current decompiler does statistical analysis in order to identify types.
E.g. This register at this instruction is most likely a int32_t but
might be a uint32_t, but definitely not a uint64_t.

So, it is not black and white. I want it to work say 80% of the time,
but at least highlight where the remaining 20% is, and do manual work
on it.

More information about the llvm-dev mailing list