[LLVMdev] help decompiling x86 ASM to LLVM IR

Tue Mar 12 11:56:34 PDT 2013

On 3/12/2013 1:17 PM, James Courtier-Dutton wrote:
> So, if we take the source-code level case.
> You can write a source-code level program that will compile unchanged
> to produce a 32-bit application or a 64-bit application.
> Proof of this is just looking at almost any Linux based distro
> available in 32-bit or 64-bitapplications.
> So, if you then ask a different question:
> Instead of porting a 32-bit program to 64-bit, port the 32-bit program
> to a program that will work equally well if compiled for 32-bit target
> or 64-bit target?
That's still impossible. In C++, it's trivial to write code like this:
template<size_t size>
struct AlignedStorage;

template<4>
struct AlignedStorage {
   union {
     uint32_t;
     uint8_t element;
   };
};

template<8>
struct AlignedStorage {
   union {
     uint64_t;
     uint8_t element;
   };
};

...
AlignedStorage<sizeof(void*)> storage;

You end up compiling literally different code based on the size of a 
pointer with templates. Or you could use it in a macro. This isn't academic:
<http://dxr.mozilla.org/search?tree=mozilla-central&q=regexp%3A%2F%23if.*SIZEOF_%2F&redirect=true> 
[1]. (Note: this is in a code base that already uses the intN_t types 
almost everywhere instead of plain int/long/etc.).

This is a concern even before we get to optimizations that can take 
advantage of identical representations to deduplicate code on different 
branches, or the fact that the inlining of sizeof() operations as 
constants has profound second-order effects on code like radically 
alterating structural layout (grepping a recent paper indicates that 
precision on structural typing binary programs even when you're 
collapsing all types of the same size is 90%. That is an upper bound on 
your effectiveness).

> First steps in this might be looking at every use of "int" and "long"
> and replace them with int32_t and int64_t. I.e. replace target
> specific types with target agnostic types.
> So, if the binary is 32bit, int will be 32bit, change the source code
> to say "int32_t" instead of "int".
> if the binary is 32bit, and on that target long will be 32bit, change
> the source code to say "int32_t".

In 3 million lines of code, there are:
* >1000 uses of size_t
* 857 uses of ptrdiff_t
* >1000 uses of intptr_t and uintptr_t
* 839 uses of ssize_t

I am assuming that all of these are intended to be explicitly 
pointer-sized integer variables. In addition, there are over 504 
distinct unions, which is a subset of places where types are 
polymorphically used--I'm not counting uses of reinterpret_cast or 
static_cast (or C-style type-punning)--which chalk up into several 
thousand more possible combinations.

> I know that there will be special cases that are difficult to handle.
> I don't expect 100%. I am looking to write a tool that can do say 80%
> of the work.
You are *very* optimistic to assume that you can well-type 80% of the 
program given only the binary code of the program. I think DSA managed, 
given LLVM IR mid-optimization, to determine 80% of the objects accessed 
by loads/stores to be a type more precise than a bag of bytes on 
SPEC2000, which isn't a particularly hard benchmark for real-world programs.

> So, it is not black and white. I want it to work say 80% of the time,
> but at least highlight where the remaining 20% is, and do manual work
> on it.

I am assuming a lot about your background knowledge here, but the fact 
that you were not aware of qemu as prior art and also some of your 
choices of words leads me to believe that you have not looked very hard 
into prior research on static analysis either of C code or binary code. 
That is not a recipse for success.

[1] Shameless DXR plug: we support regex searches :-)

-- 
Joshua Cranmer
Thunderbird and DXR developer
Source code archæologist