[LLVMdev] Advice - llvm as binary to binary translator ?

Tue Jun 24 01:28:38 PDT 2008

Hi Eric,

I'm currently writing an IA-32 to LLVMIR translator. I'm only mid way
through, but I can certainly say that there have been more difficulties
than I anticipated when I began!

I think that it is a reasonable approach, perhaps especially in your
case, since you have an emulator already. Automatic static translation
is equivalent to the halting problem for IA-32 code, though perhaps it
wouldn't be for yours (what architecture are you using?). A dynamic
phase is therefore necessary for me -- if it is for you too, you'll have
a leg up.

Self-modifying code is both hideous and unusual, and very difficult to
deal with. I'm leaving it to one side.

General thoughts: are you sure that LLVMIR is suitable? You may be
better off with a lower-level representation. At least in my case, LLVM
enforces a level of structure that doesn't exist in machine code. That's
something you'll also probably have to deal with. 

Its type system also hampers the modification of translated code, so
it's advantageous to ensure that you won't need to change any code once
translated. This is of particular importance when you're trying to
figure out the bounds of an array, and things like that: a change to the
size of an array is a change of its type, which means it's much easier
just to get the size of the array right in the first place. I'm
currently in the process of altering my code so that a lot more analysis
takes place before translation even begins!

Finally, how will you deal with memory accesses and aliasing? This is
certainly the thorniest problem, and its the one my dynamic phase exists
to solve.

Do email me off-list if you like -- it sounds like we're pursuing
similar lines of inquiry!

Harry

On Sat, 2008-06-21 at 21:53 -0400, Erik Buck wrote:
> First, is there a way to search the archives for this list ?  I  
> apologize in advance if I have stepped on a FAQ.
> 
> My goal is to execute legacy binary machine code from a very old one  
> of a kind computer on a variety of modern computers.   I already wrote  
> an emulator for the legacy machine that executes the old machine  
> code.  However, my emulator is just an interpreter and therefore has  
> some limitations:
> 
> - The emulator spends a lot of time in an executive loop that fetches  
> legacy instructions, decodes them, and jumps to appropriate C  
> functions that emulate each legacy instruction.  The executive loop  
> also has to handle emulated interrupts, support single-step debugging,  
> etc.
> 
> - The emulator is compiled and run on only a few modern hardware/ 
> operating system combinations.  The emulator is fairly portable, but  
> extensive optimizations on some platforms restrict capabilities on  
> other platforms.
> 
> - The emulator executes the legacy machine code unmodified which is  
> good, but that means opportunities for optimization are lost.  The  
> legacy machine code is full of dead code, jumps to jumps, redundant  
> sub-expressions, unnecessary memory accesses, etc.  Back in the old  
> days, compilers really didn't optimize at all.  They generated  
> horrible code that was sometimes hand modified.
> 
> My idea is to convert my emulator into a translator that emits LLVM IR  
> either directly or via calls to the LLVM library.  I would then  
> execute the result via JIT or native code compilation...
> 
> Is this a reasonable approach ?
> Can this approach be used even when the legacy code is self  
> modifying ?  After a code modification, a re-translation and re-JIT  
> would be needed.
> 
> Are there any general suggestions ?
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev