[LLVMdev] converting x86 instructions to LLVM instructions

Tue Sep 29 12:29:24 PDT 2009

Hi Marius,

On Tue, Sep 29, 2009 at 6:05 PM, Marius <wishinet at googlemail.com> wrote:
> * Timo Juhani Lindfors (timo.lindfors at iki.fi) wrote:
>> Hi,
>>
>> Alexandre Gouraud <alexandre.gouraud at enst-bretagne.fr> writes:
>> > if it does not already exists, could it mean it is a nonsense, then why?
>>
>> Why don't you compile your program directly to LLVM bitcode?
> - In security-testing you sometimes apply black boxing.
Once you use the structure of the machine code of the system under
test to generate test cases it is no longer black box testing though
:)

> I've had a similar idea lately.
> http://www.crazylazy.info/blog/content/x86-differently-vine-and-llvm-klee
>
> x86 in general for reverse engeneering purposes isn't very useful.
> If you could use LLVM-qemu to get an intermediate representation of a
> specific binary and selectively execute functions symbolically, you'd
> have a "fuzzer" that reaches code-paths - in any case. That's a much
> deeper verification. If you read the KLEE research paper and take a look
> at the number of overlooked bugs they were able to identify, this could
> be very effective.
I agree, this is an interesting idea.

> I don't know how to modify llvm-qemu to translate x86 to LLVM IL. This
> is not trivial: qemu is a very limited "emulation". The "target" x86
> won't have MSRs and specific instructions. The abstraction level is
> higher.
Actually quite the opposite is true :) The emulation is very accurate,
otherwise it would not be possible to take a random operating systems
and run it without modification in full system emulation mode. And
this requires an accurate emulation of other things as well, e.g. the
MMU. After all, the authors of the "Selective Symbolic Execution"
paper have shown that llvm-qemu is suited for this purpose.

Essentially what happens when llvm-qemu translates a basic block of
machine code is that you get a semantically equivalent version of your
machine code in form of LLVM IR. With the LLVM IR operating on a
structure which represents the machine state (a bunch of registers and
some additional state). Regardless of how you translate machine code
to LLVM IR, you somehow need to model the machine state. I highly
doubt that LLVM IR generated by llvm-qemu looks much different than
LLVM IR generated by a hand-written frontend which goes directly from
machine code to LLVM IR.

> However for unspecific targets it might scale. Marking variables
> as symbolic in LLVM bytecode however...
Well, as your input is machine code you somehow need to specify in
which register you want to put your symbolic value (or at which memory
address). Then you need to map it to LLVM IR, which at least in the
register case is rather straightforward.

> In any case it would be interesting to be able to translate x86 to LLVM
> IR. If somebody want's to give that a try let's make a plan ;).

Cheers,

Tilmann