[LLVMdev] MC disassembler for ARM

Fan Dawei fandawei.s at gmail.com
Fri Jun 8 09:40:30 PDT 2012


Yes, I got it. Thanks for the reply!

I'm considering to let the transformation Instr -> LLVM IR as a part of of
instruction definition in the td file. Then use tablegen to generate the
code automatically just as the it does for disassembler. Thus bypass the
MCInst.

All suggestions are welcomed!

Thanks,
Dawei

On Fri, Jun 8, 2012 at 12:18 PM, Jim Grosbach <grosbach at apple.com> wrote:

> That depends on how you define "one ARM instruction." It's not a clear cut
> thing. For example, is "add r1, r2, r3" the same ARM instruction as "add
> r1, r2, #4"? What is a distinct instruction and what's a variant encoding
> of the same instruction is often entirely a matter of convenience.
>
> -Jim
>
>
>
> On Jun 8, 2012, at 6:40 AM, Fan Dawei <fandawei.s at gmail.com> wrote:
>
> Hi Jim,
>
> Thanks for reply. I'm sorry I didn't make myself clear enough.
>
> The MCInst created by MCDisassembler depends on the instructions defined
> in td files. These instructions do not have a one to one mapping to ARM
> instructions. There are usually one or more instructions defined in the td
> file correspond to one actual ARM instruction.
>
> Thanks,
> David
>
> On Thu, Jun 7, 2012 at 1:27 PM, Jim Grosbach <grosbach at apple.com> wrote:
>
>>
>> On Jun 7, 2012, at 7:53 AM, Fan Dawei <fandawei.s at gmail.com> wrote:
>>
>> Hi Tim,
>>
>> Thanks a lot for your help! I'm very grateful.
>>
>> libc.so is a prelinked library, I'll build a non-prelinked one and have
>> another try.
>>
>> I'm now at the start of a binary translation project. I want to convert
>> ARM binary code [*] to llvm ir, which is then translated to binary for our
>> mips like architecture. That's why I'm looking for a decoder for ARM
>> binary.
>>
>> The ARMMCDisassembler is production quality as be told by Evan. That's
>> why I'm so interested in it. However, I realized today that might not be a
>> good choice. Although the disassembled MCInsts has a clean and simple
>> interface, the op-codes in them are auto generated from instruction
>> description files. They are in large quantities and do not have one-to-one
>> correspondence to arm instructions. I think it is not a good idea for our
>> translator to rely on the implementation of llvm ARM back-end. So I have to
>> find another decoder or implement it by by ourselves.
>>
>>
>> Every MCInst created by the MCDisassembler will have a one-to-one mapping
>> to an actual ARM instruction.
>>
>>
>> Thanks,
>> David
>>
>> [*] For most case,  the targets are the shared libraries in Android APKs
>> developed by NDK, like libangraybird.so. I think most of them are
>> pre-linked, so it is bad for us. Because there is no $a, $t and $d symbols,
>> we cannot figure out which region is arm code or thumb code statically.
>>
>>
>> On Thu, Jun 7, 2012 at 8:11 PM, Tim Northover <t.p.northover at gmail.com>wrote:
>>
>>> Hi David,
>>>
>>> On Thu, Jun 7, 2012 at 10:17 AM, Fan Dawei <fandawei.s at gmail.com> wrote:
>>> > Could you please tell me more about $a, $t and $d symbols? How these
>>> symbols
>>> > are used to define different regions? Where I can find this symbols in
>>> ELF
>>> > object file?
>>>
>>> At the start of each range of ARM code, an assembler or compiler
>>> should produce a "$a" symbol with that address, and put it (naturally
>>> enough) in the ELF symbol-table. Similarly each stretch of Thumb code
>>> gets a "$t" and each data a "$d".
>>>
>>> For example if I assemble:
>>>
>>>    .arm
>>>    mov r0, r3
>>>    ldr r2, Lit
>>> Lit:
>>>    .word 42
>>>    add r0, r0, r0
>>>    .thumb
>>>    mov r5, r2
>>>
>>> then the symbol table contains these entries:
>>>     4: 00000000     0 NOTYPE  LOCAL  DEFAULT    1 $a
>>>     [...]
>>>     6: 00000008     0 NOTYPE  LOCAL  DEFAULT    1 $d
>>>     7: 0000000c     0 NOTYPE  LOCAL  DEFAULT    1 $a
>>>     8: 00000010     0 NOTYPE  LOCAL  DEFAULT    1 $t
>>>
>>> which shows that an ARM region begins at offset 0x0, a data one at
>>> offset 0x8, we switch back to ARM at 0xc and finally Thumb takes over
>>> at 0x10.
>>>
>>> GNU objdump hides the symbols by default when printing the
>>> symbol-table (you can give it the --special-syms option to show them),
>>> but readelf shows them always.
>>>
>>> If you want the really deep details, they're fully documented in the
>>> ARM ELF ABI here (section 4.6.5):
>>>
>>>
>>> http://infocenter.arm.com/help/topic/com.arm.doc.ihi0044d/IHI0044D_aaelf.pdf
>>>
>>> Which is all nice to know, but I'm afraid it probably doesn't offer an
>>> immediate solution to the undefined instructions:
>>> + libc.so isn't a relocatable object file (well, it is dynamically,
>>> but that doesn't count).
>>> + llvm-objdump ignores them anyway at the moment, as far as I can tell.
>>>
>>> Tim.
>>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20120608/bec33a72/attachment.html>


More information about the llvm-dev mailing list