[LLVMdev] X86 Disassembler

Sat Aug 22 16:31:28 PDT 2009

On Aug 19, 2009, at 4:39 PM, Sean Callanan wrote:
> thanks for your comments.  I'll respond to them individually.  I've  
> attached a new revision of the patch that addresses them.  Patch  
> built and tested against SVN 79487, with the additional attached fix  
> that fixes an Intel table bug.

Thanks Sean, comments below.  Are you sure you attached the updated  
patch?  I still see a lot of std::endl's etc.

>> 3. In include/llvm/Support/MemoryObject.h, you changed all  
>> "uintptr_t" to "uint64_t". Is there a good reason for this? If not,  
>> please change them back.'
>
> Yes, there is.  If you're disassembling in a memory space that's  
> larger than yours (for example, on a computer where LLVM is built 32- 
> bit but the source of the bytes is a 64-bit address space), then I  
> don't want to be limited by the local process's notion of uintptr_t.

Makes perfect sense, please add a comment in the MemoryObject doxygen  
comment that explains this.  You can split this change out of your  
megapatch and commit it separately.

>> 13. You use "std::ostream" a lot. Would it be appropriate to use  
>> "raw_ostream" instead?
>
> To make sure that the table definition file looks pretty, I use  
> <iomanip>, and none of those functions work with raw_ostreams.  I  
> actually only create three std::ostringstreams – during table  
> generation – so for now I'm going to leave that the way it is.

I actually find iomanip to be pretty ugly for simple things.  llvm/ 
Support/Format.h lets you do things like:

OS << "mynumber: " << format("%4.5f", 1234.412) << '\n';

with raw_ostreams.


Using "printf style" format strings is a lot more pretty than:

+#define BYTE_FLAGS std::hex << std::setw(2) << std::setfill('0') <<  
std::right

:)


Not a critical issue, but please put a space after control flow  
keywords like if/while.
      while(current - address < size && current < limit) {
        if(readByte(current, &buf[(current - address)]))


Please stick to plain ascii letters in source files:
+ * The raison d'être for this header file is to provide those  
definitions that



One meta comment: it is generally preferable to split out patches into  
small pieces.  Please propose the APIs without the X86 implementation  
for the next round, it should be much easier to get that piece in than  
doing the whole patch at once.




Some micro level coding stuff:

+#include <llvm/MC/MCInst.h>
+#include <llvm/Support/MemoryObject.h>
+#include <llvm/Support/raw_ostream.h>

Please use "" instead of <>.  You can also just forward declare all  
these classes instead of #including their headers.

+#include "assert.h"

Please use the c++ version of c headers when possible: <cassert>.


+  MCDisassembler(MemoryObject& region,
+                 raw_ostream& vStream) {

Please use "foo_t &f" instead of "foo_t& f".

+  virtual ~MCDisassembler() {
+  }

Make sure each class with a virtual method has at least one method  
defined out of line:
http://llvm.org/docs/CodingStandards.html#ll_virtual_anch




Some bigger stuff:

include/llvm/MC/MCDisassembler.h seems to have four copies of itself  
in the same file.


In terms of API design:

+  /// Constructor     - Performs initial setup for the disassembler.   
This may
+  ///                   do the work of disassembly, or may  
(especially on
+  ///                   fixed-instruction-length CPUs) set things up  
for lazy
+  ///                   disassembly.
+  ///
+  /// @param region   - The memory object to use as a source for  
machine code.
+  /// @param vStream  - The stream to print warnings and diagnostic  
messages on.
+  MCDisassembler(MemoryObject& region,
+                 raw_ostream& vStream) {
+  }

Forcing diagnostics to get rendered to a stream is somewhat  
unfortunate.  Different clients may want different information.  Would  
it be possible to use error codes, or make each action (like  
getInstruction) fill in an error code struct with possible failure info?

The MCDisassembler API is also somewhat strange to me.  Right now it  
is designed to analyze a memory region when constructed and contain  
them.  It seems that there is a more useful low level API which would  
look something like this:

class MCDisassembler {
public:

   virtual ~MCDisassembler();

   /// DisassembleInstruction - Disassemble the first instruction in  
the specified region, printing the disassembled instruction to the  
specified raw_ostream, and returning the size of the instruction in  
bytes.  On error, this returns zero and fills in ErrorInfo with a  
human readable description of the error.
   virtual unsigned DisassembleInstruction(MemoryObject &region,  
raw_ostream &OS, std::string &ErrorInfo) = 0;

}

Having this sort of stateless API means that higher level clients  
(which are stateful) can be built on top of them without adding  
overhead to clients who don't care about the state.



+++ include/llvm/Support/Indenter.h	(revision 0)
@@ -0,0 +1,72 @@
+//===- Indenter.h - Output indenter -----------------------------*- C+ 
+ -*-===//

I'm not a big fan of manipulators like this (as David Greene can tell  
you ;-).  I just added a new raw_ostream::indent method, so now  
instead of stuff like this:

+void DisassemblerTables::emitContextDecision(
+  std::ostream& o2,
+  Indenter& i2,
...

+  o2 << i2.indent() << "struct ContextDecision " << name << " = {" <<  
std::endl;
+  i2.push();
+  o2 << i2.indent() << "{ /* opcodeDecisions */" << std::endl;
+  i2.push();

You can use:

+void DisassemblerTables::emitContextDecision(
+  raw_ostream &o2,
+  unsigned &Indent2,

+  o2.indent(Indent2++) << "struct ContextDecision " << name << " =  
{\n";
+  o2.indent(Indent2++) << "{ /* opcodeDecisions */\n";

Which reads better, is more efficient, and scales to more than 256  
levels of indentation.


+#include <fcntl.h>      // for fcntl, open
+#include <stdio.h>      // for perror
+#include <sys/types.h>  // for read
+#include <sys/uio.h>    // for read
+#include <unistd.h>     // for read


unistd etc won't work on windows.  Please use the llvm/Support/ 
MemoryBuffer.h API for reading files.  The LineReader is easier to  
implement with MemoryBuffer because you just scan for \n and \r in the  
memory range.



+std::string stringFromMCInst_x86(const MCInst& insn,
+                                 const Target* target,
+                                 std::string& triple) {

Please mark this static so that you can shrink the anon namespace.



+  X86ATTAsmPrinter* asmPrinter =
+    dynamic_cast<X86ATTAsmPrinter*>(functionPass);

Please don't use dynamic_cast: we're trying to eliminate RTTI a C cast  
should be fine.



+int HexDisassembler::disassemble(const Target* target,
+                                 LineReader& reader) {
...

+
+    if(targetName == "x86") {
+      disasm = new X86Disassembler::X86_32Disassembler(obj, logstream);
+      triple = "x86-apple-darwin";
+    }
+    else if(targetName == "x86-64") {
+      disasm = new X86Disassembler::X86_64Disassembler(obj, logstream);
+      triple = "x86_64-apple-darwin";
+    }
+    else {

Instead of doing something like this, the X86 disassembler should  
register itself as part of the "Target" API the same way that  
asmprinters, MCAsmInfo, targets, etc do.  The X86 specific  
disassembler should go in lib/Target/X86/Disassembler.  Take a look at  
how the Target/X86/AsmParser stuff registers itself for a good example.

--- tools/llvm-mc/HexDisassembler.h	(revision 0)
+++ tools/llvm-mc/HexDisassembler.h	(revision 0)

+#include <fstream>
+#include <string>

the fstream #include is not needed.  It might make sense to change the  
disassemble APIs to take a MemoryBuffer, and have the llvm-mc.cpp  
driver just handle all the file reading stuff.


I haven't reviewed the X86 specific changes yet, it would be good to  
get the high level design sorted out first, I'll review it in the next  
patch.  Please commit the MemoryObject.h changes (int64_t + comments)  
whenever you feel like it, and propose the basic API changes without  
the X86 specific part next.

Thanks for working on this Sean, it's great work!

-Chris