[llvm-commits] [llvm] r78378 - in /llvm/trunk: lib/Target/X86/AsmParser/X86AsmParser.cpp test/MC/AsmParser/x86_instructions.s utils/TableGen/AsmMatcherEmitter.cpp

Fri Aug 7 10:34:57 PDT 2009

On Aug 7, 2009, at 1:26 AM, Daniel Dunbar wrote:
> URL: http://llvm.org/viewvc/llvm-project?rev=78378&view=rev
> Log:
> llvm-mc/AsmMatcher: Move to a slightly more sane matching design.
> - Still not very sane, but a least its not 60k lines on X86. :)

Looks like a great step to me! :)

> 2. Separate the matching of operands to an instruction from the  
> construction of
>    the MCInst. In theory this can be done during matching, but since  
> the number
>    of variations is small I think it makes sense to decompose the  
> problems.

We found this to be a pretty big code size win for dagisel stuff.

> 4. The comment at the top of AsmMatcherEmitter.cpp explains the  
> approach I'm
>    moving towards for handling ambiguous instructions. The high-bit  
> is to infer
>    a partial ordering of the operand classes (and force the user to  
> specify one
>    if we can't) and use that to resolve ambiguities.

Ok.

Should test/MC/AsmParser/x86_instructions.s use filecheck?

> +++ llvm/trunk/lib/Target/X86/AsmParser/X86AsmParser.cpp Fri Aug  7  
> 03:26:05 2009
>
> +  bool isToken(const StringRef &Str) const {
> +    return Kind == Token && Str == getToken();
> +  }

A common case in the generated code is isToken("foo").  Maybe it would  
make sense to introduce a new isTokenStr templated method so that  
strlen isn't called?

> +bool X86ATTAsmParser::ParseInstruction(const StringRef &Name,  
> MCInst &Inst) {
> -  SmallVector<X86Operand, 3> Operands;
> +  SmallVector<X86Operand, 4> Operands;

Feel free to go crazy and bump this up to 8 or something. :)

> +++ llvm/trunk/utils/TableGen/AsmMatcherEmitter.cpp Fri Aug  7  
> 03:26:05 2009
> @@ -10,17 +10,87 @@
> // This tablegen backend emits a target specifier matcher for  
> converting parsed
> // assembly operands in the MCInst structures.
> //
> +// The input to the target specific matcher is a list of literal  
> tokens and
> +// operands. The target specific parser should generally eliminate  
> any syntax
> +// which is not relevant for matching; for example, comma tokens  
> should have
> +// already been consumed and eliminated by the parser. Most  
> instructions will
> +// end up with a single literal token (the instruction name) and  
> some number of
> +// operands.
> +//
> +// Some example inputs, for X86:
> +//   'addl' (immediate ...) (register ...)
> +//   'add' (immediate ...) (memory ...)
> +//   'call' '*' %epc

this makes a lot of sense to me.  X86ATTAsmParser::ParseOperand will  
have to change a bit though to handle the '*' as its own X86Operand.

> +//  - It may depend on the subtarget or the assembler context.  
> Instructions
> +//    which are invalid for the current mode, but otherwise  
> unambiguous (e.g.,
> +//    an SSE instruction in a file being assembled for i486) should  
> be accepted
> +//    and rejected by the assembler front end. However, if the  
> proper encoding
> +//    for an instruction is dependent on the assembler context then  
> the matcher
> +//    is responsible for selecting the correct machine instruction  
> for the
> +//    current mode.

Do we really need to support "pretend i386 doesn't have sse  
instructions"?  Do you expect a lack of this to cause a problem in  
practice?

> +// The matching is divided into two distinct phases:
> +//
> +//   1. Classification: Each operand is mapped to the unique set  
> which (a)
> +//      contains it, and (b) is the largest such subset for which a  
> single
> +//      instruction could match all members.
> +//
> +//      For register classes, we can generate these subgroups  
> automatically. For
> +//      arbitrary operands, we expect the user to define the  
> classes and their
> +//      relations to one another (for example, 8-bit signed  
> immediates as a
> +//      subset of 32-bit immediates).
> +//
> +//      By partitioning the operands in this way, we guarantee that  
> for any
> +//      tuple of classes, any single instruction must match either  
> all or none
> +//      of the sets of operands which could classify to that tuple.
> +//
> +//      In addition, the subset relation amongst classes induces a  
> partial order
> +//      on such tuples, which we use to resolve ambiguities.
> +//

> +//      FIXME: What do we do if a crazy case shows up where this is  
> the wrong
> +//      resolution?

Allow a c++ hack to fix it? :)  Another option is to allow explicit  
"priority" setting on an instruction pattern.  We do this with  
"addedcomplexity" for isel matching when the normal heuristic isn't  
what we want.

> +namespace {
> +  cl::opt<std::string>
> +  MatchOneInstr("match-one-instr", cl::desc("Match only the named  
> instruction"),
> +              cl::init(""));
> +}

Please just use 'static'.

> /// FlattenVariants - Flatten an .td file assembly string by  
> selecting the
> /// variant at index \arg N.
> static std::string FlattenVariants(const std::string &AsmString,

Can this be shared with the asmwriter emitter?

> /// TokenizeAsmString - Tokenize a simplified assembly string.
> +static void TokenizeAsmString(const StringRef &AsmString,
>                               SmallVectorImpl<StringRef> &Tokens) {

as you know, this is X86 specific.  When we bring up the next target,  
I wonder if this info could be moved to the .td files, so that the  
X86AsmLexer and the tblgen parser could be driven by the same point of  
truth.

> +static bool IsAssemblerInstruction(const StringRef &Name,
> +                                   const CodeGenInstruction &CGI,
> +                                   const SmallVectorImpl<StringRef>  
> &Tokens) {
> +  // Ignore psuedo ops.
> +  //
> +  // FIXME: This is a hack.
> +  if (const RecordVal *Form = CGI.TheDef->getValue("Form"))
> +    if (Form->getValue()->getAsString() == "Pseudo")
> +      return false;
> +
> +  // Ignore "PHI" node.
> +  //
> +  // FIXME: This is also a hack.
> +  if (Name == "PHI")
> +    return false;

We should probably add a target-independent "isPseudoOp" flag to  
the .td files to handle both of these cases, along with things like  
INLINEASM, EH_LABEL, etc.

> +
> +  // Ignore instructions with no .s string.
> +  //
> +  // FIXME: What are these?
> +  if (CGI.AsmString.empty())
> +    return false;

pseudoops :)

> +  // FIXME: Hack; ignore any instructions with a newline in them.
> +  if (std::find(CGI.AsmString.begin(),
> +                CGI.AsmString.end(), '\n') != CGI.AsmString.end())
> +    return false;

Seems reasonable for now. :)

> +  // Ignore instructions with attributes, these are always fake  
> instructions for
> +  // simplifying codegen.
> +  //
> +  // FIXME: Is this true?
> +  //
> +  // Also, we ignore instructions which reference the operand  
> multiple times;
> +  // this implies a constraint we would not currently honor. These  
> are
> +  // currently always fake instructions for simplifying codegen.
> +  //
> +  // FIXME: Encode this assumption in the .td, so we can error out  
> here.
> +  std::set<std::string> OperandNames;
> +  for (unsigned i = 1, e = Tokens.size(); i < e; ++i) {
> +    if (Tokens[i][0] == '$' &&
> +        std::find(Tokens[i].begin(),
> +                  Tokens[i].end(), ':') != Tokens[i].end()) {
> +      DEBUG({
> +          errs() << "warning: '" << Name << "': "
> +                 << "ignoring instruction; operand with attribute '"
> +                 << Tokens[i] << "', \n";
> +        });

For warnings, how about using:

SrcMgr.PrintMessage(CGI.TheDef->getLoc(), "whatever", "warning");

It would be nice to add a "PrintWarning" helper like we have  
"PrintError".

> +namespace {
>
> +struct OperandListLess {
> +  bool operator()(const
> +                  std::pair<const CodeGenInstruction::OperandInfo*,  
> unsigned> &

probably worth using a typedef for this crazy long name to avoid  
wrapping so much :)

> +void AsmMatcherEmitter::run(raw_ostream &OS) {
> +  CodeGenTarget Target;
> +  const std::vector<CodeGenRegister> &Registers =  
> Target.getRegisters();
> +  Record *AsmParser = Target.getAsmParser();
> +  std::string ClassName = AsmParser- 
> >getValueAsString("AsmParserClassName");
> +
> +  std::string Namespace = Registers[0].TheDef- 
> >getValueAsString("Namespace");
>
> -  // Generate the top level match function.
> +  EmitSourceFileHeader("Assembly Matcher Source Fragment", OS);
> +
> +  // Emit the function to match a register name to number.
>
>   OS << "bool " << Target.getName() << ClassName
> -     << "::MatchInstruction(const StringRef &Name, "
> +     << "::MatchRegisterName(const StringRef &Name, unsigned  
> &RegNo) {\n";
> +
> +  // FIXME: TableGen should have a fast string matcher generator.
> +  for (unsigned i = 0, e = Registers.size(); i != e; ++i) {
> +    const CodeGenRegister &Reg = Registers[i];
> +    if (Reg.TheDef->getValueAsString("AsmName").empty())
> +      continue;
> +
> +    OS << "  if (Name == \""
> +       << Reg.TheDef->getValueAsString("AsmName") << "\")\n"
> +       << "    return RegNo=" << i + 1 << ", false;\n";
> +  }
> +  OS << "  return true;\n";
> +  OS << "}\n\n";

Please split these out into "EmitRegisterNameMatcher()",  
"EmitFooMatcher()" type of functions, just for structure.

Overall, this is great work Daniel!

-Chris