[PATCH] MIR Parsing: Introduce a MI Lexing class.

Wed Jun 17 22:09:10 PDT 2015

REPOSITORY
  rL LLVM

================
Comment at: lib/CodeGen/MIRParser/MILexer.cpp:32-36
@@ +31,7 @@
+mitok::Kind MILexer::error(SMLoc Loc, const Twine &Msg) {
+  // TODO: Get the proper location in the MIR file, not just a location inside
+  // the string.
+  const char *Ptr = Loc.getPointer();
+  assert(Ptr >= BufStart && Ptr <= BufEnd);
+  Error = SMDiagnostic(SM, SMLoc(), SM.getMemoryBuffer(SM.getMainFileID())
+                                        ->getBufferIdentifier(),
----------------
Since we ultimately want to associate this back to the yaml file, using SourceMgr here doesn't seem like a very good fit. It sounds a lot more convenient to have the lexer just be StringRef based, then we can just add some simple newline and character counting to issue a custom diagnostic back in the YAML (or reuse SourceMgr just for that part, outside the core lexing).

In the past I have found this core lexing interface to be useful (supplemented by a sugar "Lexer" class in the header):
```
struct Token {
  ... Kind (which can be an error), etc. ...
  StringRef Range;
};
StringRef lex(StringRef Range, Token &OutTok);
// (or lexImpl or whatever).
```

The return value is a new range, a suffix of the old range, containing the remaining yet-to-be-lexed characters.
Some useful invariants to maintain are:
```
Token Tok, OtherTok;
StringRef R = ...., NewR;
NewR = lex(R, Tok);
assert(lex(Tok.Range, OtherTok).size() == 0); // Entire range of a valid token is consumed.
assert(Tok == OtherTok); // The exact token is recovered by re-lexing.
```

A convenient Lexer class can then be trivially built around this in the header (alongside token kind definitions and such). Everything else is in the .cpp file and nicely decoupled.

When I've used this in the past, the first thing to do in lexImpl is to stuff the incoming stringref into a trivial "Cursor" class that has a .peek() method which checks for EOF and returns '\0', otherwise the char at the cursor. This eliminates *a lot* of repeated "!isEOF() && is...." checks (your patch already has two of them: "while (!isEOF() && isspace(*CurPtr))" and "while (!isEOF() && isIdentifierChar(*CurPtr))"); these can then be written e.g. `while (isspace(C.peek())`. The main body of lexImpl then becomes something like:

```
StringRef lexImpl(StringRef R, Token &OutTok) {
  Cursor C(R);
  skipWhitespace(C);
  if (C.isEOF())
    return StringRef();
  if (Cursor RC = maybeLexDelimiter(C, OutTok))
    return RC.remaining();
  if (Cursor RC = maybeLexIdentifier(C, OutTok))
    return RC.remaining();
  if (Cursor RC = maybeLexNumber(C, OutTok))
    return RC.remaining();
  ....
}
```

http://reviews.llvm.org/D10521

EMAIL PREFERENCES
  http://reviews.llvm.org/settings/panel/emailpreferences/