[PATCH] MIR Parsing: Introduce a MI Lexing class.

Alex Lorenz arphaman at gmail.com
Thu Jun 18 09:59:53 PDT 2015


Comment at: lib/CodeGen/MIRParser/MILexer.cpp:32-36
@@ +31,7 @@
+mitok::Kind MILexer::error(SMLoc Loc, const Twine &Msg) {
+  // TODO: Get the proper location in the MIR file, not just a location inside
+  // the string.
+  const char *Ptr = Loc.getPointer();
+  assert(Ptr >= BufStart && Ptr <= BufEnd);
+  Error = SMDiagnostic(SM, SMLoc(), SM.getMemoryBuffer(SM.getMainFileID())
+                                        ->getBufferIdentifier(),
silvas wrote:
> Since we ultimately want to associate this back to the yaml file, using SourceMgr here doesn't seem like a very good fit. It sounds a lot more convenient to have the lexer just be StringRef based, then we can just add some simple newline and character counting to issue a custom diagnostic back in the YAML (or reuse SourceMgr just for that part, outside the core lexing).
> In the past I have found this core lexing interface to be useful (supplemented by a sugar "Lexer" class in the header):
> ```
> struct Token {
>   ... Kind (which can be an error), etc. ...
>   StringRef Range;
> };
> StringRef lex(StringRef Range, Token &OutTok);
> // (or lexImpl or whatever).
> ```
> The return value is a new range, a suffix of the old range, containing the remaining yet-to-be-lexed characters.
> Some useful invariants to maintain are:
> ```
> Token Tok, OtherTok;
> StringRef R = ...., NewR;
> NewR = lex(R, Tok);
> assert(lex(Tok.Range, OtherTok).size() == 0); // Entire range of a valid token is consumed.
> assert(Tok == OtherTok); // The exact token is recovered by re-lexing.
> ```
> A convenient Lexer class can then be trivially built around this in the header (alongside token kind definitions and such). Everything else is in the .cpp file and nicely decoupled.
> When I've used this in the past, the first thing to do in lexImpl is to stuff the incoming stringref into a trivial "Cursor" class that has a .peek() method which checks for EOF and returns '\0', otherwise the char at the cursor. This eliminates *a lot* of repeated "!isEOF() && is...." checks (your patch already has two of them: "while (!isEOF() && isspace(*CurPtr))" and "while (!isEOF() && isIdentifierChar(*CurPtr))"); these can then be written e.g. `while (isspace(C.peek())`. The main body of lexImpl then becomes something like:
> ```
> StringRef lexImpl(StringRef R, Token &OutTok) {
>   Cursor C(R);
>   skipWhitespace(C);
>   if (C.isEOF())
>     return StringRef();
>   if (Cursor RC = maybeLexDelimiter(C, OutTok))
>     return RC.remaining();
>   if (Cursor RC = maybeLexIdentifier(C, OutTok))
>     return RC.remaining();
>   if (Cursor RC = maybeLexNumber(C, OutTok))
>     return RC.remaining();
>   ....
> }
> ```
This would work for me, I'll put up an updated patch that implements a lexer using this kind of approach later today.



More information about the llvm-commits mailing list