[PATCH] D108308: [WIP] Cleanup identifier parsing.

Wed Aug 25 12:18:00 PDT 2021

aaron.ballman added a reviewer: rsmith.
aaron.ballman added a subscriber: rsmith.
aaron.ballman added a comment.

In general, I'm in favor of these changes. They help identify (pun *totally* intended) where we're improperly expecting ASCII identifiers in places, which can hopefully be addressed in follow-up work. @rsmith, do you have any concerns with this direction?

Can you remove the [WIP] from the title so it's clear that this is no longer in progress? Also, I'd recommend slapping an NFC in the title somewhere to make it clear there's no functional changes intended.

================
Comment at: clang/include/clang/Lex/Lexer.h:702
   // Helper functions to lex the remainder of a token of the specific type.
-  bool LexIdentifier         (Token &Result, const char *CurPtr);
+  bool LexIdentifierContinue(Token &Result, const char *CurPtr);
   bool LexNumericConstant    (Token &Result, const char *CurPtr);
----------------
Should this be `LexUnicodeIdentifierContinue()`? If so, perhaps it can also be moved up to line 578 so it's near the "start" function?

Or does this function handle both Unicode and ASCII identifiers? If so, the comments could probably be updated.

================
Comment at: clang/lib/Lex/Lexer.cpp:1758
+bool Lexer::LexIdentifierContinue(Token &Result, const char *CurPtr) {
+  // Match [_A-Za-z0-9]*, we have already matched [_A-Za-z$]
+  unsigned Size;
----------------
Is the comment here still accurate? Might be worth rewriting in prose rather than regex?

================
Comment at: clang/lib/Lex/Lexer.cpp:1762
+    unsigned char C = *CurPtr;
+    // Fast path
+    if (isAsciiIdentifierContinue(C)) {
----------------

================
Comment at: clang/lib/Lex/Lexer.cpp:1767
+    }
+    // Slow path: handle trigraph, unicode codepoints, UCNs
+    C = getCharAndSize(CurPtr, Size);
----------------

================
Comment at: clang/lib/Lex/Lexer.cpp:1783-1788
+    if (C == '\\' && tryConsumeIdentifierUCN(CurPtr, Size, Result)) {
       continue;
-    } else if (!isASCII(C) && tryConsumeIdentifierUTF8Char(CurPtr)) {
-      C = getCharAndSize(CurPtr, Size);
+    }
+    if (!isASCII(C) && tryConsumeIdentifierUTF8Char(CurPtr)) {
       continue;
     }
----------------

================
Comment at: clang/lib/Lex/Lexer.cpp:1789
     }
+    // Neither an expected unicode codepoint nor a UCN
+    break;
----------------

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D108308/new/

https://reviews.llvm.org/D108308