[cfe-dev] [Request for approval] Identifier text in raw lexing

Tue Dec 14 23:52:28 PST 2010

On Dec 7, 2010, at 1:32 AM, Abramo Bagnara wrote:
>> We are looking for a performance win (and secondarily for a simmetric
>> interface wrt literals) for our raw lexer clients (currently they need
>> to use something like the above to get the text for each identifier):
>> 
>>  token_begin = sm.getSpellingLoc(token_begin);
>> 
>>  const char* current_char = sm.getCharacterData(token_begin);
>>  const char* end = current_char + length;
>> 
>>  while (current_char < end) {
>>    char c = clang::Lexer::getCharAndSizeNoWarn(current_char, length, lo);
>>    output += c;
>>    current_char += length;
>>  }
> 
> I've attached the working patch for approval.

Hi Abramo,

This approach looks ok in principle, but I have some nit-pics about your patch :)

+++ lib/Parse/ParseExprCXX.cpp	(working copy)
@@ -130,7 +130,7 @@
       SourceLocation TemplateKWLoc = ConsumeToken();
       
       UnqualifiedId TemplateName;
-      if (Tok.is(tok::identifier)) {
+      if (Tok.isIdentifier()) {
         // Consume the identifier.

You seem to have search and replaced all uses of tok::identifier with isIdentifier.  However, raw_identifiers can't get into the parser and can't get into most places in the preprocessor.  Please keep these as checks of just tok::identifier.  Only code that can actually get both should use "isIdentifier".  Also, for the same reason, please rename isIdentifier to isAnyIdentifier.


+++ lib/Lex/Preprocessor.cpp	(working copy)
@@ -369,11 +369,11 @@
 // Lexer Event Handling.
 //===----------------------------------------------------------------------===//
 
-/// LookUpIdentifierInfo - Given a tok::identifier token, look up the
+/// LookUpIdentifierInfo - Given a (raw) identifier token, look up the
 /// identifier information for the token and install it into the token.
 IdentifierInfo *Preprocessor::LookUpIdentifierInfo(Token &Identifier,
                                                    const char *BufPtr) const {
-  assert(Identifier.is(tok::identifier) && "Not an identifier!");
+  assert(Identifier.isIdentifier() && "Not an identifier!");
   assert(Identifier.getIdentifierInfo() == 0 && "Identinfo already exists!");
 

This gets passed in a raw_identifier if I understand things correctly.  Please change the assert to verify that.  Given your changes, there is no need to pass in BufPtr either.  Please eliminate this argument, getting the pointer from the token instead.

+++ lib/Lex/Lexer.cpp	(working copy)
@@ -473,7 +473,7 @@
       // we don't have an identifier table available. Instead, just look at
       // the raw identifier to recognize and categorize preprocessor directives.
       TheLexer.LexFromRawLexer(TheTok);
-      if (TheTok.getKind() == tok::identifier && !TheTok.needsCleaning()) {
+      if (TheTok.isIdentifier() && !TheTok.needsCleaning()) {
         const char *IdStart = Buffer->getBufferStart() 
                             + TheTok.getLocation().getRawEncoding() - 1;
         llvm::StringRef Keyword(IdStart, TheTok.getLength());

Likewise, this seems like it should only check for a raw_identifier.  The code can be dramatically simplified by using the "const char*" in the token.  Please do.


Similarly, the code in Preprocessor::SkipExcludedConditionalBlock shouldn't have to use:

    const char *RawCharData = SourceMgr.getCharacterData(Tok.getLocation(),
                                                         &Invalid);

it can now use the const char* in the token.  While I was initially skeptical of the approach, I think it will lead to a nice cleanup of existing code (and is the right thing to do), but it is important not to scatter "isIdentifiers" everywhere, and important to make use of the new capabilities!

Thanks for working on this,

-Chris