[cfe-dev] Confusing comment on LexTokenInternal

Mon Jul 6 14:10:47 PDT 2009

I'm looking at how to add additional literals once Unicode types are supported, which naturally takes me to the LexTokenInternal method of the Lexer class.  The comment describing the function talks about returning true/false, yet the function is void so cannot return anything.  I suspect the API has evolved since the comment was written, but it makes me wonder about the validity of other parts of this comment, such as assumed null-termination of the buffer.

As this code is clearly marked as performance sensitive, I am being quite careful before proposing changes for multiple string/character literal types.  The simple drop-through for 'L' will no longer work as we have 10(!) possible literal prefixes in C++0x:

  <empty>
  L
  R
  U
  u
  u8
  LR
  UR
  uR
  u8R

Also, the R variants only apply to string literals, not character literals.

We must preserve enough info to continue parsing as an identifier if we do not find the ' or " character.  In the odd case we find a ' following an R I believe we parse the chars up to and including R as an identifier (maybe a macro) and start a fresh narrow-character literal token with the ' and no intervening whitespace.

Yuck :(

I also notice that characters beyond the 7-bit ASCII range are deemed 'unknown' rather than potential identifiers.  Ideally we should be checking for a valid UTF-8 sequence and allowing through as an identifier in that case (consuming the full glyph).  This seems a perquisite to adding UCN support for identifiers, as UCNs will (presumably) map to the equivalent UTF-8 sequence.  However, I am not sure how much further investigation here should look into GCC ABI to describe mangling of identifiers with extended characters.

AlisdairM