[cfe-dev] Confusing comment on LexTokenInternal

Mon Jul 6 22:10:09 PDT 2009

On Jul 6, 2009, at 2:10 PM, AlisdairM(public) wrote:

> I'm looking at how to add additional literals once Unicode types are  
> supported, which naturally takes me to the LexTokenInternal method  
> of the Lexer class.  The comment describing the function talks about  
> returning true/false, yet the function is void so cannot return  
> anything.

Fixed, thanks!

> I suspect the API has evolved since the comment was written, but it  
> makes me wonder about the validity of other parts of this comment,  
> such as assumed null-termination of the buffer.

Everything else in that function's comment still holds.  Buffers are  
still required to be nul terminated.

> As this code is clearly marked as performance sensitive, I am being  
> quite careful before proposing changes for multiple string/character  
> literal types.  The simple drop-through for 'L' will no longer work  
> as we have 10(!) possible literal prefixes in C++0x:
>
>  <empty>
>  L
>  R
>  U
>  u
>  u8
>  LR
>  UR
>  uR
>  u8R
>
> Also, the R variants only apply to string literals, not character  
> literals.

Ok, eww :).

> We must preserve enough info to continue parsing as an identifier if  
> we do not find the ' or " character.  In the odd case we find a '  
> following an R I believe we parse the chars up to and including R as  
> an identifier (maybe a macro) and start a fresh narrow-character  
> literal token with the ' and no intervening whitespace.
>
> Yuck :(

This is probably after the first translation phase, so you also have  
to thing about newlines and trigraphs, double yuck :)

> I also notice that characters beyond the 7-bit ASCII range are  
> deemed 'unknown' rather than potential identifiers.  Ideally we  
> should be checking for a valid UTF-8 sequence and allowing through  
> as an identifier in that case (consuming the full glyph).  This  
> seems a perquisite to adding UCN support for identifiers, as UCNs  
> will (presumably) map to the equivalent UTF-8 sequence.  However, I  
> am not sure how much further investigation here should look into GCC  
> ABI to describe mangling of identifiers with extended characters.

Yes, characters in the 128->255 range should definitely be considered  
potential identifiers at some point (when we support unicode better).

I would strongly recommend decomposing the problems you're working on  
into orthogonal pieces.  Please attack unicode before (and  
independently) of raw strings, or raw strings independently of  
unicode.  In LLVM and Clang, we strongly prefer incremental patches  
that get us going in the right direction over massive patches that  
implement an entire feature.

-Chris