[cfe-dev] Confusing comment on LexTokenInternal

Tue Jul 7 11:32:43 PDT 2009

> -----Original Message-----
> From: Chris Lattner [mailto:clattner at apple.com]
> Sent: 07 July 2009 18:25
> To: AlisdairM (public)
> Cc: 'clang-dev Developers'
> Subject: Re: [cfe-dev] Confusing comment on LexTokenInternal

> > I want to kill tok::wide_string_literal and somehow stuff the
> > encoding into tok::string_literal (char, char16_t, char32_t, wchar_t
> > or u8 special. Options for other languages may be appropriate too).
> > Any advice on how to approach this appreciated.
> 
> Makes sense to me!  Do you actually need to encode this in the
> *Token*?  Could you just have StringLiteralParser determine these
> properties?

I guess that makes sense.  The lexer doesn't want to attribute any meaning to the prefix/suffix on the literal, merely find meaningful bounds.  String literal concatenation happens in the parser, and this is probably the first time we really care about representation.

So I guess the first step is to kill tok::wide_string_literal, kill the Boolean flag to Lexer::LexStringLiteral, and carry the prefix (in the token's SourceRange) through the APIs into StringParser.

Then perform a matching change for character literals.

Then we can look into adding Unicode character types, or recognising suffices for user-defined literals as part of the same token.  Note: A user-defined literal is effectively a disguised function call syntax, although they might become more 'literal-like' when someone (not me!) gets around to implementing constexpr. 

> > Second, I want to include the start/end range of the contents of a
> > raw string literal - minus the delimiters.  Again, this must be done
> > by the lexer so suggest stuffing the information somewhere into the
> > token.  This suggests separate tokens for raw and non-raw (cooked?!)
> > string literals.
> 
> I think that the main lexer should just get the full extent of the
> token and store it in the token (as it does with all other tokens).
> The string literal parser would then "relex" the contents when needed
> to remove the delimiters etc.

This means duplicating some work, but probably not too much as the delimiters are limited to 16 chars max.  The premature optimiser in me want to do the work (and maintain it!) once and no more, but I'm not about to fight the data structures to make it happen - that is rarely a good sign.

If that sounds right, then it is time for me to stop talking and start cleaning up some patches ;¬)

AlisdairM