[cfe-dev] Confusing comment on LexTokenInternal

Tue Jul 7 10:24:56 PDT 2009

On Jul 7, 2009, at 9:18 AM, AlisdairM(public) wrote:
>> unicode.  In LLVM and Clang, we strongly prefer incremental patches
>> that get us going in the right direction over massive patches that
>> implement an entire feature.
>
> Yes, that is definitely the plan for submitting patches - I'm still  
> trying to make sure I understand the likely solution space to I  
> increment towards the right answer with minimal fuss. Once I know  
> the end goal, I can pick the patch of least resistance to get there  
> <g>

Great!

>>> As this code is clearly marked as performance sensitive, I am being
>>> quite careful before proposing changes for multiple string/character
>>> literal types.  The simple drop-through for 'L' will no longer work
>>> as we have 10(!) possible literal prefixes in C++0x:
>>>
>>> <empty>
>>> L
>>> R
>>> U
>>> u
>>> u8
>>> LR
>>> UR
>>> uR
>>> u8R
>>>
>>> Also, the R variants only apply to string literals, not character
>>> literals.
>>
>> Ok, eww :).
>
> Oh, and it gets worse!  I've not doubled this again to support user- 
> defined-string-literals, which will also compound the number of  
> character, floating point and integer literals we define.  If I  
> follow the existing scheme we will go from 2 string literal token  
> types (tok:string_literal and tok::wide_string_literal) to 20!

Ok, if this is the case, it is probably better to go from two token  
types to one (just string_literal) and have the literal parser stuff  
actually do the categorization.  I think the interesting clients all  
using the literal parser anyway.

> So what does this mean in practice?
>
> I want to kill tok::wide_string_literal and somehow stuff the  
> encoding into tok::string_literal (char, char16_t, char32_t, wchar_t  
> or u8 special. Options for other languages may be appropriate too).   
> Any advice on how to approach this appreciated.

Makes sense to me!  Do you actually need to encode this in the  
*Token*?  Could you just have StringLiteralParser determine these  
properties?

> Second, I want to include the start/end range of the contents of a  
> raw string literal - minus the delimiters.  Again, this must be done  
> by the lexer so suggest stuffing the information somewhere into the  
> token.  This suggests separate tokens for raw and non-raw (cooked?!)  
> string literals.

I think that the main lexer should just get the full extent of the  
token and store it in the token (as it does with all other tokens).   
The string literal parser would then "relex" the contents when needed  
to remove the delimiters etc.

-Chris