[cfe-dev] UCNs/extended characters revisitted

Sat Jul 11 13:40:16 PDT 2009

On Jul 10, 2009, at 2:01 PM, AlisdairM(public) wrote:

> i/ The extended characters must form a valid UTF-8 code point
> ii/ Not all code points in the basic character plane are valid for  
> use in identifiers.  While C++ might not act on punctuation in  
> different alphabets, it should still not allow it in identifiers.
>
> (i) is easily and efficiently checked.
> (ii) requires a large database of valid/invalid code points to look  
> up against - and frankly this part ought to be written by a Unicode  
> specialist who can validate all the corner cases.  At the moment  
> that database is around 2.5Mb, although that contains more info than  
> we need for simple validation of identifiers.  We will still need a  
> reasonable sized lookup though, ideally encoded into some kind of  
> sparse bit-vector.  I recommend doing this validation in parse or  
> sema, and for performance simply allowing lex to pass along tokens  
> without splitting on the invalid characters.

As for (ii), I assume you simply want to disallow characters having  
general category P* (ie: Pd, Ps, etc.), no? The full list of such  
characters for Unicode 5.1 can be found in <http://www.unicode.org/Public/UNIDATA/extracted/DerivedGeneralCategory.txt 
 >. Do you want to embed data from a particular version of Unicode, or  
would you rather track whatever version of Unicode is supported by the  
host environment? If the former, for a simple binary test such as this  
you could always use a simple inversion list; if the latter, you could  
either call directly into a host API to classify each character or you  
could preprocess these data to avoid calling the API repeatedly. I'm  
happy to assist with any of this.

Ned