[cfe-dev] UCNs/extended characters revisitted

Ned Holbrook ned at panic.com
Sat Jul 11 13:40:16 PDT 2009

On Jul 10, 2009, at 2:01 PM, AlisdairM(public) wrote:

> i/ The extended characters must form a valid UTF-8 code point
> ii/ Not all code points in the basic character plane are valid for  
> use in identifiers.  While C++ might not act on punctuation in  
> different alphabets, it should still not allow it in identifiers.
> (i) is easily and efficiently checked.
> (ii) requires a large database of valid/invalid code points to look  
> up against - and frankly this part ought to be written by a Unicode  
> specialist who can validate all the corner cases.  At the moment  
> that database is around 2.5Mb, although that contains more info than  
> we need for simple validation of identifiers.  We will still need a  
> reasonable sized lookup though, ideally encoded into some kind of  
> sparse bit-vector.  I recommend doing this validation in parse or  
> sema, and for performance simply allowing lex to pass along tokens  
> without splitting on the invalid characters.

As for (ii), I assume you simply want to disallow characters having  
general category P* (ie: Pd, Ps, etc.), no? The full list of such  
characters for Unicode 5.1 can be found in <http://www.unicode.org/Public/UNIDATA/extracted/DerivedGeneralCategory.txt 
 >. Do you want to embed data from a particular version of Unicode, or  
would you rather track whatever version of Unicode is supported by the  
host environment? If the former, for a simple binary test such as this  
you could always use a simple inversion list; if the latter, you could  
either call directly into a host API to classify each character or you  
could preprocess these data to avoid calling the API repeatedly. I'm  
happy to assist with any of this.


More information about the cfe-dev mailing list