[cfe-dev] UCNs/extended characters revisitted

Sat Jul 11 14:23:24 PDT 2009

> -----Original Message-----
> From: Ned Holbrook [mailto:ned at panic.com]
> Sent: 11 July 2009 21:40
> To: AlisdairM(public)
> Cc: 'clang-dev Developers'
> Subject: Re: [cfe-dev] UCNs/extended characters revisitted
> 
> On Jul 10, 2009, at 2:01 PM, AlisdairM(public) wrote:
> 
> > i/ The extended characters must form a valid UTF-8 code point
> > ii/ Not all code points in the basic character plane are valid for
> > use in identifiers.  While C++ might not act on punctuation in
> > different alphabets, it should still not allow it in identifiers.
> >
> > (i) is easily and efficiently checked.
> > (ii) requires a large database of valid/invalid code points to look
> > up against - and frankly this part ought to be written by a Unicode
> > specialist who can validate all the corner cases.  At the moment
> > that database is around 2.5Mb, although that contains more info than
> > we need for simple validation of identifiers.  We will still need a
> > reasonable sized lookup though, ideally encoded into some kind of
> > sparse bit-vector.  I recommend doing this validation in parse or
> > sema, and for performance simply allowing lex to pass along tokens
> > without splitting on the invalid characters.
> 
> As for (ii), I assume you simply want to disallow characters having
> general category P* (ie: Pd, Ps, etc.), no? The full list of such
> characters for Unicode 5.1 can be found in
> <http://www.unicode.org/Public/UNIDATA/extracted/DerivedGeneralCategory
> .txt
>  >. Do you want to embed data from a particular version of Unicode, or
> would you rather track whatever version of Unicode is supported by the
> host environment? If the former, for a simple binary test such as this
> you could always use a simple inversion list; if the latter, you could
> either call directly into a host API to classify each character or you
> could preprocess these data to avoid calling the API repeatedly. I'm
> happy to assist with any of this.

The rule here for C++0x is covered by 2.11p1 [lex.name]

" Each universal-character-name in an identifier shall designate a character whose encoding in ISO 10646 falls into one of the ranges specified in Annex A of TR 10176:2003."

Note that is one of the ISO standards that is freely available from their web site.

Eli has also pointed out the governing rules for C99, and I really hope they are similar enough not to be an issue!

As I said, my initial plan is to simply accept all code points in the basic character plane, other than those already covered in the basic ASCII range.  I'm really hoping someone else (hint hint<G>) will provide that last level of validation, although I can hook in a validation hook that always returns 'true'.

The other issue once we allow UCNs is tracking of column numbers, which is mostly an issue for formatting our error messages.  Internally, I recommend everything stays as now, tracking utf-8 code *units*.  We should convert column numbers to index based on code *points* at the time we return an error message, and leave the user's rendering system to cope with combining multiple code-points into single glyphs, although that means our ^ and ~~~~~ may be a little out of synch in awkward cases.  Fundamentally, I don't think there is any way to resolve that - those markers should ultimately be rendered by an IDE rather than our code anyway.

AlisdairM