[cfe-dev] UCNs/extended characters revisitted

Sun Jul 12 14:56:27 PDT 2009

On Jul 10, 2009, at 2:01 PM, AlisdairM(public) wrote:

> OK, while my first Unicode patch is stewing let's consider how to  
> support UCNs.
>
> Currently Clang effectively requires source files to be UTF-8  
> encoded.  If fact it mostly requires files without any extended  
> characters at all, but translates UCNs in string literals to UTF-8  
> so I propose that UTF-8 is recognised as the formal internal  
> representation.

Yep, that's the plan.  If we want to support other input formats (e.g.  
EBCDIC or UTF16) we can always translate the memory buffer to utf8  
before the lexer starts going at it.  This is a separate project from  
handling UCN's of course :).

> Now when lexing, any time we hit an extended character we describe  
> it as an unknown token.  My first proposal is that we recognise  
> there are is no punctuation to be recognised from characters above  
> 0x7F, so essentially we can add an arbitrarily long string of such  
> characters to an identifier without worrying about breaking the parse.

Ok.

> There are two issues at this point though:
>
> i/ The extended characters must form a valid UTF-8 code point
> ii/ Not all code points in the basic character plane are valid for  
> use in identifiers.  While C++ might not act on punctuation in  
> different alphabets, it should still not allow it in identifiers.
>
> (i) is easily and efficiently checked.
> (ii) requires a large database of valid/invalid code points to look  
> up against - and frankly this part ought to be written by a Unicode  
> specialist who can validate all the corner cases.  At the moment  
> that database is around 2.5Mb, although that contains more info than  
> we need for simple validation of identifiers.  We will still need a  
> reasonable sized lookup though, ideally encoded into some kind of  
> sparse bit-vector.  I recommend doing this validation in parse or  
> sema, and for performance simply allowing lex to pass along tokens  
> without splitting on the invalid characters.
>
> So my suggestion is to pass on (ii) for now, and accept a broader  
> range of identifiers than strictly allowed.  We might issue a  
> warning the first time we see an extended character in an identifier  
> (per translation unit) that extended character support currently  
> permits characters that may become illegal in future versions.

Adding a 2.5M database to clang sounds like a non-starter.  When we  
actually care enough about this, hopefully there will be a better way  
to go.  Neil, do you know of a good way to do this sort of check?

> If I can get permission for this approach, UCNs follow fairly  
> easily, simply encoding into the same extended UTF-8 code points.   
> At this point we would have most the UCN support required for C99/C+ 
> +98-03, with the exception that we are a little permissive in what  
> we accept.  Technically, that's an extension, as we should translate  
> all valid portable programs ;¬)

Makes sense to me.  There is another question of canonicalization  
though: at which stage should a UCN be translated into its  
corresponding UTF8 character?  Should this be done when the lexer  
forms the IdentifierInfo?

-Chris