[cfe-dev] UCNs/extended characters revisitted

Fri Jul 10 14:01:08 PDT 2009

OK, while my first Unicode patch is stewing let's consider how to support UCNs.

Currently Clang effectively requires source files to be UTF-8 encoded.  If fact it mostly requires files without any extended characters at all, but translates UCNs in string literals to UTF-8 so I propose that UTF-8 is recognised as the formal internal representation.

Now when lexing, any time we hit an extended character we describe it as an unknown token.  My first proposal is that we recognise there are is no punctuation to be recognised from characters above 0x7F, so essentially we can add an arbitrarily long string of such characters to an identifier without worrying about breaking the parse.  There are two issues at this point though:

i/ The extended characters must form a valid UTF-8 code point
ii/ Not all code points in the basic character plane are valid for use in identifiers.  While C++ might not act on punctuation in different alphabets, it should still not allow it in identifiers.

(i) is easily and efficiently checked.
(ii) requires a large database of valid/invalid code points to look up against - and frankly this part ought to be written by a Unicode specialist who can validate all the corner cases.  At the moment that database is around 2.5Mb, although that contains more info than we need for simple validation of identifiers.  We will still need a reasonable sized lookup though, ideally encoded into some kind of sparse bit-vector.  I recommend doing this validation in parse or sema, and for performance simply allowing lex to pass along tokens without splitting on the invalid characters.

So my suggestion is to pass on (ii) for now, and accept a broader range of identifiers than strictly allowed.  We might issue a warning the first time we see an extended character in an identifier (per translation unit) that extended character support currently permits characters that may become illegal in future versions.

If I can get permission for this approach, UCNs follow fairly easily, simply encoding into the same extended UTF-8 code points.  At this point we would have most the UCN support required for C99/C++98-03, with the exception that we are a little permissive in what we accept.  Technically, that's an extension, as we should translate all valid portable programs ;¬)

AlisdairM