[cfe-dev] UCNs/extended characters revisitted

Fri Jul 10 15:25:14 PDT 2009

On Fri, Jul 10, 2009 at 2:01 PM, AlisdairM(public)<public at alisdairm.net> wrote:
> OK, while my first Unicode patch is stewing let's consider how to support UCNs.
>
> Currently Clang effectively requires source files to be UTF-8 encoded.  If fact it mostly requires files without any extended characters at all, but translates UCNs in string literals to UTF-8 so I propose that UTF-8 is recognised as the formal internal representation.

Right, that's what we've been intending to implement.

> Now when lexing, any time we hit an extended character we describe it as an unknown token.  My first proposal is that we recognise there are is no punctuation to be recognised from characters above 0x7F, so essentially we can add an arbitrarily long string of such characters to an identifier without worrying about breaking the parse.  There are two issues at this point though:
>
> i/ The extended characters must form a valid UTF-8 code point
> ii/ Not all code points in the basic character plane are valid for use in identifiers.  While C++ might not act on punctuation in different alphabets, it should still not allow it in identifiers.
>
> (i) is easily and efficiently checked.
> (ii) requires a large database of valid/invalid code points to look up against - and frankly this part ought to be written by a Unicode specialist who can validate all the corner cases.  At the moment that database is around 2.5Mb, although that contains more info than we need for simple validation of identifiers.  We will still need a reasonable sized lookup though, ideally encoded into some kind of sparse bit-vector.  I recommend doing this validation in parse or sema, and for performance simply allowing lex to pass along tokens without splitting on the invalid characters.
>
> So my suggestion is to pass on (ii) for now, and accept a broader range of identifiers than strictly allowed.  We might issue a warning the first time we see an extended character in an identifier (per translation unit) that extended character support currently permits characters that may become illegal in future versions.

Sounds like a reasonable proposal, except that we can't delay the
validation until Parser/Sema (consider the case of a macro whose name
contains an extended character).

-Eli