[cfe-commits] [PATCH] Support for universal character names in identifiers

Tue Nov 27 14:37:33 PST 2012

On Tue, Nov 27, 2012 at 2:25 PM, Richard Smith <richard at metafoo.co.uk> wrote:
> I had a look at supporting UTF-8 in source files, and came up with the
> attached approach. getCharAndSize maps UTF-8 characters down to a char with
> the high bit set, representing the class of the character rather than the
> character itself. (I've not done any performance measurements yet, and the
> patch is generally far from being ready for review).
>
> Have you considered using a similar approach for lexing UCNs? We already
> land in getCharAndSizeSlow, so it seems like it'd be cheap to deal with them
> there. Also, validating the codepoints early would allow us to recover
> better (for instance, from UCNs encoding whitespace or elements of the basic
> source character set).

That would affect the spelling of the tokens, and I don't think the C
or C++ standard actually allows us to do that.  Evil testcase:

#define CONCAT(a,b) a ## b
#define \U000100010\u00FD 1
#if !CONCAT(\, U000100010\u00FD)
#error "This should never happen"
#endif

-Eli