[cfe-commits] [PATCH] Support for universal character names in identifiers

Tue Dec 18 23:35:55 PST 2012

On Tue, Nov 27, 2012 at 2:37 PM, Eli Friedman <eli.friedman at gmail.com> wrote:
> On Tue, Nov 27, 2012 at 2:25 PM, Richard Smith <richard at metafoo.co.uk> wrote:
>> I had a look at supporting UTF-8 in source files, and came up with the
>> attached approach. getCharAndSize maps UTF-8 characters down to a char with
>> the high bit set, representing the class of the character rather than the
>> character itself. (I've not done any performance measurements yet, and the
>> patch is generally far from being ready for review).
>>
>> Have you considered using a similar approach for lexing UCNs? We already
>> land in getCharAndSizeSlow, so it seems like it'd be cheap to deal with them
>> there. Also, validating the codepoints early would allow us to recover
>> better (for instance, from UCNs encoding whitespace or elements of the basic
>> source character set).
>
> That would affect the spelling of the tokens, and I don't think the C
> or C++ standard actually allows us to do that.  Evil testcase:
>
> #define CONCAT(a,b) a ## b
> #define \U000100010\u00FD 1
> #if !CONCAT(\, U000100010\u00FD)
> #error "This should never happen"
> #endif

For this particular case it doesn't matter: "If a character sequence
that matches the syntax of a universal-character-name is produced by
token concatenation (16.3.3), the behavior is undefined."  (2.2 Phases
of Translation [lex.phases], paragraph 1, list item 4.)

For what it's worth, the standard also says "An implementation may use
any internal encoding, so long as an actual extended character
encountered in the source file, and the same extended character
expressed in the source file as a universal-character-name (i.e.,
using the \uXXXX notation), are handled equivalently except where this
replacement is reverted in a raw string literal."

-- James