[cfe-commits] [PATCH] Support for universal character names in identifiers

Tue Nov 27 15:33:30 PST 2012

On Tue, Nov 27, 2012 at 3:01 PM, Richard Smith <richard at metafoo.co.uk> wrote:
> On Tue, Nov 27, 2012 at 2:37 PM, Eli Friedman <eli.friedman at gmail.com>
> wrote:
>>
>> On Tue, Nov 27, 2012 at 2:25 PM, Richard Smith <richard at metafoo.co.uk>
>> wrote:
>> > I had a look at supporting UTF-8 in source files, and came up with the
>> > attached approach. getCharAndSize maps UTF-8 characters down to a char
>> > with
>> > the high bit set, representing the class of the character rather than
>> > the
>> > character itself. (I've not done any performance measurements yet, and
>> > the
>> > patch is generally far from being ready for review).
>> >
>> > Have you considered using a similar approach for lexing UCNs? We already
>> > land in getCharAndSizeSlow, so it seems like it'd be cheap to deal with
>> > them
>> > there. Also, validating the codepoints early would allow us to recover
>> > better (for instance, from UCNs encoding whitespace or elements of the
>> > basic
>> > source character set).
>>
>> That would affect the spelling of the tokens, and I don't think the C
>> or C++ standard actually allows us to do that.
>
>
> If I understand you correctly, you're concerned that we would get the wrong
> string in the token's spelling? When we build a token, we take the
> characters from the underlying source buffer, not the value returned by
> getCharAndSize.

Oh, I see... so the idea is to hack up getCharAndSize instead of
calling isUCNAfterSlash/ConsumeUCNAfterSlash where we expect a UCN,
use a marker which essentially means "saw a UCN".

Seems like a workable approach; I don't think it actually helps any
with error recovery (I'm pretty sure we can't diagnose anything
without knowing what kind of token we're forming), but I think it will
make the patch simpler.  I'll try to hack up a new version of my
patch.

-Eli