[cfe-commits] [PATCH] Support for universal character names in identifiers

Tue Dec 18 20:40:09 PST 2012

On Tue, Nov 27, 2012 at 5:04 PM, Eli Friedman <eli.friedman at gmail.com> wrote:
> On Tue, Nov 27, 2012 at 3:33 PM, Eli Friedman <eli.friedman at gmail.com> wrote:
>> On Tue, Nov 27, 2012 at 3:01 PM, Richard Smith <richard at metafoo.co.uk> wrote:
>>> On Tue, Nov 27, 2012 at 2:37 PM, Eli Friedman <eli.friedman at gmail.com>
>>> wrote:
>>>>
>>>> On Tue, Nov 27, 2012 at 2:25 PM, Richard Smith <richard at metafoo.co.uk>
>>>> wrote:
>>>> > I had a look at supporting UTF-8 in source files, and came up with the
>>>> > attached approach. getCharAndSize maps UTF-8 characters down to a char
>>>> > with
>>>> > the high bit set, representing the class of the character rather than
>>>> > the
>>>> > character itself. (I've not done any performance measurements yet, and
>>>> > the
>>>> > patch is generally far from being ready for review).
>>>> >
>>>> > Have you considered using a similar approach for lexing UCNs? We already
>>>> > land in getCharAndSizeSlow, so it seems like it'd be cheap to deal with
>>>> > them
>>>> > there. Also, validating the codepoints early would allow us to recover
>>>> > better (for instance, from UCNs encoding whitespace or elements of the
>>>> > basic
>>>> > source character set).
>>>>
>>>> That would affect the spelling of the tokens, and I don't think the C
>>>> or C++ standard actually allows us to do that.
>>>
>>>
>>> If I understand you correctly, you're concerned that we would get the wrong
>>> string in the token's spelling? When we build a token, we take the
>>> characters from the underlying source buffer, not the value returned by
>>> getCharAndSize.
>>
>> Oh, I see... so the idea is to hack up getCharAndSize instead of
>> calling isUCNAfterSlash/ConsumeUCNAfterSlash where we expect a UCN,
>> use a marker which essentially means "saw a UCN".
>>
>> Seems like a workable approach; I don't think it actually helps any
>> with error recovery (I'm pretty sure we can't diagnose anything
>> without knowing what kind of token we're forming), but I think it will
>> make the patch simpler.  I'll try to hack up a new version of my
>> patch.
>
> Attached.

And, I've discovered a rather large weakness of this approach:
actually writing a correct implementation of getCharAndSizeSlow which
returns a special value for UCNs is painful at best.  I might have to
abandon this route.

-Eli