[cfe-commits] [PATCH] Support for universal character names in identifiers

Tue Nov 27 15:01:05 PST 2012

On Tue, Nov 27, 2012 at 2:37 PM, Eli Friedman <eli.friedman at gmail.com>wrote:

> On Tue, Nov 27, 2012 at 2:25 PM, Richard Smith <richard at metafoo.co.uk>
> wrote:
> > I had a look at supporting UTF-8 in source files, and came up with the
> > attached approach. getCharAndSize maps UTF-8 characters down to a char
> with
> > the high bit set, representing the class of the character rather than the
> > character itself. (I've not done any performance measurements yet, and
> the
> > patch is generally far from being ready for review).
> >
> > Have you considered using a similar approach for lexing UCNs? We already
> > land in getCharAndSizeSlow, so it seems like it'd be cheap to deal with
> them
> > there. Also, validating the codepoints early would allow us to recover
> > better (for instance, from UCNs encoding whitespace or elements of the
> basic
> > source character set).
>
> That would affect the spelling of the tokens, and I don't think the C
> or C++ standard actually allows us to do that.

If I understand you correctly, you're concerned that we would get the wrong
string in the token's spelling? When we build a token, we take the
characters from the underlying source buffer, not the value returned by
getCharAndSize.

> Evil testcase:
>
> #define CONCAT(a,b) a ## b
> #define \U000100010\u00FD 1
> #if !CONCAT(\, U000100010\u00FD)
> #error "This should never happen"
> #endif
>
> -Eli
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-commits/attachments/20121127/ebfaf026/attachment.html>