I had a look at supporting UTF-8 in source files, and came up with the attached approach. getCharAndSize maps UTF-8 characters down to a char with the high bit set, representing the class of the character rather than the character itself. (I've not done any performance measurements yet, and the patch is generally far from being ready for review).<div>

<br></div><div>Have you considered using a similar approach for lexing UCNs? We already land in getCharAndSizeSlow, so it seems like it'd be cheap to deal with them there. Also, validating the codepoints early would allow us to recover better (for instance, from UCNs encoding whitespace or elements of the basic source character set).<br>

<br><div class="gmail_quote">On Fri, Nov 16, 2012 at 9:33 PM, Richard Smith <span dir="ltr"><<a href="mailto:richard@metafoo.co.uk" target="_blank">richard@metafoo.co.uk</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div class="HOEnZb"><div class="h5">On Fri, Nov 16, 2012 at 6:53 PM, Eli Friedman <<a href="mailto:eli.friedman@gmail.com">eli.friedman@gmail.com</a>> wrote:<br>

> On Thu, Nov 15, 2012 at 8:30 PM, Richard Smith <<a href="mailto:richard@metafoo.co.uk">richard@metafoo.co.uk</a>> wrote:<br>

>> On Thu, Nov 15, 2012 at 7:17 PM, Eli Friedman <<a href="mailto:eli.friedman@gmail.com">eli.friedman@gmail.com</a>> wrote:<br>

>>> Patch attached.  Adds support universal character names in identifiers, e.g.:<br>

>>><br>

>>> char * \u00FC = "u-umlaut";<br>

>>><br>

>>> Not that it's particularly useful, but it's a longstanding hole in our<br>

>>> C99 support.<br>

>>><br>

>>> The general outline of the approach is that the spelling of the<br>

>>> identifier token contains the UCN, but the IdentifierInfo for the<br>

>>> identifier token contains pure UTF-8.  I think this is reasonable<br>

>>> given the C phases of translation, and consistent with the way we<br>

>>> handle UCNs in other contexts.<br>

>><br>

>> This seems like a good approach to me.<br>

>><br>

>>> I'm intentionally leaving out most of the support for universal<br>

>>> character names in user-defined literals, to try and reduce the size<br>

>>> of the patch.<br>

>><br>

>> Index: include/clang/Lex/Lexer.h<br>

>> ===================================================================<br>

>> --- include/clang/Lex/Lexer.h   (revision 168014)<br>

>> +++ include/clang/Lex/Lexer.h   (working copy)<br>

>> @@ -573,6 +573,10 @@<br>

>>    void cutOffLexing() { BufferPtr = BufferEnd; }<br>

>><br>

>>    bool isHexaLiteral(const char *Start, const LangOptions &LangOpts);<br>

>> +<br>

>> +  bool isUCNAfterSlash(const char *CurPtr, unsigned Size, unsigned SizeTmp[5]);<br>

>> +  void ConsumeUCNAfterSlash(const char *&CurPtr, unsigned SizeTmp[5],<br>

>> +                            Token &Result);<br>

>><br>

>> These [5]s should be [9]s. Also, how about wrapping the unsigned[9] in<br>

>> a struct so it doesn't have to be repeated in so many places, or at<br>

>> least passing it by reference so we'll get a compile error if the<br>

>> caller's array is the wrong size?<br>

>><br>

>> Index: include/clang/Lex/Token.h<br>

>> ===================================================================<br>

>> --- include/clang/Lex/Token.h   (revision 168014)<br>

>> +++ include/clang/Lex/Token.h   (working copy)<br>

>> @@ -74,9 +74,10 @@<br>

>>      StartOfLine   = 0x01,  // At start of line or only after whitespace.<br>

>>      LeadingSpace  = 0x02,  // Whitespace exists before this token.<br>

>>      DisableExpand = 0x04,  // This identifier may never be macro expanded.<br>

>> -    NeedsCleaning = 0x08,   // Contained an escaped newline or trigraph.<br>

>> +    NeedsCleaning = 0x08,  // Contained an escaped newline or trigraph.<br>

>>      LeadingEmptyMacro = 0x10, // Empty macro exists before this token.<br>

>> -    HasUDSuffix = 0x20     // This string or character literal has a ud-suffix.<br>

>> +    HasUDSuffix = 0x20,    // This string or character literal has a ud-suffix.<br>

>> +    HasUCN = 0x40          // This identifier contains a UCN<br>

>><br>

>> Missing full stop. ;-)<br>

>><br>

>> The set of permitted characters appears to be correct only for C11 and<br>

>> C++11: it seems that C99 (+TR1,2,3) and C++98 (+TC1) permitted smaller<br>

>> sets (and not even the same smaller set!). C++98 used the list from<br>

>> ISO/IEC PDTR 10176 and C99 used ISO/IEC TR 10176:1998 (surprisingly,<br>

>> C++03 didn't move from the PDTR to the 1998 TR). If you're doing this<br>

>> to have a complete C99 (and C++98, modulo 'export') implementation,<br>

>> then maybe you care about this... :)<br>

><br>

> I'll have to check whether I care about this.<br>

><br>

>> +          if (UCNIdentifierBuffer.empty() ? !isAllowedInitiallyIDChar(UcnVal) :<br>

>> +                                            !isAllowedIDChar(UcnVal)) {<br>

>> +            StringRef CurCharacter = CleanedStr.substr(i, NumChars);<br>

>> +            Diag(Identifier, diag::err_ucn_invalid_in_id) << CurCharacter;<br>

>><br>

>> It'd be nice for the diagnostic to be different for UCNs which can't<br>

>> appear at all versus UCNs which can't appear at the start of an<br>

>> identifier.<br>

>><br>

>>> I know this patch is a little lacking in terms of tests, but I'm not<br>

>>> really sure what tests we need; suggestions welcome.<br>

>><br>

>> UCNs which resolve to characters in the basic source character set.<br>

>> Identifier emission in diagnostics.<br>

>> Stringization of tokens containing UCNs. (If I'm reading this right,<br>

>> we have a pre-existing bug here, in that characters outside the basic<br>

>> source character set must be converted into UCNs in the resulting<br>

>> string literal.)<br>

><br>

> You mean like the following?<br>

><br>

> #define X "\u00FC"<br>

> #define X "ü"<br>

><br>

> This is valid in C++, but not C. :(<br>

<br>

</div></div>That, and also this:<br>

<br>

#define STR(x) #x<br>

const char *p = STR("ü");<br>

<br>

... which appears to be required to produce something like<br>

"\"\\u00FC\"" in C++, but to produce "\"ü\"" in C. This:<br>

<br>

const char *q = STR("\u00FC");<br>

<br>

... produces "\"\\u00FC\"" in C++, and produces an<br>

implementation-defined choice of "\"\\u00FC\"" and "\"\u00FC\"" in C.<br>

<div class="im"><br>

>> ud-suffixes for integer and floating-point.<br>

><br>

> Not working, but I'll add tests anyway.<br>

><br>

>> Do you want to ExtWarn on this in C89?<br>

><br>

> Err, actually, I think we need to disable this completely for C89;<br>

> IIRC, it's possible to write a valid C89 program which contains<br>

> something which looks like a UCN.<br>

<br>

</div>Hmm, OK, although the \ couldn't be converted to a token, so it'd need<br>

to be removed during preprocessing or used as an operand to # or<br>

similar.<br>

</blockquote></div><br></div>