[cfe-commits] [PATCH] Support for universal character names in identifiers

Fri Nov 16 21:33:49 PST 2012

On Fri, Nov 16, 2012 at 6:53 PM, Eli Friedman <eli.friedman at gmail.com> wrote:
> On Thu, Nov 15, 2012 at 8:30 PM, Richard Smith <richard at metafoo.co.uk> wrote:
>> On Thu, Nov 15, 2012 at 7:17 PM, Eli Friedman <eli.friedman at gmail.com> wrote:
>>> Patch attached.  Adds support universal character names in identifiers, e.g.:
>>>
>>> char * \u00FC = "u-umlaut";
>>>
>>> Not that it's particularly useful, but it's a longstanding hole in our
>>> C99 support.
>>>
>>> The general outline of the approach is that the spelling of the
>>> identifier token contains the UCN, but the IdentifierInfo for the
>>> identifier token contains pure UTF-8.  I think this is reasonable
>>> given the C phases of translation, and consistent with the way we
>>> handle UCNs in other contexts.
>>
>> This seems like a good approach to me.
>>
>>> I'm intentionally leaving out most of the support for universal
>>> character names in user-defined literals, to try and reduce the size
>>> of the patch.
>>
>> Index: include/clang/Lex/Lexer.h
>> ===================================================================
>> --- include/clang/Lex/Lexer.h   (revision 168014)
>> +++ include/clang/Lex/Lexer.h   (working copy)
>> @@ -573,6 +573,10 @@
>>    void cutOffLexing() { BufferPtr = BufferEnd; }
>>
>>    bool isHexaLiteral(const char *Start, const LangOptions &LangOpts);
>> +
>> +  bool isUCNAfterSlash(const char *CurPtr, unsigned Size, unsigned SizeTmp[5]);
>> +  void ConsumeUCNAfterSlash(const char *&CurPtr, unsigned SizeTmp[5],
>> +                            Token &Result);
>>
>> These [5]s should be [9]s. Also, how about wrapping the unsigned[9] in
>> a struct so it doesn't have to be repeated in so many places, or at
>> least passing it by reference so we'll get a compile error if the
>> caller's array is the wrong size?
>>
>> Index: include/clang/Lex/Token.h
>> ===================================================================
>> --- include/clang/Lex/Token.h   (revision 168014)
>> +++ include/clang/Lex/Token.h   (working copy)
>> @@ -74,9 +74,10 @@
>>      StartOfLine   = 0x01,  // At start of line or only after whitespace.
>>      LeadingSpace  = 0x02,  // Whitespace exists before this token.
>>      DisableExpand = 0x04,  // This identifier may never be macro expanded.
>> -    NeedsCleaning = 0x08,   // Contained an escaped newline or trigraph.
>> +    NeedsCleaning = 0x08,  // Contained an escaped newline or trigraph.
>>      LeadingEmptyMacro = 0x10, // Empty macro exists before this token.
>> -    HasUDSuffix = 0x20     // This string or character literal has a ud-suffix.
>> +    HasUDSuffix = 0x20,    // This string or character literal has a ud-suffix.
>> +    HasUCN = 0x40          // This identifier contains a UCN
>>
>> Missing full stop. ;-)
>>
>> The set of permitted characters appears to be correct only for C11 and
>> C++11: it seems that C99 (+TR1,2,3) and C++98 (+TC1) permitted smaller
>> sets (and not even the same smaller set!). C++98 used the list from
>> ISO/IEC PDTR 10176 and C99 used ISO/IEC TR 10176:1998 (surprisingly,
>> C++03 didn't move from the PDTR to the 1998 TR). If you're doing this
>> to have a complete C99 (and C++98, modulo 'export') implementation,
>> then maybe you care about this... :)
>
> I'll have to check whether I care about this.
>
>> +          if (UCNIdentifierBuffer.empty() ? !isAllowedInitiallyIDChar(UcnVal) :
>> +                                            !isAllowedIDChar(UcnVal)) {
>> +            StringRef CurCharacter = CleanedStr.substr(i, NumChars);
>> +            Diag(Identifier, diag::err_ucn_invalid_in_id) << CurCharacter;
>>
>> It'd be nice for the diagnostic to be different for UCNs which can't
>> appear at all versus UCNs which can't appear at the start of an
>> identifier.
>>
>>> I know this patch is a little lacking in terms of tests, but I'm not
>>> really sure what tests we need; suggestions welcome.
>>
>> UCNs which resolve to characters in the basic source character set.
>> Identifier emission in diagnostics.
>> Stringization of tokens containing UCNs. (If I'm reading this right,
>> we have a pre-existing bug here, in that characters outside the basic
>> source character set must be converted into UCNs in the resulting
>> string literal.)
>
> You mean like the following?
>
> #define X "\u00FC"
> #define X "ü"
>
> This is valid in C++, but not C. :(

That, and also this:

#define STR(x) #x
const char *p = STR("ü");

... which appears to be required to produce something like
"\"\\u00FC\"" in C++, but to produce "\"ü\"" in C. This:

const char *q = STR("\u00FC");

... produces "\"\\u00FC\"" in C++, and produces an
implementation-defined choice of "\"\\u00FC\"" and "\"\u00FC\"" in C.

>> ud-suffixes for integer and floating-point.
>
> Not working, but I'll add tests anyway.
>
>> Do you want to ExtWarn on this in C89?
>
> Err, actually, I think we need to disable this completely for C89;
> IIRC, it's possible to write a valid C89 program which contains
> something which looks like a UCN.

Hmm, OK, although the \ couldn't be converted to a token, so it'd need
to be removed during preprocessing or used as an operand to # or
similar.