[cfe-commits] [PATCH] Support for universal character names in identifiers
Eli Friedman
eli.friedman at gmail.com
Fri Nov 16 18:53:30 PST 2012
On Thu, Nov 15, 2012 at 8:30 PM, Richard Smith <richard at metafoo.co.uk> wrote:
> On Thu, Nov 15, 2012 at 7:17 PM, Eli Friedman <eli.friedman at gmail.com> wrote:
>> Patch attached. Adds support universal character names in identifiers, e.g.:
>>
>> char * \u00FC = "u-umlaut";
>>
>> Not that it's particularly useful, but it's a longstanding hole in our
>> C99 support.
>>
>> The general outline of the approach is that the spelling of the
>> identifier token contains the UCN, but the IdentifierInfo for the
>> identifier token contains pure UTF-8. I think this is reasonable
>> given the C phases of translation, and consistent with the way we
>> handle UCNs in other contexts.
>
> This seems like a good approach to me.
>
>> I'm intentionally leaving out most of the support for universal
>> character names in user-defined literals, to try and reduce the size
>> of the patch.
>
> Index: include/clang/Lex/Lexer.h
> ===================================================================
> --- include/clang/Lex/Lexer.h (revision 168014)
> +++ include/clang/Lex/Lexer.h (working copy)
> @@ -573,6 +573,10 @@
> void cutOffLexing() { BufferPtr = BufferEnd; }
>
> bool isHexaLiteral(const char *Start, const LangOptions &LangOpts);
> +
> + bool isUCNAfterSlash(const char *CurPtr, unsigned Size, unsigned SizeTmp[5]);
> + void ConsumeUCNAfterSlash(const char *&CurPtr, unsigned SizeTmp[5],
> + Token &Result);
>
> These [5]s should be [9]s. Also, how about wrapping the unsigned[9] in
> a struct so it doesn't have to be repeated in so many places, or at
> least passing it by reference so we'll get a compile error if the
> caller's array is the wrong size?
>
> Index: include/clang/Lex/Token.h
> ===================================================================
> --- include/clang/Lex/Token.h (revision 168014)
> +++ include/clang/Lex/Token.h (working copy)
> @@ -74,9 +74,10 @@
> StartOfLine = 0x01, // At start of line or only after whitespace.
> LeadingSpace = 0x02, // Whitespace exists before this token.
> DisableExpand = 0x04, // This identifier may never be macro expanded.
> - NeedsCleaning = 0x08, // Contained an escaped newline or trigraph.
> + NeedsCleaning = 0x08, // Contained an escaped newline or trigraph.
> LeadingEmptyMacro = 0x10, // Empty macro exists before this token.
> - HasUDSuffix = 0x20 // This string or character literal has a ud-suffix.
> + HasUDSuffix = 0x20, // This string or character literal has a ud-suffix.
> + HasUCN = 0x40 // This identifier contains a UCN
>
> Missing full stop. ;-)
>
> The set of permitted characters appears to be correct only for C11 and
> C++11: it seems that C99 (+TR1,2,3) and C++98 (+TC1) permitted smaller
> sets (and not even the same smaller set!). C++98 used the list from
> ISO/IEC PDTR 10176 and C99 used ISO/IEC TR 10176:1998 (surprisingly,
> C++03 didn't move from the PDTR to the 1998 TR). If you're doing this
> to have a complete C99 (and C++98, modulo 'export') implementation,
> then maybe you care about this... :)
I'll have to check whether I care about this.
> + if (UCNIdentifierBuffer.empty() ? !isAllowedInitiallyIDChar(UcnVal) :
> + !isAllowedIDChar(UcnVal)) {
> + StringRef CurCharacter = CleanedStr.substr(i, NumChars);
> + Diag(Identifier, diag::err_ucn_invalid_in_id) << CurCharacter;
>
> It'd be nice for the diagnostic to be different for UCNs which can't
> appear at all versus UCNs which can't appear at the start of an
> identifier.
>
>> I know this patch is a little lacking in terms of tests, but I'm not
>> really sure what tests we need; suggestions welcome.
>
> UCNs which resolve to characters in the basic source character set.
> Identifier emission in diagnostics.
> Stringization of tokens containing UCNs. (If I'm reading this right,
> we have a pre-existing bug here, in that characters outside the basic
> source character set must be converted into UCNs in the resulting
> string literal.)
You mean like the following?
#define X "\u00FC"
#define X "ΓΌ"
This is valid in C++, but not C. :(
> ud-suffixes for integer and floating-point.
Not working, but I'll add tests anyway.
> Do you want to ExtWarn on this in C89?
Err, actually, I think we need to disable this completely for C89;
IIRC, it's possible to write a valid C89 program which contains
something which looks like a UCN.
-Eli
More information about the cfe-commits
mailing list