[cfe-commits] [PATCH] Support for universal character names in identifiers

Thu Nov 15 20:30:17 PST 2012

On Thu, Nov 15, 2012 at 7:17 PM, Eli Friedman <eli.friedman at gmail.com> wrote:
> Patch attached.  Adds support universal character names in identifiers, e.g.:
>
> char * \u00FC = "u-umlaut";
>
> Not that it's particularly useful, but it's a longstanding hole in our
> C99 support.
>
> The general outline of the approach is that the spelling of the
> identifier token contains the UCN, but the IdentifierInfo for the
> identifier token contains pure UTF-8.  I think this is reasonable
> given the C phases of translation, and consistent with the way we
> handle UCNs in other contexts.

This seems like a good approach to me.

> I'm intentionally leaving out most of the support for universal
> character names in user-defined literals, to try and reduce the size
> of the patch.

Index: include/clang/Lex/Lexer.h
===================================================================

--- include/clang/Lex/Lexer.h	(revision 168014)
+++ include/clang/Lex/Lexer.h	(working copy)
@@ -573,6 +573,10 @@
   void cutOffLexing() { BufferPtr = BufferEnd; }

   bool isHexaLiteral(const char *Start, const LangOptions &LangOpts);
+
+  bool isUCNAfterSlash(const char *CurPtr, unsigned Size, unsigned SizeTmp[5]);
+  void ConsumeUCNAfterSlash(const char *&CurPtr, unsigned SizeTmp[5],
+                            Token &Result);

These [5]s should be [9]s. Also, how about wrapping the unsigned[9] in
a struct so it doesn't have to be repeated in so many places, or at
least passing it by reference so we'll get a compile error if the
caller's array is the wrong size?

Index: include/clang/Lex/Token.h
===================================================================
--- include/clang/Lex/Token.h	(revision 168014)
+++ include/clang/Lex/Token.h	(working copy)
@@ -74,9 +74,10 @@
     StartOfLine   = 0x01,  // At start of line or only after whitespace.
     LeadingSpace  = 0x02,  // Whitespace exists before this token.
     DisableExpand = 0x04,  // This identifier may never be macro expanded.
-    NeedsCleaning = 0x08,   // Contained an escaped newline or trigraph.
+    NeedsCleaning = 0x08,  // Contained an escaped newline or trigraph.
     LeadingEmptyMacro = 0x10, // Empty macro exists before this token.
-    HasUDSuffix = 0x20     // This string or character literal has a ud-suffix.
+    HasUDSuffix = 0x20,    // This string or character literal has a ud-suffix.
+    HasUCN = 0x40          // This identifier contains a UCN

Missing full stop. ;-)

The set of permitted characters appears to be correct only for C11 and
C++11: it seems that C99 (+TR1,2,3) and C++98 (+TC1) permitted smaller
sets (and not even the same smaller set!). C++98 used the list from
ISO/IEC PDTR 10176 and C99 used ISO/IEC TR 10176:1998 (surprisingly,
C++03 didn't move from the PDTR to the 1998 TR). If you're doing this
to have a complete C99 (and C++98, modulo 'export') implementation,
then maybe you care about this... :)

+          if (UCNIdentifierBuffer.empty() ? !isAllowedInitiallyIDChar(UcnVal) :
+                                            !isAllowedIDChar(UcnVal)) {
+            StringRef CurCharacter = CleanedStr.substr(i, NumChars);
+            Diag(Identifier, diag::err_ucn_invalid_in_id) << CurCharacter;

It'd be nice for the diagnostic to be different for UCNs which can't
appear at all versus UCNs which can't appear at the start of an
identifier.

> I know this patch is a little lacking in terms of tests, but I'm not
> really sure what tests we need; suggestions welcome.

UCNs which resolve to characters in the basic source character set.
Identifier emission in diagnostics.
Stringization of tokens containing UCNs. (If I'm reading this right,
we have a pre-existing bug here, in that characters outside the basic
source character set must be converted into UCNs in the resulting
string literal.)
ud-suffixes for integer and floating-point.

Do you want to ExtWarn on this in C89?