[cfe-commits] [PATCH] Support for universal character names in identifiers

Tue Nov 27 17:04:25 PST 2012

On Tue, Nov 27, 2012 at 3:33 PM, Eli Friedman <eli.friedman at gmail.com> wrote:
> On Tue, Nov 27, 2012 at 3:01 PM, Richard Smith <richard at metafoo.co.uk> wrote:
>> On Tue, Nov 27, 2012 at 2:37 PM, Eli Friedman <eli.friedman at gmail.com>
>> wrote:
>>>
>>> On Tue, Nov 27, 2012 at 2:25 PM, Richard Smith <richard at metafoo.co.uk>
>>> wrote:
>>> > I had a look at supporting UTF-8 in source files, and came up with the
>>> > attached approach. getCharAndSize maps UTF-8 characters down to a char
>>> > with
>>> > the high bit set, representing the class of the character rather than
>>> > the
>>> > character itself. (I've not done any performance measurements yet, and
>>> > the
>>> > patch is generally far from being ready for review).
>>> >
>>> > Have you considered using a similar approach for lexing UCNs? We already
>>> > land in getCharAndSizeSlow, so it seems like it'd be cheap to deal with
>>> > them
>>> > there. Also, validating the codepoints early would allow us to recover
>>> > better (for instance, from UCNs encoding whitespace or elements of the
>>> > basic
>>> > source character set).
>>>
>>> That would affect the spelling of the tokens, and I don't think the C
>>> or C++ standard actually allows us to do that.
>>
>>
>> If I understand you correctly, you're concerned that we would get the wrong
>> string in the token's spelling? When we build a token, we take the
>> characters from the underlying source buffer, not the value returned by
>> getCharAndSize.
>
> Oh, I see... so the idea is to hack up getCharAndSize instead of
> calling isUCNAfterSlash/ConsumeUCNAfterSlash where we expect a UCN,
> use a marker which essentially means "saw a UCN".
>
> Seems like a workable approach; I don't think it actually helps any
> with error recovery (I'm pretty sure we can't diagnose anything
> without knowing what kind of token we're forming), but I think it will
> make the patch simpler.  I'll try to hack up a new version of my
> patch.

Attached.

-Eli
-------------- next part --------------
Index: test/Preprocessor/ucn-pp-identifier.c
===================================================================

--- test/Preprocessor/ucn-pp-identifier.c	(revision 0)
+++ test/Preprocessor/ucn-pp-identifier.c	(revision 0)
@@ -0,0 +1,110 @@
+// RUN: %clang_cc1 %s -fsyntax-only -std=c99 -pedantic -verify -Wundef
+
+#define \u00FC
+#define a\u00FD() 0
+#ifndef \u00FC
+#error "This should never happen"
+#endif
+
+#if a\u00FD()
+#error "This should never happen"
+#endif
+
+#if a\U000000FD()
+#error "This should never happen"
+#endif
+
+// Check that we allow UCNs in preprocessing numbers.
+// (Why exactly C allows them, I have no idea, but those are the rules)
+#define CONCAT(a,b) a ## b
+#define \U000100010\u00FD 1
+#if !CONCAT(\U00010001, 0\u00FD)
+#error "This should never happen"
+#endif
+
+// Check concatenating a '\' with the rest of a UCN.  (Also a little weird,
+// but apparently allowed in C.)
+#if !CONCAT(\, U000100010\u00FD)
+#error "This should never happen"
+#endif
+
+// Check that we don't accept all uses of \u and \U as UCNs.
+// (Again, sort of weird, but part of the rules)
+#if \uarecool // expected-error {{invalid token at start of a preprocessor expression}}
+#endif
+#if \U0001000  // expected-error {{invalid token at start of a preprocessor expression}}
+#endif
+
+// Make sure we reject disallowed UCNs
+#define \ufffe // expected-error {{character '\ufffe' cannot be used as a universal character name in an identifer}}
+#define \U10000000  // expected-error {{character '\U10000000' cannot be used as a universal character name in an identifer}}
+#define \u0061  // expected-error {{character '\u0061' cannot be used as a universal character name in an identifer}}
+// FIXME: Not clear what our behavior should be here; \u0024 is "$".
+#define a\u0024  // expected-error {{character '\u0024' cannot be used as a universal character name in an identifer}}
+
+#if \u0110 // expected-warning {{'?' is not defined, evaluates to 0}}
+#endif
+
+
+#define \u0110 1 / 0
+#if \u0110
+#endif
+
+#define STRINGIZE(X) # X
+
+extern int check_size[sizeof(STRINGIZE(\u0112)) == 3 ? 1 : -1];
+// RUN: %clang_cc1 %s -fsyntax-only -std=c99 -pedantic -verify -Wundef
+
+#define \u00FC
+#define a\u00FD() 0
+#ifndef \u00FC
+#error "This should never happen"
+#endif
+
+#if a\u00FD()
+#error "This should never happen"
+#endif
+
+#if a\U000000FD()
+#error "This should never happen"
+#endif
+
+// Check that we allow UCNs in preprocessing numbers.
+// (Why exactly C allows them, I have no idea, but those are the rules)
+#define CONCAT(a,b) a ## b
+#define \U000100010\u00FD 1
+#if !CONCAT(\U00010001, 0\u00FD)
+#error "This should never happen"
+#endif
+
+// Check concatenating a '\' with the rest of a UCN.  (Also a little weird,
+// but apparently allowed in C.)
+#if !CONCAT(\, U000100010\u00FD)
+#error "This should never happen"
+#endif
+
+// Check that we don't accept all uses of \u and \U as UCNs.
+// (Again, sort of weird, but part of the rules)
+#if \uarecool // expected-error {{invalid token at start of a preprocessor expression}}
+#endif
+#if \U0001000  // expected-error {{invalid token at start of a preprocessor expression}}
+#endif
+
+// Make sure we reject disallowed UCNs
+#define \ufffe // expected-error {{character '\ufffe' cannot be used as a universal character name in an identifer}}
+#define \U10000000  // expected-error {{character '\U10000000' cannot be used as a universal character name in an identifer}}
+#define \u0061  // expected-error {{character '\u0061' cannot be used as a universal character name in an identifer}}
+// FIXME: Not clear what our behavior should be here; \u0024 is "$".
+#define a\u0024  // expected-error {{character '\u0024' cannot be used as a universal character name in an identifer}}
+
+#if \u0110 // expected-warning {{'?' is not defined, evaluates to 0}}
+#endif
+
+
+#define \u0110 1 / 0
+#if \u0110
+#endif
+
+#define STRINGIZE(X) # X
+
+extern int check_size[sizeof(STRINGIZE(\u0112)) == 3 ? 1 : -1];
Index: test/CXX/over/over.oper/over.literal/p8.cpp
===================================================================
--- test/CXX/over/over.oper/over.literal/p8.cpp	(revision 168748)
+++ test/CXX/over/over.oper/over.literal/p8.cpp	(working copy)
@@ -7,8 +7,7 @@
 
 void operator "" _km(long double); // ok
 string operator "" _i18n(const char*, std::size_t); // ok
-// FIXME: This should be accepted once we support UCNs
-template<char...> int operator "" \u03C0(); // ok, UCN for lowercase pi // expected-error {{expected identifier}}
+template<char...> int operator "" \u03C0(); // ok, UCN for lowercase pi // expected-warning {{reserved}}
 float operator ""E(const char *); // expected-error {{invalid suffix on literal}} expected-warning {{reserved}}
 float operator " " B(const char *); // expected-error {{must be '""'}} expected-warning {{reserved}}
 string operator "" 5X(const char *, std::size_t); // expected-error {{expected identifier}}
Index: include/clang/Basic/DiagnosticLexKinds.td
===================================================================
--- include/clang/Basic/DiagnosticLexKinds.td	(revision 168748)
+++ include/clang/Basic/DiagnosticLexKinds.td	(working copy)
@@ -93,8 +93,14 @@
   "multi-character character constant">, InGroup<MultiChar>;
 def ext_four_char_character_literal : Extension<
   "multi-character character constant">, InGroup<FourByteMultiChar>;
-  
 
+
+def err_ucn_invalid_in_id : Error<
+  "character '%0' cannot be used as a universal character name "
+  "in an identifer">;
+def err_ucn_invalid_at_start_id : Error<
+  "character '%0' cannot be used at the start of an identifer">;
+
 // Literal
 def ext_nonstandard_escape : Extension<
   "use of non-standard escape character '\\%0'">;
Index: include/clang/Lex/Lexer.h
===================================================================
--- include/clang/Lex/Lexer.h	(revision 168748)
+++ include/clang/Lex/Lexer.h	(working copy)
@@ -473,7 +473,7 @@
   /// can return false for characters that end up being the same, but it will
   /// never return true for something that needs to be mapped.
   static bool isObviouslySimpleCharacter(char C) {
-    return C != '?' && C != '\\';
+    return C != '?' && C != '\\' && (signed char)C >= 0;
   }
 
   /// getAndAdvanceChar - Read a single 'character' from the specified buffer,
@@ -573,6 +573,10 @@
   void cutOffLexing() { BufferPtr = BufferEnd; }
 
   bool isHexaLiteral(const char *Start, const LangOptions &LangOpts);
+
+  bool isUCNAfterSlash(const char *CurPtr, unsigned &Size);
+  static bool isUCNAfterSlashNoWarn(const char* CurPtr, unsigned &Size,
+                                    const LangOptions &LangOpts);
 };
 
 
Index: include/clang/Lex/Token.h
===================================================================
--- include/clang/Lex/Token.h	(revision 168748)
+++ include/clang/Lex/Token.h	(working copy)
@@ -74,9 +74,10 @@
     StartOfLine   = 0x01,  // At start of line or only after whitespace.
     LeadingSpace  = 0x02,  // Whitespace exists before this token.
     DisableExpand = 0x04,  // This identifier may never be macro expanded.
-    NeedsCleaning = 0x08,   // Contained an escaped newline or trigraph.
+    NeedsCleaning = 0x08,  // Contained an escaped newline or trigraph.
     LeadingEmptyMacro = 0x10, // Empty macro exists before this token.
-    HasUDSuffix = 0x20     // This string or character literal has a ud-suffix.
+    HasUDSuffix = 0x20,    // This string or character literal has a ud-suffix.
+    HasUCN = 0x40          // This identifier contains a UCN.
   };
 
   tok::TokenKind getKind() const { return (tok::TokenKind)Kind; }
Index: lib/Lex/Lexer.cpp
===================================================================
--- lib/Lex/Lexer.cpp	(revision 168748)
+++ lib/Lex/Lexer.cpp	(working copy)
@@ -336,10 +336,12 @@
   // NOTE: this has to be checked *before* testing for an IdentifierInfo.
   if (Tok.is(tok::raw_identifier))
     TokStart = Tok.getRawIdentifierData();
-  else if (const IdentifierInfo *II = Tok.getIdentifierInfo()) {
-    // Just return the string from the identifier table, which is very quick.
-    Buffer = II->getNameStart();
-    return II->getLength();
+  else if (!(Tok.getFlags() & Token::HasUCN)) {
+    if (const IdentifierInfo *II = Tok.getIdentifierInfo()) {
+      // Just return the string from the identifier table, which is very quick.
+      Buffer = II->getNameStart();
+      return II->getLength();
+    }
   }
 
   // NOTE: this can be checked even after testing for an IdentifierInfo.
@@ -1341,7 +1343,6 @@
 ///   2. If this is an escaped newline (potentially with whitespace between
 ///      the backslash and newline), implicitly skip the newline and return
 ///      the char after it.
-///   3. If this is a UCN, return it.  FIXME: C++ UCN's?
 ///
 /// This handles the slow/uncommon case of the getCharAndSize method.  Here we
 /// know that we can accumulate into Size, and that we have already incremented
@@ -1357,6 +1358,12 @@
     ++Size;
     ++Ptr;
 Slash:
+    // Check for UCN; if we find one, return an extended-character note.
+    if (isUCNAfterSlash(Ptr, Size)) {
+      if (Tok) Tok->setFlag(Token::HasUCN);
+      return (char)0x80;
+    }
+
     // Common case, backslash-char where the char is not whitespace.
     if (!isWhitespace(Ptr[0])) return '\\';
 
@@ -1403,6 +1410,13 @@
     }
   }
 
+  // If we're outside ASCII, just return an extended-character note.
+  // (We'll validate that the character is valid later.)
+  if ((signed char)Ptr[0] < 0) {
+    ++Size;
+    return (char)0x80;
+  }
+
   // If this is neither, return a single character.
   ++Size;
   return *Ptr;
@@ -1422,6 +1436,10 @@
     ++Size;
     ++Ptr;
 Slash:
+    // Check for UCN; if we find one, return an extended-character note.
+    if (isUCNAfterSlashNoWarn(Ptr, Size, LangOpts))
+      return (char)0x80;
+
     // Common case, backslash-char where the char is not whitespace.
     if (!isWhitespace(Ptr[0])) return '\\';
 
@@ -1457,6 +1475,13 @@
     }
   }
 
+  // If we're outside ASCII, just return an extended-character note.
+  // (We'll validate that the character is valid later.)
+  if ((signed char)Ptr[0] < 0) {
+    ++Size;
+    return (char)0x80;
+  }
+
   // If this is neither, return a single character.
   ++Size;
   return *Ptr;
@@ -1466,6 +1491,57 @@
 // Helper methods for lexing.
 //===----------------------------------------------------------------------===//
 
+bool Lexer::isUCNAfterSlash(const char* CurPtr, unsigned &Size) {
+  if (!LangOpts.CPlusPlus && !LangOpts.C99)
+    return false;
+  unsigned CharSize;
+  unsigned SizeTmp = Size;
+  char FirstChar = getCharAndSize(CurPtr, CharSize);
+  CurPtr += CharSize;
+  SizeTmp += CharSize;
+  unsigned NumHexDigits;
+  if (FirstChar == 'u')
+    NumHexDigits = 4;
+  else if (FirstChar == 'U')
+    NumHexDigits = 8;
+  else
+    return false;
+  for (unsigned i = 0; i < NumHexDigits; ++i) {
+    if (!isxdigit(getCharAndSize(CurPtr, CharSize)))
+      return false;
+    CurPtr += CharSize;
+    SizeTmp += CharSize;
+  }
+  Size = SizeTmp;
+  return true;
+}
+
+bool Lexer::isUCNAfterSlashNoWarn(const char* CurPtr, unsigned &Size,
+                                  const LangOptions &LangOpts) {
+  if (!LangOpts.CPlusPlus && !LangOpts.C99)
+    return false;
+  unsigned CharSize;
+  unsigned SizeTmp = Size;
+  char FirstChar = getCharAndSizeNoWarn(CurPtr, CharSize, LangOpts);
+  CurPtr += CharSize;
+  SizeTmp += CharSize;
+  unsigned NumHexDigits;
+  if (FirstChar == 'u')
+    NumHexDigits = 4;
+  else if (FirstChar == 'U')
+    NumHexDigits = 8;
+  else
+    return false;
+  for (unsigned i = 0; i < NumHexDigits; ++i) {
+    if (!isxdigit(getCharAndSizeNoWarn(CurPtr, CharSize, LangOpts)))
+      return false;
+    CurPtr += CharSize;
+    SizeTmp += CharSize;
+  }
+  Size = SizeTmp;
+  return true;
+}
+
 /// \brief Routine that indiscriminately skips bytes in the source file.
 void Lexer::SkipBytes(unsigned Bytes, bool StartOfLine) {
   BufferPtr += Bytes;
@@ -1485,7 +1561,6 @@
 
   // Fast path, no $,\,? in identifier found.  '\' might be an escaped newline
   // or UCN, and ? might be a trigraph for '\', an escaped newline or UCN.
-  // FIXME: UCNs.
   //
   // TODO: Could merge these checks into a CharInfo flag to make the comparison
   // cheaper
@@ -1526,7 +1601,7 @@
       CurPtr = ConsumeChar(CurPtr, Size, Result);
       C = getCharAndSize(CurPtr, Size);
       continue;
-    } else if (!isIdentifierBody(C)) { // FIXME: UCNs.
+    } else if (C != 0x80 && !isIdentifierBody(C)) {
       // Found end of identifier.
       goto FinishIdentifier;
     }
@@ -1535,7 +1610,7 @@
     CurPtr = ConsumeChar(CurPtr, Size, Result);
 
     C = getCharAndSize(CurPtr, Size);
-    while (isIdentifierBody(C)) { // FIXME: UCNs.
+    while (isIdentifierBody(C)) {
       CurPtr = ConsumeChar(CurPtr, Size, Result);
       C = getCharAndSize(CurPtr, Size);
     }
@@ -1560,12 +1635,18 @@
   unsigned Size;
   char C = getCharAndSize(CurPtr, Size);
   char PrevCh = 0;
-  while (isNumberBody(C)) { // FIXME: UCNs in ud-suffix.
+  while (isNumberBody(C)) {
     CurPtr = ConsumeChar(CurPtr, Size, Result);
     PrevCh = C;
     C = getCharAndSize(CurPtr, Size);
   }
 
+  // Check for a UCN.
+  if (C == '\x80') {
+    CurPtr = ConsumeChar(CurPtr, Size, Result);
+    return LexNumericConstant(Result, CurPtr);
+  }
+
   // If we fell out, check for a sign, due to 1e+12.  If we have one, continue.
   if ((C == '-' || C == '+') && (PrevCh == 'E' || PrevCh == 'e')) {
     // If we are in Microsoft mode, don't continue if the constant is hex.
@@ -3208,9 +3289,13 @@
       Kind = tok::unknown;
     break;
 
-  case '\\':
-    // FIXME: UCN's.
-    // FALL THROUGH.
+  case '\x80': {
+    // Notify MIOpt that we read a non-whitespace/non-comment token.
+    MIOpt.ReadToken();
+
+    return LexIdentifier(Result, CurPtr);
+  }
+
   default:
     Kind = tok::unknown;
     break;
Index: lib/Lex/Preprocessor.cpp
===================================================================
--- lib/Lex/Preprocessor.cpp	(revision 168748)
+++ lib/Lex/Preprocessor.cpp	(working copy)
@@ -38,11 +38,13 @@
 #include "clang/Lex/CodeCompletionHandler.h"
 #include "clang/Lex/ModuleLoader.h"
 #include "clang/Lex/LiteralSupport.h"
+#include "clang/Basic/ConvertUTF.h"
 #include "clang/Basic/SourceManager.h"
 #include "clang/Basic/FileManager.h"
 #include "clang/Basic/TargetInfo.h"
 #include "llvm/ADT/APFloat.h"
 #include "llvm/ADT/SmallString.h"
+#include "llvm/ADT/STLExtras.h"
 #include "llvm/Support/MemoryBuffer.h"
 #include "llvm/Support/raw_ostream.h"
 #include "llvm/Support/Capacity.h"
@@ -399,7 +401,7 @@
                                           SmallVectorImpl<char> &Buffer,
                                           bool *Invalid) const {
   // NOTE: this has to be checked *before* testing for an IdentifierInfo.
-  if (Tok.isNot(tok::raw_identifier)) {
+  if (Tok.isNot(tok::raw_identifier) && !(Tok.getFlags() & Token::HasUCN)) {
     // Try the fast path.
     if (const IdentifierInfo *II = Tok.getIdentifierInfo())
       return II->getName();
@@ -497,6 +499,78 @@
 // Lexer Event Handling.
 //===----------------------------------------------------------------------===//
 
+static int HexDigitValue(char C) {
+  if (C >= '0' && C <= '9') return C-'0';
+  if (C >= 'a' && C <= 'f') return C-'a'+10;
+  return C-'A'+10;
+}
+
+namespace {
+  struct UCNCharRange {
+    unsigned Lower;
+    unsigned Upper;
+  };
+  UCNCharRange UCNAllowedCharRanges[] =
+      // 1
+    { { 0x00A8, 0x00A8 }, { 0x00AA, 0x00AA }, { 0x00AD, 0x00AD },
+      { 0x00AF, 0x00AF }, { 0x00B2, 0x00B5 }, { 0x00B7, 0x00BA },
+      { 0x00BC, 0x00BE }, { 0x00C0, 0x00D6 }, { 0x00D8, 0x00F6 },
+      { 0x00F8, 0x00FF },
+      // 2
+      { 0x0100, 0x167F }, { 0x1681, 0x180D }, { 0x180F, 0x1FFF },
+      // 3
+      { 0x200B, 0x200D }, { 0x202A, 0x202E }, { 0x203F, 0x2040 },
+      { 0x2054, 0x2054 }, { 0x2060, 0x206F },
+      // 4
+      { 0x2070, 0x218F }, { 0x2460, 0x24FF }, { 0x2776, 0x2793 },
+      { 0x2C00, 0x2DFF }, { 0x2E80, 0x2FFF },
+      // 5
+      { 0x3004, 0x3007 }, { 0x3021, 0x302F }, { 0x3031, 0x303F },
+      // 6
+      { 0x3040, 0xD7FF },
+      // 7
+      { 0xF900, 0xFD3D }, { 0xFD40, 0xFDCF }, { 0xFDF0, 0xFE44 },
+      { 0xFE47, 0xFFFD },
+      // 8
+      { 0x10000, 0x1FFFD }, { 0x20000, 0x2FFFD }, { 0x30000, 0x3FFFD },
+      { 0x40000, 0x4FFFD }, { 0x50000, 0x5FFFD }, { 0x60000, 0x6FFFD },
+      { 0x70000, 0x7FFFD }, { 0x80000, 0x8FFFD }, { 0x90000, 0x9FFFD },
+      { 0xA0000, 0xAFFFD }, { 0xB0000, 0xBFFFD }, { 0xC0000, 0xCFFFD },
+      { 0xD0000, 0xDFFFD }, { 0xE0000, 0xEFFFD } };
+}
+
+static bool isAllowedIDChar(unsigned c) {
+  unsigned LowPoint = 0;
+  unsigned HighPoint = llvm::array_lengthof(UCNAllowedCharRanges);
+  while (HighPoint != LowPoint) {
+    unsigned MidPoint = (HighPoint + LowPoint) / 2;
+    if (c < UCNAllowedCharRanges[MidPoint].Lower)
+      HighPoint = MidPoint;
+    else if (c > UCNAllowedCharRanges[MidPoint].Upper)
+      LowPoint = MidPoint + 1;
+    else
+      return true;
+  }
+  return false;
+}
+
+static bool isAllowedInitiallyIDChar(unsigned c) {
+  return !(0x0300 <= c && c <= 0x036F) &&
+         !(0x1DC0 <= c && c <= 0x1DFF) &&
+         !(0x20D0 <= c && c <= 0x20FF) &&
+         !(0xFE20 <= c && c <= 0xFE2F);
+}
+
+static void AppendCodePoint(unsigned Codepoint,
+                            llvm::SmallVectorImpl<char> &Str) {
+  char ResultBuf[4];
+  char *ResultPtr = ResultBuf;
+  bool Res = ConvertCodePointToUTF8(Codepoint, ResultPtr);
+  (void)Res;
+  assert(Res && "Unexpected conversion failure");
+  Str.append(ResultBuf, ResultPtr);
+}
+
 /// LookUpIdentifierInfo - Given a tok::raw_identifier token, look up the
 /// identifier information for the token and install it into the token,
 /// updating the token kind accordingly.
@@ -505,14 +579,56 @@
 
   // Look up this token, see if it is a macro, or if it is a language keyword.
   IdentifierInfo *II;
-  if (!Identifier.needsCleaning()) {
+  if (!Identifier.needsCleaning() && !(Identifier.getFlags() & Token::HasUCN)) {
     // No cleaning needed, just use the characters from the lexed buffer.
     II = getIdentifierInfo(StringRef(Identifier.getRawIdentifierData(),
                                            Identifier.getLength()));
   } else {
     // Cleaning needed, alloca a buffer, clean into it, then use the buffer.
     SmallString<64> IdentifierBuffer;
+    SmallString<64> UCNIdentifierBuffer;
     StringRef CleanedStr = getSpelling(Identifier, IdentifierBuffer);
+    if (Identifier.getFlags() & Token::HasUCN) {
+      for (unsigned i = 0, e = CleanedStr.size(); i != e; ++i) {
+        if (CleanedStr[i] == '\\') {
+          unsigned UcnVal;
+          unsigned NumChars;
+          if (CleanedStr[i+1] == 'u') {
+            UcnVal = (HexDigitValue(CleanedStr[i+2]) << 12) +
+                     (HexDigitValue(CleanedStr[i+3]) << 8) +
+                     (HexDigitValue(CleanedStr[i+4]) << 4) +
+                     (HexDigitValue(CleanedStr[i+5]));
+            NumChars = 6;
+          } else {
+            assert(CleanedStr[i+1] == 'U');
+            UcnVal = (HexDigitValue(CleanedStr[i+2]) << 28) +
+                     (HexDigitValue(CleanedStr[i+3]) << 24) +
+                     (HexDigitValue(CleanedStr[i+4]) << 20) +
+                     (HexDigitValue(CleanedStr[i+5]) << 16) +
+                     (HexDigitValue(CleanedStr[i+6]) << 12) +
+                     (HexDigitValue(CleanedStr[i+7]) << 8) +
+                     (HexDigitValue(CleanedStr[i+8]) << 4) +
+                     (HexDigitValue(CleanedStr[i+9]));
+            NumChars = 10;
+          }
+          if (!isAllowedIDChar(UcnVal)) {
+            StringRef CurCharacter = CleanedStr.substr(i, NumChars);
+            Diag(Identifier, diag::err_ucn_invalid_in_id) << CurCharacter;
+            UcnVal = 0xFFFD;
+          }
+          if (UCNIdentifierBuffer.empty() && !isAllowedInitiallyIDChar(UcnVal)) {
+            StringRef CurCharacter = CleanedStr.substr(i, NumChars);
+            Diag(Identifier, diag::err_ucn_invalid_at_start_id) << CurCharacter;
+            UcnVal = 0xFFFD;
+          }
+          AppendCodePoint(UcnVal, UCNIdentifierBuffer);
+          i += NumChars - 1;
+        } else {
+          UCNIdentifierBuffer.push_back(CleanedStr[i]);
+        }
+      }
+      CleanedStr = UCNIdentifierBuffer;
+    }
     II = getIdentifierInfo(CleanedStr);
   }