[cfe-commits] [PATCH] Support for universal character names in identifiers
Eli Friedman
eli.friedman at gmail.com
Tue Nov 27 17:04:25 PST 2012
On Tue, Nov 27, 2012 at 3:33 PM, Eli Friedman <eli.friedman at gmail.com> wrote:
> On Tue, Nov 27, 2012 at 3:01 PM, Richard Smith <richard at metafoo.co.uk> wrote:
>> On Tue, Nov 27, 2012 at 2:37 PM, Eli Friedman <eli.friedman at gmail.com>
>> wrote:
>>>
>>> On Tue, Nov 27, 2012 at 2:25 PM, Richard Smith <richard at metafoo.co.uk>
>>> wrote:
>>> > I had a look at supporting UTF-8 in source files, and came up with the
>>> > attached approach. getCharAndSize maps UTF-8 characters down to a char
>>> > with
>>> > the high bit set, representing the class of the character rather than
>>> > the
>>> > character itself. (I've not done any performance measurements yet, and
>>> > the
>>> > patch is generally far from being ready for review).
>>> >
>>> > Have you considered using a similar approach for lexing UCNs? We already
>>> > land in getCharAndSizeSlow, so it seems like it'd be cheap to deal with
>>> > them
>>> > there. Also, validating the codepoints early would allow us to recover
>>> > better (for instance, from UCNs encoding whitespace or elements of the
>>> > basic
>>> > source character set).
>>>
>>> That would affect the spelling of the tokens, and I don't think the C
>>> or C++ standard actually allows us to do that.
>>
>>
>> If I understand you correctly, you're concerned that we would get the wrong
>> string in the token's spelling? When we build a token, we take the
>> characters from the underlying source buffer, not the value returned by
>> getCharAndSize.
>
> Oh, I see... so the idea is to hack up getCharAndSize instead of
> calling isUCNAfterSlash/ConsumeUCNAfterSlash where we expect a UCN,
> use a marker which essentially means "saw a UCN".
>
> Seems like a workable approach; I don't think it actually helps any
> with error recovery (I'm pretty sure we can't diagnose anything
> without knowing what kind of token we're forming), but I think it will
> make the patch simpler. I'll try to hack up a new version of my
> patch.
Attached.
-Eli
-------------- next part --------------
Index: test/Preprocessor/ucn-pp-identifier.c
===================================================================
--- test/Preprocessor/ucn-pp-identifier.c (revision 0)
+++ test/Preprocessor/ucn-pp-identifier.c (revision 0)
@@ -0,0 +1,110 @@
+// RUN: %clang_cc1 %s -fsyntax-only -std=c99 -pedantic -verify -Wundef
+
+#define \u00FC
+#define a\u00FD() 0
+#ifndef \u00FC
+#error "This should never happen"
+#endif
+
+#if a\u00FD()
+#error "This should never happen"
+#endif
+
+#if a\U000000FD()
+#error "This should never happen"
+#endif
+
+// Check that we allow UCNs in preprocessing numbers.
+// (Why exactly C allows them, I have no idea, but those are the rules)
+#define CONCAT(a,b) a ## b
+#define \U000100010\u00FD 1
+#if !CONCAT(\U00010001, 0\u00FD)
+#error "This should never happen"
+#endif
+
+// Check concatenating a '\' with the rest of a UCN. (Also a little weird,
+// but apparently allowed in C.)
+#if !CONCAT(\, U000100010\u00FD)
+#error "This should never happen"
+#endif
+
+// Check that we don't accept all uses of \u and \U as UCNs.
+// (Again, sort of weird, but part of the rules)
+#if \uarecool // expected-error {{invalid token at start of a preprocessor expression}}
+#endif
+#if \U0001000 // expected-error {{invalid token at start of a preprocessor expression}}
+#endif
+
+// Make sure we reject disallowed UCNs
+#define \ufffe // expected-error {{character '\ufffe' cannot be used as a universal character name in an identifer}}
+#define \U10000000 // expected-error {{character '\U10000000' cannot be used as a universal character name in an identifer}}
+#define \u0061 // expected-error {{character '\u0061' cannot be used as a universal character name in an identifer}}
+// FIXME: Not clear what our behavior should be here; \u0024 is "$".
+#define a\u0024 // expected-error {{character '\u0024' cannot be used as a universal character name in an identifer}}
+
+#if \u0110 // expected-warning {{'?' is not defined, evaluates to 0}}
+#endif
+
+
+#define \u0110 1 / 0
+#if \u0110
+#endif
+
+#define STRINGIZE(X) # X
+
+extern int check_size[sizeof(STRINGIZE(\u0112)) == 3 ? 1 : -1];
+// RUN: %clang_cc1 %s -fsyntax-only -std=c99 -pedantic -verify -Wundef
+
+#define \u00FC
+#define a\u00FD() 0
+#ifndef \u00FC
+#error "This should never happen"
+#endif
+
+#if a\u00FD()
+#error "This should never happen"
+#endif
+
+#if a\U000000FD()
+#error "This should never happen"
+#endif
+
+// Check that we allow UCNs in preprocessing numbers.
+// (Why exactly C allows them, I have no idea, but those are the rules)
+#define CONCAT(a,b) a ## b
+#define \U000100010\u00FD 1
+#if !CONCAT(\U00010001, 0\u00FD)
+#error "This should never happen"
+#endif
+
+// Check concatenating a '\' with the rest of a UCN. (Also a little weird,
+// but apparently allowed in C.)
+#if !CONCAT(\, U000100010\u00FD)
+#error "This should never happen"
+#endif
+
+// Check that we don't accept all uses of \u and \U as UCNs.
+// (Again, sort of weird, but part of the rules)
+#if \uarecool // expected-error {{invalid token at start of a preprocessor expression}}
+#endif
+#if \U0001000 // expected-error {{invalid token at start of a preprocessor expression}}
+#endif
+
+// Make sure we reject disallowed UCNs
+#define \ufffe // expected-error {{character '\ufffe' cannot be used as a universal character name in an identifer}}
+#define \U10000000 // expected-error {{character '\U10000000' cannot be used as a universal character name in an identifer}}
+#define \u0061 // expected-error {{character '\u0061' cannot be used as a universal character name in an identifer}}
+// FIXME: Not clear what our behavior should be here; \u0024 is "$".
+#define a\u0024 // expected-error {{character '\u0024' cannot be used as a universal character name in an identifer}}
+
+#if \u0110 // expected-warning {{'?' is not defined, evaluates to 0}}
+#endif
+
+
+#define \u0110 1 / 0
+#if \u0110
+#endif
+
+#define STRINGIZE(X) # X
+
+extern int check_size[sizeof(STRINGIZE(\u0112)) == 3 ? 1 : -1];
Index: test/CXX/over/over.oper/over.literal/p8.cpp
===================================================================
--- test/CXX/over/over.oper/over.literal/p8.cpp (revision 168748)
+++ test/CXX/over/over.oper/over.literal/p8.cpp (working copy)
@@ -7,8 +7,7 @@
void operator "" _km(long double); // ok
string operator "" _i18n(const char*, std::size_t); // ok
-// FIXME: This should be accepted once we support UCNs
-template<char...> int operator "" \u03C0(); // ok, UCN for lowercase pi // expected-error {{expected identifier}}
+template<char...> int operator "" \u03C0(); // ok, UCN for lowercase pi // expected-warning {{reserved}}
float operator ""E(const char *); // expected-error {{invalid suffix on literal}} expected-warning {{reserved}}
float operator " " B(const char *); // expected-error {{must be '""'}} expected-warning {{reserved}}
string operator "" 5X(const char *, std::size_t); // expected-error {{expected identifier}}
Index: include/clang/Basic/DiagnosticLexKinds.td
===================================================================
--- include/clang/Basic/DiagnosticLexKinds.td (revision 168748)
+++ include/clang/Basic/DiagnosticLexKinds.td (working copy)
@@ -93,8 +93,14 @@
"multi-character character constant">, InGroup<MultiChar>;
def ext_four_char_character_literal : Extension<
"multi-character character constant">, InGroup<FourByteMultiChar>;
-
+
+def err_ucn_invalid_in_id : Error<
+ "character '%0' cannot be used as a universal character name "
+ "in an identifer">;
+def err_ucn_invalid_at_start_id : Error<
+ "character '%0' cannot be used at the start of an identifer">;
+
// Literal
def ext_nonstandard_escape : Extension<
"use of non-standard escape character '\\%0'">;
Index: include/clang/Lex/Lexer.h
===================================================================
--- include/clang/Lex/Lexer.h (revision 168748)
+++ include/clang/Lex/Lexer.h (working copy)
@@ -473,7 +473,7 @@
/// can return false for characters that end up being the same, but it will
/// never return true for something that needs to be mapped.
static bool isObviouslySimpleCharacter(char C) {
- return C != '?' && C != '\\';
+ return C != '?' && C != '\\' && (signed char)C >= 0;
}
/// getAndAdvanceChar - Read a single 'character' from the specified buffer,
@@ -573,6 +573,10 @@
void cutOffLexing() { BufferPtr = BufferEnd; }
bool isHexaLiteral(const char *Start, const LangOptions &LangOpts);
+
+ bool isUCNAfterSlash(const char *CurPtr, unsigned &Size);
+ static bool isUCNAfterSlashNoWarn(const char* CurPtr, unsigned &Size,
+ const LangOptions &LangOpts);
};
Index: include/clang/Lex/Token.h
===================================================================
--- include/clang/Lex/Token.h (revision 168748)
+++ include/clang/Lex/Token.h (working copy)
@@ -74,9 +74,10 @@
StartOfLine = 0x01, // At start of line or only after whitespace.
LeadingSpace = 0x02, // Whitespace exists before this token.
DisableExpand = 0x04, // This identifier may never be macro expanded.
- NeedsCleaning = 0x08, // Contained an escaped newline or trigraph.
+ NeedsCleaning = 0x08, // Contained an escaped newline or trigraph.
LeadingEmptyMacro = 0x10, // Empty macro exists before this token.
- HasUDSuffix = 0x20 // This string or character literal has a ud-suffix.
+ HasUDSuffix = 0x20, // This string or character literal has a ud-suffix.
+ HasUCN = 0x40 // This identifier contains a UCN.
};
tok::TokenKind getKind() const { return (tok::TokenKind)Kind; }
Index: lib/Lex/Lexer.cpp
===================================================================
--- lib/Lex/Lexer.cpp (revision 168748)
+++ lib/Lex/Lexer.cpp (working copy)
@@ -336,10 +336,12 @@
// NOTE: this has to be checked *before* testing for an IdentifierInfo.
if (Tok.is(tok::raw_identifier))
TokStart = Tok.getRawIdentifierData();
- else if (const IdentifierInfo *II = Tok.getIdentifierInfo()) {
- // Just return the string from the identifier table, which is very quick.
- Buffer = II->getNameStart();
- return II->getLength();
+ else if (!(Tok.getFlags() & Token::HasUCN)) {
+ if (const IdentifierInfo *II = Tok.getIdentifierInfo()) {
+ // Just return the string from the identifier table, which is very quick.
+ Buffer = II->getNameStart();
+ return II->getLength();
+ }
}
// NOTE: this can be checked even after testing for an IdentifierInfo.
@@ -1341,7 +1343,6 @@
/// 2. If this is an escaped newline (potentially with whitespace between
/// the backslash and newline), implicitly skip the newline and return
/// the char after it.
-/// 3. If this is a UCN, return it. FIXME: C++ UCN's?
///
/// This handles the slow/uncommon case of the getCharAndSize method. Here we
/// know that we can accumulate into Size, and that we have already incremented
@@ -1357,6 +1358,12 @@
++Size;
++Ptr;
Slash:
+ // Check for UCN; if we find one, return an extended-character note.
+ if (isUCNAfterSlash(Ptr, Size)) {
+ if (Tok) Tok->setFlag(Token::HasUCN);
+ return (char)0x80;
+ }
+
// Common case, backslash-char where the char is not whitespace.
if (!isWhitespace(Ptr[0])) return '\\';
@@ -1403,6 +1410,13 @@
}
}
+ // If we're outside ASCII, just return an extended-character note.
+ // (We'll validate that the character is valid later.)
+ if ((signed char)Ptr[0] < 0) {
+ ++Size;
+ return (char)0x80;
+ }
+
// If this is neither, return a single character.
++Size;
return *Ptr;
@@ -1422,6 +1436,10 @@
++Size;
++Ptr;
Slash:
+ // Check for UCN; if we find one, return an extended-character note.
+ if (isUCNAfterSlashNoWarn(Ptr, Size, LangOpts))
+ return (char)0x80;
+
// Common case, backslash-char where the char is not whitespace.
if (!isWhitespace(Ptr[0])) return '\\';
@@ -1457,6 +1475,13 @@
}
}
+ // If we're outside ASCII, just return an extended-character note.
+ // (We'll validate that the character is valid later.)
+ if ((signed char)Ptr[0] < 0) {
+ ++Size;
+ return (char)0x80;
+ }
+
// If this is neither, return a single character.
++Size;
return *Ptr;
@@ -1466,6 +1491,57 @@
// Helper methods for lexing.
//===----------------------------------------------------------------------===//
+bool Lexer::isUCNAfterSlash(const char* CurPtr, unsigned &Size) {
+ if (!LangOpts.CPlusPlus && !LangOpts.C99)
+ return false;
+ unsigned CharSize;
+ unsigned SizeTmp = Size;
+ char FirstChar = getCharAndSize(CurPtr, CharSize);
+ CurPtr += CharSize;
+ SizeTmp += CharSize;
+ unsigned NumHexDigits;
+ if (FirstChar == 'u')
+ NumHexDigits = 4;
+ else if (FirstChar == 'U')
+ NumHexDigits = 8;
+ else
+ return false;
+ for (unsigned i = 0; i < NumHexDigits; ++i) {
+ if (!isxdigit(getCharAndSize(CurPtr, CharSize)))
+ return false;
+ CurPtr += CharSize;
+ SizeTmp += CharSize;
+ }
+ Size = SizeTmp;
+ return true;
+}
+
+bool Lexer::isUCNAfterSlashNoWarn(const char* CurPtr, unsigned &Size,
+ const LangOptions &LangOpts) {
+ if (!LangOpts.CPlusPlus && !LangOpts.C99)
+ return false;
+ unsigned CharSize;
+ unsigned SizeTmp = Size;
+ char FirstChar = getCharAndSizeNoWarn(CurPtr, CharSize, LangOpts);
+ CurPtr += CharSize;
+ SizeTmp += CharSize;
+ unsigned NumHexDigits;
+ if (FirstChar == 'u')
+ NumHexDigits = 4;
+ else if (FirstChar == 'U')
+ NumHexDigits = 8;
+ else
+ return false;
+ for (unsigned i = 0; i < NumHexDigits; ++i) {
+ if (!isxdigit(getCharAndSizeNoWarn(CurPtr, CharSize, LangOpts)))
+ return false;
+ CurPtr += CharSize;
+ SizeTmp += CharSize;
+ }
+ Size = SizeTmp;
+ return true;
+}
+
/// \brief Routine that indiscriminately skips bytes in the source file.
void Lexer::SkipBytes(unsigned Bytes, bool StartOfLine) {
BufferPtr += Bytes;
@@ -1485,7 +1561,6 @@
// Fast path, no $,\,? in identifier found. '\' might be an escaped newline
// or UCN, and ? might be a trigraph for '\', an escaped newline or UCN.
- // FIXME: UCNs.
//
// TODO: Could merge these checks into a CharInfo flag to make the comparison
// cheaper
@@ -1526,7 +1601,7 @@
CurPtr = ConsumeChar(CurPtr, Size, Result);
C = getCharAndSize(CurPtr, Size);
continue;
- } else if (!isIdentifierBody(C)) { // FIXME: UCNs.
+ } else if (C != 0x80 && !isIdentifierBody(C)) {
// Found end of identifier.
goto FinishIdentifier;
}
@@ -1535,7 +1610,7 @@
CurPtr = ConsumeChar(CurPtr, Size, Result);
C = getCharAndSize(CurPtr, Size);
- while (isIdentifierBody(C)) { // FIXME: UCNs.
+ while (isIdentifierBody(C)) {
CurPtr = ConsumeChar(CurPtr, Size, Result);
C = getCharAndSize(CurPtr, Size);
}
@@ -1560,12 +1635,18 @@
unsigned Size;
char C = getCharAndSize(CurPtr, Size);
char PrevCh = 0;
- while (isNumberBody(C)) { // FIXME: UCNs in ud-suffix.
+ while (isNumberBody(C)) {
CurPtr = ConsumeChar(CurPtr, Size, Result);
PrevCh = C;
C = getCharAndSize(CurPtr, Size);
}
+ // Check for a UCN.
+ if (C == '\x80') {
+ CurPtr = ConsumeChar(CurPtr, Size, Result);
+ return LexNumericConstant(Result, CurPtr);
+ }
+
// If we fell out, check for a sign, due to 1e+12. If we have one, continue.
if ((C == '-' || C == '+') && (PrevCh == 'E' || PrevCh == 'e')) {
// If we are in Microsoft mode, don't continue if the constant is hex.
@@ -3208,9 +3289,13 @@
Kind = tok::unknown;
break;
- case '\\':
- // FIXME: UCN's.
- // FALL THROUGH.
+ case '\x80': {
+ // Notify MIOpt that we read a non-whitespace/non-comment token.
+ MIOpt.ReadToken();
+
+ return LexIdentifier(Result, CurPtr);
+ }
+
default:
Kind = tok::unknown;
break;
Index: lib/Lex/Preprocessor.cpp
===================================================================
--- lib/Lex/Preprocessor.cpp (revision 168748)
+++ lib/Lex/Preprocessor.cpp (working copy)
@@ -38,11 +38,13 @@
#include "clang/Lex/CodeCompletionHandler.h"
#include "clang/Lex/ModuleLoader.h"
#include "clang/Lex/LiteralSupport.h"
+#include "clang/Basic/ConvertUTF.h"
#include "clang/Basic/SourceManager.h"
#include "clang/Basic/FileManager.h"
#include "clang/Basic/TargetInfo.h"
#include "llvm/ADT/APFloat.h"
#include "llvm/ADT/SmallString.h"
+#include "llvm/ADT/STLExtras.h"
#include "llvm/Support/MemoryBuffer.h"
#include "llvm/Support/raw_ostream.h"
#include "llvm/Support/Capacity.h"
@@ -399,7 +401,7 @@
SmallVectorImpl<char> &Buffer,
bool *Invalid) const {
// NOTE: this has to be checked *before* testing for an IdentifierInfo.
- if (Tok.isNot(tok::raw_identifier)) {
+ if (Tok.isNot(tok::raw_identifier) && !(Tok.getFlags() & Token::HasUCN)) {
// Try the fast path.
if (const IdentifierInfo *II = Tok.getIdentifierInfo())
return II->getName();
@@ -497,6 +499,78 @@
// Lexer Event Handling.
//===----------------------------------------------------------------------===//
+static int HexDigitValue(char C) {
+ if (C >= '0' && C <= '9') return C-'0';
+ if (C >= 'a' && C <= 'f') return C-'a'+10;
+ return C-'A'+10;
+}
+
+namespace {
+ struct UCNCharRange {
+ unsigned Lower;
+ unsigned Upper;
+ };
+ UCNCharRange UCNAllowedCharRanges[] =
+ // 1
+ { { 0x00A8, 0x00A8 }, { 0x00AA, 0x00AA }, { 0x00AD, 0x00AD },
+ { 0x00AF, 0x00AF }, { 0x00B2, 0x00B5 }, { 0x00B7, 0x00BA },
+ { 0x00BC, 0x00BE }, { 0x00C0, 0x00D6 }, { 0x00D8, 0x00F6 },
+ { 0x00F8, 0x00FF },
+ // 2
+ { 0x0100, 0x167F }, { 0x1681, 0x180D }, { 0x180F, 0x1FFF },
+ // 3
+ { 0x200B, 0x200D }, { 0x202A, 0x202E }, { 0x203F, 0x2040 },
+ { 0x2054, 0x2054 }, { 0x2060, 0x206F },
+ // 4
+ { 0x2070, 0x218F }, { 0x2460, 0x24FF }, { 0x2776, 0x2793 },
+ { 0x2C00, 0x2DFF }, { 0x2E80, 0x2FFF },
+ // 5
+ { 0x3004, 0x3007 }, { 0x3021, 0x302F }, { 0x3031, 0x303F },
+ // 6
+ { 0x3040, 0xD7FF },
+ // 7
+ { 0xF900, 0xFD3D }, { 0xFD40, 0xFDCF }, { 0xFDF0, 0xFE44 },
+ { 0xFE47, 0xFFFD },
+ // 8
+ { 0x10000, 0x1FFFD }, { 0x20000, 0x2FFFD }, { 0x30000, 0x3FFFD },
+ { 0x40000, 0x4FFFD }, { 0x50000, 0x5FFFD }, { 0x60000, 0x6FFFD },
+ { 0x70000, 0x7FFFD }, { 0x80000, 0x8FFFD }, { 0x90000, 0x9FFFD },
+ { 0xA0000, 0xAFFFD }, { 0xB0000, 0xBFFFD }, { 0xC0000, 0xCFFFD },
+ { 0xD0000, 0xDFFFD }, { 0xE0000, 0xEFFFD } };
+}
+
+static bool isAllowedIDChar(unsigned c) {
+ unsigned LowPoint = 0;
+ unsigned HighPoint = llvm::array_lengthof(UCNAllowedCharRanges);
+ while (HighPoint != LowPoint) {
+ unsigned MidPoint = (HighPoint + LowPoint) / 2;
+ if (c < UCNAllowedCharRanges[MidPoint].Lower)
+ HighPoint = MidPoint;
+ else if (c > UCNAllowedCharRanges[MidPoint].Upper)
+ LowPoint = MidPoint + 1;
+ else
+ return true;
+ }
+ return false;
+}
+
+static bool isAllowedInitiallyIDChar(unsigned c) {
+ return !(0x0300 <= c && c <= 0x036F) &&
+ !(0x1DC0 <= c && c <= 0x1DFF) &&
+ !(0x20D0 <= c && c <= 0x20FF) &&
+ !(0xFE20 <= c && c <= 0xFE2F);
+}
+
+static void AppendCodePoint(unsigned Codepoint,
+ llvm::SmallVectorImpl<char> &Str) {
+ char ResultBuf[4];
+ char *ResultPtr = ResultBuf;
+ bool Res = ConvertCodePointToUTF8(Codepoint, ResultPtr);
+ (void)Res;
+ assert(Res && "Unexpected conversion failure");
+ Str.append(ResultBuf, ResultPtr);
+}
+
/// LookUpIdentifierInfo - Given a tok::raw_identifier token, look up the
/// identifier information for the token and install it into the token,
/// updating the token kind accordingly.
@@ -505,14 +579,56 @@
// Look up this token, see if it is a macro, or if it is a language keyword.
IdentifierInfo *II;
- if (!Identifier.needsCleaning()) {
+ if (!Identifier.needsCleaning() && !(Identifier.getFlags() & Token::HasUCN)) {
// No cleaning needed, just use the characters from the lexed buffer.
II = getIdentifierInfo(StringRef(Identifier.getRawIdentifierData(),
Identifier.getLength()));
} else {
// Cleaning needed, alloca a buffer, clean into it, then use the buffer.
SmallString<64> IdentifierBuffer;
+ SmallString<64> UCNIdentifierBuffer;
StringRef CleanedStr = getSpelling(Identifier, IdentifierBuffer);
+ if (Identifier.getFlags() & Token::HasUCN) {
+ for (unsigned i = 0, e = CleanedStr.size(); i != e; ++i) {
+ if (CleanedStr[i] == '\\') {
+ unsigned UcnVal;
+ unsigned NumChars;
+ if (CleanedStr[i+1] == 'u') {
+ UcnVal = (HexDigitValue(CleanedStr[i+2]) << 12) +
+ (HexDigitValue(CleanedStr[i+3]) << 8) +
+ (HexDigitValue(CleanedStr[i+4]) << 4) +
+ (HexDigitValue(CleanedStr[i+5]));
+ NumChars = 6;
+ } else {
+ assert(CleanedStr[i+1] == 'U');
+ UcnVal = (HexDigitValue(CleanedStr[i+2]) << 28) +
+ (HexDigitValue(CleanedStr[i+3]) << 24) +
+ (HexDigitValue(CleanedStr[i+4]) << 20) +
+ (HexDigitValue(CleanedStr[i+5]) << 16) +
+ (HexDigitValue(CleanedStr[i+6]) << 12) +
+ (HexDigitValue(CleanedStr[i+7]) << 8) +
+ (HexDigitValue(CleanedStr[i+8]) << 4) +
+ (HexDigitValue(CleanedStr[i+9]));
+ NumChars = 10;
+ }
+ if (!isAllowedIDChar(UcnVal)) {
+ StringRef CurCharacter = CleanedStr.substr(i, NumChars);
+ Diag(Identifier, diag::err_ucn_invalid_in_id) << CurCharacter;
+ UcnVal = 0xFFFD;
+ }
+ if (UCNIdentifierBuffer.empty() && !isAllowedInitiallyIDChar(UcnVal)) {
+ StringRef CurCharacter = CleanedStr.substr(i, NumChars);
+ Diag(Identifier, diag::err_ucn_invalid_at_start_id) << CurCharacter;
+ UcnVal = 0xFFFD;
+ }
+ AppendCodePoint(UcnVal, UCNIdentifierBuffer);
+ i += NumChars - 1;
+ } else {
+ UCNIdentifierBuffer.push_back(CleanedStr[i]);
+ }
+ }
+ CleanedStr = UCNIdentifierBuffer;
+ }
II = getIdentifierInfo(CleanedStr);
}
More information about the cfe-commits
mailing list