[cfe-commits] r173369 - in /cfe/trunk: include/clang/Basic/ConvertUTF.h include/clang/Basic/DiagnosticLexKinds.td include/clang/Lex/Lexer.h include/clang/Lex/Token.h lib/Lex/Lexer.cpp lib/Lex/Preprocessor.cpp test/CXX/over/over.oper/over.literal/p8.cpp test/CodeGen/ucn-identifiers.c test/FixIt/fixit-unicode.c test/Lexer/utf8-invalid.c test/Preprocessor/ucn-pp-identifier.c test/Sema/ucn-identifiers.c

Jordan Rose jordan_rose at apple.com
Thu Jan 24 13:38:21 PST 2013


On Jan 24, 2013, at 13:34 , Dmitri Gribenko <gribozavr at gmail.com> wrote:

> On Thu, Jan 24, 2013 at 10:50 PM, Jordan Rose <jordan_rose at apple.com> wrote:
>> Author: jrose
>> Date: Thu Jan 24 14:50:46 2013
>> New Revision: 173369
>> 
>> URL: http://llvm.org/viewvc/llvm-project?rev=173369&view=rev
>> Log:
>> Handle universal character names and Unicode characters outside of literals.
>> 
>> This is a missing piece for C99 conformance.
>> 
>> This patch handles UCNs by adding a '\\' case to LexTokenInternal and
>> LexIdentifier -- if we see a backslash, we tentatively try to read in a UCN.
>> If the UCN is not syntactically well-formed, we fall back to the old
>> treatment: a backslash followed by an identifier beginning with 'u' (or 'U').
>> 
>> Because the spelling of an identifier with UCNs still has the UCN in it, we
>> need to convert that to UTF-8 in Preprocessor::LookUpIdentifierInfo.
>> 
>> Of course, valid code that does *not* use UCNs will see only a very minimal
>> performance hit (checks after each identifier for non-ASCII characters,
>> checks when converting raw_identifiers to identifiers that they do not
>> contain UCNs, and checks when getting the spelling of an identifier that it
>> does not contain a UCN).
>> 
>> This patch also adds basic support for actual UTF-8 in the source. This is
>> treated almost exactly the same as UCNs except that we consider stray
>> Unicode characters to be mistakes and offer a fixit to remove them.


>> +    // Instead of letting the parser complain about the unknown token,
>> +    // just warn that we don't have valid UTF-8, then drop the character.
> 
> The comment says 'just warn', but we throw an error here:
> 
>> +    if (!isLexingRawMode())
>> +      Diag(CurPtr, diag::err_invalid_utf8);


Yup. We're allowed to do this one because we get to map non-ASCII characters down to ASCII however we want, and we can map them to an invalid ASCII character. At least, that was my understanding of Richard's comments.



More information about the cfe-commits mailing list