[cfe-commits] [PATCH] Support for universal character names in identifiers

Thu Jan 17 16:33:28 PST 2013

Another flaw here is that if a UCN is not a valid identifier character, it gets read in a second time by LexTokenInternal, which means we get the warnings twice. I was trying not to have a NoWarn variant but maybe it's necessary.

Jordan

On Jan 17, 2013, at 11:31 , Jordan Rose <jordan_rose at apple.com> wrote:

> How about this approach?
> - LexUnicode mirrors LexTokenInternal, dispatching to the proper lex method based on the first Unicode character in a token.
> - UCNs are validated in readUCN (called by LexTokenInternal and LexIdentifier). The specific identifier restrictions are checked in LexUnicode and LexIdentifier.
> - UCNs are recomputed in Preprocessor::LookUpIdentifierInfo because we start with the spelling info there, but all the validation has already happened.
> 
> With these known flaws:
> - the classification of characters in LexUnicode should be more efficient.
> - poor recovery for a non-identifier UCN in an identifier. Right now I just take that to mean "end of identifier", which is the most pedantically correct thing to do, but it's probably not what's intended.
> - still needs more tests, of course
> 
> FWIW, though, I'm not sure unifying literal Unicode and UCNs is actually a great idea. The case where it matters most (validation of identifier characters) is pretty easy to separate out into a helper function (and indeed it already is). The other cases (accepting Unicode whitespace and fixits for accidental Unicode) only make sense for literal Unicode, not escaped Unicode.
> 
> Anyway, what do you think?
> Jordan
> 
> <UCNs.patch>