[cfe-commits] [PATCH] Support for universal character names in identifiers

Jordan Rose jordan_rose at apple.com
Fri Jan 18 11:20:46 PST 2013


On Jan 17, 2013, at 17:43 , Richard Smith <richard at metafoo.co.uk> wrote:

> On Thu, Jan 17, 2013 at 11:31 AM, Jordan Rose <jordan_rose at apple.com> wrote:
>> How about this approach?
>> - LexUnicode mirrors LexTokenInternal, dispatching to the proper lex method based on the first Unicode character in a token.
>> - UCNs are validated in readUCN (called by LexTokenInternal and LexIdentifier). The specific identifier restrictions are checked in LexUnicode and LexIdentifier.
>> - UCNs are recomputed in Preprocessor::LookUpIdentifierInfo because we start with the spelling info there, but all the validation has already happened.
> 
> This seems reasonable to me. It looks like this will now correctly reject
> 
> #if 0
> \u0000
> #endif
> 
> ... which I believe the old approach accepted.
> 
>> With these known flaws:
>> - the classification of characters in LexUnicode should be more efficient.
>> - poor recovery for a non-identifier UCN in an identifier. Right now I just take that to mean "end of identifier", which is the most pedantically correct thing to do, but it's probably not what's intended.
>> - still needs more tests, of course
>> 
>> FWIW, though, I'm not sure unifying literal Unicode and UCNs is actually a great idea. The case where it matters most (validation of identifier characters) is pretty easy to separate out into a helper function (and indeed it already is). The other cases (accepting Unicode whitespace and fixits for accidental Unicode) only make sense for literal Unicode, not escaped Unicode.
> 
> For build systems which automatically convert UTF-8 to UCNs for gcc
> compatibility[0], it's not implausible that UCN-encoded smart quotes
> or whitespace might end up in Clang input.
> 
>  [0]: http://gcc.gnu.org/wiki/FAQ#utf8_identifiers
> 
>> Anyway, what do you think?
> 
> Generally, I like this. Some specific comments:
> 
> In readUCN, \u123 isn't an error, it's two separate tokens. Perhaps
> the diagnostics here should be warnings instead?

Since a lone backslash always produces an invalid token, it seems reasonable to leave these as errors, but yes, I now need to go back and not consume anything in that case.


> Likewise, in LexIdentifier and LexTokenInternal, you shouldn't call
> ConsumeChar until you know that you have actually got a UCN.
> 
> Likewise, in LexUnicode, err_non_ascii can be at most a warning; we're
> required to produce a preprocessing-token for each character which
> can't be the start of another token. The err_invalid_utf8 part in
> LexTokenInternal is fine though, since that's covered by the
> implementation-defined mapping from physical source file characters to
> the basic source character set.

*sigh* This seems like more justification to distinguish UCNs and actual Unicode to handle the "accidental" case, but I will make sure UCNs as written never fall into that block. (Actually, I'll just go with your way for now and handle that QoI issue later.)




More information about the cfe-commits mailing list