[cfe-commits] [PATCH] Support for universal character names in identifiers

Thu Jan 17 17:43:25 PST 2013

On Thu, Jan 17, 2013 at 11:31 AM, Jordan Rose <jordan_rose at apple.com> wrote:
> How about this approach?
> - LexUnicode mirrors LexTokenInternal, dispatching to the proper lex method based on the first Unicode character in a token.
> - UCNs are validated in readUCN (called by LexTokenInternal and LexIdentifier). The specific identifier restrictions are checked in LexUnicode and LexIdentifier.
> - UCNs are recomputed in Preprocessor::LookUpIdentifierInfo because we start with the spelling info there, but all the validation has already happened.

This seems reasonable to me. It looks like this will now correctly reject

#if 0
\u0000
#endif

... which I believe the old approach accepted.

> With these known flaws:
> - the classification of characters in LexUnicode should be more efficient.
> - poor recovery for a non-identifier UCN in an identifier. Right now I just take that to mean "end of identifier", which is the most pedantically correct thing to do, but it's probably not what's intended.
> - still needs more tests, of course
>
> FWIW, though, I'm not sure unifying literal Unicode and UCNs is actually a great idea. The case where it matters most (validation of identifier characters) is pretty easy to separate out into a helper function (and indeed it already is). The other cases (accepting Unicode whitespace and fixits for accidental Unicode) only make sense for literal Unicode, not escaped Unicode.

For build systems which automatically convert UTF-8 to UCNs for gcc
compatibility[0], it's not implausible that UCN-encoded smart quotes
or whitespace might end up in Clang input.

  [0]: http://gcc.gnu.org/wiki/FAQ#utf8_identifiers

> Anyway, what do you think?

Generally, I like this. Some specific comments:

In readUCN, \u123 isn't an error, it's two separate tokens. Perhaps
the diagnostics here should be warnings instead?

Likewise, in LexIdentifier and LexTokenInternal, you shouldn't call
ConsumeChar until you know that you have actually got a UCN.

Likewise, in LexUnicode, err_non_ascii can be at most a warning; we're
required to produce a preprocessing-token for each character which
can't be the start of another token. The err_invalid_utf8 part in
LexTokenInternal is fine though, since that's covered by the
implementation-defined mapping from physical source file characters to
the basic source character set.