[cfe-commits] [PATCH] Support for universal character names in identifiers

Richard Smith richard at metafoo.co.uk
Fri Jan 18 11:36:57 PST 2013


On Fri, Jan 18, 2013 at 11:20 AM, Jordan Rose <jordan_rose at apple.com> wrote:
>
> On Jan 17, 2013, at 17:43 , Richard Smith <richard at metafoo.co.uk> wrote:
>
>> On Thu, Jan 17, 2013 at 11:31 AM, Jordan Rose <jordan_rose at apple.com> wrote:
>>> How about this approach?
>>> - LexUnicode mirrors LexTokenInternal, dispatching to the proper lex method based on the first Unicode character in a token.
>>> - UCNs are validated in readUCN (called by LexTokenInternal and LexIdentifier). The specific identifier restrictions are checked in LexUnicode and LexIdentifier.
>>> - UCNs are recomputed in Preprocessor::LookUpIdentifierInfo because we start with the spelling info there, but all the validation has already happened.
>>
>> This seems reasonable to me. It looks like this will now correctly reject
>>
>> #if 0
>> \u0000
>> #endif
>>
>> ... which I believe the old approach accepted.
>>
>>> With these known flaws:
>>> - the classification of characters in LexUnicode should be more efficient.
>>> - poor recovery for a non-identifier UCN in an identifier. Right now I just take that to mean "end of identifier", which is the most pedantically correct thing to do, but it's probably not what's intended.
>>> - still needs more tests, of course
>>>
>>> FWIW, though, I'm not sure unifying literal Unicode and UCNs is actually a great idea. The case where it matters most (validation of identifier characters) is pretty easy to separate out into a helper function (and indeed it already is). The other cases (accepting Unicode whitespace and fixits for accidental Unicode) only make sense for literal Unicode, not escaped Unicode.
>>
>> For build systems which automatically convert UTF-8 to UCNs for gcc
>> compatibility[0], it's not implausible that UCN-encoded smart quotes
>> or whitespace might end up in Clang input.
>>
>>  [0]: http://gcc.gnu.org/wiki/FAQ#utf8_identifiers
>>
>>> Anyway, what do you think?
>>
>> Generally, I like this. Some specific comments:
>>
>> In readUCN, \u123 isn't an error, it's two separate tokens. Perhaps
>> the diagnostics here should be warnings instead?
>
> Since a lone backslash always produces an invalid token, it seems reasonable to leave these as errors, but yes, I now need to go back and not consume anything in that case.

The issue is that it's not an invalid token at this phase of
translation. If the preprocessing-token is not converted to a token,
it's not an error. For instance, it might be skipped in a #if 0, or it
might be stringized, or it might be in a macro which is never
expanded.

>> Likewise, in LexIdentifier and LexTokenInternal, you shouldn't call
>> ConsumeChar until you know that you have actually got a UCN.
>>
>> Likewise, in LexUnicode, err_non_ascii can be at most a warning; we're
>> required to produce a preprocessing-token for each character which
>> can't be the start of another token. The err_invalid_utf8 part in
>> LexTokenInternal is fine though, since that's covered by the
>> implementation-defined mapping from physical source file characters to
>> the basic source character set.
>
> *sigh* This seems like more justification to distinguish UCNs and actual Unicode to handle the "accidental" case, but I will make sure UCNs as written never fall into that block. (Actually, I'll just go with your way for now and handle that QoI issue later.)
>

One thing I missed before: please don't use iscntrl or isxdigit;
they're dependent on the current locale (and to be safe, don't use
isascii either).




More information about the cfe-commits mailing list